Saturday, July 18, 2015

Data scraping with C++

Hello. 

My name is CText. I am a C++ class. I can read a text file – usually an HTML or XML file. I can extract a lot of useful information from it. I can find a piece of text enclosed by given strings. I can look for it in a particular line or in the entire file. I can parse tables. I can parse lists. I have worked with various data sources, including Amazon, Christies, EPO, Factiva, and many more. I significantly speed up coding and increase code reliability. Although I am probably not perfectly optimized, I am doing my job well. I have to admit that I have been tested only with Visual C++ on Microsoft Windows. And I require an additional header file to function. But I still hope I can be useful. The following program illustrates my capabilities. The program
  1. downloads the GDP data from the World Bank website,
  2. converts the GDP data and saves it to a CSV file,
  3. finds the first line of the table in the file and displays this line, and
  4. retrieves the list of the countries and displays it.
Sample code

Download the source code or copy and paste it:

#include <iostream>
#include <andatathresher.h>

using namespace std;

int main()
{
cout << "Hello World!" << endl;
download("http://data.worldbank.org/indicator/NY.GDP.MKTP.CD", 
"file.html");
CText txt("file.html");
CCSV csv;
csv.data = txt.parsetable(0);
csv.savetofile("gdp.csv");
int line = txt.findline("<table");
cout << "The line containing first \"table\" has the number " 
<< line << " and its contents is: " << endl << txt.line(line) 
<< endl;
vector<string> ctrs = txt.selectiveharvest("Country name", 
"</table>", "<tr", "</tr>", "<a href", ">", "<");
cout << "Here is the list of countries covered:" << endl;
for (int i = 0; i < ctrs.size() - 1; i++) 
cout << ctrs[i] << "; ";
if (ctrs.size()>0) cout << ctrs[ctrs.size() - 1] << endl;
system("pause");
return 0;
}

Class members

string content
A field with the content of the file.

vector<int> separators
A field that contains positions of new line characters in the file.

CText::CText(const string &fname)
A class constructor. Loads a file and performs its preliminary analysis. A constructor without parameters can be called as well and file can be loaded manually but one has to be careful about functions using lines.  

bool CText::load(const string &filename)
This function loads a file with a given file name, stores its content in the field content, and performs its primary analysis. Returns true upon success and false otherwise.

void CText::update()
A method performing preliminary analysis of the file. It updates the separators vector.

int CText::findline(const string &text, int pos = 0)
This function returns the number of the first line that contains given string. It starts searching in the file from the offset pos. Returns -1 if no line is found.

int CText::findinline(string &res, int line, const string &before, const string &after)
This function extracts a string from a line with the line number given by line. The string that is directly after before and before after is returned through reference res. The function returns offset of the found string or -1 if it cannot be found.

int CText::findafter(string &res, const string &prefix, const string &before, const string &after, int pos = 0)
This function extracts a string from a file. It starts searching from the offset pos. It looks for the first subsequent occurrence of prefix and then for the first subsequent occurrence of before. The string that is directly after before and before after is returned through reference res. The function returns offset of the found string or -1 if it cannot be found.

int CText::findbetween(string &res, const string &before, const string &after, int pos = 0)
This function acts as findafter but with an empty string in prefix.

int CText::geturl(string &res, const string &pattern)
This function looks for a line that contains string pattern. Then it searcher for the first href HTML attribute it can find and returns its content trough reference res. The function returns offset of the found string or -1 if it cannot be found.

string CText::line(int index)
This function returns line from the file with the line number given by index. Lines are numbered from 0.

vector<string> CText::harvest(const string &ldelimit, const string &udelimit, const string &prefix, const string &before, const string &after)
This function acts as selectiveharves but with empty strings in starter and stopper. That is, it operates on the entire file.

vector<string> CText::selectiveharvest(const string& starter, const string& stopper, const string &ldelimit, const string &udelimit, const string &prefix, const string &before, const string &after)
This functions locates all strings that satisfy some criteria and returns them as a vector. It operates only on the fragment of the file that is after first occurrence of starter and before first subsequent occurrence of stopper. Within this range function looks for pieces of text enclosed by ldelimit and udelimit. Each such enclosure is supposed to produce one element of the resulting vector. Within each enclosure, function performs procedure similar to findafter and adds the result as an element to the resulting vector.

int CText::occurencies(const string& text)
This function counts how many times given string can be found in the file.

vector<TDataRec> CText::parsetable(int pos)
This function parses an HTML table and returns it as a vector of vectors of strings. It looks for the first table after offset pos. It will not work for tables inside tables. For the definition of TDataRec type see file anutil.h.

Class code

Download the source code or copy and paste it:

#pragma once

#include <string>
#include <vector>
#include "anutil.h"
#include <algorithm>

using namespace std;

/* INTERFACE */

class CText {
public:
string content;
vector<int> separators;
CText() {};
CText(const string &fname);
void update();
bool load(const string &filename);
int findline(const string &text, int pos);
int findinline(string &res, int line, 
const string &before, const string &after);
int findafter(string &res, const string &prefix, 
const string &before, const string &after, int pos);
int findbetween(string &res, const string &before, 
const string &after, int pos);
int geturl(string &res, const string &pattern);
vector<string> harvest(const string &ldelimit, 
const string &udelimit, const string &prefix, 
const string &before, const string &after);
vector<string> selectiveharvest(const string &starter, 
const string &stopper, const string &ldelimit, 
const string &udelimit, const string &prefix, 
const string &before, const string &after);
string line(int index);
int occurencies(const string &text);
vector<TDataRec> parsetable(int pos);
};

/* IMPLEMENTATION */

CText::CText(const string &fname)
{
load(fname);
}

void CText::update()
{
separators.clear();
for (int i=0;i<content.length();i++) 
if (content[i]=='\n') separators.push_back(i);
separators.push_back(content.length());
}

bool CText::load(const string &filename)
{
ifstream infile;
infile.open(filename,ios::binary);
if (!infile.is_open())
{
content = "";
return false;
}
stringstream cont;
string line;
while (!infile.eof())
{
getline(infile,line);
cont << line << endl;
}
content = cont.str();
infile.close();
update();
return true;

int CText::findline(const string &text, int pos = 0)
{
if (separators.size()==0) return -1;
int k = content.find(text,pos);
if (k<0) return -1;
int i; 
for (i=separators.size()-1; i>=0; i--) 
if (separators[i]<k) break;
return i+1;
}

int CText::findinline(string &res, int line, 
const string &before, const string &after)
{
int start = content.find(before,separators[line]);
int stop = content.find(after,start+before.length());
if ((start<0) || (stop<0)) return -1;
if (line+1<separators.size()) if ((start>separators[line+1]) 
|| (stop>separators[line+1])) return -1;
int pos = start + before.length();
res = content.substr(pos,stop-pos);
return pos;
}

int CText::findafter(string &res, const string &prefix, 
const string &before, const string &after, int pos = 0)
{
int k1 = content.find(prefix,pos);
if (k1<0) return -1;
int k2 = content.find(before,k1+prefix.length());
if (k2<0) return -1;
int k3 = content.find(after,k2+before.length());
if (k3<0) return -1;
pos = k2 + before.length();
res = content.substr(pos,k3-pos);
return pos;
}

int CText::findbetween(string &res, const string &before, 
const string &after, int pos = 0)
{
return findafter(res,"",before,after,pos);
}

int CText::geturl(string &res, const string &pattern)
{
int line = findline(pattern);
if (line<0) return -1;
int pos = findinline(res,line,"href=\"","\"");
return pos;
}

string CText::line(int index)
{
int start;
if (index==0) start = 0; else start = separators[index-1]+1;
int stop = separators[index];
return content.substr(start,stop-start);
}

vector<string> CText::harvest(const string &ldelimit, const string &udelimit, 
const string &prefix, const string &before, const string &after)
{
return selectiveharvest("","",ldelimit,udelimit,prefix,before,after);
}

vector<string> CText::selectiveharvest(const string& starter, 
const string& stopper, const string &ldelimit, const string &udelimit, 
const string &prefix, const string &before, const string &after)
{
vector<string> res;
int beginning;
if (starter=="") beginning = 0; else beginning = content.find(starter);
if (beginning<0) return res;
int finish;
if (stopper=="") finish = content.length(); 
else finish = content.find(stopper,beginning);
if (finish<0) finish = content.length();
int start = content.find(ldelimit,beginning);
if (start<0) return res;
int stop = content.find(udelimit,start);
while ((stop>=0) && (start<finish))
{
string item;
int q = findafter(item,prefix,before,after,start);
if (q>stop) item="";
res.push_back(item);
start = content.find(ldelimit,stop);
if (start<0) return res;
stop = content.find(udelimit,start);
}
return res;
}

int CText::occurencies(const string& text)
{
if (text=="") return 0;
int count = 0;
int pos = 0;
int k = content.find(text);
while (k>=0)
{
count++;
pos = k+1;
k = content.find(text,pos);
}
return count;
}

vector<TDataRec> CText::parsetable(int pos)
{
vector<TDataRec> res;
string temp = content;
transform(temp.begin(),temp.end(),temp.begin(),toupper);
int p1 = temp.find("<TABLE",pos+1);
if (p1<0) return res;
int p2 = temp.find("</TABLE",p1+1);
if (p2<0) return res;
int p3 = temp.find("<TR",p1+1);
while ((p3>=0) && (p3<p2))
{
int p4 = temp.find("</TR",p3+1);
TDataRec row;
int p5 = temp.find("<T",p3+1);
while ((p5>=0) && (p5<p4))
{
int p6 = temp.find(">",p5+1)+1;
int p7 = temp.find("</",p6+1);
string cell = htmltostring(content.substr(p6,p7-p6));
row.push_back(cell);
p5 = temp.find("<T",p7+1);
}
res.push_back(row);
p3 = temp.find("<TR",p4+1);
}
return res;
}

Let me know your experience!

No comments:

Post a Comment