Framework to crawl and acquire Open-data

Web crawling technique is used to acquire Open-data.It is the process of  downloading webpages and acquiring useful information from it.The framework contain different crawlers (spiders) which crawl different sources like websites, blogs, and social networking sites like Twitter ;and acquire information regarding emergency services like Hospitals, Police Stations and Fire Stations.

Different crawlers are used to get the maximum amount of data. The crawlers use a Breadth First Search to crawl through the pages. Breadth First Search is used since it allows the crawling of the important pages first and increase the overall download rate.


It also makes use of some rules which use XPath and Regular Expressions. XPath is used to navigate through elements in an XML document. Here Xpath is used to acquire the specific contents and tags in a website. It is easier to use Xpath since it is a difficult task to navigate through the HTML tree if in others like BeautifulSoup.

 For example: //tr//text () extracts the text of all <tr> tags in a specific page.

Also regular expressions are used to get a set of pages that have a specific pattern.

For example: (keralapolice.org/newsite/ps_)\w+ (.html) finds out all the html pages having the particular pattern.


Crawling Algorithm shows how to crawl data irrespective of crawling patten:


Let Q be an array which contains a set of seed urls 
Until Q is empty:
Pop Q[i] from Q
Find all urls based on a specific rule
Store the found urls in n array FURL
Until FURL is empty:
Pop FURL[i] from FURL
Search for content with a particular Xpath or Regular Expression 
Extract text in that tag 
Store the text to a data store


 The below table lists the crawled data of Hospitals






Comments