Framework to crawl and acquire Open-data

Posted by Umesha Sree Veni U.B November 16, 2021

Framework to crawl and acquire Open-data

Web crawling technique is used to acquire Open-data.It is the process of downloading webpages and acquiring useful information from it.The framework contain different crawlers (spiders) which crawl different sources like websites, blogs, and social networking sites like Twitter ;and acquire information regarding emergency services like Hospitals, Police Stations and Fire Stations.

Different crawlers are used to get the maximum amount of data. The crawlers use a Breadth First Search to crawl through the pages. Breadth First Search is used since it allows the crawling of the important pages first and increase the overall download rate.

It also makes use of some rules which use XPath and Regular Expressions. XPath is used to navigate through elements in an XML document. Here Xpath is used to acquire the specific contents and tags in a website. It is easier to use Xpath since it is a difficult task to navigate through the HTML tree if in others like BeautifulSoup.

For example: //tr//text () extracts the text of all <tr> tags in a specific page.

Also regular expressions are used to get a set of pages that have a specific pattern.

For example: (keralapolice.org/newsite/ps_)\w+ (.html) finds out all the html pages having the particular pattern.

Crawling Algorithm shows how to crawl data irrespective of crawling patten:





Let Q be an array which contains a set of seed urls 

Until Q is empty:

 Pop Q[i] from Q

 Find all urls based on a specific rule

 Store the found urls in n array FURL

Until FURL is empty:

 Pop FURL[i] from FURL

Search for content with a particular Xpath or Regular Expression 

Extract text in that tag 

Store the text to a data store

The below table lists the crawled data of Hospitals

Search This Blog

Umesha says Bhawatā !

Framework to crawl and acquire Open-data

Comments

Post a Comment

Popular Posts

Peacock ,Significance In Hinduism

The Power of Mind