On Wednesday, December 17th at Telecom ParisTech, at 2 pm in the Amphi Grenat, Muhammad Faheem will defend his thesis on Intelligent Content Acquisition in Web Archiving. Here is the abstract:
Web sites are dynamic in nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We present in this thesis intelligent systems that crawl the Web in Intelligent manner.
The application-aware helper (AAH), fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications. Because the AAH is aware of the Web application currently crawled, it is able to refine the list of URLs to process and to extend the archive with semantic information about extracted content. The AAH has introduced a semi-automatic crawling approach that relies on hand-written description of known Web sites.
We also propose a fully-automatic system that does not require any human intervention to crawl the Web pages. We introduce ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler (fully automatic) that utilizes the inner structure of the Web pages and guides the crawling process based on the importance of their content.
A large part of the information on the Web is hidden behind Web forms (known as the deep Web, or invisible Web, or hidden Web). The above stated systems does not crawl the hidden Web pages. To address this problem, we propose OWET (Open Web Extraction Toolkit) as such a platform, a free, publicly available data extraction framework.