MBR: Data Crawling
The increasing availability of structured data on the Internet is becoming an important data source for research in economics and management. However, much of the structured data cannot be readily downloaded but can only be accessed through websites. The manual extraction of content from websites is burdensome and becomes with increasing size of the underlying data quickly unfeasible. One solution is to systematically extract this data with automated programs written for this purpose, so called crawlers. The goal of this course is to provide a good understanding about the possibilities of crawling, while also giving enough time to work on your own crawling project.
The following topics will be covered in the course:
• When is crawling useful?
Before starting a crawling project it is important to carefully assess if crawling is really the appropriate tool to get the desired data. We discuss topics like data size, structure of the data, and technical countermeasures from website owners.
• How to determine the observations to download?
One key challenge for many crawling projects is that there is no readily available list of all subsites of a domain that should be included in the crawling process. We consider different ways to get around this problem, including the use of site maps, APIs, continuous IDs, or snowball approaches.
• How to do the actual crawling?
A first important consideration when setting up a crawling process is if it should be set up as a one-off task or as a repeated process in which the same websites are regularly visited to create a panel. We then look at how crawling actually works by programming first simple crawlers using the language Python, but also address more advanced topics like running multiple instances of the crawler in parallel.
• How to extract content?
The raw HTML code that is downloaded from the webserver can usually not be directly used. Before the acquired data can be used, one has first to extract (“parse”) the desired information from the raw data. Depending on the complexity of the project, parsing of the data is either done on-the-fly while crawling it or as a separate process. We introduce the concept of regular expressions and a parsing framework and show how they can be used to identify structured information within a website.
• How to process the acquired data?
Depending on the time dimension of the crawling process and the size of the crawl, managing the data and converting it to a format that can be used by statistical packages can be challenging. For bigger projects it might be useful to store data directly in a relational database, while it might otherwise be fine to save to flat text files that can be imported from statistical packages. We also discuss how crawled data can be enriched with data from proprietary databases and go through the process of preparing the data to the point where regressions can be run.
- The course is held in English
- The number of participants is limited to 20
|Dates & Location||
Monday, 15.04.2019 - Thursday, 18.04.2019, 09:00-17:00, Kaulbachstr. 45. 2nd floor, Room 202 (Seminar Room)
|Credits||2 SWS towards module B/I or B/II|
|Examination||Success of participation will be determined by the course project.|