Web Crawler



  • Introduction

    Web crawling to gather information is a common technique used to efficiently collect information from across the web. As an introduction to web crawling, in this project we will use Scrapy, a free and open source web crawling framework written in Python[1]. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler. Even though Scrapy is a comprehensive infrastructure to support web crawling, you will face different kinds of challenges in real applications, e.g., dynamic JavaScript or your IP being blocked.

    The project contains 3 parts. Each part is an extension of the previous one. The end goal is to code a Scrapy project that can crawl tens of thousands of apps from the Xiaomi AppStore, or any other app store with which you are familiar.

    Project Description

    Instructions

    First stage: Create a Scrapy project to crawl the content in the Xiaomi Appstore homepage or any other Appstore homepage
    Second stage: Save the crawled content in MongoDB[2]. Install Python MongoDB driver and modify pipelines.py to insert crawled data into MongoDB.
    Third stage: Crawl more content by following next page links. So far you have likely only crawled the content of the home page. We need to use Splash[3] and ScrapyJS[4] to re-render the web page to transform the dynamic part to static content if the next page link is written in JavaScript

    Setup Requirements

    python2.7
    Scrapy 1.0+
    Splash
    ScrapyJS
    MongoDB
    Suggested Prerequisite Knowledge
    Basic Python
    Submission Instructions
    Please upload your final code to your Github account
    Please record a video explaining the design choices you made including: the structure of your code, how you chose efficiently collected and stored the gathered data, and how you dealt with gathering data from non-static sources. Please keep the video under 5 minutes

    References

    [1]Scrapy http://scrapy.org
    [2]MongoDB https://www.mongodb.org/
    [3]Splash & ScrapyJS https://github.com/scrapinghub/scrapy-splash
    [4]ScrapyJS & ScrapyJS https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/


登录后回复
 

与 BitTiger Community 的连接断开,我们正在尝试重连,请耐心等待