This is a step by step tutorial on how to scrap web data into an SQLite database, using the scrapy module for Python 2.7. This tutorial is for Windows but you can easily adapt it for Linux.
In another page I explained the rationale for doing this. You have your own, but for the sake of the tutorial, I need to show you a real life example, step by step.
The site we will work on here is the TU mobile website from Towson University, Maryland. This is an average level 4 years Public College, averaging 22,000 students, with 62% females.
The TU mobile site collects in real time the enrollment and class availability, and its peak usage is during enrollment periods: at the end of each semesters.
What I propose you to do is this:
You need at least these tools:
It’s pretty straightforward, as it’s portable. Just install the package somewhere and make sure to uncheck “integrate into system path“. You may want to install another version of Python someday, and this would cause some issues.
You’re gonna work with the command line, and we need the PATH variable to be set with Python environment. Since you unchecked the box that would integrate it during install, we need a batch to set that up and open a Windows command line for you.
Open the Python folder and create a Python27.cmd MS-DOS batch with this code:
set PATH=%~dp0;%~dp0Scripts;%PATH% start cmd.exe
Launch it… and voila:
Python 2.7 and its scripts (pip and any module you will install) are now in the environment of this command line.
This sub-folder contains PiP, the Python modules installer, and the compiled modules you will install. It needs to be in you PATH environment.
We can now install the scrapy module and its dependencies with pip:
pip install pypiwin32 scrapy arrow
I suspect this is because pip is compiled during installation and need to be recompiled but I didn’t mind to find out how. Anyway, just adapt the command a little bit and you should be set:
python Scripts\pip.exe install pypiwin32 scrapy arrow
Keep in mind that anytime you call a Python compiled module that trashes this error message, the fallback command will be the same:
python Scripts\command.exe [parameters]
Now is time to create your scrapper project. Let’s call it tuScraper:
scrapy startproject tuScraper
This will create a sub-folder tuScraper with the following structure:
Spiders are the actual Python code that will scrap the data from the web. Each project can have multiple spiders for multiple uses, depending on the type of pages and how the data is formatted.
To create a spider:
C:\Python27> cd tuScraper C:\Python27\tuScraper> scrapy genspider --template=basic tuscraper mytumobile.towson.edu
Now we have very basic tuscraper.py spider under
# -*- coding: utf-8 -*- import scrapy class TuscraperSpider(scrapy.Spider): name = 'tuscraper' allowed_domains = ['mytumobile.towson.edu'] start_urls = ['http://mytumobile.towson.edu/'] def parse(self, response): pass
It doesn’t do a thing, it’s an empty shell. We will modify this spider to process the URLs we need to scrap into the database. Before playing with Scrapy super powers, we need to analyze the website to know how the data is organized.