IT Cooking

IT Stuff and Silly Shell

Spread the love

 

python 2.7 scrapy 1.5 PyPI 1.5.0 SQLite 3.x

This is a step by step tutorial on how to scrap web data into an SQLite database, using the scrapy module for Python 2.7. This tutorial is for Windows but you can easily adapt it for Linux.

Another project I have is to port it to work on a Linux Thecus NAS or even better, on a ddwrt router. More updates coming soon.

Introduction

In another page I explained the rationale for doing this. You have your own, but for the sake of the tutorial, I need to show you a real life example, step by step.

The site we will work on here is the TU mobile website from Towson University, Maryland. This is an average level 4 years Public College, averaging 22,000 students, with 62% females.

The TU mobile site collects in real time the enrollment and class availability, and its peak usage is during enrollment periods: at the end of each semesters.

What I propose you to do is this:

  1. install and configure Python 2.7 portable
  2. install and configure Scrapy
  3. analyze the TU mobile website for data to scrap
  4. play with scrapy console to check how the data is scraped
  5. initialize the database
  6. connect Python to the database
  7. test and build some insert queries to the SQLite database
  8. analyze the full script with comments
  9. wrapping up with a batch script to automate the scraping experience

 

Tools needed

You need at least these tools:

 

Installation of Python 2.7 + Scrapy for Windows

Install Python 2.7

It’s pretty straightforward, as it’s portable. Just install the package somewhere and make sure to uncheck “integrate into system path“. You may want to install another version of Python someday, and this would cause some issues.

Create a batch for initializing the environment

You’re gonna work with the command line, and we need the PATH variable to be set with Python environment. Since you unchecked the box that would integrate it during install, we need a batch to set that up and open a Windows command line for you.

Open the Python folder and create a Python27.cmd MS-DOS batch with this code:

set PATH=%~dp0;%~dp0Scripts;%PATH% 
start cmd.exe

Launch it… and voila:

Python 2.7 and its scripts (pip and any module you will install) are now in the environment of this command line.

Scripts

This sub-folder contains PiP, the Python modules installer, and the compiled modules you will install. It needs to be in you PATH environment.

Install Scrapy module

We can now install the scrapy module and its dependencies with pip:

pip install pypiwin32 scrapy arrow

 

Fatal error in launcher: Unable to create process using ‘”‘
If you moved the Python27 folder some place, you may experience this error when launching pip:

 

I suspect this is because pip is compiled during installation and need to be recompiled but I didn’t mind to find out how. Anyway, just adapt the command a little bit and you should be set:

python Scripts\pip.exe install pypiwin32 scrapy arrow

Keep in mind that anytime you call a Python compiled module that trashes this error message, the fallback command will be the same:

python Scripts\command.exe [parameters]

 

Initialize a Scrapy project

Now is time to create your scrapper project. Let’s call it tuScraper:

scrapy startproject tuScraper

This will create a sub-folder tuScraper with the following structure:

  • \Python27
    • \Scripts compiled binaries and pip.exe are here
    • \tuScraper  place your database.sqlite3 here
      • \tuScraper  these are your config files, pipelines, etc
        • \spiders  this is where your spiders will be created

Create a Scrapy Spider

Spiders are the actual Python code that will scrap the data from the web. Each project can have multiple spiders for multiple uses, depending on the type of pages and how the data is formatted.

To create a spider:

C:\Python27> cd tuScraper
C:\Python27\tuScraper> scrapy genspider --template=basic tuscraper mytumobile.towson.edu

 

Now we have very basic tuscraper.py spider under Python27\tuScraper\tuScraper\spiders :

# -*- coding: utf-8 -*- 
import scrapy 

class TuscraperSpider(scrapy.Spider): 
  name = 'tuscraper' 
  allowed_domains = ['mytumobile.towson.edu'] 
  start_urls = ['http://mytumobile.towson.edu/'] 
  
  def parse(self, response): 
    pass

It doesn’t do a thing, it’s an empty shell. We will modify this spider to process the URLs we need to scrap into the database. Before playing with Scrapy super powers, we need to analyze the website to know how the data is organized.

 


Spread the love