Screen Scraping: A Hands-on Introduction
Table of Contents
1 Goals
- Get a working python environment installed on your own computer
- Make python seem less scary
- Understand some of the differences between python and other languages
- Understand what screen scraping is all about
- Learn the tools to scrape web sites (and other structured text) effectively
This may take some time! This time is available indefinitely, depending on how quickly we go it could take a number of sessions - my intention is to play it by ear to see what we need to focus on.
1.1 Who am I?
My name is Alex Storer, and I'm part of the Research Technology Consulting team at IQSS. I have a PhD in Computational Neuroscience, and have done a lot of programming and scripting to interact with data.
Our team can help you with your research questions, both with the statistics and the technology. If you want to chat with us, simply e-mail support@help.hmdc.harvard.edu.
1.2 What is this page?
This is a tutorial that I wrote using the org-mode in emacs. It is hosted here:
http://www.people.fas.harvard.edu/~astorer/scraping/scraping.html
You can always find details about our ongoing workshops here:
2 Basic Python
Python is a powerful interpreted language that people often use for scraping. We'll highlight here a few of the most helpful features for understanding Python code and writing scrapers. This is by no means a complete or thorough introduction to Python! It's just enough to get by.
2.1 Installation
Python comes in two modern flavors, version 2 and version 3. There are some important language differences between them, and in practice, almost everyone uses version 2. To install it, go here and select the relevant operating system.
2.1.1 IDE
An IDE, or Integrated Development Environment, is used to facilitate programming. A good IDE does things like code highlighting, error checking, one-click-running, and easy integration across multiple files. An example of a crappy IDE is notepad. I like to use emacs. Most people prefer something else.
2.1.2 Wing IDE 101
For this session, I recommend Wing 101. It's a free version of a more fully-featured IDE, but for beginners, it's perfect. If you don't already have an IDE that you're invested in, or you want your intro to python to be as painless as possible, you should install it. It's cross platform.
- Getting Started in Wing
Once you have Wing installed, you might want to use the tutorial to learn how to navigate around in it.
Opening the tutorial in Wing 101.
2.2 Further Python Resources
But wait, I want to spend four months becoming a Python guru!
Dude, you're awesome. Here are some resources that will help you:
- Python Programming Fundamentals Uses WingIDE to teach basic computer science tactics using python
- Python Challenge A fun programming riddle that will increase your chops.
- Learn Python The Hard Way The Hard Way means by actually writing code. Maybe it should be called The Good Way?
2.3 Diving In
In Wing, there is a window open called Python Shell
- If you know R, think of this just like the R command line
- If you've never programmed before, think of this as a graphing calculator
print 2+4
6
2.3.1 Basic Text Handling
- Of course, this graphing calculator can handle text, too!
mystr = "Hello, World!" print mystr print len(mystr)
Hello, World! 13
| Python Code | R Code | English Translation |
print 2+4 | print(2+4) | Print the value of 2+4 |
mystr = '`Hello World'` | mystr <- '`Hello World'` | Assign the string "Hello World" to the variable mystr |
len(mystr) | nchar(mystr) | How "long" is the variable mystr? Note: R can tell you how long it is, but if you want the number of characters, that's what you need to ask for. |
Note to Stata Users:
Assigning a variable is not the same as adding a "column" to your dataset.
2.3.2 Indexing and Slicing
Get the first element of a string.
- Note: Python counts from 0. This is a common convention in most languages constructed by computer scientists.
mystr = "Dogs in outer space" print mystr[0]
D
Get the last element of a string
mystr = "Dogs in outer space" print mystr[-1] print mystr[len(mystr)-1]
e e
mystr = "Dogs in outer space" print mystr[1:3] print mystr[3:] print mystr[:-3]
og s in outer space Dogs in outer sp
2.3.3 Including Other Packages
- By default, python doesn't include every possible "package"
- This is similar to R, but unlike Matlab
- Use the
includestatement to load a library
import math print math.sin(math.pi)
1.22464679915e-16
After we import from a package, we have to access sub-elements of that
package using the . operator. Notice also that while the value
1.22464679915e-16 is very nearly 0, the math module doesn't know
that sin(π) = 0. There are smarter modules for doing math in
Python, like scipy and numpy. Some people love using Python for
Math. I think it makes more sense to use R.
- If you want to
importsomething into your namespacefrom math import <myfunction>orfrom math import *
from math import * print sin(pi)
1.22464679915e-16
2.3.4 Objects and methods
Python makes extensive use of objects. An object has
- Methods: functions that work only on that option
- Fields: data that only that type of object has
For example, let's imagine a fruit object. A fruit might have a
field called hasPeel, which tells you whether this fruit is
peeled. It could also have a method called peel, which alters the
state of the fruit.
str = "THE World is A BIG and BEAUTIFUL place. " print str.upper() name = "Alex Storer" print name.swapcase()
THE WORLD IS A BIG AND BEAUTIFUL PLACE. aLEX sTORER
Here we defined two strings, str and name, and used these to
invoke string methods which affect the case of the string.
- You can write your own objects and methods
- Objects can be sub-classes of other objects
- e.g., a
psychologistis a type ofresearcher, who does everything aresearcherdoes but also some other things only apyschologistdoes.
- e.g., a
2.3.5 Defining Functions
You can write your own functions, pieces of code that can be used to
take specific inputs and give outputs. You can create a function by
using the def command.
def square(x): return x*x print square(9)
81
Pay close attention to the whitespace that is used in Python!
Unlike other languages, it is not ignored. Everything with the same
indentation is in the same level. Above, the statement return x*x
is part of the square function, but the following line is outside of
the function definition.
2.3.6 Logical Flow

The xkcd guide to writing good code
You can think about this logical process as being in pseudocode.
IF do things right ---> code well OTHERWISE ---> do things fast
A lot of programming is figuring out how to fit things into this sort
of if=/=else structure. Let's look at an example in Python.
- The method
findreturns the index of the first location of a string match
mystr = "This is one cool looking string!" if mystr.find("string")>len(mystr)/2: print "The word 'string' is in the second half" else: print "The word 'string is not in the second half"
The word 'string' is in the second half
What happens if the word "string" is not there at all?
- The method
findreturns -1 if the string isn't found
mystr = "I don't know about you, but I only use velcro." print mystr.find("string") if mystr.find("string")>len(mystr)/2: print "The word 'string' is in the second half" elif mystr.find("string")>=0: print "The word 'string is not in the second half" else: print "The word 'string' isn't there!"
-1 The word 'string' isn't there!
- Important Note: In Python, most everything evaluates to
True. Exceptions include0andNone. This means that you can say things likeif (result)where the result may be a computation, a string search, or anything like that. As long as it evaluates toTrue, it will work!
2.3.7 Review
if,elifandelsecan be used to control the flow of a program- strings are a type of a object, and have a number of methods that
come with them, including
find,upperandswapcase- methods are called using
mystring.method() - The list of methods for strings can be found in the Python documentation
- methods are called using
defcan be used to define a function- The
returnstatement determine what the function returns
- The
2.4 For Loops
The for loop is a major component of how python is used. You can iterate over lots of different things, and python is smart enough to know how to do it.
- Note: the following is what's called pseudocode - something that looks like code, but isn't going to run. It's a helpful way to clarify the steps that you need to take to get things to work.
for (item in container): process item print item print "done processing items!"
Notice the use of the <TAB> (or spacing) - that's how python knows whether we're inside the loop or not!
2.4.1 Example
str = "Daddy ran to help Ann. Up and down went the seesaw." for word in str.split(): print word
Daddy ran to help Ann. Up and down went the seesaw.
Notice the use of str.split(): this is an example of calling a
method of a string object. It returns a list of words after
splitting the string on whitespace.
2.5 Lists
- A list is a data type that can hold anything.
- Lists are iterable (you can pass them to a
forloop - You can
.append,.extend,and otherwise manipulate lists. Python Documentation
mylist = ['dogs',1,4,"fishes",["hearts","clovers"],list] for element in mylist: print element mylist.reverse() print mylist
dogs 1 4 fishes ['hearts', 'clovers'] <type 'list'> [<type 'list'>, ['hearts', 'clovers'], 'fishes', 4, 1, 'dogs']
2.6 Exercise
- Write a function that takes in a string, and outputs the square of its length.
- Write a function that returns the number of capitalized letters in a string. Hint: try using =lower= and the == operator
- Write a function that returns everything in a string up to "dog", and returns "not found" if the string is not present.
2.6.1 Exercise Solutions
- Exercise 1:
Write a function that takes in a string, and outputs the square of its length.Notice that a function can call another function that you wrote.
def square(x): return x*x def sqlen(x): return square(len(x)) print sqlen("Feet")
16
- Exercise 2
Write a function that returns the number of capitalized letters in a string.def numcaps(x): lowerstr = x.lower() ncaps = 0 for i in range(len(x)): if lowerstr[i]!=x[i]: ncaps += 1 return ncaps teststr = "Dogs and Cats are both Animals" print teststr, "has", str(numcaps(teststr)), "capital letters"
Dogs and Cats are both Animals has 3 capital letters
- Exercise 3
def findDog(x): mylist = x.split("dog") if len(mylist) < 2: return "not found" else: return mylist[0] return mylist print findDog("i have a dog but not a cat") print findDog("i have a fish but not a cat") print findDog("i have a dog but not a dogwood")
i have a not found i have a
2.7 dict type
A dict, short for dictionary, is a helpful data structure in
Python for building mappings between inputs and outputs.
2.7.1 Examples
mydict = dict() mydict["dogs"] = 14 mydict["fish"] = "slumberland" mydict["dogs"]+= 3 print mydict
{'fish': 'slumberland', 'dogs': 17}
len(mydict["fish"])
One of the nice things about python is that even when very condensed, it is still readable. People talk about coding in a pythonic way, meaning to write very tight, readable code.
print dict([(x, x**2) for x in (2, 4, 6)])
{2: 4, 4: 16, 6: 36}
Let's use a dictionary to store word counts from a sentence.
str = "Up and down went the seesaw. Up it went. Down it went. Up, up, up!" print str for i in [",",".","!"]: str = str.replace(i," ") print str str = str.lower() print str print set(str.lower().split())
Up and down went the seesaw. Up it went. Down it went. Up, up, up! Up and down went the seesaw Up it went Down it went Up up up up and down went the seesaw up it went down it went up up up set(['and', 'up', 'it', 'down', 'seesaw', 'went', 'the'])
We see that a set contains an unordered collection of the
elements of the list returned by split(). Let's make a dictionary
with keys that are pulled from this set.
str = "Up and down went the seesaw. Up it went. Down it went. Up, up, up!" for i in [",",".","!"]: str = str.replace(i," ") words = str.lower().split() d = dict.fromkeys(set(words),0) print d for w in words: d[w]+=1 print d
{'and': 0, 'down': 0, 'seesaw': 0, 'went': 0, 'the': 0, 'up': 0, 'it': 0}
{'and': 1, 'down': 2, 'seesaw': 1, 'went': 3, 'the': 1, 'up': 5, 'it': 2}
2.7.2 Writing to CSV
A very useful feature of dictionaries is that there is an easy method to write them out to a CSV (comma-separated variable) file.
import csv f = open('/tmp/blah.csv','w') nums = [1,2,3] c = csv.DictWriter(f,nums) for i in range(0,10): c.writerow(dict([(x, x**i) for x in nums])) f.close()
This writes out the following csv file:
1,1,1 1,2,3 1,4,9 1,8,27 1,16,81 1,32,243 1,64,729 1,128,2187 1,256,6561 1,512,19683
2.7.3 A Note on File Objects
- Think about file objects like a book
- If a file is open, you don't want other people to mess with it
- Files can be opened for reading or writing
- There are methods to move around an open file
- Close the book when you're done reading it!
- Python documentation on "File I/O" is here
| English | Python | Output |
|---|---|---|
Open blah.txt just for reading | f = open('blah.txt','r') | file object f |
| Get the next line in a file | str = f.readline() | string containing a single line |
| Get the entire file | str = f.read() | string containing entire file |
| Go to the beginning of a file | f.seek(0) | None |
Close blah.txt | f.close() | None |
To play with this, download this file somewhere on your hard drive.
I'm putting it on my hard drive as /tmp/gaga.txt. On Windows, it
may look more like C:\temp\gaga.txt - just make sure you get the
path correct when you tell Python where to look!
f = open('/tmp/gaga.txt','r') print f str = f.read() print "str has length: ", len(str) str2 = f.read() print "str2 has length: ", len(str2) f.seek(0) str3 = f.readline() print "str3 has length: ", len(str3) f.close()
<open file '/tmp/gaga.txt', mode 'r' at 0x10045e8a0> str has length: 1220 str2 has length: 0 str3 has length: 77
You'll use file objects a lot. As we see them, I'll try to point out what's important about them.
2.7.4 Exercise
- Exercise 1
Write a function that counts the number of unique letters in a word.
- Exercise 2
Write a function that takes in a string, and returns adictthat tells you how many words of each number of letters there are."Dogs and cats are all animals" dogs and cats are al animls 4 3 4 3 2 6 {2: 1, 3: 2, 4: 2, 6: 1}
- Exercise 3
Loop over a list of strings, and write a csv that contains a column for each number and a row for each string.1,2,3,4,5,6,7,8,9,10,11,12,13 2,3,2,3,4,5,2,3,2,1 , 0, 0, 0 5,2,1,0,1,2,0,0,0,0 , 0, 0, 0 etc.
2.7.5 Exercise Solutions
- Exercise 1
Write a function that counts the number of unique letters in a word.def uniqueletters(w): d = dict() for char in w: d[char] = 1 return len(d.keys()) print uniqueletters("dog") print uniqueletters("dogged")
3 4
- Exercise 2
Write a function that takes in a string, and returns adictthat tells you how many words of each number of letters there are."Dogs and cats are all animals" dogs and cats are al animls 4 3 4 3 2 6 {2: 1, 3: 2, 4: 2, 6: 1}def uniqueletters(w): d = dict() for char in w: d[char] = 1 return len(d.keys()) def wordcounter(str): d = dict() for w in str.split(): u = uniqueletters(w) if u in d.keys(): d[u]+=1 else: d[u] = 1 return d print wordcounter("Dogs and cats are all animals")
{2: 1, 3: 2, 4: 2, 6: 1}
- Exercise 3
Loop over a list of strings, and write a csv that contains a column for each number and a row for each string.1,2,3,4,5,6,7,8,9,10,11,12,13 2,3,2,3,4,5,2,3,2,1 , 0, 0, 0 5,2,1,0,1,2,0,0,0,0 , 0, 0, 0 etc.
import csv def uniqueletters(w): d = dict() for char in w: d[char] = 1 return len(d.keys()) def wordcounter(str): d = dict() for w in str.split(): u = uniqueletters(w) if u in d.keys(): d[u]+=1 else: d[u] = 1 return d def listwriter(l): emptydict = dict([(x, 0) for x in range(1,26)]) f = open('/tmp/blah.csv','w') c = csv.DictWriter(f,sorted(emptydict.keys())) c.writeheader() for str in l: c.writerow(dict(emptydict.items()+wordcounter(str).items())) f.close() listwriter(["Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation.", "We observe today not a victory of party, but a celebration of freedom -- symbolizing an end, as well as a beginning -- signifying renewal, as well as change.", "So, first of all, let me assert my firm belief that the only thing we have to fear is fear itself -- nameless, unreasoning, unjustified terror which paralyzes needed efforts to convert retreat into advance."])
Here is the resulting CSV file:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25 1,2,1,2,5,3,0,1,2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 5,8,4,1,2,4,3,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 1,7,6,9,4,2,3,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3 Regular Expressions
Regular expressions are a framework for doing complicated manipulation.
3.1 A first example
For example, consider the following text:
Joseph Schmoe (15), Phone:(934) 292-2390, SSN:295-48-2019
A first guess for a rule to get the area code would be to find a grouping of three numbers. Let's look at the source code for this in python.
import re str = "Joseph Schmoe (15), Phone:(934) 292-2390, SSN:295-48-2311" print re.findall("\d\d\d",str)
['934', '292', '239', '295', '231']
3.1.1 What the code does
import re- tells python to use the regular expression library. (Like
library(zelig))
- tells python to use the regular expression library. (Like
str = ...- defines a string
- Python will figure out that the type is a string based on the fact that it's in quotes
- There is a difference between
foo = '333'
and
foo = 333
re.findall("\d\d\d",str)- From the
relibrary, call thefindallfunction- When in doubt, Google it.
- By the way, googling things effectively is the most important modern research skill there is.
- When in doubt, Google it.
- Finds all of the matched of the regular expression
\d\d\dinstr- Returns it as a list
- From the
import re str = "Joseph Schmoe (15), Phone:(934) 292-2390, SSN:295-48-2311" print re.findall("\((\d\d\d)\)",str)
['934']
3.1.2 Different expressions
"Joseph Schmoe (15), Phone:(934) 292-2390, SSN:295-48-2311"
| English | Regex | findall Output |
|---|---|---|
| Any three numbers | \d\d\d | ['934', '292', '239', '295', '231'] |
| Any three numbers that start with ( | \(\d\d\d | ['(934'] |
| One or more adjacent numbers | \d+ | ['15', '934', '292', '2390', '295', '48', '2311'] |
| One or more numbers in parenthesis | \(\d+\) | ['(15)', '(934)'] |
| Three numbers in parenthesis | \(\d\d\d\) | ['(934)'] |
| Three numbers in parenthesis, but group only the numbers | \((\d\d\d)\) | ['934'] |
3.2 Further examples
import re str = "Joseph Schmoe, Bowling High Score:(225), Phone:(934) 292-2390" print re.findall("\w+:\((\d+)\)",str)
['225', '934']
- The
\wis code for any alphanumeric character and the underscore. - The
:is code for only the character:.
import re str = "I called his phone after he phoned me, but he has two phones!" print re.findall("phone\w*",str)
['phone', 'phoned', 'phones']
- We match all instances of "phone" with any number of characters
after it
- Note the difference between
\w+(1 or more) and\w*(0 or more)
- Note the difference between
import re str = "I called his phone after he phoned me, but he has two phones!" print re.findall("phone\w+",str)
['phoned', 'phones']
3.3 Other helpful regex tools
Regular expressions are extremely powerful, and are used extensively for text processing. Here are some good places to look for regex help:
- Python re library has documentation of how to use regex in python with
examples
- I can never remember regex syntax, so I go here all the time.
- Regexr is an interactive regex checker
- Textbooks on regex will tell you not just how to use them, but how they are implemented. Help answer the question "what is the best regex for this situation?"
3.4 Exercises
This file contains 100 blogs about dogs in a structured text format that may be familiar to you.
- Exercise 1
Use regular expressions to parse this file and write a csv file containing the article number and the number of words. (I'm going to start by downloading it to my hard drive, but if you're macho you want to figure out how to use theurllibmodule to parse it without downloading.)
- Exercise 2
Write a CSV file that investigates whether articles contain certain words. In particular, do dog bloggers write more about 'pets' or 'companions'?
3.5 Solutions
- Exercise 1
import csv, re f = open('/tmp/example.txt') fp = open('/tmp/result.csv','wb') c = csv.DictWriter(fp,["Article Number","Words"]) articlenum = 0 for line in f: d = dict() r = re.match("LENGTH:\s*(\d+)",line) if r: articlenum+=1 d["Article Number"] = articlenum d["Words"] = r.groups()[0] c.writerow(d) f.close() fp.close()
The
result.csvfile is:1,305 2,303 3,425 4,275 5,197 6,615 7,281 8,466 9,692 10,656 11,294 12,674 13,1455 14,1454 15,1063 16,1066 17,512 18,433 19,294 20,528 21,758 22,497 23,598 24,957 25,163 26,661 27,616 28,521 29,331 30,275 31,266 32,762 33,365 34,781 35,753 36,442 37,1251 38,462 39,230 40,281 41,564 42,510 43,316 44,1060 45,402 46,990 47,392 48,536 49,509 50,636 51,973 52,234 53,675 54,416 55,488 56,487 57,546 58,596 59,326 60,312 61,369 62,1507 63,2398 64,183 65,1718 66,280 67,302 68,302 69,1326 70,549 71,460 72,302 73,288 74,288 75,269 76,308 77,2241 78,515 79,526 80,320 81,400 82,301 83,302 84,263 85,297 86,300 87,953 88,308 89,1019 90,787 91,307 92,371 93,512 94,303 95,285 96,302 97,666 98,490 99,551 100,411
- Exercise 2
Let's begin just by checking some basic regular expressions
import re str = "A competition between Pets and Animal Companions! How do you refer to your dog?" print "\w*:" print re.findall("\w*",str) print "[p]et:" print re.findall("[p]et",str) print "[pP]et:" print re.findall("[pP]et",str)
\w*: ['A', '', 'competition', '', 'between', '', 'Pets', '', 'and', '', 'Animal', '', 'Companions', '', '', '', 'How', '', 'do', '', 'you', '', 'refer', '', 'to', '', 'your', '', 'dog', '', ''] [p]et: ['pet'] [pP]et: ['pet', 'Pet']
Great! So we know how to match "pet" or "Pet", but it still matches "competition"! Let's write out some patterns that we would like to match:
Do Match I own a dog - pets are great! Do you have a pet? Pets are wonderful. I've got to tell you–pets are the best! Don't Match Great competition! Petabytes of data are needed. I went to the petting zoo with my companion! She owns a whippet. It looks to me like we need the word "pet" with a space or punctuation at the beginning or the end, with an optional s at the end.
[-,\s.;][pP]etEither a dash a comma whitespace a period or a semicolon Either p or P the letters et import re strlist = ["I own a dog - pets are great!", "Do you have a pet?", "Pets are wonderful.", "I've got to tell you--pets are the best!", "Great competition!", "Petabytes of data are needed.", "I went to the petting zoo with my companion!", "She owns a whippet."] for str in strlist: print str print re.findall("[-,\s.;][pP]et",str)
I own a dog - pets are great! [' pet'] Do you have a pet? [' pet'] Pets are wonderful. [] I've got to tell you--pets are the best! ['-pet'] Great competition! [] Petabytes of data are needed. [] I went to the petting zoo with my companion! [' pet'] She owns a whippet. []
This isn't good enough! We're going to need to change the endings, too.
[-,\s.;][pP]et[s]?[.\s.;-]Either a dash a comma whitespace a period or a semicolon Either p or P the letters et an optional s Eith a period, whitespace, a semicolon or a dash import re strlist = ["I own a dog - pets are great!", "Do you have a pet?", "Pets are wonderful.", "I've got to tell you--pets are the best!", "Great competition!", "Petabytes of data are needed.", "I went to the petting zoo with my companion!", "She owns a whippet."] for str in strlist: print str print re.findall("[-,\s.;?][pP]et[s]?[,\s.;-?]",str)
I own a dog - pets are great! [' pets '] Do you have a pet? [' pet?'] Pets are wonderful. [] I've got to tell you--pets are the best! ['-pets '] Great competition! [] Petabytes of data are needed. [] I went to the petting zoo with my companion! [] She owns a whippet. []
We're almost there! We just need to make it so a string can also begin with Pets.
^[pP]et[s]?[.\s.;-]Only match the beginning of a string Either p or P the letters et an optional s Eith a period, whitespace, a semicolon or a dash So we will either match the regular expression
^[pP]et[s]?[.\s.;-]or the expression[-,\s.;?][pP]et[s]?[,\s.;-?]. The syntax for this is the pipe operator|.Our regular expression just to check for pets is:
[-,\s.;?][pP]et[s]?[,\s.;-?]|^[pP]et[s]?[,\s.;-?]This looks like a sloppy mess, but we built it up by hand ourselves, and it's really not so bad!
import re strlist = ["I own a dog - pets are great!", "Do you have a pet?", "Pets are wonderful.", "I've got to tell you--pets are the best!", "Great competition!", "Petabytes of data are needed.", "I went to the petting zoo with my companion!", "She owns a whippet."] for str in strlist: print str print re.findall("[-,\s.;?][pP]et[s]?[,\s.;-?]|^[pP]et[s]?[,\s.;-?]",str)
I own a dog - pets are great! [' pets '] Do you have a pet? [' pet?'] Pets are wonderful. ['Pets '] I've got to tell you--pets are the best! ['-pets '] Great competition! [] Petabytes of data are needed. [] I went to the petting zoo with my companion! [] She owns a whippet. []
Having constructed this regex for pets, we can now do the same for companion. Because the word companion isn't going to be inside words the way pet is, we don't have to be as careful. Let's say we need to match companion and companions, but not companionship. We can copy the same regex for pets, but remove the gunk from the beginning (although it probably can't hurt for correctness to include it!)
Let's try:
[cC]ompanion[s]?[,\s.;-?]Note: Remember to use
re.matchto match the beginning of the string only, andre.searchto match anywhere!import csv, re f = open('/tmp/example.txt') fp = open('/tmp/pets.csv','wb') c = csv.DictWriter(fp,["Article Number","Words","Pet","Companion"]) articlenum = 0 for line in f: r = re.match("LENGTH:\s*(\d+)",line) if r: if articlenum>0: c.writerow(d) d = dict() articlenum+=1 d["Article Number"] = articlenum d["Words"] = r.groups()[0] d["Pet"] = 0 d["Companion"] = 0 else: pets = re.search("[-,\s.;?][pP]et[s]?[,\s.;-?]|^[pP]et[s]?[,\s.;-?]",line) companions = re.search("[cC]ompanion[s]?[,\s.;-?]",line) if pets: d["Pet"] = 1 if companions: d["Companion"] = 1 f.close() fp.close()
Let's take a look at the csv file.
1,305,0,0 2,303,1,0 3,425,1,0 4,275,1,0 5,197,0,0 6,615,0,0 7,281,1,1 8,466,1,0 9,692,1,0 10,656,0,0 11,294,1,0 12,674,0,0 13,1455,1,0 14,1454,1,0 15,1063,1,0 16,1066,1,0 17,512,0,0 18,433,1,0 19,294,1,0 20,528,1,0 21,758,1,0 22,497,0,0 23,598,0,0 24,957,0,0 25,163,0,0 26,661,0,0 27,616,0,1 28,521,0,0 29,331,0,1 30,275,1,0 31,266,1,0 32,762,0,0 33,365,0,0 34,781,0,1 35,753,0,0 36,442,0,0 37,1251,0,0 38,462,0,0 39,230,0,0 40,281,0,0 41,564,0,0 42,510,1,0 43,316,1,0 44,1060,1,1 45,402,1,0 46,990,1,0 47,392,0,0 48,536,1,0 49,509,1,0 50,636,1,0 51,973,1,0 52,234,0,0 53,675,1,0 54,416,1,0 55,488,1,0 56,487,1,0 57,546,1,0 58,596,1,0 59,326,1,0 60,312,1,0 61,369,0,0 62,1507,0,1 63,2398,1,0 64,183,1,0 65,1718,1,0 66,280,1,0 67,302,0,0 68,302,1,0 69,1326,1,0 70,549,1,0 71,460,1,0 72,302,1,0 73,288,1,0 74,288,0,0 75,269,0,0 76,308,0,0 77,2241,0,0 78,515,1,1 79,526,0,0 80,320,1,0 81,400,0,0 82,301,1,0 83,302,1,0 84,263,1,0 85,297,1,0 86,300,0,0 87,953,0,0 88,308,1,0 89,1019,1,0 90,787,1,0 91,307,0,0 92,371,0,0 93,512,1,0 94,303,1,0 95,285,0,0 96,302,1,0 97,666,0,0 98,490,0,0 99,551,1,1
4 Web Sites
4.1 Example: Egypt Independent / المصري اليوم
- Example article: http://www.egyptindependent.com/node/725861
- Tells us nothing about the date or content
- A node number is automatically created by a content-management system
4.1.1 Aside: "Brittleness"
- A brittle system is one that is not resistant to change
- For example, between early April and late April of 2012, Egypt
Independent transitioned from
http://www.egyptindependent.com/node/725861
to a new URL naming scheme that involves the title:
http://www.egyptindependent.com/news/european-union-will-keep-mubarak-assets-ice-illicit-gains-authority-head-says
All scrapers are brittle.
- The assumptions you're forced to make about how information is organized on a given website will not hold forever.
- In fact, the legality of scraping is not entirely clear, and some sites may not be interested in you hammering their servers!
4.1.2 Metadata
Sometimes, metadata is included which tells us important things about our article
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="msvalidate.01" content="F1F61CF0E5EC4EC2940FCA062AB13A53" /> <meta name="google-site-verification" content="Q8FKHdNoQ2EH7SH1MzwH_JNcgVgMYeCgFnzNlXlR4N0" /> <title>European Union will keep Mubarak assets on ice, Illicit Gains Authority head says | Egypt Independent</title> <!-- tC490Uh18j-7O_rp7nG0_e6U9QY --> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <link rel="canonical" href="http://www.egyptindependent.com/node/725861" /> <meta name="keywords" content="Assem al-Gohary, corruption, EU, freezing Mubarak’s assets, Hosni Mubarak, Illicit Gains Authority (IGA), News, Top stories" /> <meta name="description" content="The European Union will continue to freeze the assets of former President Hosni Mubarak, his family and other former officials although Egypt has thus far been unsuccessful in recovering funds siphoned abroad by the regime." /> <meta name="abstract" content="Al-Masry Al-Youm - Egypt's leading independent media group المصرى اليوم للصحافة والنشر هى مؤسسة إعلامية مصرية مستقلة تأسست عام ,2003." />
- Keywords, abstract, description and title are all clear
- Lots of other gunk that isn't relevant to us!
- Pulling information out of this document requires that we know how
they organize title their metadata!
- What if keywords were called terms?
4.1.3 Body
The actual body of the article can be found by right-clicking on the text we're interested in from Chrome or Firefox and selecting "Insepct Element"
<div class="panel-region-separator"></div><div class="panel-pane pane-node-body" > <div class="pane-content"> <p>The European Union will continue to freeze the assets of former President Hosni Mubarak, his family and other former officials although Egypt has thus far been unsuccessful in recovering funds siphoned abroad by the regime.</p> <p>The Illicit Gains Authority (IGA), the judicial committee responsible for recovering the money, on Wednesday received an official notification from the European Union, confirming its freeze on the assets would be renewed another year as of 19 March, state-run MENA news service reported on Wednesday. </p> <p>“This was in response to a request by Egypt,” the state news agency quoted IGA head Assem al-Gohary as saying. </p> <p>Egypt formally asked European Union countries earlier this month to continue freezing funds belonging to Mubarak, his two sons and other members of his administration.</p> <p>Shortly after Mubarak was forced to step down in February 2011, the public prosecutor ordered that the foreign assets of the deposed president and his family be frozen.</p> <p>Mubarak's actual worth is still unknown after more than a year of investigations into his foreign and domestic assets. Last year claims that Mubarak, in his nearly 30-year reign as head of state, may have amassed a fortune of up to US$70 billion — greater than that of Microsoft's Bill Gates — helped drive the protests that eventually brought him down.</p> <p>Last year Swiss authorities also froze Mubarak’s assets, acting more speedily than when the EU froze the assets of another deposed North African ruler, former Tunisian President Zine al-Abidine Ben Ali.</p> <p>On Wednesday, the IGA met with the Swiss ambassador in Cairo to discuss the difficulties it faces in recovering those funds, in light of the obligations of the United Nations Convention Against Corruption on the member states, reported MENA.</p> <p>Gohary once estimated the frozen assets at 410 million Swiss francs (LE2.7 billion), which Egypt is trying to repatriate in cooperation with the Foreign Ministry.</p> </div>
All of the body is included in the panel-pane pane-node-body section
of this site, within the sub-section pane-content. Our "algorithm"
for getting this information out will require finding the exact
section of the site that we require pulling this data out from. If
you don't do this, any terms that are on the sidebar will end up being
in your analysis!
4.1.4 Scraping Articles
Every News Feature is on a page in the following scheme:
http://www.egyptindependent.com/subchannel/News%20features?page=5
And this paper goes back 77 pages, to April, 2009.
Investigating the source for a single search page can tell us what we have to do to get at the relevant information:
<div class="views-row views-row-4 views-row-even"> <div class="views-field-field-published-date-value"> <span class="field-content"><span class="date-display-single">09 Feb 2012</span></span> </div> <div class="views-field-title"> <span class="field-content"><a href="http://www.egyptindependent.com/node/647936">Parliament Review: A week of comedy and disappointment</a></span> </div> <div class="views-field-body"> <span class="field-content">This week’s parliamentary sessions had the public joking about airing future sessions on comedy channels instead of news, and those who abstained from the polls telling those who participated, in hope of having a legitimate authority...</span> </div>
Our algorithm to scrape articles from this page will be as follows:
- Initialize FOO=1
- Go to
http://www.egyptindependent.com/subchannel/News%20features?page=FOO - Repeat until complete:
- Find the next occurence of
views-row... - Find the sub-field called
views-field-field-published-date-valueand retrieve its value (the date) - Find the sub-field called
views-field-titleand retrieve its value (the title) - Follow the link from above
- Within the link, find the meta-data
keywordsand retrieve their values (the keywords) - Within the link, find the
panel-pane pane-node-bodysection, and retrieve the test (the article itself)
- Find the next occurence of
4.1.5 Scraping Exercise!
Not all web sites are designed in the same way. Go to the site of your choice, and figure out how to get the articles you're interested in. Write out pseudocode that will tell you:
- How to download individual articles
- How to get the Author of an article
- How to get the Title of an article
- How to get the Date of an article
- How to get the text of the article
If you need a site to practice on that isn't too challenging, check out Robert Ebert's Blog.
5 Web scraping in python
Now that we know how we want to scrape and have some grasp on the tools that are necessary, let's try and pull the articles and their metadata off of this website.
==
import urllib baseurl = "http://www.egyptindependent.com/subchannel/News%20features?page=" destpath = "/tmp/" npages = 10 # should be 10 for i in range(1,npages): urllib.urlretrieve (baseurl+str(i),destpath+"page"+str(i)+".html")Note: Windows users, you may need your destination to be specified using two slashes, e.g.
C://Python27//tmp//
If we take a look at what exists after running this script, we can see that it worked.
bash-3.2$ ls /tmp/page* /tmp/page1.html /tmp/page3.html /tmp/page5.html /tmp/page7.html /tmp/page9.html /tmp/page2.html /tmp/page4.html /tmp/page6.html /tmp/page8.html
Aside: The os module
If you're doing lots of things in a script that will involve files or
paths, but you want it to work cross-platform, consider using the os
and os.path modules.
Do things like
- change the current directory
- get the directory or filename of a file
5.1 Using ElementTree
Here is a very basic html tree which we can work with.
import urllib fileloc = 'http://www.people.fas.harvard.edu/~astorer/scraping/test.html' f = urllib.urlopen(fileloc) print f.read()
<html>
<head>
<title>Example page</title>
</head>
<body>
<p>Moved to <a href="http://example.org/">example.org</a>
or <a href="http://example.com/">example.com</a>.</p>
</body>
</html>
- The ElementTree is a hierarchical structure of Elements.
list()returns a list of the children of a single Element- An Element contains
- A
tag(what kind of element is it) textof what lives in the element
- A
from xml.etree.ElementTree import ElementTree fileloc = '/Users/astorer/Work/presentations/scraping/test.html' tree = ElementTree() tree.parse(fileloc) elem = tree.find('body') print elem print list(elem) elem = tree.find('body/p') print elem print list(elem) print elem.tag print elem.text
<Element 'body' at 0x1004c5310> [<Element 'p' at 0x1004c5350>] <Element 'p' at 0x1004c5350> [<Element 'a' at 0x1004c5390>, <Element 'a' at 0x1004c53d0>] p Moved to
5.2 Using lxml
Now let's see how we can parse out the list of article URLs from an xml page. Our basic approach isn't going to work here, and we need to install an external package.
5.2.1 Installing a Package
External packages can be easily installed in python using the
easy_install command. The only challenge is in making sure that
if you have multiple version of python installed, you are
installing the libraries to the correct location. I'm on a mac,
but the Python version on a mac is 2.6, and I prefer using 2.7.
Make sure you install the setuptools for 2.7 following
these instructions. Then, run
sudo easy_install-2.7 lxml
If you have no idea what I'm talking about, it will probably be fine if you simply use the following:
sudo easy_install lxml
On windows, try doing
easy_install lxml
To verify that this installed for you, open up python, and type
import lxml
If you get an error, sheck your setup and try reinstalling.
5.2.2 Using lxml
lxml will generate an ElementTree for us after parsing the xml.
Let's review some of the functions that will be useful for us in
this example.
| English | Python |
|---|---|
| Construct a parser | lxml.etree.HTMLParser() |
| Parse an HTML file | lxml.etree.parse(file,parser) |
Get all instances of <span class="..."> | MyTree.xpath('.//span[@class="..."]') |
Get all instances of <span class="date"> within <div class="article"> | MyTree.xpath('.//div[@class="article"]/span[@class="date"]') |
| Make a list of tuples that we can iterate over | zip(iterable1,iterable2,...) |
Encode a string foo as unicode (UTF-8) | foo.encode("UTF-8") |
The xpath syntax is described in more detail here. Briefly, we
are finding every occurence of spans with the class
date-display-single, no matter where they live in the tree.
Then we can iterate over them to get the actual dates. Similarly,
we can iterate over all links that are within the <span class="field content"> that are within the <div class="views-field-title"> and zip it with the dates to iterate
over both simultaneously. Notice that whenever foreign characters
are used, Python will be unable to display them unless we encode
the string first as unicode. The following code makes this explicit.
==
from lxml import etree fname = '/tmp/page1.html' fp = open(fname, 'rb') parser = etree.HTMLParser() tree = etree.parse(fp, parser) dateelems = tree.xpath('.//span[@class="date-display-single"]') linkelems = tree.xpath('.//div[@class="views-field-title"]/span[@class="field-content"]/a') for (d,l) in zip(dateelems,linkelems): print d.text print l.get('href') print l.text.encode("utf-8")
13 Apr 2012 http://www.egyptindependent.com/news/muslim-brotherhood-returns-street-politics-fills-square Muslim Brotherhood returns to street politics, fills square 12 Apr 2012 http://www.egyptindependent.com/news/meet-your-presidential-candidate-omar-suleiman-phantom Meet your presidential candidate: Omar Suleiman, the phantom 11 Apr 2012 http://www.egyptindependent.com/news/administrative-court-ruling-leaves-transition-timetable-disarray Administrative court ruling leaves transition timetable in disarray 11 Apr 2012 http://www.egyptindependent.com/news/shater-faces-early-hiccups-campaign-trail Shater faces early hiccups on campaign trail 11 Apr 2012 http://www.egyptindependent.com/news/new-alternatives-may-bolster-moussa%E2%80%99s-chances New alternatives may bolster Moussa’s chances 09 Apr 2012 http://www.egyptindependent.com/news/profile-kamal-al-helbawy-defector-conscience Profile: Kamal al-Helbawy, a defector of conscience 09 Apr 2012 http://www.egyptindependent.com/news/suleiman-president-game-changer-or-set-plan Suleiman for president: Game changer or set plan? 09 Apr 2012 http://www.egyptindependent.com/news/despite-prison-time-revolutionaries-uniform-continue-struggle Despite prison time, revolutionaries in uniform continue struggle 06 Apr 2012 http://www.egyptindependent.com/news/parliament-review-constitution-crisis-continues-brothers-spread-their-wings Parliament Review: Constitution crisis continues as Brothers spread their wings 06 Apr 2012 http://www.egyptindependent.com/news/profile-april-6-genealogy-youth-movement Profile: April 6, genealogy of a youth movement
5.2.3 XPath Examples
- Get all links under
<div class="views-field-title">
from lxml import etree fname = '/tmp/page1.html' fp = open(fname, 'rb') parser = etree.HTMLParser() tree = etree.parse(fp, parser) elems = tree.xpath('.//div[@class="views-field-title"]//a') for e in elems: print e.text.encode('utf-8')
Muslim Brotherhood returns to street politics, fills square Meet your presidential candidate: Omar Suleiman, the phantom Administrative court ruling leaves transition timetable in disarray Shater faces early hiccups on campaign trail New alternatives may bolster Moussa’s chances Profile: Kamal al-Helbawy, a defector of conscience Suleiman for president: Game changer or set plan? Despite prison time, revolutionaries in uniform continue struggle Parliament Review: Constitution crisis continues as Brothers spread their wings Profile: April 6, genealogy of a youth movement Parliament Review: A week of political exclusion and inclusion Friday's protest to unite, or further divide On campaign trail, Moussa speaks villagers' language Elections commission's disqualifications dubious, say experts Lawyers blame flawed justice system for acquittal of protesters’ accused killers Halloween! An Ode to Love Letters to Treze علشان مننشاش المغربي Wedding dance of "Beja" tribe Egyptian protester passes out after harassment "Pearly Pink Flower" I Cry! "The Amazing Bibliotheca Alexandrina!" am agree title am agree title Divorce between Margaret Scobey and Mr. Baradei
- Get all clickable images
These will look like:
<a href="www.webpage.com"><img src="laksjdasldkj.jpg"></a>
from lxml import etree fname = '/tmp/page1.html' fp = open(fname, 'rb') parser = etree.HTMLParser() tree = etree.parse(fp, parser) elems = tree.xpath('.//a/img') for e in elems: print e.get('src')
/sites/default/files/img/english_logo.png http://www.egyptindependent.com//sites/default/files/imagecache/video_thumbnail/video/2011/03/28/22597/lshn_mnnshsh_lmgrby_87049_1558797204.jpg http://www.egyptindependent.com//sites/default/files/imagecache/video_thumbnail/video/2010/05/09/3685/wedding_dance_17135_1577232530.jpg http://www.egyptindependent.com//sites/default/files/imagecache/video_thumbnail/video/2010/04/14/2252/rabw_14731_451336279.jpg http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/photo/2011/11/13/27866/pink_flower.jpg http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/photo/2011/11/09/27866/ips.jpg http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/photo/2011/10/29/27866/amazing.jpg http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/caricature/2010/04/14/120/piioioioioi.jpg http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/caricature/2010/04/13/120/wewweweww.jpg http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/caricature/2010/04/05/489/separation.jpg /sites/default/files/W300.jpg
5.2.4 lxml Exercise
Write a csv file that contains every image along with the location that it links to. If the webpage has:
<a href="www.webpage.com"><img src="laksjdasldkj.jpg"></a>
Your entry in the csv file would look like:
www.webpage.com, laksjdasldkj.jpg
Hint: use the elt.getparent() method to query elements 'above' a given element elt.
5.2.5 Solutions to exercise
import csv from lxml import etree fname = '/tmp/page1.html' fp = open(fname, 'rb') f = open('/tmp/links.csv','w') entries = ["Image","Link"] c = csv.DictWriter(f,entries) parser = etree.HTMLParser() tree = etree.parse(fp, parser) lnkelems = tree.xpath('.//a/img') for lnk in lnkelems: d = dict() d["Image"] = lnk.get('src') d["Link"] = lnk.getparent().get('href') c.writerow(d) fp.close() f.close()
The resulting file is a CSV file.:
/sites/default/files/img/english_logo.png,/ http://www.egyptindependent.com//sites/default/files/imagecache/video_thumbnail/video/2011/03/28/22597/lshn_mnnshsh_lmgrby_87049_1558797204.jpg,http://www.egyptindependent.com/node/377786 http://www.egyptindependent.com//sites/default/files/imagecache/video_thumbnail/video/2010/05/09/3685/wedding_dance_17135_1577232530.jpg,http://www.egyptindependent.com/node/40195 http://www.egyptindependent.com//sites/default/files/imagecache/video_thumbnail/video/2010/04/14/2252/rabw_14731_451336279.jpg,http://www.egyptindependent.com/node/26345 http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/photo/2011/11/13/27866/pink_flower.jpg,http://www.egyptindependent.com/node/514135 http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/photo/2011/11/09/27866/ips.jpg,http://www.egyptindependent.com/node/513040 http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/photo/2011/10/29/27866/amazing.jpg,http://www.egyptindependent.com/node/509791 http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/caricature/2010/04/14/120/piioioioioi.jpg,http://www.egyptindependent.com/node/26275 http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/caricature/2010/04/13/120/wewweweww.jpg,http://www.egyptindependent.com/node/26269 http://www.egyptindependent.com//sites/default/files/imagecache/news-featured/caricature/2010/04/05/489/separation.jpg,http://www.egyptindependent.com/node/24767 /sites/default/files/W300.jpg,http://www.almasryalyoum.com/en/your-guide
5.2.6 Downloading articles from each page
Goal: A file with the dates, titles and location of each article. Save each article in html form to the hard drive.
from lxml import etree import csv import urllib import re f = open('/tmp/files.csv','w') entries = ["Day","Month","Year","Title","Remote","Local"] c = csv.DictWriter(f,entries) destpath = '/tmp/' fname = '/tmp/page1.html' fp = open(fname, 'rb') parser = etree.HTMLParser() tree = etree.parse(fp, parser) dateelems = tree.xpath('.//div[@class="views-field-field-published-date-value"]/span[@class="field-content"]/span[@class="date-display-single"]') linkelems = tree.xpath('.//div[@class="panel-pane pane-views pane-subchannel-news subchannel-pane"]//div[@class="views-field-title"]/span[@class="field-content"]/a') for (d,l) in zip(dateelems,linkelems): entry = dict() myDate = d.text.split() urlname = l.get('href') print urlname entry["Day"] = myDate[0] entry["Month"] = myDate[1] entry["Year"] = myDate[2] remotename = re.match('.*/(.*)',urlname) dest = destpath+remotename.group(1)+".html" urllib.urlretrieve (urlname,dest) entry["Local"] = dest entry["Remote"] = urlname entry["Title"] = l.text.encode("utf-8") c.writerow(entry) print entry f.close() fp.close()
http://www.egyptindependent.com/news/muslim-brotherhood-returns-street-politics-fills-square
{'Remote': 'http://www.egyptindependent.com/news/muslim-brotherhood-returns-street-politics-fills-square', 'Title': 'Muslim Brotherhood returns to street politics, fills square', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/muslim-brotherhood-returns-street-politics-fills-square.html', 'Day': '13'}
http://www.egyptindependent.com/news/meet-your-presidential-candidate-omar-suleiman-phantom
{'Remote': 'http://www.egyptindependent.com/news/meet-your-presidential-candidate-omar-suleiman-phantom', 'Title': 'Meet your presidential candidate: Omar Suleiman, the phantom', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/meet-your-presidential-candidate-omar-suleiman-phantom.html', 'Day': '12'}
http://www.egyptindependent.com/news/administrative-court-ruling-leaves-transition-timetable-disarray
{'Remote': 'http://www.egyptindependent.com/news/administrative-court-ruling-leaves-transition-timetable-disarray', 'Title': 'Administrative court ruling leaves transition timetable in disarray', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/administrative-court-ruling-leaves-transition-timetable-disarray.html', 'Day': '11'}
http://www.egyptindependent.com/news/shater-faces-early-hiccups-campaign-trail
{'Remote': 'http://www.egyptindependent.com/news/shater-faces-early-hiccups-campaign-trail', 'Title': 'Shater faces early hiccups on campaign trail', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/shater-faces-early-hiccups-campaign-trail.html', 'Day': '11'}
http://www.egyptindependent.com/news/new-alternatives-may-bolster-moussa%E2%80%99s-chances
{'Remote': 'http://www.egyptindependent.com/news/new-alternatives-may-bolster-moussa%E2%80%99s-chances', 'Title': 'New alternatives may bolster Moussa\xe2\x80\x99s chances', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/new-alternatives-may-bolster-moussa%E2%80%99s-chances.html', 'Day': '11'}
http://www.egyptindependent.com/news/profile-kamal-al-helbawy-defector-conscience
{'Remote': 'http://www.egyptindependent.com/news/profile-kamal-al-helbawy-defector-conscience', 'Title': 'Profile: Kamal al-Helbawy, a defector of conscience', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/profile-kamal-al-helbawy-defector-conscience.html', 'Day': '09'}
http://www.egyptindependent.com/news/suleiman-president-game-changer-or-set-plan
{'Remote': 'http://www.egyptindependent.com/news/suleiman-president-game-changer-or-set-plan', 'Title': 'Suleiman for president: Game changer or set plan?', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/suleiman-president-game-changer-or-set-plan.html', 'Day': '09'}
http://www.egyptindependent.com/news/despite-prison-time-revolutionaries-uniform-continue-struggle
{'Remote': 'http://www.egyptindependent.com/news/despite-prison-time-revolutionaries-uniform-continue-struggle', 'Title': 'Despite prison time, revolutionaries in uniform continue struggle', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/despite-prison-time-revolutionaries-uniform-continue-struggle.html', 'Day': '09'}
http://www.egyptindependent.com/news/parliament-review-constitution-crisis-continues-brothers-spread-their-wings
{'Remote': 'http://www.egyptindependent.com/news/parliament-review-constitution-crisis-continues-brothers-spread-their-wings', 'Title': 'Parliament Review: Constitution crisis continues as Brothers spread their wings', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/parliament-review-constitution-crisis-continues-brothers-spread-their-wings.html', 'Day': '06'}
http://www.egyptindependent.com/news/profile-april-6-genealogy-youth-movement
{'Remote': 'http://www.egyptindependent.com/news/profile-april-6-genealogy-youth-movement', 'Title': 'Profile: April 6, genealogy of a youth movement', 'Month': 'Apr', 'Year': '2012', 'Local': '/tmp/profile-april-6-genealogy-youth-movement.html', 'Day': '06'}
The resulting file is a CSV file.:
13,Apr,2012,"Muslim Brotherhood returns to street politics, fills square",http://www.egyptindependent.com/news/muslim-brotherhood-returns-street-politics-fills-square,/tmp/muslim-brotherhood-returns-street-politics-fills-square.html 12,Apr,2012,"Meet your presidential candidate: Omar Suleiman, the phantom",http://www.egyptindependent.com/news/meet-your-presidential-candidate-omar-suleiman-phantom,/tmp/meet-your-presidential-candidate-omar-suleiman-phantom.html 11,Apr,2012,Administrative court ruling leaves transition timetable in disarray,http://www.egyptindependent.com/news/administrative-court-ruling-leaves-transition-timetable-disarray,/tmp/administrative-court-ruling-leaves-transition-timetable-disarray.html 11,Apr,2012,Shater faces early hiccups on campaign trail,http://www.egyptindependent.com/news/shater-faces-early-hiccups-campaign-trail,/tmp/shater-faces-early-hiccups-campaign-trail.html 11,Apr,2012,New alternatives may bolster Moussa’s chances,http://www.egyptindependent.com/news/new-alternatives-may-bolster-moussa%E2%80%99s-chances,/tmp/new-alternatives-may-bolster-moussa%E2%80%99s-chances.html 09,Apr,2012,"Profile: Kamal al-Helbawy, a defector of conscience",http://www.egyptindependent.com/news/profile-kamal-al-helbawy-defector-conscience,/tmp/profile-kamal-al-helbawy-defector-conscience.html 09,Apr,2012,Suleiman for president: Game changer or set plan?,http://www.egyptindependent.com/news/suleiman-president-game-changer-or-set-plan,/tmp/suleiman-president-game-changer-or-set-plan.html 09,Apr,2012,"Despite prison time, revolutionaries in uniform continue struggle",http://www.egyptindependent.com/news/despite-prison-time-revolutionaries-uniform-continue-struggle,/tmp/despite-prison-time-revolutionaries-uniform-continue-struggle.html 06,Apr,2012,Parliament Review: Constitution crisis continues as Brothers spread their wings,http://www.egyptindependent.com/news/parliament-review-constitution-crisis-continues-brothers-spread-their-wings,/tmp/parliament-review-constitution-crisis-continues-brothers-spread-their-wings.html 06,Apr,2012,"Profile: April 6, genealogy of a youth movement",http://www.egyptindependent.com/news/profile-april-6-genealogy-youth-movement,/tmp/profile-april-6-genealogy-youth-movement.html
5.2.7 Exercise
Modify the above code so that instead of iterating over only the first page, it iterates over all pages.
- Consider using the
globlibrary to look for all of the html files in a directory. - Can you do this so you don't save the pages, but parse them directly?
- Use google and the python documentation to help figure it out!
Now that we've seen
lxmlin action, let's figure out how to use it to pull out just the text of the article. Recall that all of the original text is in the following tags:<div class="panel-pane pane-node-body" > <div class="pane-content">
5.3 Stripping text
Can be included if there's interest!
5.4 Parallelization to increase speed
Can be included if there's interest!