Web Scraping with Python

  • Hello, everyone.
    Today I will be demonstrating how to connect to web pages in Python 2.7, scrape information you need from them, and post back to them using standard Python libraries. I will be performing this in a Python terminal, but obviously these actions can be scripted as well.

    My first demonstration will be simply accessing a webpage and seeing it's contents.

    First, you will need to import urllib and urllib2:

    >>> import urllib
    >>> import urllib2



    These libraries contain the functions necessary to communicate with a server. Why both? Because there are minor differences between the two. urllib2 can accept a Request object and modify the headers for a URL request, while urllib can only accept a URL. Also, urllib has the urlencode method for generating GET query strings, while urllib2 doesn't.

    Now, we'll go ahead and use the first programming mission as an example web page. Set a variable equal to the URL you desire. For your own purposes, you can set this to take user input or just keep it static as I will.

    >>> url = 'http://challenges.enigmagroup.org/programming/1/'



    Now, the mission calls for posting a header containing your IP, username, and a cookie with the value of mission equal to yes.

    The next thing we need to do then is set up our variables. There are dozens of ways to find your IP address, but I will just go to https://www.whatismyip.com/ and check there. After you have it, store it as well as your username in a dictionary like so:

    >>> headers = {'ip': '111.1.11.111', 'username': 'Anonanonamous' }



    The next step is to get your cookie data. For the EG missions, you will need your PHPSESSID. Once again, there are dozens of ways to check your cookies on a site. My preferred method is just to check them in Firebug. Once you've obtained it, you can now add it and the mission variable to another string:

    >>> cookiedata = 'PHPSESSID=xxxxxxxxxxxxxxxxxxxxxxxxxx;mission=yes'



    This is all the information we need to complete programming 1. Now we just have to build and send the request.

    The first step is to url encode the headers. Why do we need to url encode the headers? Good question. Because these are GET variables, which means they will be added to the URL. Therefore, they need to be URL valid. The following command:

    >>> data = urllib.urlencode(headers)



    Will form the data into something like the following:
    'ip=xx.xxx.xx.xx&username=Anonanonamous'
    Which should look like something you would see following an index.php page.

    Now that the data is ready to be sent, let's build the request:

    >>> request = urllib2.build_opener()



    Then add the cookie data:

    >>> request.addheaders.append(('COOKIE', cookiedata))



    And then send the request:

    >>> response = request.open(url, data)



    Now, stored in response you have everything the webpage sent you back. I won't go into all the things that Python's libraries can extract for you, but I will do the two most useful/necessary.

    First of all, you can check your header information. This is everything you would see from using something like Tamper Data or curl -I [url]:

    >>> print response.info()
    Date: Sun, 19 Jun 2016 23:18:08 GMT
    Server: Apache/2.2.15 (CentOS)
    Expires: Thu, 19 Nov 1981 08:52:00 GMT
    Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
    Pragma: no-cache
    Connection: close
    Transfer-Encoding: chunked
    Content-Type: text/html; charset=utf-8



    This might be useful for other missions or projects you are working on, but not this one.

    The primary - and most obvious - thing you want back is the HTML code of the page you accessed. This is very easy to do as well:

    >>> html = response.read()
    >>> print html
    
    
    
    
    ...
    ...
        Enigma Group :: Programming 1
        


    Your objective is to send a POST header to this script. The POST content must contain your IP Address and your USERNAME. The variable names are ip & username!

    ONE MORE THING... MAKE SURE you send a cookie with the value mission=yes

    Congratulations on beating the mission again.
    No credits were rewarded, but it's good to see you're testing your skills again.

    Return to Enigma Main




    I'm going to spare you most of what is printed as it isn't userful to us. This is exactly what you would see if you open a web page and view its source. Now, because we don't want to see all that nasty stuff and we know what the juicy part has for tags, we can use regular expressions in future missions to filter out anything we don't want to see. Let's practice here:

    >>> import re
    >>> responseFinder = re.compile(r'

    ') >>> responseList = re.findall(responseFinder, html) >>> print responseList[0] Congratulations on beating the mission again.
    No credits were rewarded, but it's good to see you're testing your skills again.


    Yay!!! You will also need a case for if you fail, but I will leave that up to you. So re is Python's regular expression library. Regular expressions are a topic for a different time, but essentially they're very powerful (but greedy) text processing protocols. You can use something like BeautifulSoup if you're lazy, but I won't teach you and you will never learn.

    re.compile() sets up a regular expression statement. The 'r' before the quotes denotes raw, which negates the need to escape characters. The (.*) in the middle is what will be returned. That means match anything you find inbetween both sides of the regular expression. Try r'
    (.*)
    ' on a webpage and see what you get :) re.findall() searches for the reqular expression in the string you give it and returns all the matches in a list. You can then look through the list and get whatever you need from it. In this instance, however, I know that there is only one occurance of this particular expression, so I can just print what is in the 0 index location.

    In future EG programming missions, you will need to scrape information from the pages and perform some sort of operation on it. Following what I've shown you, it's simple enough to get the information. After that, it's just a matter of arranging it how you will and performing whatever mutations that mission calls for on the data.

    I believe the last topic to go over is taking images from a webpage. This should be pretty simple to a seasoned Python user, but I will give a quick example.

    To do this, you will need to install a library. For some reason, PIL does not come standard in Python any longer. It even changed its name to Pillow, I believe. Regardless, install it with pip or whatever way you choose.

    Say we wanted to do programming mission 3. The description page gives you the link to the image. We can therefore connect to that page:

    >>> url = 'http://challenges.enigmagroup.org/programming/3/image.php'
    >>> request = urllib2.build_opener()
    >>> request.addheaders.append(('COOKIE', 'PHPSESSID=XXXX'))
    >>> response = request.open(url)



    Now, if you print response.read(), you'll get a lot of unintelligible garbage. That's because you're trying to print a JPG file in ASCII! However, this means that your response is an image, so instead, you can write it to a file or manipulate it directly. I usually write it to a file so that I have an easier time debugging:

    >>> html = response.read()
    >>> output = open("image.jpg", "wb")
    >>> output.write(html)
    >>> output.close()



    Now that you have the image stored, you can mess with it using PIL:

    >>> pic = Image.open("image.jpg")
    >>> pix = pic.load()
    >>> r, g, b = pix[1,1]
    >>> print r, g, b
    6 29 229



    PIL is a very powerful imaging library. To even get the smallest grasp of what it is capable of would take another class and is out of scope for this lesson. However, that is all the information you need to pass the first 4 programming missions. After that, it takes a little more effort to manipulate the data, but still completely possible with the tools I've shown you and a grasp of programmatic problem solving.

    I usually use the following simple algorithm when doing these missions:
    1. set up variables
    2. connect to the webpage
    3. scrape what I need
    4. manipulate that data
    5. send back to the webpage

    For even more, better documentation, here's an excellent site: http://www.voidspace.org.uk/python/articles/urllib2.shtml

    Simple as that.

Comments

7 comments