Python

Search for '{{search_term}}'

Print HTML links using Python HTML Parser

CMOS is the Code-Maven Open Source podcast that also includes video interviews. Subscribe to this feed RSS feed with your Podcast listener app or via iTunes iTunes.

Now that we know how to fetch an HTML page with Python using urllib we take another step and try to extract all the links from the HTML file. For this we are going to use the HTMLParser module.

examples/python/print_links_html_parser.py

from __future__ import print_function
import urllib2, sys
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag != 'a':
            return
        attr = dict(attrs)
        print(attr)

def extract():
    if len(sys.argv) != 2:
        print('Usage: {} URL'.format(sys.argv[0]))
        return
    url = sys.argv[1]

    try:
        f = urllib2.urlopen(url)
        html = f.read()
        f.close()
    except urllib2.HTTPError as e:
        print(e, 'while fetching', url)
        return

    parser = MyHTMLParser()
    parser.feed(html)

extract()

The extract function first expects a URL on the command line, and then using that URL and the urllib2 library, it fetches the HTML served on that URL.

Then we create an HTMLParser instance and call the feed method passing the HTML to it. More precisely, we are subclassing HTMLParser and we create an instance of that subclass.

The way HTMLParser works is that it goes over all the elements of the HTML and every time it encounters an opening tag it calls the handle_starttag method with two parameters, (besides the object itself): the name of the tag and the attributes as a list of tuples.

When it encounters an end-tag it calls handle_endtag with the name of the tag.

When it encounters text inside a tag (for example the anchor of a link), it calls the handle_data method with the text.

If we subclass the HTMLParser, and implements some, or all of the above methods, then when we call the feed method, it will call the methods we have overridden in the subclass.

So we have created a subclass called MyHTMLParser and we have implemented the handle_starttag in it. In this task we are only interested in the URLs of the links and those are the href attributes in the opening part of the a tags.

Inside the method we check the tag and if it is not an a then we call return: We don't need to do anything with such tags.

If it is an a tag, we convert the attributes to a dictionary and then print them out.

In the next article we'll see how can we collect this information for later use.

Comments

In the comments, please wrap your code snippets within <pre> </pre> tags and use spaces for indentation.
comments powered by Disqus