Python

Search for '{{search_term}}'

Extract HTML links using Python HTML Parser

CMOS is the Code-Maven Open Source podcast that also includes video interviews. Subscribe to this feed RSS feed with your Podcast listener app or via iTunes iTunes.

We have seen how to parse the HTML file and print the links using the HTMLParser module of Python, but instead of printing them to the screen we should somehow process the links.

That's what we are going to do now. We are going to extract the links and let some other code collect or process them.

Using a global variable

examples/python/extract_links_html_parser_global.py

from __future__ import print_function
import urllib2, sys
from HTMLParser import HTMLParser

links = []

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag != 'a':
            return
        attr = dict(attrs)
        links.append(attr)

def extract():
    if len(sys.argv) != 2:
        print('Usage: {} URL'.format(sys.argv[0]))
        return
    url = sys.argv[1]

    try:
        f = urllib2.urlopen(url)
        html = f.read()
        f.close()
    except urllib2.HTTPError as e:
        print(e, 'while fetching', url)
        return

    parser = MyHTMLParser()
    parser.links = []
    parser.feed(html)
    for l in links:
        print(l)

extract()

In this solution we added a global variable called links that starts out as an empty list (line 5).

  links = []

Then within MyHTMLParser, the subclass of HTMLParser, we append each link to this list (line 12).

links.append(attr)

Finally, once the processing is over, we can go over the extracted list of links and print them or do whatever we need to do with them (lines 31-32).

for l in links:
    print(l)

An alternative would be to do some processing instead of collecting the links, that would mean calling a function from within the parser (on line 12) and passing the link to it. That would still probably need some global variable where we collect the results of that processing.

Another alternative would be to use an attribute of the parser object to collect the links:

Using attribute of the parser

examples/python/extract_links_html_parser_attribute.py

from __future__ import print_function
import urllib2, sys
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag != 'a':
            return
        attr = dict(attrs)
        self.links.append(attr)

def extract():
    if len(sys.argv) != 2:
        print('Usage: {} URL'.format(sys.argv[0]))
        return
    url = sys.argv[1]

    try:
        f = urllib2.urlopen(url)
        html = f.read()
        f.close()
    except urllib2.HTTPError as e:
        print(e, 'while fetching', url)
        return

    parser = MyHTMLParser()
    parser.links = []
    parser.feed(html)
    for l in parser.links:
        print(l)

extract()

In this solution, right after creating the parser, in line 27, we attached a new attribute to it:

parser.links = []

then, inside the parser subclass we append the links to this attribute (line 10):

self.links.append(attr)

And finally we loop over the collected links (lines 29-30):

for l in parser.links:
    print(l)

This seems to be cleaner than the previous solution because we don't use a global variable, but we take a risk. We have assumed the HTMLParser class does not have an attribute called links and if in the future it gets one, our code will break. Which means we need to be more careful with upgrades. Maybe we should also add a comment explaining the problem.

Comments

In the comments, please wrap your code snippets within <pre> </pre> tags and use spaces for indentation.
comments powered by Disqus