Python

Extract HTML links using Python HTML Parser

Prev Next

We have seen how to parse the HTML file and print the links using the HTMLParser module of Python, but instead of printing them to the screen we should somehow process the links.

That's what we are going to do now. We are going to extract the links and let some other code collect or process them.

Using a global variable

examples/python/extract_links_html_parser_global.py

from __future__ import print_function
import urllib2, sys
from HTMLParser import HTMLParser

links = []

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag != 'a':
            return
        attr = dict(attrs)
        links.append(attr)

def extract():
    if len(sys.argv) != 2:
        print('Usage: {} URL'.format(sys.argv[0]))
        return
    url = sys.argv[1]

    try:
        f = urllib2.urlopen(url)
        html = f.read()
        f.close()
    except urllib2.HTTPError as e:
        print(e, 'while fetching', url)
        return

    parser = MyHTMLParser()
    parser.links = []
    parser.feed(html)
    for l in links:
        print(l)

extract()

In this solution we added a global variable called links that starts out as an empty list (line 5).

  links = []

Then within MyHTMLParser, the subclass of HTMLParser, we append each link to this list (line 12).

links.append(attr)

Finally, once the processing is over, we can go over the extracted list of links and print them or do whatever we need to do with them (lines 31-32).

for l in links:
    print(l)

An alternative would be to do some processing instead of collecting the links, that would mean calling a function from within the parser (on line 12) and passing the link to it. That would still probably need some global variable where we collect the results of that processing.

Another alternative would be to use an attribute of the parser object to collect the links:

Using attribute of the parser

examples/python/extract_links_html_parser_attribute.py

from __future__ import print_function
import urllib2, sys
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag != 'a':
            return
        attr = dict(attrs)
        self.links.append(attr)

def extract():
    if len(sys.argv) != 2:
        print('Usage: {} URL'.format(sys.argv[0]))
        return
    url = sys.argv[1]

    try:
        f = urllib2.urlopen(url)
        html = f.read()
        f.close()
    except urllib2.HTTPError as e:
        print(e, 'while fetching', url)
        return

    parser = MyHTMLParser()
    parser.links = []
    parser.feed(html)
    for l in parser.links:
        print(l)

extract()

In this solution, right after creating the parser, in line 27, we attached a new attribute to it:

parser.links = []

then, inside the parser subclass we append the links to this attribute (line 10):

self.links.append(attr)

And finally we loop over the collected links (lines 29-30):

for l in parser.links:
    print(l)

This seems to be cleaner than the previous solution because we don't use a global variable, but we take a risk. We have assumed the HTMLParser class does not have an attribute called links and if in the future it gets one, our code will break. Which means we need to be more careful with upgrades. Maybe we should also add a comment explaining the problem.

Prev Next

Written by
Gabor Szabo

Published on 2015-07-08

If you have any comments or questions, feel free to post them on the source of this page in GitHub. Source on GitHub. Comment on this post

Author: Gabor Szabo

Gábor who writes the articles of the Code Maven site offers courses in in the subjects that are discussed on this web site.

Gábor helps companies set up test automation, CI/CD Continuous Integration and Continuous Delivery and other DevOps related systems. Gabor can help your team improve the development speed and reduce the risk of bugs.

He is also the author of a number of eBooks.

Contact Gabor if you'd like to hire his services.

If you would like to support his freely available work, you can do it via Patreon, GitHub, or PayPal.