Accessing web sites from a Python program is not very difficult, but using the requests library makes it even fun.

Let's see a few examples.

We are going to use the httpbin.org site that provides an excellent set of end-point for us to experiment with. The site has both an http and an https version. (BTW It was created by the same author as the requests package.)

examples/python/requests_get_html.py

import requests

r = requests.get('http://httpbin.org/')
print(type(r))
print(r.status_code)
print(r.headers)
print(r.headers['content-type'])
#print(r.text)

After importing the module we can call its get method passing a URL to it. It will return requests.models.Response object.

Some of the methods it has are:

r.status_code is the HTTP status code. On success it is 200.

r.headers is a dictionary of all the entries in the header. for example for the above request I got the following header:

{'content-length': '12150', 'server': 'nginx', 'connection': 'keep-alive', 'access-control-allow-credentials': 'true',
'date': 'Thu, 05 Jan 2017 09:45:02 GMT', 'access-control-allow-origin': '*', 'content-type': 'text/html; charset=utf-8'}

We could look at the Content-Type that the server sent us using the simple dictionary access code: r.headers['content-type'] that, in the above case, will print

text/html; charset=utf-8

r.text contains the full content of the page. The content that you would get if you opened the page in your browser and the clicked on "view source", or that you would get if you ran curl with the given URL.

You will get the exact same behavior if you change the URL to be https.

Asking for JSON

Asking for a page that would return JSON is exactly the same as asking for an HTML page. The difference is only in the returned values:

examples/python/requests_get_json_ip.py

import requests

r = requests.get('http://httpbin.org/ip')
print(r.headers['content-type'])
print(r.text)
data = r.json()
print(data)
print(data['origin'])

The content-type is

application/json

The text is a JSON string. In the case of the

{
  "origin": "31.168.125.94"
}

The json method converts the JSON string into a Python data structure. In our case it is a dictionary holding a single key and the IP address of our computer as the value:

{u'origin': u'31.168.135.94'}

We can access the individual values (in this case the IP address), just as we do with any other dictionary in Python:

data['origin']

Get the User Agent

Another URL the httpbin.org site provides return the user-agent of our client. When I visited that URL with my FireFox browser I got the following response:

{
  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:50.0) Gecko/20100101 Firefox/50.0"
}

When I visited in with Chrome, I got this:

{
  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"
}

When I ran the Python script:

examples/python/requests_get_json_ua.py

import requests

r = requests.get('http://httpbin.org/user-agent')
print(r.headers['content-type'])
print(r.text)
data = r.json()
print(data)
print(data['user-agent'])

I got the following:

{
  "user-agent": "python-requests/2.3.0 CPython/2.7.12 Darwin/16.3.0"
}

This means that the server can tell both the IP address I use and the browser I have when accessing the site and even the operating system on my computer. Based on this information they can return different data to different users.

I've heard that certain sites will show you higher prices if you are using a Mac as they assume you'll be ready to pay a higher price.

Other sites will refuse to serve you unless you use a "human browser". They would notice the "user-agent" of your Python application and return error.

Setting the User-agent in Python requests

Luckily the requests library makes it very easy to change the User-Agent string supplied in the request and to fake any browser.

examples/python/requests_get_json_ua_changed.py

import requests

r = requests.get('http://httpbin.org/user-agent',
    headers = {'User-agent': 'Internet Explorer/2.0'})
print(r.headers['content-type'])
print(r.text)
data = r.json()
print(data)
print(data['user-agent'])

The only thing you need to do is to supply the headers key with a dictionary including the User Agent:

    headers = {'User-agent': 'Internet Explorer/2.0'}

If you run this program it will send the request as if it was Internet Explorer 2.0 and let the system administrator wonder if you are really stuck with such an old browser.