Download an HTML page using Ruby

Prev Next

While a page on a web-site is totally different from a file, several languages provide a way to read them as if they were regular files. I am not sure if this is a good idea, but it certainly works for some people.

In Ruby, the open-uri modules provides this simplified interface.

After loading the module with require it overrides the standard open function so from now on, in addition to opening regular files, it will be able to 'open' URLs as well. Of course, it can only open them read-only as we can only fetch pages cannot push them out, but it can get all kinds of additional parameters.

examples/ruby/download.rb

require 'open-uri'
url = 'http://code-maven.com/'
fh = open(url)
html = fh.read
puts html

The open in such cases will return an instance of StringIO. If we were printing out the contet of fh we would get:

puts fh   #  #<StringIO:0x007fc41c8bc238>

Once we get the object we can apply the same methods as on a regular filehandle. For example we can use the read method to read in the content of the whole page. As opposed to the case when we read regular files,, in this case there is no efficiency reason to read the content line-by-line. The way HTTP works it does not make much sense. By the time we start reading the page the whole document have arrived and is located in the memory of our program. We can as well copy it to our internal variable using the read method.

Lie about who are we

When a browser accesses a web site it tells the site what kind of browser is that, which version etc. The same happens when we "open" a web page using the open function supplied by the open-uri module.

By default, opern-uri calls it 'browser' Ruby which does not say much. We can change it to whatever we want by passing "User-Agent" to the open call:

examples/ruby/download_user_agent.rb

require 'open-uri'
url = 'http://code-maven.com/'
fh = open(url, 
   "User-Agent" => "Code-Maven-Example (see: http://code-maven.com/download-a-page-using-ruby )"
)
html = fh.read
puts html

This string will be written in the Access log of the web server we connect to.

Prev Next

Written by
Gabor Szabo

Published on 2015-10-11

If you have any comments or questions, feel free to post them on the source of this page in GitHub. Source on GitHub. Comment on this post

Author: Gabor Szabo

Gábor who writes the articles of the Code Maven site offers courses in in the subjects that are discussed on this web site.

Gábor helps companies set up test automation, CI/CD Continuous Integration and Continuous Delivery and other DevOps related systems. Gabor can help your team improve the development speed and reduce the risk of bugs.

He is also the author of a number of eBooks.

Contact Gabor if you'd like to hire his services.

If you would like to support his freely available work, you can do it via Patreon, GitHub, or PayPal.