Saturday, March 12, 2011

The case of a non-raised exception

As you probably know, python uses exceptions for error handling. It is considered a good style to avoid adding error-handling code in the form of conditional statements. Instead, one should rely on the fact that an appropriate exception is raised once an error condition is detected, and caught when it can be dealt with.

So, let'a assume that you are given a task of downloading a file given the URL and the file name on disk, using Python. You may want to write the following code and hope that you don't have to add any error-handling because (as you think) all errors that can happen are either network errors or file write errors, and those two types of errors already raise exceptions for you.

#!/usr/bin/python

import urllib2
import sys
import socket

def download(url, fname):
    net = urllib2.urlopen(url)
    f = open(fname, "wb")
    
    while True:
        data = net.read(4096)
        if not data:
            break
        f.write(data)
    
    net.close()
    f.close()

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print "Usage: download.py URL filename"
    
    url = sys.argv[1]
    fname = sys.argv[2]
    
    socket.setdefaulttimeout(30)
    
    download(url, fname)

Indeed, this code downloads existing files via HTTP just fine. Also, it provides sensible tracebacks for non-existing hosts, 404 errors, full-disk situations, and socket timeouts. So, it looks like the result of calling the download() fnction is either a successfully downloaded file, or an exception that the other part of the application will likely be able to deal with.

But actually, it only looks like this. Consider a situation when the HTTP server closes the connection gracefully at the TCP level, but prematurely. You can test this by starting your own Apache web server, putting a large file there, and calling "apache2ctl restart" while the client is downloading the file. Result: an incompletely downloaded file, and no exceptions.

I don't know if it should be considered a bug in urllib2 or in the example download() function above. In fact, urllib2 could have noticed the mismatch of the total number of bytes before the EOF and the value in the Content-Length HTTP header.

Here is a version of the download() function that detects incomplete downloads based on the Content-Length header:

def download(url, fname):
    net = urllib2.urlopen(url)
    contentlen = net.info().get("Content-Length", "")
    f = open(fname, "wb")
    datalen = 0
    
    while True:
        data = net.read(4096)
        if not data:
            break
        f.write(data)
 datalen += len(data)
    
    net.close()
    f.close()

    try:
        contentlen = int(contentlen)
    except ValueError:
        contentlen = None

    if contentlen is not None and contentlen != datalen:
        raise urllib2.URLError("Incomplete download")

No comments: