Home » LibraryHippo

Category Archives: LibraryHippo

Moving LibraryHippo to Python 2.7 – OpenID edition

Now that Google has announced that Python 2.7 is fully supported on Google App Engine, I figured I should get my act in gear and make convert LibraryHippo over. I’d had a few aborted attempts earlier, but this time things are going much better.

How We Got Here – Cloning LibraryHippo

One of the requirements for moving to Python 2.7 is that the app must use the High Replication Datastore, and LibraryHippo did not. Moreover, the only way to convert to the HRD is to copy your data to a whole new application. So I bit the bullet, and made a new application from the LibraryHippo source.

When you set up a new application, you have the option of allowing federated authentication via OpenID. I’d wanted to do this for some time, so I thought, “While I’m changing the datastore, template engine, and version of Python under the hood, why not add a little complexity?”, and I picked it.

The Simplest Thing That Should Work – Google as Provider

In theory, LibraryHippo should be able to support any OpenID provider, but I wanted to start with Google as provider for a few reasons:

  • concentrating one one provider would get the site running quickly and I could add additional providers over time
  • I need to support existing users – they’ve already registered with App Engine using Google, and I want things to keep working for them, and
  • I wanted to minimize my headaches – I figure, if an organization supports both an OpenID client feature and an OpenID provider, they must work together as well as any other combination.

Even though there’s been official guidance around using OpenID in App Engine since mid-2010, I started with Nick Johnson’s article for an overview – he’s never steered me wrong before. And I’m glad I did. While the official guide is very informative, Nick broke things down really well. To quote him,

Once you’ve enabled OpenID authentication for your app, a few things change:

  • URLs generated by create_login_url without a federated_identity parameter specified will redirect to the OpenID login page for Google Accounts.
  • URLs that are protected by “login: required” in app.yaml or web.xml will result in a redirect to the path “/_ah/login_required”, with a “continue” parameter of the page originally fetched. This allows you to provide your own openid login page.
  • URLs generated by create_login_url with a federated_identity provider will redirect to the specified provider.

That sounded pretty good – the existing application didn’t use login: required anywhere, just create_login_url (without a federated_identity, of course).
So, LibraryHippo should be good to go – every time create_login_url is used to generate a URL, it’ll send users to Google Accounts. I tried it out.

It just worked, almost. When a not-logged-in user tried to access a page that required a login, she was directed to the Google Accounts page. There were cosmetic differences, but I don’t think they’re worth worrying about:

standard Google login page

standard Google login page

federated Google login page

federated Google login page

Approve access to e-mail address

After providing her credentials, the user was redirected to a page that asked her if it was okay for LibraryHippo to know her e-mail address. After that approval was granted, it was back to the LibaryHippo site and everything operated as usual.

However, login: admin is still a problem. I really shouldn’t have been surprised by this, but login: admin seems to do the same thing that login: required does – redirect to /_ah/login_required, which is not found.

Login Required Not Found

This isn’t a huge problem – it only affects administrators (me), and I could workaround by visiting a page that required any kind of login first, but it still stuck in my craw.
Fortunately, the fix is very easy – just handle /_ah/login_required. I ripped off Nick’s OpenIdLoginHandler, only instead of offering a choice of providers using users.create_login_url, this one always redirects to Google’s OpenId provider page. With this fix, admins are able to go directly from a not-logged-in state to any admin required page.

class OpenIdLoginHandler(webapp2.RequestHandler):
    def get(self):
        continue_url = self.request.GET.get('continue')
        login_url = users.create_login_url(dest_url=continue_url)

        self.redirect(login_url)        

...

handlers = [ ...
    ('/_ah/login_required$', OpenIdLoginHandler),
    ... ]

Using Other Providers

With the above solution, LibraryHippo’s authentication system has the same functionality as before – users can login with a Google account. It’s time to add support for other OpenID providers.

username.myopenid.com

I added a custom provider picker page as Nick suggested, and tried to login with my myOpenID account, with my vanity URL as provider – blair.conrad.myopenid.com. The redirect to MyOpenID worked just as it should, and once I was authenticated, I landed back at LibraryHippo, at the “family creation” page, since LibraryHippo recognized me as a newly-authenticated user, with no history.

myopenid.com

Buoyed by my success, I tried again, this time using the “direct provider federated identity” MyOpenID url – myopenid.com. It was a complete disaster.

Error: Server Error  The server encountered an error and could not complete your request. If the problem persists, please report your problem and mention this error message and the query that caused it.

Once MyOpenID had confirmed my identity, and I was redirected back to the LibraryHippo application, App Engine threw a 500 Server Error. There’s nothing in the logs – just the horrible error on the screen. In desperation, I stripped down my login handler to the bare minimum, using the example at Using Federated Authentication via OpenID in Google App Engine as my guide. I ended up with this class that reproduces the problem:

class TryLogin(webapp2.RequestHandler):
    def get(self):
        providers = {
            'Google'   : 'www.google.com/accounts/o8/id',
            'MyOpenID' : 'myopenid.com',
            'Blair Conrad\'s MyOpenID' : 'blair.conrad.myopenid.com',
            'Blair Conrad\'s WordPress' : 'blairconrad.wordpress.com',
            'Yahoo' : 'yahoo.com',
            'StackExchange': 'openid.stackexchange.com',
            }
        
        user = users.get_current_user()
        if user:  # signed in already
            self.response.out.write('Hello <em>%s</em>! [<a href="%s">sign out</a>]' % (
                user.nickname(), users.create_logout_url(self.request.uri)))
        else:     # let user choose authenticator
            self.response.out.write('Hello world! Sign in at: ')
            for name, uri in providers.items():
                self.response.out.write('[<a href="%s">%s</a>]' % (
                    users.create_login_url(dest_url= '/trylogin', federated_identity=uri), name))

...

handlers = [
    ('/trylogin$', TryLogin),
    ('/_ah/login_required$', OpenIdLoginHandler),
    ...
]

Interestingly, both Yahoo! and WordPress work, but StackExchange does not. If it weren’t for Yahoo!, I’d guess that it’s the direct provider federated identities that give App Engine problems (yes, Google is a direct provider, but I consider it to be an exception in any case).

Next steps

For now, I’m going to use the simple “just Google as federated ID provider” solution that I described above. It seems to work, and I’d rather see if I can find out why these providers fail before implementing an OpenID selector that excludes a few providers. Also, implementing the simple solution will allow me to experiment with federated IDs on the side, since I don’t know how e-mail will work with federated IDs, or how best to add federated users as families’ responsible parties. But that’s a story for another day.

Advertisements

libraryhippo.com lives

For a while now, I’ve been dithering over getting LibraryHippo its own domain name. I hadn’t, mostly because I’m a little lazy and cheap and afraid of the domain registration process.
Yesterday at the Day Job, a co-worker (whose name just might be an anagram of Yen Waster), on learning that libraryhippo.com was available, plunked a $10 bill on my desk and insisted that I buy the domain. I called him foolish and tried to return the money, but he was adamant.

So, after an hour’s work last night, and as a result of Mr. Waster’s generosity, I invite you to check out the majesty of

IE binds to id attributes, or “How I learned to love var

I recently converted the LibraryHippo “Family Status” page to use AJAX to fetch individual card statuses, instead of having the server aggregate all the statuses and send the complete summary back to the user. It was fairly straightforward, with one notable exception – Internet Explorer.

AJAX LibraryHippoWhen using Firefox or Chrome, as soon as the page loaded, the user would see a list of cards that LibraryHippo was checking, complete with throbbers. As results came in, the matching progress line would disappear and other tables would fill in, holding the results – books that have to go back, holds ready for pickup, etc. I don’t mind admitting that I was a little proud of my first foray into AJAXy web programming.

The morning after I finished the update, a co-worker signed up. Unlike everyone else I knew, she used Internet Explorer. She hit the summary page and everything stalled. The progress list was populated, the throbbers were throbbing, and… that’s it. They just kept going. Oh, and a little indicator woke up in the status bar, saying that there was an error on the page: “Object doesn’t support this property or method”. The reported line numbers didn’t match my source file, but via judicious application of alerts()s, I was able to isolate the problem to a callback that’s executed on a successful card check to update a span that holds a row count:

 function refresh_table_count(table_selector)
{  
    count = $(table_selector + ' tbody tr').length;
    $(table_selector + ' thead #count').html(count);
}

That seemed pretty innocuous, and not dissimilar from code that I had elsewhere in the <script> block. Quick web searches revealed nothing, so I resorted to cutting and renaming bits until I could see what was going on. I was down to an HTML body with a single table definition, and the function above. The error persisted. Suspicious, I renamed the count variable to c, and the problem disappeared.

At this point, I was convinced that IE’s Javascript interpreter reserved the count keyword for itself. I made this claim to a friend, who was sceptical. Eager to show him, I whipped up a quick example, and… it worked. There were no problems with the word count. I was stymied again, but not for long: my sample HTML file didn’t include an element with a "count" id. Once I added the count id, the sample broke.

It turns out that IE is actually creating a global object that matches the item’s ID! As Rick Strahl explains, the problem is a little worse than that, because the assignment on line 3 above should’ve overwritten the variable reference, but there’s “some whacky scoping going on”.

Workarounds:

  1. do away with the temporary variable (possible in this case)
  2. rename the temporary variable (always possible, but lame)
  3. use more specific id attribute values (probably a good idea in any case)
  4. use the var statement to declare all variables – this is safest and probably the easiest to remember:
function refresh_table_count(table_selector)
{
    var count = $(table_selector + ' tbody tr').length;
    $(table_selector + ' thead #count').html(count);
}

Now everything is working on the new page, and I’ve every confidence that var will help keep it so.

Automated Testing using App Engine Service APIs (and a Memcaching Memoizer)

I’m a fan of Test-driven development, and automated testing in general. As such, I’ve been trying ensure that the LibraryHippo code has an adequate set of automated tests before deploying new versions.

Importing Google App Engine Modules

Unfortunately, testing code that relies on the Google App Engine SDK is a little tricky, as I found when working with one of the LibraryHippo entities. There’s an entity called a Card, which extends db.Model and represents a user’s library card.

The Card definition is not entirely unlike this:

class Card(db.Model):
    family = db.ReferenceProperty(Family)
    number = db.StringProperty()
    name = db.StringProperty()
    pin = db.StringProperty()
    library = db.ReferenceProperty(Library)

    def pin_is_valid(self):
        return self.pin != ''

Unfortunately, testing this class isn’t as straightforward as one would hope. Suppose I have this test file:

from card import Card

def test_card_blank_pin_is_invalid():
    c = Card()
    c.pin = ''
    assert not c.pin_is_valid()

It fails miserably, spewing out a string of import errors. Here’s the tidied-up stack:

> from card import Card
> from google.appengine.ext import db
> from google.appengine.api import datastore
> from google.appengine.datastore import datastore_index
> from google.appengine.api import validation
> import yaml
E ImportError: No module named yaml

Not so good. Fortunately, it’s not that hard to find out what needs to be done in order to make the imports work:

import sys
import dev_appserver
sys.path = dev_appserver.EXTRA_PATHS + sys.path 

from card import Card

def test_card_blank_pin_is_invalid():
    c = Card()
    c.pin = ''
    assert not c.pin_is_valid()

Now Python can find all the imports it needs. For a while this was good enough, since I wasn’t testing any code that hit the datastore or actually used any of the app Engine Service APIs.

Running the App Engine Service APIs

However, I recently found a need to use Memcache to store partially-calculated results and decided (like everyone else) to write a memoizing decorator to do the job. There’s enough logic in my memoizer that I felt it needed an automated test. I tried this:

import sys
import dev_appserver
sys.path = dev_appserver.EXTRA_PATHS + sys.path 

from google.appengine.api import memcache
from gael.memcache import *

def test_memoize_formats_string_key_using_kwargs():
    values = [1, 2]
    @memoize('hippo %(animal)s zebra', 100)
    def pop_it(animal):
        return values.pop()

    result = pop_it(animal='rabbit')
    assert 2 == result

    cached_value = memcache.get('hippo rabbit zebra')
    assert 2 == cached_value

(gael is Google App Engine Library – my extension/utility package – as it grows and I gain experience, I may spin it out of LibraryHippo to be its own project.) Again, it failed miserably. Here’s a cleaned-up version of the failure:

> result = pop_it(animal='rabbit')
> cached_result = google.appengine.api.memcache.get(key_value)
> self._make_sync_call('memcache', 'Get', request, response)
> return apiproxy.MakeSyncCall(service, call, request, response)
> assert stub, 'No api proxy found for service "%s"' % service
E AssertionError: No api proxy found for service "memcache";

This was puzzling. All the imports were in place, so why the failure? This time the answer was a little harder to find, but tenacious searching paid off, and I stumbled on a Google Group post  called Unit tests / google apis without running the dev app server. The author had actually done the work to figure out what initialization code had to be run in order to get have the Service APIs work. The solution relied on hard-coded paths to the App Engine imports, but it was obvious how to combine it with the path manipulation I used earlier to produce this:

import sys

from dev_appserver import EXTRA_PATHS
sys.path = EXTRA_PATHS + sys.path 

from google.appengine.tools import dev_appserver
from google.appengine.tools.dev_appserver_main import ParseArguments
args, option_dict = ParseArguments(sys.argv) # Otherwise the option_dict isn't populated.
dev_appserver.SetupStubs('local', **option_dict)

from google.appengine.api import memcache
from gael.memcache import *

def test_memoize_formats_string_key_using_kwargs():
    values = [1, 2]
    @memoize('hippo %(animal)s zebra', 100)
    def pop_it(animal):
        return values.pop()

    result = pop_it(animal='rabbit')
    assert 2 == result

    cached_value = memcache.get('hippo rabbit zebra')
    assert 2 == cached_value

There’s an awful lot of boilerplate here, so I tried to clean up the module, moving the App Engine setup into a new module in gael:

import sys

def add_appsever_import_paths():
    from dev_appserver import EXTRA_PATHS
    sys.path = EXTRA_PATHS + sys.path 

def initialize_service_apis():
    from google.appengine.tools import dev_appserver

    from google.appengine.tools.dev_appserver_main import ParseArguments
    args, option_dict = ParseArguments(sys.argv) # Otherwise the option_dict isn't populated.
    dev_appserver.SetupStubs('local', **option_dict)

Then the top of the test file becomes

import gael.testing
gael.testing.add_appsever_import_paths()
gael.testing.initialize_service_apis()

from google.appengine.api import memcache
from gael.memcache import *

def test_memoize_formats_string_key_using_kwargs():
    ...

The Decorator

In case anyone’s curious, here’s the memoize decorator I was testing. I needed something flexible, so it takes a key argument that can either be a format string or a callable. I’ve never cared for positional format arguments – not in Python, C#, Java, nor C/C++ – so both the format string and the callable use the **kwargs to construct the key. I’d prefer to use str.format instead of the % operator, but not until App Engine moves to Python 2.6+

def memoize(key, seconds_to_keep=600):
    def decorator(func):
        def wrapper(*args, **kwargs):
            if callable(key):
                key_value = key(args, kwargs)
            else:
                key_value = key % kwargs

            cached_result = google.appengine.api.memcache.get(key_value)
            if cached_result is not None:
                logging.debug('found ' + key_value)
                return cached_result
            logging.info('calling func to get '  + key_value)
            result = func(*args, **kwargs)
            google.appengine.api.memcache.set(key_value, result, seconds_to_keep)
            return result
        return wrapper
    return decorator

Faking out Memcache – Unit Testing the Decorator

The astute among you are probably thinking that I could’ve saved myself a lot of trouble if I’d just faked out memcache and unit tested the decorator instead of trying to hook everything up for an integration test. That’s true, but at first I couldn’t figure out how to do that cleanly, and it was my first foray into memcache, so I didn’t mind working with the service directly.

Still, the unit testing approach would be better, so I looked at my decorator and rebuilt it to use a class rather than a function. It’s my first time doing this, and it’ll probably not be the last – I really like the separation between initialization and execution that the __init__/__call__ methods give me; I think it makes things a lot easier to read.

def memoize(key, seconds_to_keep=600):
    class memoize():
        def __init__(self, func):
            self.key = key
            self.seconds_to_keep=600
            self.func = func
            self.cache=google.appengine.api.memcache

        def __call__(self, *args, **kwargs):
            if callable(self.key):
                key_value = self.key(args, kwargs)
            else:
                key_value = self.key % kwargs

            cached_result = self.cache.get(key_value)
            if cached_result is not None:
                logging.debug('found ' + key_value)
                return cached_result
            logging.info('calling func to get '  + key_value)
            result = self.func(*args, **kwargs)

            self.cache.set(key_value, result, self.seconds_to_keep)
            return result

    return memoize

Then the test can inject its own caching mechanism to override self.cache:

class MyCache:
    def __init__(self):
        self.cache = {}

    def get(self, key):
        return self.cache.get(key, None)

    def set(self, key, value, *args):
        self.cache[key] = value

def test_memoize_formats_string_key_using_kwargs():
    values = [1, 2]
    @memoize('hippo %(animal)s zebra', 100)
    def pop_it(animal):
        return values.pop()

    cache = MyCache()
    pop_it.cache = cache
    result = pop_it(animal='rabbit')
    assert 2 == result

    cached_value = cache.get('hippo rabbit zebra')
    assert 2 == cached_value

And that’s it. Now I have a unit-tested implementation of my memoizer and two new helpers in my extension library.

A first look at Appstats – where’s my time spent?

After hearing about the release of Google’s App Engine SDK 1.3.1, I rushed out to try the new Appstats Event Recorder to help profile LibraryHippo. I didn’t expect great things, as I’m generally happy with the performance, with one notable exception, but I was curious about the tool.

App Engine Fan has posted a great introduction of some of the features that make Appstats a useful and powerful tool – it’s very easy to hook up, and seems to add very little overhead. In addition, it has very rich configuration options – one can omit classes of calls, fold calls together, select the amount of information retained about each call, and specify how many such records are retained (in what amounts to a circular buffer).

I didn’t use (or need) any particularly advanced configuration, so I just installed the Event Recorder and let it go.

Here’s what I saw:

Appstats result for checking one family with just Waterloo and Kitchener accounts

Checking one family, with 3 Waterloo and Kitchener library cards

I don’t have an in-depth analysis, but here are some impressions:

  • it’s pretty
  • the information is presented very well – with only minimal reading, I can see that LibraryHippo made a handful of datastore queries, as well as a series of urlfetch.Fetch calls for each library card it checked
  • I can get a quick view of what’s taking what proportion of the time – for example, the fetches easily dominate
  • total time (about 2.3 seconds) is easy to find, as well as the amount taken by the underlying API – 73 milliseconds
  • there’s something else that’s going on – 1056 ms for cpu + api – nearly half the elapsed time. I’m not sure what that means exactly

So far, no big surprises – I knew that most of the time was taken up by queries to the library web pages, but it’s very cool to see it this way, and to see how much time is taken up going to the Datastore (not much). There’s room for improvement, but 2.3 seconds is more than acceptable for this family – one of LibraryHippo’s heaviest users.

Two things did stand out, though. First, in the first group of urlfetch.Fetches, there are gaps between the fourth, fifth, and sixth calls (the ones that take 128 ms, 91ms, and 52ms) and the pattern repeats (with smaller gaps) in the second batches. This is where the retrieved records are processed and transformed into a normalized representation before rendering. The total time taken is a small, but I didn’t expect to see anything.

Second, there’s a datastore_v3.Get call before each card is checked. This is not an explicit call that LibraryHippo makes, so I clicked on the line in the graph and got a detailed view of what was going on:

Detail of implicit datastore_v3.get call

Detail of implicit datastore_v3.get call

It looks like the call is coming from the create method on line 8 of the all_libraries.py file. Curious, I click on that line and lo and behold, I get a view of the source. This very cool.

   1: #!/usr/bin/env python
   2: 
   3: import sys
   4: 
   5: modules = {}
   6: 
   7: def create(card, fetcher):
   8:     id = card.library.type
   9:     if not modules.has_key(id):
  10:         modules[id] = __import__(id)
  11:     return modules[id].LibraryAccount(card, fetcher)
  12: 
  13: def main(args=None):
  14:     if args == None:
  15:         args = sys.argv[1:]
  16:     return 0

Correlating the detail view and the source code, we see that create is handed a card parameter that has an as-yet-unresolved library instance. Accessing the library attribute on the card must complete what was a lazy load initiated when I loaded the Family entity – the cards come from the Family.card_set member.

Ordinarily, I might start investigating the gaps and the implicit gets, but I know there’s a much greater threat to LibraryHippo usability, which I confirm by checking out the record for another family’s notification:

Appstats results of checking one family, with a Region of Waterloo Account

Checking one family, with 4 Waterloo and Kitchener cards, and one Region of Waterloo

Here’s where the presentation really packs a wallop – there’s clearly a qualitative difference here. And what a difference – instead of 2.5 seconds on the horizontal axis, it’s 25 seconds, and most of the fetches are compressed to nigh-invisibility.

There are two differences between this family’s accounts and the first family’s: they have an extra Kitchener Library card that the first family didn’t, and they have a Region of Waterloo Library card. It’s the RWL card that makes the difference: you can see it being checked in the last batch of urlfetch.Fetches.
The 4 Waterloo and Kitchener library card checks are completely done after 3154ms, but the Region of Waterloo checking goes on for a further 21 seconds – for one library, and it’s not an aberration – the library web site is slow.

This post is already long enough, so I’ll use an upcoming one to talk about what this slowness means for LibraryHippo and how I’ve tried to keep it from destroying the user experience.

Cookies, Redirects, and Transcripts – Supercharging urlfetch

LibraryHippo‘s main function is fetching current library account status for patrons. Since I have no special relationship with any of the libraries involved, LibraryHippo web scrapes the libraries’ web interfaces.

The library websites issue cookies and redirects, so I needed to do something to augment the URL Fetch Python API.
I wrote a utility class that worked with the urllib2 interface, but that didn’t allow me to set the deadline argument, and I wanted to increase its value to 10 seconds. I resigned myself to wiring up a version that used urlfetch, when I found Scott Hillman’s URLOpener, which uses cookielib to follow redirects and handle any cookies met along the way.

URLOpener looked like it would work for me, with a few tweaks – it didn’t support relative URLs in redirects, it doesn’t allow one to specify headers in requests, and it lacked one feature that I really wanted – a transcript.

Why a transcript?

The libraries don’t provide a spec for their output, so I built the web scraper by trial and error, sometimes putting books on hold or taking them out just to get test data. Every once in a while something comes up that I haven’t coded for and the application breaks. In these cases, I can’t rely on the problem being reproducible, since the patron could’ve returned (or picked up) the item whose record was troublesome or some other library state might’ve changed. I need to know what the web site looked like when the problem occurred, and since the ultimate cause might be several pages back, I need a history.

I started adding a transcript feature to the URLOpener – recording every request and response including headers. As I worked, I worried about two things:

  • the fetch logic was becoming convoluted, and
  • the approach was inflexible – what if later I didn’t want to follow redirects, or to keep a transcript?

Decorators to the rescue

I decided to separate each bit of functionality – following redirects, tracking cookies, and keeping a transcript – into its own decorator, to be applied as needed. First I teased out the code that followed redirects, with my change to allow relative URLs:

class RedirectFollower():
    def __init__(self, fetcher):
        self.fetcher = fetcher

    def __call__(self, url, payload=None, method='GET', headers={},
                 allow_truncated=False, follow_redirects=False, deadline=None):
        while True:
            response = self.fetcher(url, payload, method, headers,
                                    allow_truncated, False, deadline)
            new_url = response.headers.get('location')
            if new_url:
                # Join the URLs in case the new location is relative
                url = urlparse.urljoin(url, new_url)

                # Next request should be a get, payload needed
                method = 'GET'
                payload = None
            else:
                break

        return response

After that, the cookie-handling code was easy to put in its own class:

class CookieHandler():
    def __init__(self, fetcher):
        self.fetcher = fetcher
        self.cookie_jar = Cookie.SimpleCookie()

    def __call__(self, url, payload=None, method='GET', headers={},
                 allow_truncated=False, follow_redirects=True, deadline=None):
            headers['Cookie'] = self._make_cookie_header()
            response = self.fetcher(url, payload, method, headers,
                                    allow_truncated, follow_redirects, deadline)
            self.cookie_jar.load(response.headers.get('set-cookie', ''))
            return response

    def _make_cookie_header(self):
        cookieHeader = ""
        for value in self.cookie_jar.values():
            cookieHeader += "%s=%s; " % (value.key, value.value)
        return cookieHeader

Now I had the URLOpener functionality back, just by creating an object like so:

fetch = RedirectFollower(CookieHandler(urlfetch.fetch))

Implementing transcripts

I still needed one more decorator – the transcriber.

class Transcriber():
    def __init__(self, fetcher):
        self.fetcher = fetcher
        self. transactions = []

    def __call__(self, url, payload=None, method='GET', headers={},
                 allow_truncated=False, follow_redirects=True, deadline=None):
        self.transactions.append(Transcriber._Request(vars()))
        response = self.fetcher(url, payload, method, headers,
                                    allow_truncated, follow_redirects, deadline)
        self.transactions.append(Transcriber._Response(response))
        return response

    class _Request:
        def __init__(self, values):
            self.values = dict((key, values[key])
                               for key in ('url', 'method', 'payload', 'headers'))
            self.values['time'] = datetime.datetime.now()

        def __str__(self):
            return '''Request at %(time)s:
  url = %(url)s
  method = %(method)s
  payload = %(payload)s
  headers = %(headers)s''' % self.values

    class _Response:
        def __init__(self, values):
            self.values = dict(status_code=values.status_code,
                               headers=values.headers,
                               content=values.content,
                               time=datetime.datetime.now())

        def __str__(self):
            return '''Response at %(time)s:
  status_code = %(status_code)d
  headers = %(headers)s
  content = %(content)s''' % self.values

To record all my transactions, all I have to do is wrap my fetcher one more time. When something goes wrong, I can examine the whole chain of calls and have a better shot at fixing the scraper.

fetch = Transcriber(RedirectFollower(CookieHandler(urlfetch.fetch)))
response = fetch(patron_account_url)
try:
    process(response)
except:
    logging.error('error checking account for ' + patron, exc_info=True)
    for action in fetch.transactions:
            logging.debug(action)

Extra-fine logging without rewriting fetch

The exercise of transforming URLOpener into a series of decorators may seem like just that, an exercise that doesn’t provide real value, but provides a powerful debugging tool for your other decorators. By moving the Transcriber to the inside of the chain of decorators, you can see each fetch that’s made due to a redirect, and which cookies are set when:

fetch = RedirectFollower(CookieHandler(Transcriber(urlfetch.fetch)))

The only trick is that the Transcriber.transactions attribute isn’t available from the outermost decorator. This is easily solved by extracting a base class and having it delegate to the wrapped item.

class _BaseWrapper:
    def __init__(self, fetcher):
        self.fetcher = fetcher

    def __getattr__(self, name):
        return getattr(self.fetcher, name)

Then the other decorators extend _BaseWrapper, either losing their __init__ or having them modified. For example, CookieHandler becomes:

class CookieHandler(_BaseWrapper):
    def __init__(self, fetcher):
        _BaseWrapper.__init__(self, fetcher)
        self.cookie_jar = Cookie.SimpleCookie()
...

And then the following code works, and helped me diagnose a small bug I’d originally had in my RedirectFollower. As a bonus, if I ever need to get at CookieHandler.cookie_jar, it’s right there too.

fetch = RedirectFollower(CookieHandler(Transcriber(urlfetch.fetch)))
fetch(patron_account_url)
for action in fetch.transactions:
    logging.debug(action)

Meet LibraryHippo

I enjoy reading and using my local libraries. My wife and I have four library cards between us – one each for the Waterloo Public Library, one for the Kitchener Public Library, and one for the Region of Waterloo Library. Using our cards, we were able to find all kinds of books to read and DVDs to watch, but organizing our borrowing was a little annoying, since:

  • we had to log into four different library accounts to get an overview of our current borrowings and holds,
  • each account had a long, hard-to-remember ID, and
  • the library would send e-mail when items were overdue, not in time to take them back.

I’d been using Library Elf to manage our cards, but they’d recently moved to a for-pay model, so I combined a sense of frugality with the desire to build something using a new technology and created LibraryHippo, a Google App Engine-powered web application that takes care of my library cards.

LibraryHippo logoLibraryHippo:

  • manages multiple cards per family
  • shows a comprehensive overview of a family’s current library status
  • sends e-mail every morning if
    • a family has items that are nearly due
    • there are items ready to be picked up, or
    • there’s a problem checking an account

Feel free to check out the project, hosted on Google Code. A fair number of my future posts will talk about the adventures I’ve had implementing and improving LibraryHippo.