Home » 2010 » February

Monthly Archives: February 2010

Automated Testing using App Engine Service APIs (and a Memcaching Memoizer)

I’m a fan of Test-driven development, and automated testing in general. As such, I’ve been trying ensure that the LibraryHippo code has an adequate set of automated tests before deploying new versions.

Importing Google App Engine Modules

Unfortunately, testing code that relies on the Google App Engine SDK is a little tricky, as I found when working with one of the LibraryHippo entities. There’s an entity called a Card, which extends db.Model and represents a user’s library card.

The Card definition is not entirely unlike this:

class Card(db.Model):
    family = db.ReferenceProperty(Family)
    number = db.StringProperty()
    name = db.StringProperty()
    pin = db.StringProperty()
    library = db.ReferenceProperty(Library)

    def pin_is_valid(self):
        return self.pin != ''

Unfortunately, testing this class isn’t as straightforward as one would hope. Suppose I have this test file:

from card import Card

def test_card_blank_pin_is_invalid():
    c = Card()
    c.pin = ''
    assert not c.pin_is_valid()

It fails miserably, spewing out a string of import errors. Here’s the tidied-up stack:

> from card import Card
> from google.appengine.ext import db
> from google.appengine.api import datastore
> from google.appengine.datastore import datastore_index
> from google.appengine.api import validation
> import yaml
E ImportError: No module named yaml

Not so good. Fortunately, it’s not that hard to find out what needs to be done in order to make the imports work:

import sys
import dev_appserver
sys.path = dev_appserver.EXTRA_PATHS + sys.path 

from card import Card

def test_card_blank_pin_is_invalid():
    c = Card()
    c.pin = ''
    assert not c.pin_is_valid()

Now Python can find all the imports it needs. For a while this was good enough, since I wasn’t testing any code that hit the datastore or actually used any of the app Engine Service APIs.

Running the App Engine Service APIs

However, I recently found a need to use Memcache to store partially-calculated results and decided (like everyone else) to write a memoizing decorator to do the job. There’s enough logic in my memoizer that I felt it needed an automated test. I tried this:

import sys
import dev_appserver
sys.path = dev_appserver.EXTRA_PATHS + sys.path 

from google.appengine.api import memcache
from gael.memcache import *

def test_memoize_formats_string_key_using_kwargs():
    values = [1, 2]
    @memoize('hippo %(animal)s zebra', 100)
    def pop_it(animal):
        return values.pop()

    result = pop_it(animal='rabbit')
    assert 2 == result

    cached_value = memcache.get('hippo rabbit zebra')
    assert 2 == cached_value

(gael is Google App Engine Library – my extension/utility package – as it grows and I gain experience, I may spin it out of LibraryHippo to be its own project.) Again, it failed miserably. Here’s a cleaned-up version of the failure:

> result = pop_it(animal='rabbit')
> cached_result = google.appengine.api.memcache.get(key_value)
> self._make_sync_call('memcache', 'Get', request, response)
> return apiproxy.MakeSyncCall(service, call, request, response)
> assert stub, 'No api proxy found for service "%s"' % service
E AssertionError: No api proxy found for service "memcache";

This was puzzling. All the imports were in place, so why the failure? This time the answer was a little harder to find, but tenacious searching paid off, and I stumbled on a Google Group post  called Unit tests / google apis without running the dev app server. The author had actually done the work to figure out what initialization code had to be run in order to get have the Service APIs work. The solution relied on hard-coded paths to the App Engine imports, but it was obvious how to combine it with the path manipulation I used earlier to produce this:

import sys

from dev_appserver import EXTRA_PATHS
sys.path = EXTRA_PATHS + sys.path 

from google.appengine.tools import dev_appserver
from google.appengine.tools.dev_appserver_main import ParseArguments
args, option_dict = ParseArguments(sys.argv) # Otherwise the option_dict isn't populated.
dev_appserver.SetupStubs('local', **option_dict)

from google.appengine.api import memcache
from gael.memcache import *

def test_memoize_formats_string_key_using_kwargs():
    values = [1, 2]
    @memoize('hippo %(animal)s zebra', 100)
    def pop_it(animal):
        return values.pop()

    result = pop_it(animal='rabbit')
    assert 2 == result

    cached_value = memcache.get('hippo rabbit zebra')
    assert 2 == cached_value

There’s an awful lot of boilerplate here, so I tried to clean up the module, moving the App Engine setup into a new module in gael:

import sys

def add_appsever_import_paths():
    from dev_appserver import EXTRA_PATHS
    sys.path = EXTRA_PATHS + sys.path 

def initialize_service_apis():
    from google.appengine.tools import dev_appserver

    from google.appengine.tools.dev_appserver_main import ParseArguments
    args, option_dict = ParseArguments(sys.argv) # Otherwise the option_dict isn't populated.
    dev_appserver.SetupStubs('local', **option_dict)

Then the top of the test file becomes

import gael.testing
gael.testing.add_appsever_import_paths()
gael.testing.initialize_service_apis()

from google.appengine.api import memcache
from gael.memcache import *

def test_memoize_formats_string_key_using_kwargs():
    ...

The Decorator

In case anyone’s curious, here’s the memoize decorator I was testing. I needed something flexible, so it takes a key argument that can either be a format string or a callable. I’ve never cared for positional format arguments – not in Python, C#, Java, nor C/C++ – so both the format string and the callable use the **kwargs to construct the key. I’d prefer to use str.format instead of the % operator, but not until App Engine moves to Python 2.6+

def memoize(key, seconds_to_keep=600):
    def decorator(func):
        def wrapper(*args, **kwargs):
            if callable(key):
                key_value = key(args, kwargs)
            else:
                key_value = key % kwargs

            cached_result = google.appengine.api.memcache.get(key_value)
            if cached_result is not None:
                logging.debug('found ' + key_value)
                return cached_result
            logging.info('calling func to get '  + key_value)
            result = func(*args, **kwargs)
            google.appengine.api.memcache.set(key_value, result, seconds_to_keep)
            return result
        return wrapper
    return decorator

Faking out Memcache – Unit Testing the Decorator

The astute among you are probably thinking that I could’ve saved myself a lot of trouble if I’d just faked out memcache and unit tested the decorator instead of trying to hook everything up for an integration test. That’s true, but at first I couldn’t figure out how to do that cleanly, and it was my first foray into memcache, so I didn’t mind working with the service directly.

Still, the unit testing approach would be better, so I looked at my decorator and rebuilt it to use a class rather than a function. It’s my first time doing this, and it’ll probably not be the last – I really like the separation between initialization and execution that the __init__/__call__ methods give me; I think it makes things a lot easier to read.

def memoize(key, seconds_to_keep=600):
    class memoize():
        def __init__(self, func):
            self.key = key
            self.seconds_to_keep=600
            self.func = func
            self.cache=google.appengine.api.memcache

        def __call__(self, *args, **kwargs):
            if callable(self.key):
                key_value = self.key(args, kwargs)
            else:
                key_value = self.key % kwargs

            cached_result = self.cache.get(key_value)
            if cached_result is not None:
                logging.debug('found ' + key_value)
                return cached_result
            logging.info('calling func to get '  + key_value)
            result = self.func(*args, **kwargs)

            self.cache.set(key_value, result, self.seconds_to_keep)
            return result

    return memoize

Then the test can inject its own caching mechanism to override self.cache:

class MyCache:
    def __init__(self):
        self.cache = {}

    def get(self, key):
        return self.cache.get(key, None)

    def set(self, key, value, *args):
        self.cache[key] = value

def test_memoize_formats_string_key_using_kwargs():
    values = [1, 2]
    @memoize('hippo %(animal)s zebra', 100)
    def pop_it(animal):
        return values.pop()

    cache = MyCache()
    pop_it.cache = cache
    result = pop_it(animal='rabbit')
    assert 2 == result

    cached_value = cache.get('hippo rabbit zebra')
    assert 2 == cached_value

And that’s it. Now I have a unit-tested implementation of my memoizer and two new helpers in my extension library.

A first look at Appstats – where’s my time spent?

After hearing about the release of Google’s App Engine SDK 1.3.1, I rushed out to try the new Appstats Event Recorder to help profile LibraryHippo. I didn’t expect great things, as I’m generally happy with the performance, with one notable exception, but I was curious about the tool.

App Engine Fan has posted a great introduction of some of the features that make Appstats a useful and powerful tool – it’s very easy to hook up, and seems to add very little overhead. In addition, it has very rich configuration options – one can omit classes of calls, fold calls together, select the amount of information retained about each call, and specify how many such records are retained (in what amounts to a circular buffer).

I didn’t use (or need) any particularly advanced configuration, so I just installed the Event Recorder and let it go.

Here’s what I saw:

Appstats result for checking one family with just Waterloo and Kitchener accounts

Checking one family, with 3 Waterloo and Kitchener library cards

I don’t have an in-depth analysis, but here are some impressions:

  • it’s pretty
  • the information is presented very well – with only minimal reading, I can see that LibraryHippo made a handful of datastore queries, as well as a series of urlfetch.Fetch calls for each library card it checked
  • I can get a quick view of what’s taking what proportion of the time – for example, the fetches easily dominate
  • total time (about 2.3 seconds) is easy to find, as well as the amount taken by the underlying API – 73 milliseconds
  • there’s something else that’s going on – 1056 ms for cpu + api – nearly half the elapsed time. I’m not sure what that means exactly

So far, no big surprises – I knew that most of the time was taken up by queries to the library web pages, but it’s very cool to see it this way, and to see how much time is taken up going to the Datastore (not much). There’s room for improvement, but 2.3 seconds is more than acceptable for this family – one of LibraryHippo’s heaviest users.

Two things did stand out, though. First, in the first group of urlfetch.Fetches, there are gaps between the fourth, fifth, and sixth calls (the ones that take 128 ms, 91ms, and 52ms) and the pattern repeats (with smaller gaps) in the second batches. This is where the retrieved records are processed and transformed into a normalized representation before rendering. The total time taken is a small, but I didn’t expect to see anything.

Second, there’s a datastore_v3.Get call before each card is checked. This is not an explicit call that LibraryHippo makes, so I clicked on the line in the graph and got a detailed view of what was going on:

Detail of implicit datastore_v3.get call

Detail of implicit datastore_v3.get call

It looks like the call is coming from the create method on line 8 of the all_libraries.py file. Curious, I click on that line and lo and behold, I get a view of the source. This very cool.

   1: #!/usr/bin/env python
   2: 
   3: import sys
   4: 
   5: modules = {}
   6: 
   7: def create(card, fetcher):
   8:     id = card.library.type
   9:     if not modules.has_key(id):
  10:         modules[id] = __import__(id)
  11:     return modules[id].LibraryAccount(card, fetcher)
  12: 
  13: def main(args=None):
  14:     if args == None:
  15:         args = sys.argv[1:]
  16:     return 0

Correlating the detail view and the source code, we see that create is handed a card parameter that has an as-yet-unresolved library instance. Accessing the library attribute on the card must complete what was a lazy load initiated when I loaded the Family entity – the cards come from the Family.card_set member.

Ordinarily, I might start investigating the gaps and the implicit gets, but I know there’s a much greater threat to LibraryHippo usability, which I confirm by checking out the record for another family’s notification:

Appstats results of checking one family, with a Region of Waterloo Account

Checking one family, with 4 Waterloo and Kitchener cards, and one Region of Waterloo

Here’s where the presentation really packs a wallop – there’s clearly a qualitative difference here. And what a difference – instead of 2.5 seconds on the horizontal axis, it’s 25 seconds, and most of the fetches are compressed to nigh-invisibility.

There are two differences between this family’s accounts and the first family’s: they have an extra Kitchener Library card that the first family didn’t, and they have a Region of Waterloo Library card. It’s the RWL card that makes the difference: you can see it being checked in the last batch of urlfetch.Fetches.
The 4 Waterloo and Kitchener library card checks are completely done after 3154ms, but the Region of Waterloo checking goes on for a further 21 seconds – for one library, and it’s not an aberration – the library web site is slow.

This post is already long enough, so I’ll use an upcoming one to talk about what this slowness means for LibraryHippo and how I’ve tried to keep it from destroying the user experience.

Waiting pays off again – Google App Engine gets Datastore retries

Sometimes when we wait a little, the universe provides for us. Last year, I was holding off taking LibraryHippo live because it’s almost useless without daily notifications. At the time, Google App Engine had no scheduled tasks. I’d just resigned myself to soliciting users and setting up a cron job on one of my own computers to trigger the notifications, or use some other kind of hackery. Right after I made this decision, the Google App Engine team announced that Scheduled Tasks were available. Well, it’s happened again.

I really enjoy working on the Google App Engine framework, but one of the more frustrating aspects (lately) has been the timeouts I get when reading from the Datastore. I’ve recently taken some steps that will reduce the impact on users, but was on the verge of implementing an automatic Datastore retry mechanism.
Fortunately, it looks like I may not have to, with the arrival of App Engine SDK 1.3.1:

“App Engine now automatically retries all datastore calls (with the exception of transaction commits) when your applications encounters a datastore error caused by being unable to reach Bigtable. Datastore retries automatically builds in what many of you have been doing in your code already, and our tests have shown it drastically reduces the number of errors your application experiences (by up to 3-4x error reduction for puts, 10-30x for gets).”

There are other Datastore improvements, including Datastore Query Cursors, and unlimited result set size for queries (previously the maximum was 1000). Also instrumentation for the Python version of the SDK, and unit tests for Java.

I’m keen to try out the Python instrumentation, but I’m pretty sure I already know where my bottleneck is…

Using Subversion to Evangelize PowerShell

I’ve never been really comfortable with the Windows Command prompt – whenever I can, I grab Cygwin to give myself a more familiar (and powerful) command-line environment. I really appreciate the tools included with the Unix command shells, as well as the easy composability of the utilities that come with Unix.

Unsurprisingly, I was immediately attracted to PowerShell – a powerful replacement shell for Windows, with .NET integration, a Unix-like pipeline that works on objects rather than strings, and has plenty of built-in cmdlets. I installed it and tried to work. There were familiar commands (many of the Unix and Windows command names are aliased to their PowerShell equivalents), and these both helped and hindered – it was easy to find a command, but the options were slightly off, so the commands my fingers knew produced errors or unexpected results. Eventually I fell off the wagon, reverting to cmd.exe. Over the next year or so, I would return to PowerShell, only to stop using it again.

A few months ago, I was talking to the Guy in the Next Cubicle. He was also interested in PowerShell. We talked a little about the few scripts (neither of us had written new cmdlets) we’d made, and shared them. Having someone to talk to about PowerShell sparked something, and gave me the impetus to change my shortcuts to start PowerShell instead of cmd.exe. Unfortunately, sharing our scripts was awkward – usually via e-mail or instant messaging. In addition, as we improved our scripts, adding some to perform common tasks, we found the answers to some questions when people came for help was, “Well, if you had this PowerShell script, you’d just…”, with no convenient way to get them the scripts.

Finally we decided to do something about it, and initiated a plan to help share our setup among our coworkers. We wanted a system that:

  • was easy to adopt,
  • made it easy to get and share scripts, and
  • was customizable – so people could have different prompts, for example

Inspired by Joey Hess’s keeping your life in svn, we tried putting the whole profile directory under Subversion control. It was an easy choice, since everyone at the Day Job is already using Subversion to wrangle our source code.

Getting Started

Getting started with the profile is almost as easy as running

cd %USERPROFILE%\My Documents
svn checkout http://svn.dayjob.com/path/to/PowerShellProfile/trunk WindowsPowerShell

After this, the user will have a directory inside their My Documents directory that looks something like this:

profile directory skeleton

profile directory skeleton

It’s not quite usable, though. By default PowerShell doesn’t allow scripts to be run, so the new profile will be of no benefit to users. To ease their pain, the profile directory contains a setup_powershell.bat which runs

powershell -Command "Set-ExecutionPolicy RemoteSigned"

After running setup_powershell.bat, all a user has to do is start PowerShell and they will benefit from the new profile.

Inside the WindowsPowerShell Directory

The first item of note inside the WindowsPowerShellDirectory is the Includes directory, which is populated with .ps1 files. Each .ps1 file contains functions to be loaded into memory and made available for the user’s session.
At startup, each of these .ps1 files are each dot-sourced (using the dot-source from a function trick I talked about last time).

Next, the Scripts directory, which is added to the user’s $env:PATH. Each of the .ps1 files in the directory contains a standalone script that a user might choose to execute as they work. We have a number of Day Job-specific scripts as well as some Subversion helpers and two meta-scripts, designed to make it easier to work with the PowerShell profile.

Since new users won’t be familiar with all the scripts in the profile, and because new scripts might be added at any time, we include a script to provide a quick synopsis of the available scripts: Get-ProfileHelp. It scans the Scripts directory, printing out, in an easy-to-read table, the Synopsis from the top of each script.

#<#
#.Synopsis
#  Get help for the PowerShellProfile scripts
##>
Get-ScriptDirectory | Get-ChildItem -include "*.ps1" -recurse
 | ForEach-Object {
    $name = $_.Name; $name = $name.Remove($name.Length-4)
    $synopsis = ""
    $content = (Get-Content $_.PSPath)
    for ($i = 0; $i -le ($content.length - 1); $i += 1)
    {
       if ( $content[$i] -like '*.Synopsis*' )
       {
           $synopsis = $content[$i+1].Substring(1).Trim()
           break
       }
    }
    $o = New-Object Object
    $o | Add-Member NoteProperty Name $name
    $o | Add-Member NoteProperty Synopsis $synopsis
    $o
} | Format-Table -AutoSize

The Get-ScriptDirectory function just finds the location of the currently executing script. We’ll see it later. Running the script gives output like this:

Name                     Synopsis
----                     --------
Copy-Branch              Copy an SVN branch, and optionally switch to the new branch
Get-ProfileHelp          Get help for the PowerShellProfile scripts
Get-SslCertificate       Load and display an SSL Certificate from a host
Import-ResharperSettings Import Resharper 4.5 settings from a file
Merge-Branch             Merge SVN commits back into the current working directory
Switch-Branch            Switch to a new SVN branch
Update-Profile           Get the latest version of the PowerShell profile from SVN

Rather than forcing users to navigate to the profile directory and run an svn command, the Update-Profile.ps1 script will automatically update the profile source:

#<#
# .SYNOPSIS
#     Get the latest version of the PowerShell profile from SVN
##>
svn update (Split-Path $profile)

User Profile Directories

In addition to the Includes and Scripts directories, each user of the PowerShell profile can have their own directory full of customizations. The name of the directory is taken from the $env:USERNAME variable. On startup, if the directory exists, any Include and Scripts directories are processed – being dot-sourced or added to the path, respectively. This allows users to have their own personal scripts and functions. In addition, the profile.ps1 from the directory is dot-sourced.

If a user runs PowerShell and doesn’t already have a profile directory, a welcome message is printed to the screen, explaining basic usage of the profile. Then a profile directory is created and populated with an empty profile.ps1 to get the user started and to keep them from being welcomed again.

Some users choose to commit their personal profile directories to the repository, and some don’t – there’s no requirement either way. If someone chose, they could even use an svn:externals on the WindowsPowerShell directory and host their personal directory in another part of the repository or even a different repository.

Putting it All Together

Finally we see the Microsoft.PowerShell_profile.ps1 file that orchestrates all this:

# Will turn on extra output to help debug profile-loading.
# Don't check in as "true"
$verbose = $false

# A convenience function to get the directory the current script lives in
# - useful for importing from relative paths
function Get-ScriptDirectory
{
  $Invocation = (Get-Variable MyInvocation -Scope 1).Value
  Split-Path $Invocation.MyCommand.Path
}

function Include-ProfileDirectory([string] $directory)
{
    # Load every file in the Includes subdirectory -
    # hopefully they can be loaded in any order.
    # The Includes directory should contain files that define functions and 
    # filters to be executed later, but not scripts that need to do
    # something when the file is sourced.
    if ( Test-Path ($directory + '\Includes') )
    {
        Get-ChildItem -Path:($directory + '\Includes') -Filter:*.ps1 | ForEach-Object {
            if ( $verbose )
            {
                Write-Output ("importing " + $_.PSPath)
            }
            . $_.PSPath
        }
    }

    # The Scripts directory should contain PowerShell scripts that someone
    # might want to be executed, so we'll add it to our path.
    if ( Test-Path "$directory\Scripts" )
    {
        $env:PATH = "$($env:PATH);$directory\Scripts"
    }
}

# dot-source the Include-ProfileDirectory function. If we run it 
# directly,  all the included functions will be scoped inside the
# Include-ProfileDirectory command, and inaccessible to the user.
. Include-ProfileDirectory(Get-ScriptDirectory)

# Look for user-specfic customizations. If they're there, load them.
$userProfileDir = ((Get-ScriptDirectory) + '\' + $env:USERNAME)
if ( Test-Path $userProfileDir )
{
    if ( $verbose )
    {
        Write-Output "including $userProfileDir"
    }
    . Include-ProfileDirectory($userProfileDir)

    $userProfile = ($userProfileDir + '\profile.ps1')
    if ( Test-Path $userProfile )
    {
        . $userProfile
    }
}
else
{
    Write-Host -foregroundColor yellow -backgroundColor darkblue @"

Welcome to the DayJob PowerShell Profile.  It looks like this is your
first time here, so I'll create a new profile for you. This profile
will be called

   $userProfile

If you want to customize your PowerShell experience, you can edit this
file. Eventually you may want to modify files in the containing directories,
but keep in mind that those changes will affect other users.

Have fun!

"@

    New-Item -path  $userProfile -itemType "file" -Force > Out-Null
}