Home » Posts tagged 'Python'

Tag Archives: Python

Moving LibraryHippo to Python 2.7 – OpenID edition

Now that Google has announced that Python 2.7 is fully supported on Google App Engine, I figured I should get my act in gear and make convert LibraryHippo over. I’d had a few aborted attempts earlier, but this time things are going much better.

How We Got Here – Cloning LibraryHippo

One of the requirements for moving to Python 2.7 is that the app must use the High Replication Datastore, and LibraryHippo did not. Moreover, the only way to convert to the HRD is to copy your data to a whole new application. So I bit the bullet, and made a new application from the LibraryHippo source.

When you set up a new application, you have the option of allowing federated authentication via OpenID. I’d wanted to do this for some time, so I thought, “While I’m changing the datastore, template engine, and version of Python under the hood, why not add a little complexity?”, and I picked it.

The Simplest Thing That Should Work – Google as Provider

In theory, LibraryHippo should be able to support any OpenID provider, but I wanted to start with Google as provider for a few reasons:

  • concentrating one one provider would get the site running quickly and I could add additional providers over time
  • I need to support existing users – they’ve already registered with App Engine using Google, and I want things to keep working for them, and
  • I wanted to minimize my headaches – I figure, if an organization supports both an OpenID client feature and an OpenID provider, they must work together as well as any other combination.

Even though there’s been official guidance around using OpenID in App Engine since mid-2010, I started with Nick Johnson’s article for an overview – he’s never steered me wrong before. And I’m glad I did. While the official guide is very informative, Nick broke things down really well. To quote him,

Once you’ve enabled OpenID authentication for your app, a few things change:

  • URLs generated by create_login_url without a federated_identity parameter specified will redirect to the OpenID login page for Google Accounts.
  • URLs that are protected by “login: required” in app.yaml or web.xml will result in a redirect to the path “/_ah/login_required”, with a “continue” parameter of the page originally fetched. This allows you to provide your own openid login page.
  • URLs generated by create_login_url with a federated_identity provider will redirect to the specified provider.

That sounded pretty good – the existing application didn’t use login: required anywhere, just create_login_url (without a federated_identity, of course).
So, LibraryHippo should be good to go – every time create_login_url is used to generate a URL, it’ll send users to Google Accounts. I tried it out.

It just worked, almost. When a not-logged-in user tried to access a page that required a login, she was directed to the Google Accounts page. There were cosmetic differences, but I don’t think they’re worth worrying about:

standard Google login page

standard Google login page

federated Google login page

federated Google login page

Approve access to e-mail address

After providing her credentials, the user was redirected to a page that asked her if it was okay for LibraryHippo to know her e-mail address. After that approval was granted, it was back to the LibaryHippo site and everything operated as usual.

However, login: admin is still a problem. I really shouldn’t have been surprised by this, but login: admin seems to do the same thing that login: required does – redirect to /_ah/login_required, which is not found.

Login Required Not Found

This isn’t a huge problem – it only affects administrators (me), and I could workaround by visiting a page that required any kind of login first, but it still stuck in my craw.
Fortunately, the fix is very easy – just handle /_ah/login_required. I ripped off Nick’s OpenIdLoginHandler, only instead of offering a choice of providers using users.create_login_url, this one always redirects to Google’s OpenId provider page. With this fix, admins are able to go directly from a not-logged-in state to any admin required page.

class OpenIdLoginHandler(webapp2.RequestHandler):
    def get(self):
        continue_url = self.request.GET.get('continue')
        login_url = users.create_login_url(dest_url=continue_url)

        self.redirect(login_url)        

...

handlers = [ ...
    ('/_ah/login_required$', OpenIdLoginHandler),
    ... ]

Using Other Providers

With the above solution, LibraryHippo’s authentication system has the same functionality as before – users can login with a Google account. It’s time to add support for other OpenID providers.

username.myopenid.com

I added a custom provider picker page as Nick suggested, and tried to login with my myOpenID account, with my vanity URL as provider – blair.conrad.myopenid.com. The redirect to MyOpenID worked just as it should, and once I was authenticated, I landed back at LibraryHippo, at the “family creation” page, since LibraryHippo recognized me as a newly-authenticated user, with no history.

myopenid.com

Buoyed by my success, I tried again, this time using the “direct provider federated identity” MyOpenID url – myopenid.com. It was a complete disaster.

Error: Server Error  The server encountered an error and could not complete your request. If the problem persists, please report your problem and mention this error message and the query that caused it.

Once MyOpenID had confirmed my identity, and I was redirected back to the LibraryHippo application, App Engine threw a 500 Server Error. There’s nothing in the logs – just the horrible error on the screen. In desperation, I stripped down my login handler to the bare minimum, using the example at Using Federated Authentication via OpenID in Google App Engine as my guide. I ended up with this class that reproduces the problem:

class TryLogin(webapp2.RequestHandler):
    def get(self):
        providers = {
            'Google'   : 'www.google.com/accounts/o8/id',
            'MyOpenID' : 'myopenid.com',
            'Blair Conrad\'s MyOpenID' : 'blair.conrad.myopenid.com',
            'Blair Conrad\'s WordPress' : 'blairconrad.wordpress.com',
            'Yahoo' : 'yahoo.com',
            'StackExchange': 'openid.stackexchange.com',
            }
        
        user = users.get_current_user()
        if user:  # signed in already
            self.response.out.write('Hello <em>%s</em>! [<a href="%s">sign out</a>]' % (
                user.nickname(), users.create_logout_url(self.request.uri)))
        else:     # let user choose authenticator
            self.response.out.write('Hello world! Sign in at: ')
            for name, uri in providers.items():
                self.response.out.write('[<a href="%s">%s</a>]' % (
                    users.create_login_url(dest_url= '/trylogin', federated_identity=uri), name))

...

handlers = [
    ('/trylogin$', TryLogin),
    ('/_ah/login_required$', OpenIdLoginHandler),
    ...
]

Interestingly, both Yahoo! and WordPress work, but StackExchange does not. If it weren’t for Yahoo!, I’d guess that it’s the direct provider federated identities that give App Engine problems (yes, Google is a direct provider, but I consider it to be an exception in any case).

Next steps

For now, I’m going to use the simple “just Google as federated ID provider” solution that I described above. It seems to work, and I’d rather see if I can find out why these providers fail before implementing an OpenID selector that excludes a few providers. Also, implementing the simple solution will allow me to experiment with federated IDs on the side, since I don’t know how e-mail will work with federated IDs, or how best to add federated users as families’ responsible parties. But that’s a story for another day.

Advertisements

Prime Time Programming, Part 2

Last time I presented a truly horrible prime number generator I was using for Project Euler problems. Then I presented a revamped generator that used trial division. By adding various refinements to the generator, we saw the time required to generate primes less than 107 shrink from hours to 123 seconds. Today I’ll describe a different approach.

Attempt 2: Sieve of Eratosthenes

The Sieve of Eratosthenes is another method for finding prime numbers.
The algorithm is basically this:

  1. make a big array of numbers, from 2 to the highest prime you’re hoping to find
  2. look for the next number that’s not crossed off
  3. this number is your next prime
  4. cross off every multiple of the number you just found
  5. so long as the prime you just found is less than the square root of your limit, go to step 2
  6. the uncrossed numbers are prime

Suppose we want primes less than or equal to 20. We start with this list:

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

The first uncrossed off number is 2. That’s our first prime. Cross off all the multiples of 2 (believe it or not, the 4 is crossed off):

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

The next uncrossed off number is 3. Cross off the multiples:

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Next, we have 5. Cross off its multiples (actually, they’re already crossed off because they’re all also multiples of either 2 or 3):

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

5 is greater than √20, so stop looping. All the uncrossed off numbers are prime:

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A “borrowed” implementation

The wikipedia article even provides a Python implementation, which I reproduce here in slightly altered form:

def eratosthenes_sieve(m):
    # Create a candidate list within which non-primes will be
    # marked as None; only candidates below sqrt(m) need be checked. 
    candidates = range(m+1)
    fin = int(m**0.5)
 
    # Loop over the candidates, marking out each multiple.
    for i in xrange(2, fin+1):
        if not candidates[i]:
            continue
 
        candidates[2*i::i] = [None] * (m//i - 1)
 
    # Filter out non-primes and return the list.
    return [i for i in candidates[2:] if i]

I ran this to find my standard list of primes less than 107, and was surprised at the results. The time to completion varied wildly on successive runs. Sometimes over 50 seconds, and sometimes as low as 13. I noticed that when the run times were high, the laptop’s hard drive was thrashing, and just afterward my other applications were unresponsive.

I reran the test and, with a little help from PerfMon, found out that the memory usage was off the chart. No, seriously. Right off the top. I had to rescale the graph to get everything to fit. the Private Bytes went way up over 200MB. On my 512 MB laptop with Firefox, emacs, and a few background processes, things slow to a crawl. With a smaller set of primes, or on more powerful iron, this implementation may work, but it’s not going to meet my needs.

Attempt 3: The Sieve, but slowly cross out the composites

If allocating a big array of numbers just to cross most of them out right away doesn’t work, how about we start by allocating nothing and just cross out numbers at the last moment?
The idea is pretty simple: start counting at 2, and keep a record of upcoming composite numbers that we’ve discovered by looking at multiples of primes so far. Essentially we maintain a rolling wave of the next multiples of 2, 3, 5, …:

  1. let composites = {}, an empty associative array, where each key is a composite number and its value is the prime that it’s a multiple of
  2. let n = 2
  3. if n is a known composite, remove it and insert the next multiple in the list
  4. otherwise, it’s prime. Yield it and insert a new composite, n2
  5. increment n
  6. go to step 3

Let’s see how it works

n composites
2 {} 2 isn’t in composites, so yield it. Then insert 22 = 4 and increment n.
3 {4:2} 3 isn’t in composites, so yield it. Then insert 32 = 9 and increment n.
4 {4:2, 9:3} 4 is in composites, with value 2. Remove it, insert 4 + 2 = 6 and increment n.
5 {6:2, 9:3} 5 isn’t in composites, so yield it. Then insert 52 = 25 and increment n.
6 {6:2, 9:3, 25:5} 6 is in composites, with value 2. Remove it, insert 6 + 2 = 8 and increment n.
7 {8:2, 9:3, 25:5} 7 isn’t in composites, so yield it. Then insert 72 = 49 and increment n.
8 {8:2, 9:3, 25:5, 49:7} 8 is in composites, with value 2. Remove it, insert 8 + 2 = 10 and increment n.
9 {9:3, 10:2, 25:5, 49:7} 9 is in composites, with value 3. Remove it, insert 9 + 3 = 12 and increment n.
10 {10:2, 12:3, 25:5, 49:7} 10 is in composites, with value 2. Remove it, and… wait.

12 is already in the list, because it’s a multiple of 3. We can’t insert it. We’ll have to amend the algorithm to account for collisions. If a multiple is already accounted for, keep moving until we find one that isn’t in the list.

In this case, add another 2 to 12 to get 14 and insert it. Then increment n.

n composites
11 {12:3, 14:2, 25:5, 49:7} 11 isn’t in composites, so yield it, insert 112 = 121, increment n, and continue…

Show me the code

Here’s an implementation of the naive algorithm presented above

def sieve():
    composites = {}
    n = 2
    while True:
        factor = composites.pop(n, None)
        if factor:
            q = n + factor
            while composites.has_key(q):
                q += factor
            composites[q] = factor
        else:
            # not there - prime
            composites[n*n] = n
            yield n
        n += 1

This implementation takes 26.8 seconds to generate all primes below 107, close to ¼ the time the best trial division algorithm took.

Why is this method so great?

  • Using the associative array values to remember which “stream” of multiples each composite comes from means that the array is no bigger than the number of primes we’ve seen so far
  • The primes can be yielded as soon they’re found instead of waiting until the end
  • Adding p2 when we find a new prime cuts down on collisions but still ensures that we’ll keep find all multiples of p, because 2p, 3p, …, (p-1)p will be weeded out as multiples of lower primes.
  • There’s very little wasted work – finding a new prime number takes O(1) operations – just checking that the number isn’t in the associative array and adding the square to the array. Many composites will take more work, but an amount proportional to the number of distinct prime factors the number has. For example, 12 has prime factors 2, 2, and 3. We tried to add 12 to the array twice, once as a multiple of 2 and once as a multiple of 3. Fortunately, the number of distinct factors is severely limited. For numbers less than 107, 9699690 has the most distinct prime factors: 2, 3, 5, 7, 11, 13, 17, 19. This sure beats the 446 divisions trial division took to find that 9999991 was prime.

The method already incorporates some of the advantages of the souped-up trial division methods from last time. We only worry about multiples of primes, so there’s no need to cut out the composite factors. And when checking to see if n is prime, we never consider prime factors larger than √n. Still, there are some optimizations to make.

That’s Odd

In the sample runthrough above, the algorithm checks 4, 6, 8, 10, … for primality, even though no even number larger than 2 are prime. Here’s an implementation that avoids that:

 def odd_sieve():
    composites = {}
    yield 2
    n = 3
    while True:
        factor = composites.pop(n, None)
        if factor:
            q = n + 2 * factor
            while composites.has_key(q):
                q += 2 * factor
            composites[q] = factor
        else:
            # not there - prime
            composites[n*n] = n
            yield n
        n += 2

This method generates primes less than 107 in 13.4 seconds. This is about half the time it took when we didn’t pre-filter the evens. In the trial division case, when we cut out the even numbers, we were saving almost no work – one division per even number, out of potentially dozens or hundreds of divisions being performed. This time, we cut out an associative array lookup and insertion, and most numbers are checked by using only a small number of these operations. Let’s see what else we can do.

What about 3?

If skipping the numbers that are divisible by 2 paid off, will skipping those divisible by 3 as well? Probably.

def sixish_sieve():
    composites = {}
    yield 2
    yield 3
    step = 2
    n = 5
    while True:
        factor = composites.pop(n, None)
        if factor:
            q = n + 2 * factor
            while q % 6 == 3 or composites.has_key(q):
                q += 2 * factor
            composites[q] = factor
        else:
            # not there - prime
            composites[n*n] = n
            yield n
        n += step
        step = 6 - step

Now the time to generate primes less than 107 is 11.9 seconds. Again, I think we’ve hit diminishing returns. We didn’t get the 1/3 reduction I’d hoped, probably due to the more complicated “next multiple” calculation.

YAGNI

Things are going pretty well. There’s only one thing that bothers me about the latest generator – we’re storing too many composites in the associative array. Every time we find a prime number, its square is inserted in the array. Even 99999912 is put in the array, even though we’ll never check to see if any number greater than 107 is prime. So, modifying the algorithm to omit storing composites that are too large, we get:

def sixish_sieve_max():
    composites = {}
    yield 2
    yield 3
    step = 2
    n = 5
    while True:
        factor = composites.pop(n, None)
        if factor:
            q = n + 2 * factor
            while q % 6 == 3 or composites.has_key(q):
                q += 2 * factor
            composites[q] = factor
        else:
            # not there - prime
            if n*n <= highest:
                composites[n*n] = n
            yield n
        n += step
        step = 6 - step

This generator takes 10.8 seconds to generate primes below 107 – modest improvement, and one I’d keep anyhow, since the code is barely more complicated than the previous version. However, the real boost, if there is any, is in the memory usage. When sixish_sieve generates primes below 107, the private bytes climb up to 52MB, but sixish_sieve_max stays below 25MB. The advantage continues as the problem set grows – when the upper limit is 2*107, sixish_sieve takes 100MB, but sixish_sieve_max remains at a cool 25MB – I guess that’s the difference between storing 1270607 composites and 607.

Conclusion

This was a fun and interesting exercise. Being able to look bad at your old code and say “Boy, that was horrible. I’m glad I’m smarter now,” makes me happy. And embarrassed. I enjoyed seeing how applying incremental changes and even rudimentary profiling yielded provably better results, right up until they showed that I probably needed to abandon my current path (trial division) and jump on a new one.

I’m sticking with sixish_sieve_max for now. It’s fast enough to meet my current needs, and will likely remain so until “CPU inflation” forces the Project Euler team to increase the size of their problem sets. Of course, maybe by then I’ll have a faster processor and I won’t care.

Prime Time Programming, Part 1

From time to time, I find myself caught up in the heady world of Project Euler. It’s almost like playing Professor Layton or some other puzzle-solving game – mathematical or programming brain teasers, served in bite-sized pieces.

If you look at the Project Euler problems for any length of time, you’ll notice common themes. One theme is prime numbers – many problems can’t be solved without generating varying quantities of primes. To that end, I’d written a prime generator to be shared between problem solutions. Over time, I added functionality and “optimized” the generator to improve the running time of my solutions. Everything was great, until I tried Problem 315, whose solution required a list of primes between 107 and 2×107. My solution worked, but it ran really slowly – something like 12 minutes. Now, I’m not doing myself any favours by writing in Python and running on a 7-year-old laptop, but that’s still too long.

My Original Generator

This is the slow-performing generator that I replaced when working on Problem 315. The squeamish reader may want to avert his eyes.

class PrimeGenerator:
    __primes_so_far = [5]
    __increment = 2

    def __init__(self):
        self.__next_index = -1


    def is_prime(self, n):
        if n == 2 or n == 3:
            return True
        elif n % 2 == 0 or n %3 == 0:
            return False
        elif n <= PrimeGenerator.__primes_so_far[-1]:
            # bisecting didn't really help
            return n in PrimeGenerator.__primes_so_far
        else:
            return self.__is_prime(n)

    def __is_prime(self, n):
        limit = math.sqrt(n)
        g = PrimeGenerator()
        g.next() # skip 2
        g.next() # skip 3
        p = g.next()
        while p <= limit:
            if  n % p == 0:
                return False
            p = g.next()
        return True
        
    def next(self):
        self.next = self.__next3
        return 2

    def __next3(self):
        self.next = self.__next5
        return 3

    def __next5(self):
        self.__next_index += 1
        if self.__next_index < len(PrimeGenerator.__primes_so_far):
            return PrimeGenerator.__primes_so_far[self.__next_index]
            
        candidate = PrimeGenerator.__primes_so_far[-1]

        while True:
            candidate += PrimeGenerator.__increment
            PrimeGenerator.__increment = 6 - PrimeGenerator.__increment
        
            if self.__is_prime(candidate):
                PrimeGenerator.__primes_so_far.append(candidate)
                return candidate

    def __iter__(self):
        return self

My eyes!

When I first went back to this code, I exclaimed, “What was I thinking?”. I can think of two things:

  1. The is_prime member was intended to help out for problems where I didn’t have to create too many primes, but just had to check a few number for primality. This doesn’t really belong here, and clutters the class. I’d be better off focusing on prime generation.
  2. I was optimizing at least partly for the case where I’d want to get lists of primes a couple times—hence the indexing into an already-generated list of primes. If the generator were fast enough, this wouldn’t be necessary.

Things I can’t understand, even today. I’ll have to blame them on then-ignorance, foolishness, and hasty modifications to the class:

  • Why the mucking about with __next3 and __next5? What did I have against yield?
  • Why is is_prime fussing with a whole new PrimeGenerator and skipping 2 and 3? Why not just go straight for the saved list of primes starting with 5?

In fact, just about the only defensible things (as we’ll see below) in the whole class are:

  • ending the modulus checks once the candidate divisor is greater than the square root of the candidate number, and
  • the bit where the next__increment is formed by subtracting the current one from 6 – I was exploiting the fact that, past 3, for any prime p, p ≡ 1 (mod 6) or p ≡ 5 (mod 6).

How slow was it?

The generator took 551.763 seconds to generate primes less than 107, as measured by the following:

def run(f):
    global highest
    start = datetime.datetime.now()
    count = 1
    for p in f:
        count += 1
        if p > highest: break
    end = datetime.datetime.now()
    elapsed = end-start
    return highest, count, elapsed.seconds + (elapsed.microseconds/1000000.0)

Where f is an instance of PrimeGenerator passed into the run method, and highest is a global that’s been set to 107.

Moving forward

Based on the extreme slowness and general horribleness of the code, I figured I’d be better off starting over. So, I chucked the generator and started afresh with the simplest generator I could write. I resolved to make incremental changes, ensuring that at each step, the code was:

  1. correct (otherwise, why bother)
  2. faster than the previous version
  3. simple enough for me to understand

Let’s see what happened.

Attempt 1: Trial Division

Trial division is one of the simplest methods for generating primes—you start counting at 2, and if no smaller positive integers (other than 1) divide the current number, it’s prime. The naive implementation is very simple.

def naive_trial():
    n = 2
    while True:
        may_be_prime = True
        for k in xrange(2, n):
            if n % k == 0:
                may_be_prime = False
                break
        if may_be_prime:
            yield n
        n += 1

This method takes 113.804 seconds to generate primes below 100000—I couldn’t wait for the full 107 – it would probably be over 3 hours.

Trial until root

Fortunately, there are some pretty obvious optimizations one can make to the algorithm. The first comes from the observation that if there is a number k, 1 < k < n, that divides n, then there is a number j that divides n with 1 < j ≤ √n, so we can stop our trial once we’ve hit that point.

def trial_until_root():
    n = 2
    while True:
        may_be_prime = True
        for k in xrange(2, int(n**0.5)+1):
            if n % k == 0:
                may_be_prime = False
                break
        if may_be_prime:
            yield n
        n += 1

This method takes 468 seconds to generate primes up to 107. A definite improvement (and already faster my original generator), but there’s still room for more.

Trial by primes

Here’s another observation about divisors of n: if there’s a number k that divides n, then there’s a prime number p ≤ k that divides n, since either k is prime or has prime factors. So if we keep a list of the primes found so far, we only need to check prime divisors that are less than √n.

def trial_by_primes():
    primes_so_far = []
    n = 2
    while True:
        may_be_prime = True
        for p in primes_so_far:
            if n % p == 0:
                may_be_prime = False
                break
            if p * p > n: # it's prime
                break
        if may_be_prime:
            primes_so_far.append(n)
            yield n
        n += 1

Now we’re down to 136 seconds to generate primes below 107. That was a worthwhile change, but we have to balance it against the fact that the generator now requires additional storage – the list of primes encountered so far. In this case, we’re storing over 660,000 prime numbers in a list. Even an older laptop can handle this burden, but it’s something to keep in mind.

That’s odd

And by “that”, I mean “all the prime numbers except for 2”. There’s no point checking the even numbers to see if they’re prime. Let’s see what happens when we skip them. The only tricky part (and it’s not that tricky) is to make sure we still return 2 as our first prime.

def odd_trial_by_primes():
    primes_so_far = []
    yield 2
    n = 3
    while True:
        may_be_prime = True
        for p in primes_so_far:
            if n % p == 0:
                may_be_prime = False
                break
            if p * p > n: # it's prime
                break
        if may_be_prime:
            primes_so_far.append(n)
            yield n
        n += 2

This method takes 127 seconds to generate primes less than 107. Not a huge improvement, but better than nothing, and it doesn’t really complicate the code that much. I’ll keep it. The reason we don’t get a huge improvement here is that checking the even numbers for primeness doesn’t take that much effort – they were eliminated as soon as we modded them by the first prime in primes_so_far. Still, it’s a little quicker to jump right over them than to perform the division.

What about 3?

If skipping the numbers that are divisible by 2 paid off, will skipping those divisible by 3? As I noted above, every prime p greater than 3 satisfies p ≡ 1 (mod 6) or p ≡ 5 (mod 6). Let’s use that. We’ll take advantage of these facts:

  • if p ≡ 1 (mod 6) , then p+4 ≡ 5 (mod 6)
  • if p ≡ 5 (mod 6) , then p+2 ≡ 1 (mod 6)

So we want to alternate our step between 2 and 4. Fortunately 6 – 4 = 2 and 6 – 2 = 4, so we can use 6 – step as our next step.

def sixish_trial_by_primes():
    primes_so_far = []
    yield 2
    yield 3
    step = 2
    n = 5
    while True:
        may_be_prime = True
        for p in primes_so_far:
            if n % p == 0:
                may_be_prime = False
                break
            if p * p > n: # it's prime
                break
        if may_be_prime:
            primes_so_far.append(n)
            yield n
        n += step
        step = 6 - step

Now the time drops to 123 seconds to generate primes less than 107. Clearly we’ve hit diminishing returns – we’re saving two modulus operations on numbers that are divisible by 3 (but not 2), at the cost of a more complicated “step” calculation. We could continue on in this vein, but the gains are not likely to be large, and the complexity increases rapidly. Consider the next step: we’d make sure we don’t test numbers divisible by 2, 3, or 5. That means (after 5) we only consider numbers whose remainders when divided by 30 are one of 1, 7, 11, 13, 17, 19, 23, or 29. The steps between numbers are 6, 4, 2, 4, 2, 4, 6, and 2. Who has the energy?

The problem with Trial Division

Trial division has a few things going for it:

  • it’s simple to understand
  • there are some obvious optimizations that can make its performance tolerable

Ultimately, though, its downfall is that it takes a lot of work to verify that a large number is prime. Consider the largest prime number below 107: 9999991. In order to verify that this is prime, we have to consider all prime factors less than √9999991. There are 446 of these. That’s 446 divisions, just to verify that one number is prime.

We’re unlikely to radically improve performance by continuing to tinker with trial division. It’s time to throw the whole thing away again and try a new approach. Next time we’ll do just that.

Automated Testing using App Engine Service APIs (and a Memcaching Memoizer)

I’m a fan of Test-driven development, and automated testing in general. As such, I’ve been trying ensure that the LibraryHippo code has an adequate set of automated tests before deploying new versions.

Importing Google App Engine Modules

Unfortunately, testing code that relies on the Google App Engine SDK is a little tricky, as I found when working with one of the LibraryHippo entities. There’s an entity called a Card, which extends db.Model and represents a user’s library card.

The Card definition is not entirely unlike this:

class Card(db.Model):
    family = db.ReferenceProperty(Family)
    number = db.StringProperty()
    name = db.StringProperty()
    pin = db.StringProperty()
    library = db.ReferenceProperty(Library)

    def pin_is_valid(self):
        return self.pin != ''

Unfortunately, testing this class isn’t as straightforward as one would hope. Suppose I have this test file:

from card import Card

def test_card_blank_pin_is_invalid():
    c = Card()
    c.pin = ''
    assert not c.pin_is_valid()

It fails miserably, spewing out a string of import errors. Here’s the tidied-up stack:

> from card import Card
> from google.appengine.ext import db
> from google.appengine.api import datastore
> from google.appengine.datastore import datastore_index
> from google.appengine.api import validation
> import yaml
E ImportError: No module named yaml

Not so good. Fortunately, it’s not that hard to find out what needs to be done in order to make the imports work:

import sys
import dev_appserver
sys.path = dev_appserver.EXTRA_PATHS + sys.path 

from card import Card

def test_card_blank_pin_is_invalid():
    c = Card()
    c.pin = ''
    assert not c.pin_is_valid()

Now Python can find all the imports it needs. For a while this was good enough, since I wasn’t testing any code that hit the datastore or actually used any of the app Engine Service APIs.

Running the App Engine Service APIs

However, I recently found a need to use Memcache to store partially-calculated results and decided (like everyone else) to write a memoizing decorator to do the job. There’s enough logic in my memoizer that I felt it needed an automated test. I tried this:

import sys
import dev_appserver
sys.path = dev_appserver.EXTRA_PATHS + sys.path 

from google.appengine.api import memcache
from gael.memcache import *

def test_memoize_formats_string_key_using_kwargs():
    values = [1, 2]
    @memoize('hippo %(animal)s zebra', 100)
    def pop_it(animal):
        return values.pop()

    result = pop_it(animal='rabbit')
    assert 2 == result

    cached_value = memcache.get('hippo rabbit zebra')
    assert 2 == cached_value

(gael is Google App Engine Library – my extension/utility package – as it grows and I gain experience, I may spin it out of LibraryHippo to be its own project.) Again, it failed miserably. Here’s a cleaned-up version of the failure:

> result = pop_it(animal='rabbit')
> cached_result = google.appengine.api.memcache.get(key_value)
> self._make_sync_call('memcache', 'Get', request, response)
> return apiproxy.MakeSyncCall(service, call, request, response)
> assert stub, 'No api proxy found for service "%s"' % service
E AssertionError: No api proxy found for service "memcache";

This was puzzling. All the imports were in place, so why the failure? This time the answer was a little harder to find, but tenacious searching paid off, and I stumbled on a Google Group post  called Unit tests / google apis without running the dev app server. The author had actually done the work to figure out what initialization code had to be run in order to get have the Service APIs work. The solution relied on hard-coded paths to the App Engine imports, but it was obvious how to combine it with the path manipulation I used earlier to produce this:

import sys

from dev_appserver import EXTRA_PATHS
sys.path = EXTRA_PATHS + sys.path 

from google.appengine.tools import dev_appserver
from google.appengine.tools.dev_appserver_main import ParseArguments
args, option_dict = ParseArguments(sys.argv) # Otherwise the option_dict isn't populated.
dev_appserver.SetupStubs('local', **option_dict)

from google.appengine.api import memcache
from gael.memcache import *

def test_memoize_formats_string_key_using_kwargs():
    values = [1, 2]
    @memoize('hippo %(animal)s zebra', 100)
    def pop_it(animal):
        return values.pop()

    result = pop_it(animal='rabbit')
    assert 2 == result

    cached_value = memcache.get('hippo rabbit zebra')
    assert 2 == cached_value

There’s an awful lot of boilerplate here, so I tried to clean up the module, moving the App Engine setup into a new module in gael:

import sys

def add_appsever_import_paths():
    from dev_appserver import EXTRA_PATHS
    sys.path = EXTRA_PATHS + sys.path 

def initialize_service_apis():
    from google.appengine.tools import dev_appserver

    from google.appengine.tools.dev_appserver_main import ParseArguments
    args, option_dict = ParseArguments(sys.argv) # Otherwise the option_dict isn't populated.
    dev_appserver.SetupStubs('local', **option_dict)

Then the top of the test file becomes

import gael.testing
gael.testing.add_appsever_import_paths()
gael.testing.initialize_service_apis()

from google.appengine.api import memcache
from gael.memcache import *

def test_memoize_formats_string_key_using_kwargs():
    ...

The Decorator

In case anyone’s curious, here’s the memoize decorator I was testing. I needed something flexible, so it takes a key argument that can either be a format string or a callable. I’ve never cared for positional format arguments – not in Python, C#, Java, nor C/C++ – so both the format string and the callable use the **kwargs to construct the key. I’d prefer to use str.format instead of the % operator, but not until App Engine moves to Python 2.6+

def memoize(key, seconds_to_keep=600):
    def decorator(func):
        def wrapper(*args, **kwargs):
            if callable(key):
                key_value = key(args, kwargs)
            else:
                key_value = key % kwargs

            cached_result = google.appengine.api.memcache.get(key_value)
            if cached_result is not None:
                logging.debug('found ' + key_value)
                return cached_result
            logging.info('calling func to get '  + key_value)
            result = func(*args, **kwargs)
            google.appengine.api.memcache.set(key_value, result, seconds_to_keep)
            return result
        return wrapper
    return decorator

Faking out Memcache – Unit Testing the Decorator

The astute among you are probably thinking that I could’ve saved myself a lot of trouble if I’d just faked out memcache and unit tested the decorator instead of trying to hook everything up for an integration test. That’s true, but at first I couldn’t figure out how to do that cleanly, and it was my first foray into memcache, so I didn’t mind working with the service directly.

Still, the unit testing approach would be better, so I looked at my decorator and rebuilt it to use a class rather than a function. It’s my first time doing this, and it’ll probably not be the last – I really like the separation between initialization and execution that the __init__/__call__ methods give me; I think it makes things a lot easier to read.

def memoize(key, seconds_to_keep=600):
    class memoize():
        def __init__(self, func):
            self.key = key
            self.seconds_to_keep=600
            self.func = func
            self.cache=google.appengine.api.memcache

        def __call__(self, *args, **kwargs):
            if callable(self.key):
                key_value = self.key(args, kwargs)
            else:
                key_value = self.key % kwargs

            cached_result = self.cache.get(key_value)
            if cached_result is not None:
                logging.debug('found ' + key_value)
                return cached_result
            logging.info('calling func to get '  + key_value)
            result = self.func(*args, **kwargs)

            self.cache.set(key_value, result, self.seconds_to_keep)
            return result

    return memoize

Then the test can inject its own caching mechanism to override self.cache:

class MyCache:
    def __init__(self):
        self.cache = {}

    def get(self, key):
        return self.cache.get(key, None)

    def set(self, key, value, *args):
        self.cache[key] = value

def test_memoize_formats_string_key_using_kwargs():
    values = [1, 2]
    @memoize('hippo %(animal)s zebra', 100)
    def pop_it(animal):
        return values.pop()

    cache = MyCache()
    pop_it.cache = cache
    result = pop_it(animal='rabbit')
    assert 2 == result

    cached_value = cache.get('hippo rabbit zebra')
    assert 2 == cached_value

And that’s it. Now I have a unit-tested implementation of my memoizer and two new helpers in my extension library.

A first look at Appstats – where’s my time spent?

After hearing about the release of Google’s App Engine SDK 1.3.1, I rushed out to try the new Appstats Event Recorder to help profile LibraryHippo. I didn’t expect great things, as I’m generally happy with the performance, with one notable exception, but I was curious about the tool.

App Engine Fan has posted a great introduction of some of the features that make Appstats a useful and powerful tool – it’s very easy to hook up, and seems to add very little overhead. In addition, it has very rich configuration options – one can omit classes of calls, fold calls together, select the amount of information retained about each call, and specify how many such records are retained (in what amounts to a circular buffer).

I didn’t use (or need) any particularly advanced configuration, so I just installed the Event Recorder and let it go.

Here’s what I saw:

Appstats result for checking one family with just Waterloo and Kitchener accounts

Checking one family, with 3 Waterloo and Kitchener library cards

I don’t have an in-depth analysis, but here are some impressions:

  • it’s pretty
  • the information is presented very well – with only minimal reading, I can see that LibraryHippo made a handful of datastore queries, as well as a series of urlfetch.Fetch calls for each library card it checked
  • I can get a quick view of what’s taking what proportion of the time – for example, the fetches easily dominate
  • total time (about 2.3 seconds) is easy to find, as well as the amount taken by the underlying API – 73 milliseconds
  • there’s something else that’s going on – 1056 ms for cpu + api – nearly half the elapsed time. I’m not sure what that means exactly

So far, no big surprises – I knew that most of the time was taken up by queries to the library web pages, but it’s very cool to see it this way, and to see how much time is taken up going to the Datastore (not much). There’s room for improvement, but 2.3 seconds is more than acceptable for this family – one of LibraryHippo’s heaviest users.

Two things did stand out, though. First, in the first group of urlfetch.Fetches, there are gaps between the fourth, fifth, and sixth calls (the ones that take 128 ms, 91ms, and 52ms) and the pattern repeats (with smaller gaps) in the second batches. This is where the retrieved records are processed and transformed into a normalized representation before rendering. The total time taken is a small, but I didn’t expect to see anything.

Second, there’s a datastore_v3.Get call before each card is checked. This is not an explicit call that LibraryHippo makes, so I clicked on the line in the graph and got a detailed view of what was going on:

Detail of implicit datastore_v3.get call

Detail of implicit datastore_v3.get call

It looks like the call is coming from the create method on line 8 of the all_libraries.py file. Curious, I click on that line and lo and behold, I get a view of the source. This very cool.

   1: #!/usr/bin/env python
   2: 
   3: import sys
   4: 
   5: modules = {}
   6: 
   7: def create(card, fetcher):
   8:     id = card.library.type
   9:     if not modules.has_key(id):
  10:         modules[id] = __import__(id)
  11:     return modules[id].LibraryAccount(card, fetcher)
  12: 
  13: def main(args=None):
  14:     if args == None:
  15:         args = sys.argv[1:]
  16:     return 0

Correlating the detail view and the source code, we see that create is handed a card parameter that has an as-yet-unresolved library instance. Accessing the library attribute on the card must complete what was a lazy load initiated when I loaded the Family entity – the cards come from the Family.card_set member.

Ordinarily, I might start investigating the gaps and the implicit gets, but I know there’s a much greater threat to LibraryHippo usability, which I confirm by checking out the record for another family’s notification:

Appstats results of checking one family, with a Region of Waterloo Account

Checking one family, with 4 Waterloo and Kitchener cards, and one Region of Waterloo

Here’s where the presentation really packs a wallop – there’s clearly a qualitative difference here. And what a difference – instead of 2.5 seconds on the horizontal axis, it’s 25 seconds, and most of the fetches are compressed to nigh-invisibility.

There are two differences between this family’s accounts and the first family’s: they have an extra Kitchener Library card that the first family didn’t, and they have a Region of Waterloo Library card. It’s the RWL card that makes the difference: you can see it being checked in the last batch of urlfetch.Fetches.
The 4 Waterloo and Kitchener library card checks are completely done after 3154ms, but the Region of Waterloo checking goes on for a further 21 seconds – for one library, and it’s not an aberration – the library web site is slow.

This post is already long enough, so I’ll use an upcoming one to talk about what this slowness means for LibraryHippo and how I’ve tried to keep it from destroying the user experience.

Cookies, Redirects, and Transcripts – Supercharging urlfetch

LibraryHippo‘s main function is fetching current library account status for patrons. Since I have no special relationship with any of the libraries involved, LibraryHippo web scrapes the libraries’ web interfaces.

The library websites issue cookies and redirects, so I needed to do something to augment the URL Fetch Python API.
I wrote a utility class that worked with the urllib2 interface, but that didn’t allow me to set the deadline argument, and I wanted to increase its value to 10 seconds. I resigned myself to wiring up a version that used urlfetch, when I found Scott Hillman’s URLOpener, which uses cookielib to follow redirects and handle any cookies met along the way.

URLOpener looked like it would work for me, with a few tweaks – it didn’t support relative URLs in redirects, it doesn’t allow one to specify headers in requests, and it lacked one feature that I really wanted – a transcript.

Why a transcript?

The libraries don’t provide a spec for their output, so I built the web scraper by trial and error, sometimes putting books on hold or taking them out just to get test data. Every once in a while something comes up that I haven’t coded for and the application breaks. In these cases, I can’t rely on the problem being reproducible, since the patron could’ve returned (or picked up) the item whose record was troublesome or some other library state might’ve changed. I need to know what the web site looked like when the problem occurred, and since the ultimate cause might be several pages back, I need a history.

I started adding a transcript feature to the URLOpener – recording every request and response including headers. As I worked, I worried about two things:

  • the fetch logic was becoming convoluted, and
  • the approach was inflexible – what if later I didn’t want to follow redirects, or to keep a transcript?

Decorators to the rescue

I decided to separate each bit of functionality – following redirects, tracking cookies, and keeping a transcript – into its own decorator, to be applied as needed. First I teased out the code that followed redirects, with my change to allow relative URLs:

class RedirectFollower():
    def __init__(self, fetcher):
        self.fetcher = fetcher

    def __call__(self, url, payload=None, method='GET', headers={},
                 allow_truncated=False, follow_redirects=False, deadline=None):
        while True:
            response = self.fetcher(url, payload, method, headers,
                                    allow_truncated, False, deadline)
            new_url = response.headers.get('location')
            if new_url:
                # Join the URLs in case the new location is relative
                url = urlparse.urljoin(url, new_url)

                # Next request should be a get, payload needed
                method = 'GET'
                payload = None
            else:
                break

        return response

After that, the cookie-handling code was easy to put in its own class:

class CookieHandler():
    def __init__(self, fetcher):
        self.fetcher = fetcher
        self.cookie_jar = Cookie.SimpleCookie()

    def __call__(self, url, payload=None, method='GET', headers={},
                 allow_truncated=False, follow_redirects=True, deadline=None):
            headers['Cookie'] = self._make_cookie_header()
            response = self.fetcher(url, payload, method, headers,
                                    allow_truncated, follow_redirects, deadline)
            self.cookie_jar.load(response.headers.get('set-cookie', ''))
            return response

    def _make_cookie_header(self):
        cookieHeader = ""
        for value in self.cookie_jar.values():
            cookieHeader += "%s=%s; " % (value.key, value.value)
        return cookieHeader

Now I had the URLOpener functionality back, just by creating an object like so:

fetch = RedirectFollower(CookieHandler(urlfetch.fetch))

Implementing transcripts

I still needed one more decorator – the transcriber.

class Transcriber():
    def __init__(self, fetcher):
        self.fetcher = fetcher
        self. transactions = []

    def __call__(self, url, payload=None, method='GET', headers={},
                 allow_truncated=False, follow_redirects=True, deadline=None):
        self.transactions.append(Transcriber._Request(vars()))
        response = self.fetcher(url, payload, method, headers,
                                    allow_truncated, follow_redirects, deadline)
        self.transactions.append(Transcriber._Response(response))
        return response

    class _Request:
        def __init__(self, values):
            self.values = dict((key, values[key])
                               for key in ('url', 'method', 'payload', 'headers'))
            self.values['time'] = datetime.datetime.now()

        def __str__(self):
            return '''Request at %(time)s:
  url = %(url)s
  method = %(method)s
  payload = %(payload)s
  headers = %(headers)s''' % self.values

    class _Response:
        def __init__(self, values):
            self.values = dict(status_code=values.status_code,
                               headers=values.headers,
                               content=values.content,
                               time=datetime.datetime.now())

        def __str__(self):
            return '''Response at %(time)s:
  status_code = %(status_code)d
  headers = %(headers)s
  content = %(content)s''' % self.values

To record all my transactions, all I have to do is wrap my fetcher one more time. When something goes wrong, I can examine the whole chain of calls and have a better shot at fixing the scraper.

fetch = Transcriber(RedirectFollower(CookieHandler(urlfetch.fetch)))
response = fetch(patron_account_url)
try:
    process(response)
except:
    logging.error('error checking account for ' + patron, exc_info=True)
    for action in fetch.transactions:
            logging.debug(action)

Extra-fine logging without rewriting fetch

The exercise of transforming URLOpener into a series of decorators may seem like just that, an exercise that doesn’t provide real value, but provides a powerful debugging tool for your other decorators. By moving the Transcriber to the inside of the chain of decorators, you can see each fetch that’s made due to a redirect, and which cookies are set when:

fetch = RedirectFollower(CookieHandler(Transcriber(urlfetch.fetch)))

The only trick is that the Transcriber.transactions attribute isn’t available from the outermost decorator. This is easily solved by extracting a base class and having it delegate to the wrapped item.

class _BaseWrapper:
    def __init__(self, fetcher):
        self.fetcher = fetcher

    def __getattr__(self, name):
        return getattr(self.fetcher, name)

Then the other decorators extend _BaseWrapper, either losing their __init__ or having them modified. For example, CookieHandler becomes:

class CookieHandler(_BaseWrapper):
    def __init__(self, fetcher):
        _BaseWrapper.__init__(self, fetcher)
        self.cookie_jar = Cookie.SimpleCookie()
...

And then the following code works, and helped me diagnose a small bug I’d originally had in my RedirectFollower. As a bonus, if I ever need to get at CookieHandler.cookie_jar, it’s right there too.

fetch = RedirectFollower(CookieHandler(Transcriber(urlfetch.fetch)))
fetch(patron_account_url)
for action in fetch.transactions:
    logging.debug(action)

New Year’s Python Meme

It’s a little late, but I’m participating in Tarek Ziadé’s Python Meme (via Richard Jones):

  1. What’s the coolest Python application, framework or library you have discovered in 2009?

    Google App Engine. I’d known of it before, but hadn’t tried it until early this year when I started to work on LibraryHippo.

  2. What new programming technique did you learn in 2009?

    I’m not sure if this counts as a technique, but I recently found (and found a use for) Jean-Paul S. Boodhoo’s Static Gateway Pattern. At the Day Job, we have a lot of hard-coded dependencies and reliance on well-known static methods for authorization. The Static Gateway Pattern made it easy to provide an injectable implementation without rewriting the whole application. I expect it to continue to be useful, at least until we take the time to introduce a full Inversion of Control container.

  3. What’s the name of the open source project you contributed the most in 2009? What did you do?

    I didn’t, really. Unless you count LibraryHippo. I’ve an interest in working on Noda Time, but I haven’t managed to yet.

  4. What was the Python blog or website you read the most in 2009?

    Word Aligned

  5. What are the three top things you want to learn in 2010?