Home » Posts tagged 'Profiling'

Tag Archives: Profiling

Prime Time Programming, Part 2

Last time I presented a truly horrible prime number generator I was using for Project Euler problems. Then I presented a revamped generator that used trial division. By adding various refinements to the generator, we saw the time required to generate primes less than 107 shrink from hours to 123 seconds. Today I’ll describe a different approach.

Attempt 2: Sieve of Eratosthenes

The Sieve of Eratosthenes is another method for finding prime numbers.
The algorithm is basically this:

  1. make a big array of numbers, from 2 to the highest prime you’re hoping to find
  2. look for the next number that’s not crossed off
  3. this number is your next prime
  4. cross off every multiple of the number you just found
  5. so long as the prime you just found is less than the square root of your limit, go to step 2
  6. the uncrossed numbers are prime

Suppose we want primes less than or equal to 20. We start with this list:

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

The first uncrossed off number is 2. That’s our first prime. Cross off all the multiples of 2 (believe it or not, the 4 is crossed off):

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

The next uncrossed off number is 3. Cross off the multiples:

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Next, we have 5. Cross off its multiples (actually, they’re already crossed off because they’re all also multiples of either 2 or 3):

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

5 is greater than √20, so stop looping. All the uncrossed off numbers are prime:

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A “borrowed” implementation

The wikipedia article even provides a Python implementation, which I reproduce here in slightly altered form:

def eratosthenes_sieve(m):
    # Create a candidate list within which non-primes will be
    # marked as None; only candidates below sqrt(m) need be checked. 
    candidates = range(m+1)
    fin = int(m**0.5)
 
    # Loop over the candidates, marking out each multiple.
    for i in xrange(2, fin+1):
        if not candidates[i]:
            continue
 
        candidates[2*i::i] = [None] * (m//i - 1)
 
    # Filter out non-primes and return the list.
    return [i for i in candidates[2:] if i]

I ran this to find my standard list of primes less than 107, and was surprised at the results. The time to completion varied wildly on successive runs. Sometimes over 50 seconds, and sometimes as low as 13. I noticed that when the run times were high, the laptop’s hard drive was thrashing, and just afterward my other applications were unresponsive.

I reran the test and, with a little help from PerfMon, found out that the memory usage was off the chart. No, seriously. Right off the top. I had to rescale the graph to get everything to fit. the Private Bytes went way up over 200MB. On my 512 MB laptop with Firefox, emacs, and a few background processes, things slow to a crawl. With a smaller set of primes, or on more powerful iron, this implementation may work, but it’s not going to meet my needs.

Attempt 3: The Sieve, but slowly cross out the composites

If allocating a big array of numbers just to cross most of them out right away doesn’t work, how about we start by allocating nothing and just cross out numbers at the last moment?
The idea is pretty simple: start counting at 2, and keep a record of upcoming composite numbers that we’ve discovered by looking at multiples of primes so far. Essentially we maintain a rolling wave of the next multiples of 2, 3, 5, …:

  1. let composites = {}, an empty associative array, where each key is a composite number and its value is the prime that it’s a multiple of
  2. let n = 2
  3. if n is a known composite, remove it and insert the next multiple in the list
  4. otherwise, it’s prime. Yield it and insert a new composite, n2
  5. increment n
  6. go to step 3

Let’s see how it works

n composites
2 {} 2 isn’t in composites, so yield it. Then insert 22 = 4 and increment n.
3 {4:2} 3 isn’t in composites, so yield it. Then insert 32 = 9 and increment n.
4 {4:2, 9:3} 4 is in composites, with value 2. Remove it, insert 4 + 2 = 6 and increment n.
5 {6:2, 9:3} 5 isn’t in composites, so yield it. Then insert 52 = 25 and increment n.
6 {6:2, 9:3, 25:5} 6 is in composites, with value 2. Remove it, insert 6 + 2 = 8 and increment n.
7 {8:2, 9:3, 25:5} 7 isn’t in composites, so yield it. Then insert 72 = 49 and increment n.
8 {8:2, 9:3, 25:5, 49:7} 8 is in composites, with value 2. Remove it, insert 8 + 2 = 10 and increment n.
9 {9:3, 10:2, 25:5, 49:7} 9 is in composites, with value 3. Remove it, insert 9 + 3 = 12 and increment n.
10 {10:2, 12:3, 25:5, 49:7} 10 is in composites, with value 2. Remove it, and… wait.

12 is already in the list, because it’s a multiple of 3. We can’t insert it. We’ll have to amend the algorithm to account for collisions. If a multiple is already accounted for, keep moving until we find one that isn’t in the list.

In this case, add another 2 to 12 to get 14 and insert it. Then increment n.

n composites
11 {12:3, 14:2, 25:5, 49:7} 11 isn’t in composites, so yield it, insert 112 = 121, increment n, and continue…

Show me the code

Here’s an implementation of the naive algorithm presented above

def sieve():
    composites = {}
    n = 2
    while True:
        factor = composites.pop(n, None)
        if factor:
            q = n + factor
            while composites.has_key(q):
                q += factor
            composites[q] = factor
        else:
            # not there - prime
            composites[n*n] = n
            yield n
        n += 1

This implementation takes 26.8 seconds to generate all primes below 107, close to ¼ the time the best trial division algorithm took.

Why is this method so great?

  • Using the associative array values to remember which “stream” of multiples each composite comes from means that the array is no bigger than the number of primes we’ve seen so far
  • The primes can be yielded as soon they’re found instead of waiting until the end
  • Adding p2 when we find a new prime cuts down on collisions but still ensures that we’ll keep find all multiples of p, because 2p, 3p, …, (p-1)p will be weeded out as multiples of lower primes.
  • There’s very little wasted work – finding a new prime number takes O(1) operations – just checking that the number isn’t in the associative array and adding the square to the array. Many composites will take more work, but an amount proportional to the number of distinct prime factors the number has. For example, 12 has prime factors 2, 2, and 3. We tried to add 12 to the array twice, once as a multiple of 2 and once as a multiple of 3. Fortunately, the number of distinct factors is severely limited. For numbers less than 107, 9699690 has the most distinct prime factors: 2, 3, 5, 7, 11, 13, 17, 19. This sure beats the 446 divisions trial division took to find that 9999991 was prime.

The method already incorporates some of the advantages of the souped-up trial division methods from last time. We only worry about multiples of primes, so there’s no need to cut out the composite factors. And when checking to see if n is prime, we never consider prime factors larger than √n. Still, there are some optimizations to make.

That’s Odd

In the sample runthrough above, the algorithm checks 4, 6, 8, 10, … for primality, even though no even number larger than 2 are prime. Here’s an implementation that avoids that:

 def odd_sieve():
    composites = {}
    yield 2
    n = 3
    while True:
        factor = composites.pop(n, None)
        if factor:
            q = n + 2 * factor
            while composites.has_key(q):
                q += 2 * factor
            composites[q] = factor
        else:
            # not there - prime
            composites[n*n] = n
            yield n
        n += 2

This method generates primes less than 107 in 13.4 seconds. This is about half the time it took when we didn’t pre-filter the evens. In the trial division case, when we cut out the even numbers, we were saving almost no work – one division per even number, out of potentially dozens or hundreds of divisions being performed. This time, we cut out an associative array lookup and insertion, and most numbers are checked by using only a small number of these operations. Let’s see what else we can do.

What about 3?

If skipping the numbers that are divisible by 2 paid off, will skipping those divisible by 3 as well? Probably.

def sixish_sieve():
    composites = {}
    yield 2
    yield 3
    step = 2
    n = 5
    while True:
        factor = composites.pop(n, None)
        if factor:
            q = n + 2 * factor
            while q % 6 == 3 or composites.has_key(q):
                q += 2 * factor
            composites[q] = factor
        else:
            # not there - prime
            composites[n*n] = n
            yield n
        n += step
        step = 6 - step

Now the time to generate primes less than 107 is 11.9 seconds. Again, I think we’ve hit diminishing returns. We didn’t get the 1/3 reduction I’d hoped, probably due to the more complicated “next multiple” calculation.

YAGNI

Things are going pretty well. There’s only one thing that bothers me about the latest generator – we’re storing too many composites in the associative array. Every time we find a prime number, its square is inserted in the array. Even 99999912 is put in the array, even though we’ll never check to see if any number greater than 107 is prime. So, modifying the algorithm to omit storing composites that are too large, we get:

def sixish_sieve_max():
    composites = {}
    yield 2
    yield 3
    step = 2
    n = 5
    while True:
        factor = composites.pop(n, None)
        if factor:
            q = n + 2 * factor
            while q % 6 == 3 or composites.has_key(q):
                q += 2 * factor
            composites[q] = factor
        else:
            # not there - prime
            if n*n <= highest:
                composites[n*n] = n
            yield n
        n += step
        step = 6 - step

This generator takes 10.8 seconds to generate primes below 107 – modest improvement, and one I’d keep anyhow, since the code is barely more complicated than the previous version. However, the real boost, if there is any, is in the memory usage. When sixish_sieve generates primes below 107, the private bytes climb up to 52MB, but sixish_sieve_max stays below 25MB. The advantage continues as the problem set grows – when the upper limit is 2*107, sixish_sieve takes 100MB, but sixish_sieve_max remains at a cool 25MB – I guess that’s the difference between storing 1270607 composites and 607.

Conclusion

This was a fun and interesting exercise. Being able to look bad at your old code and say “Boy, that was horrible. I’m glad I’m smarter now,” makes me happy. And embarrassed. I enjoyed seeing how applying incremental changes and even rudimentary profiling yielded provably better results, right up until they showed that I probably needed to abandon my current path (trial division) and jump on a new one.

I’m sticking with sixish_sieve_max for now. It’s fast enough to meet my current needs, and will likely remain so until “CPU inflation” forces the Project Euler team to increase the size of their problem sets. Of course, maybe by then I’ll have a faster processor and I won’t care.

Prime Time Programming, Part 1

From time to time, I find myself caught up in the heady world of Project Euler. It’s almost like playing Professor Layton or some other puzzle-solving game – mathematical or programming brain teasers, served in bite-sized pieces.

If you look at the Project Euler problems for any length of time, you’ll notice common themes. One theme is prime numbers – many problems can’t be solved without generating varying quantities of primes. To that end, I’d written a prime generator to be shared between problem solutions. Over time, I added functionality and “optimized” the generator to improve the running time of my solutions. Everything was great, until I tried Problem 315, whose solution required a list of primes between 107 and 2×107. My solution worked, but it ran really slowly – something like 12 minutes. Now, I’m not doing myself any favours by writing in Python and running on a 7-year-old laptop, but that’s still too long.

My Original Generator

This is the slow-performing generator that I replaced when working on Problem 315. The squeamish reader may want to avert his eyes.

class PrimeGenerator:
    __primes_so_far = [5]
    __increment = 2

    def __init__(self):
        self.__next_index = -1


    def is_prime(self, n):
        if n == 2 or n == 3:
            return True
        elif n % 2 == 0 or n %3 == 0:
            return False
        elif n <= PrimeGenerator.__primes_so_far[-1]:
            # bisecting didn't really help
            return n in PrimeGenerator.__primes_so_far
        else:
            return self.__is_prime(n)

    def __is_prime(self, n):
        limit = math.sqrt(n)
        g = PrimeGenerator()
        g.next() # skip 2
        g.next() # skip 3
        p = g.next()
        while p <= limit:
            if  n % p == 0:
                return False
            p = g.next()
        return True
        
    def next(self):
        self.next = self.__next3
        return 2

    def __next3(self):
        self.next = self.__next5
        return 3

    def __next5(self):
        self.__next_index += 1
        if self.__next_index < len(PrimeGenerator.__primes_so_far):
            return PrimeGenerator.__primes_so_far[self.__next_index]
            
        candidate = PrimeGenerator.__primes_so_far[-1]

        while True:
            candidate += PrimeGenerator.__increment
            PrimeGenerator.__increment = 6 - PrimeGenerator.__increment
        
            if self.__is_prime(candidate):
                PrimeGenerator.__primes_so_far.append(candidate)
                return candidate

    def __iter__(self):
        return self

My eyes!

When I first went back to this code, I exclaimed, “What was I thinking?”. I can think of two things:

  1. The is_prime member was intended to help out for problems where I didn’t have to create too many primes, but just had to check a few number for primality. This doesn’t really belong here, and clutters the class. I’d be better off focusing on prime generation.
  2. I was optimizing at least partly for the case where I’d want to get lists of primes a couple times—hence the indexing into an already-generated list of primes. If the generator were fast enough, this wouldn’t be necessary.

Things I can’t understand, even today. I’ll have to blame them on then-ignorance, foolishness, and hasty modifications to the class:

  • Why the mucking about with __next3 and __next5? What did I have against yield?
  • Why is is_prime fussing with a whole new PrimeGenerator and skipping 2 and 3? Why not just go straight for the saved list of primes starting with 5?

In fact, just about the only defensible things (as we’ll see below) in the whole class are:

  • ending the modulus checks once the candidate divisor is greater than the square root of the candidate number, and
  • the bit where the next__increment is formed by subtracting the current one from 6 – I was exploiting the fact that, past 3, for any prime p, p ≡ 1 (mod 6) or p ≡ 5 (mod 6).

How slow was it?

The generator took 551.763 seconds to generate primes less than 107, as measured by the following:

def run(f):
    global highest
    start = datetime.datetime.now()
    count = 1
    for p in f:
        count += 1
        if p > highest: break
    end = datetime.datetime.now()
    elapsed = end-start
    return highest, count, elapsed.seconds + (elapsed.microseconds/1000000.0)

Where f is an instance of PrimeGenerator passed into the run method, and highest is a global that’s been set to 107.

Moving forward

Based on the extreme slowness and general horribleness of the code, I figured I’d be better off starting over. So, I chucked the generator and started afresh with the simplest generator I could write. I resolved to make incremental changes, ensuring that at each step, the code was:

  1. correct (otherwise, why bother)
  2. faster than the previous version
  3. simple enough for me to understand

Let’s see what happened.

Attempt 1: Trial Division

Trial division is one of the simplest methods for generating primes—you start counting at 2, and if no smaller positive integers (other than 1) divide the current number, it’s prime. The naive implementation is very simple.

def naive_trial():
    n = 2
    while True:
        may_be_prime = True
        for k in xrange(2, n):
            if n % k == 0:
                may_be_prime = False
                break
        if may_be_prime:
            yield n
        n += 1

This method takes 113.804 seconds to generate primes below 100000—I couldn’t wait for the full 107 – it would probably be over 3 hours.

Trial until root

Fortunately, there are some pretty obvious optimizations one can make to the algorithm. The first comes from the observation that if there is a number k, 1 < k < n, that divides n, then there is a number j that divides n with 1 < j ≤ √n, so we can stop our trial once we’ve hit that point.

def trial_until_root():
    n = 2
    while True:
        may_be_prime = True
        for k in xrange(2, int(n**0.5)+1):
            if n % k == 0:
                may_be_prime = False
                break
        if may_be_prime:
            yield n
        n += 1

This method takes 468 seconds to generate primes up to 107. A definite improvement (and already faster my original generator), but there’s still room for more.

Trial by primes

Here’s another observation about divisors of n: if there’s a number k that divides n, then there’s a prime number p ≤ k that divides n, since either k is prime or has prime factors. So if we keep a list of the primes found so far, we only need to check prime divisors that are less than √n.

def trial_by_primes():
    primes_so_far = []
    n = 2
    while True:
        may_be_prime = True
        for p in primes_so_far:
            if n % p == 0:
                may_be_prime = False
                break
            if p * p > n: # it's prime
                break
        if may_be_prime:
            primes_so_far.append(n)
            yield n
        n += 1

Now we’re down to 136 seconds to generate primes below 107. That was a worthwhile change, but we have to balance it against the fact that the generator now requires additional storage – the list of primes encountered so far. In this case, we’re storing over 660,000 prime numbers in a list. Even an older laptop can handle this burden, but it’s something to keep in mind.

That’s odd

And by “that”, I mean “all the prime numbers except for 2”. There’s no point checking the even numbers to see if they’re prime. Let’s see what happens when we skip them. The only tricky part (and it’s not that tricky) is to make sure we still return 2 as our first prime.

def odd_trial_by_primes():
    primes_so_far = []
    yield 2
    n = 3
    while True:
        may_be_prime = True
        for p in primes_so_far:
            if n % p == 0:
                may_be_prime = False
                break
            if p * p > n: # it's prime
                break
        if may_be_prime:
            primes_so_far.append(n)
            yield n
        n += 2

This method takes 127 seconds to generate primes less than 107. Not a huge improvement, but better than nothing, and it doesn’t really complicate the code that much. I’ll keep it. The reason we don’t get a huge improvement here is that checking the even numbers for primeness doesn’t take that much effort – they were eliminated as soon as we modded them by the first prime in primes_so_far. Still, it’s a little quicker to jump right over them than to perform the division.

What about 3?

If skipping the numbers that are divisible by 2 paid off, will skipping those divisible by 3? As I noted above, every prime p greater than 3 satisfies p ≡ 1 (mod 6) or p ≡ 5 (mod 6). Let’s use that. We’ll take advantage of these facts:

  • if p ≡ 1 (mod 6) , then p+4 ≡ 5 (mod 6)
  • if p ≡ 5 (mod 6) , then p+2 ≡ 1 (mod 6)

So we want to alternate our step between 2 and 4. Fortunately 6 – 4 = 2 and 6 – 2 = 4, so we can use 6 – step as our next step.

def sixish_trial_by_primes():
    primes_so_far = []
    yield 2
    yield 3
    step = 2
    n = 5
    while True:
        may_be_prime = True
        for p in primes_so_far:
            if n % p == 0:
                may_be_prime = False
                break
            if p * p > n: # it's prime
                break
        if may_be_prime:
            primes_so_far.append(n)
            yield n
        n += step
        step = 6 - step

Now the time drops to 123 seconds to generate primes less than 107. Clearly we’ve hit diminishing returns – we’re saving two modulus operations on numbers that are divisible by 3 (but not 2), at the cost of a more complicated “step” calculation. We could continue on in this vein, but the gains are not likely to be large, and the complexity increases rapidly. Consider the next step: we’d make sure we don’t test numbers divisible by 2, 3, or 5. That means (after 5) we only consider numbers whose remainders when divided by 30 are one of 1, 7, 11, 13, 17, 19, 23, or 29. The steps between numbers are 6, 4, 2, 4, 2, 4, 6, and 2. Who has the energy?

The problem with Trial Division

Trial division has a few things going for it:

  • it’s simple to understand
  • there are some obvious optimizations that can make its performance tolerable

Ultimately, though, its downfall is that it takes a lot of work to verify that a large number is prime. Consider the largest prime number below 107: 9999991. In order to verify that this is prime, we have to consider all prime factors less than √9999991. There are 446 of these. That’s 446 divisions, just to verify that one number is prime.

We’re unlikely to radically improve performance by continuing to tinker with trial division. It’s time to throw the whole thing away again and try a new approach. Next time we’ll do just that.