Tuesday, July 26, 2016

Sequences vs. Lists in F# (with Euler Problems #7, #8, #9, and #10)

In going through the Euler Problems in F#, I've found myself using lists and sequences a lot. Probably because I'm comfortable with the map, filter, collect, sum and similar functions because they are so much like the functions I love in LINQ in C#.

This means that my solutions have tended to be a bit of "brute force". But I have done some refinement later; this will continue.

I ended up working on the next set of Euler Problems at the same time -- bouncing between them and sharing ideas where I could. I ended up with solutions that were reasonably performant. But I picked up something else from this process:
I saw how choosing a sequence over a list (or a list over a sequence) can drastically impact performance.
Let's take a look at the solutions, and then we'll look at my learnings about sequences and lists.

[Update: Collected articles for the first 10 Euler Problems in F# are available here: Jeremy Explores Functional Programming.]

Euler Problem #7
Here's Euler Problem #7:
By listing the first six prime numbers: 2, 3, 5, 7, 11, and 13, we can see that the 6th prime is 13.
What is the 10,001st prime number?
And here's my solution:


And here's running the function with the target value:


This is a bit of a brute force solution. The "isPrime" determines whether there are any factors of a number. This is what we used when looking at Euler Problem #3 (the "factor" function), but then I added the "Seq.isEmpty" check on to the end. If there are no factors, then the number is prime.

Why Sequence is Important Here
The reason that using a sequence is important here is that sequences are lazy-evaluated -- similar to how IEnumerable in C# is only evaluated once you start enumerating through it.

So when we use "isEmpty", this pulls the first item in the sequence (and only the first item). If it finds an item, then it short-circuits the evaluation of the rest of the sequence. So, we don't actually end up calculating all the factors for a number, we only calculate the *first* factor. If it exists, then the number is not prime, and we return false. If there are no factors, then the number is prime, and we return true.

Why the "match"?
In the "euler7" function, we use pattern matching on our number. The reason is that I wanted to do a bit of optimization.

The second pattern uses another sequence. It goes from 3 to 1 million. But I specified "[3..2..1000000]", The middle term means "step 2", so we're only taking every other number in this sequence. This skips all of the even numbers (which we know are not prime), and cuts our processing time in half.

But we have one exception: "2" (an even number) is prime. So, I manually add a pattern for that in case someone asks for the first prime number.

Why Sequence is Important Here (Again)
You'll notice that we're using a sequence in the "euler7" function as well. This is because we ask for a particular "item". This will evaluate the sequence up to that item (and no further).

The "n-2" looks odd. That's because "item" is zero-based. So to get the 10001st prime, we need to get the item at index 10000. But we also skipped "2" in this sequence, so we need to subtract one more number from our index to get an accurate match.

Performance
It takes 18 seconds to evaluate the 10,001st prime number. That's not too bad for a brute force attack. There's sure to be a better mathematical approach to this problem, but I haven't researched that yet.

Euler Problem #8
Let's move on to Euler Problem #8:
The four adjacent digits in the 1000-digit number that have the greatest product are 9 × 9 × 8 × 9 = 5832.
73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450
Find the thirteen adjacent digits in the 1000-digit number that have the greatest product. What is the value of this product?
First of all, 1000 digit number?!?

My solution starts out by treating this as a giant string:


When we run this to find the 13 numbers that produce the largest product, we get this result:


Just a warning when you're looking for Euler solutions online. For this particular problem, there are different target values. This one uses 13, but there are also solutions that are looking for 5 adjacent numbers or another variation.

We don't need to treat the original number as a number, we can treat it as a string. What I decided to do was create a 13-character "window" on the string that I could slide through the entire 1000 characters.

So the "windowProduct" function takes a starting index and the size of the window (13 in our case), takes those characters and turns them into integers, and then multiplies them together.

A couple notes: I extracted out a "charToInt64" function to make converting a character to a number a bit more obvious. And I'm using 64-bit values here because our product will overflow a 32-bit integer.

For the "euler8" function, I create a sequence that includes all of the possible windows in our 1000-digit number. It finds the product of all those numbers, and picks out the biggest one.

Why Sequence is Important Here
The sequence inside "windowProduct" is important because we're dealing with a string here. In C#, a string implements the IEnumerable<char> interface; in F#, string is a "seq<char>". Since we already have a sequence, we'll just keep working with it.

Why List is Important Here
Inside the "euler8" function, we use a list rather than a sequence. The reason is that we need all of the possible values in order to find the "max" value. This means that all of the items need to be evaluated. Because of this (and the fixed size of the collection), a list is a bit more efficient here.

Performance is good (less than a 10th of a second), but in my (extremely unscientific) tests, using a sequence in this function took about twice as long. With these particular values, it's not a significant difference. But it's something to keep in mind.

Euler Problem #9
So let's move on to Euler Problem #9:
A Pythagorean triplet is a set of three natural numbers, a < b < c, for which,
a2 + b2 = c2
For example, 32 + 42 = 9 + 16 = 25 = 52.
There exists exactly one Pythagorean triplet for which a + b + c = 1000.
Find the product abc.
I did a very severe brute-force on this:


This does give the correct answer:


(And in case you're wondering, the three values are 200, 375, and 425.)

The "squareList" is a list that contains the squares for the numbers from 1 to half of our target value (I used half because it sounded reasonable since we were adding and squaring numbers in various ways -- I can't say that I have a mathematical proof for why this is a good value).

The "collect" function creates a Cartesian product of 2 sets of these lists of squares (similar to what we did for Euler Problem #4). This creates all of the permutations of values -- adding each square together and recording the value. The result is a list of tuples with the first square, the second square, and the sum of the two squares.

The "filter" compares the sum of the two squares against our "squareList" (since this is a list of known squares). It hangs on to only those values where the sum is also a square.

The "map" takes the square root of all of our values (yes, I know this isn't very efficient, but when I was trying to come up with a data structure that held both the squares and the original values, I ended up with a bit of a mess).

The "find" function gets the values where the original (non-squared) values add up to our target (1000).

Finally, we have the "prodTuple" function which will multiply the 3 values in our tuple together to get us the ultimate result.

Why Lists are Important Here
Like with our previous example, we have a fixed number of values (the Cartesian product of all of the squares from 1 to half of our target). Since this is a fixed size, and we're evaluating everything, a sequence wouldn't buy us any short-circuiting advantages.

Performance
The performance is okay. It takes a little over a second to evaluate this. Again, there is probably a better mathematical approach to this. But computers are very good a brute-forcing things (that's why they're so good at playing chess).

Euler Problem #10
Finally today, we have Euler Problem #10:
The sum of the primes below 10 is 2 + 3 + 5 + 7 = 17.
Find the sum of all the primes below two million.
Here's my solution (that performs very well):


And when we run this with our target value, we get the correct answer:


This isn't my first try at this. My first try was a brute force approach. That didn't come out so well:


It worked, but it took almost an hour and a half to complete. This is really not acceptable.

Sieve of Eratosthenes
To get acceptable performance, I abandoned the brute-force approach and used something known as the Sieve of Eratosthenes. This is really good at finding the prime numbers below a certain value.

The short version is that we can eliminate all values that are multiples of a prime. Let's take the values from 2 to 20:

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

First, we can eliminate everything evenly divisible by 2 (our first prime):

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

Then we can eliminate everything evenly divisible by 3 (our second prime):

[2, 3, 4, 5, 6, 7, 8, 910, 11, 12, 13, 14, 1516, 17, 18, 19, 20]

We only have to go up to the square root of our target value (that's the number I mistakenly applied to a different problem previously) Since the square root of 20 is 4.x, we can stop at 4.

What's left is the set of primes less than 20:

[2, 3, 5, 7, 11, 13, 17, 19]

A Clumsy Approach
I'll admit that my implementation of the Sieve of Eratosthenes is a bit clumsy. But it works. You'll see that we have the same "isPrime" function that we used in Euler Problem #7 (although it's just on a single line here).

Then I calculate a set of "sievePrimes". This uses brute-force to get the prime numbers up to 1414 (which happens to be the square root of our target number).

The "divisibleByRange" function takes a number and sees if it is evenly divisible by any value in a range of values. When we use our "sievePrimes" as the range, we can find out which values can be eliminated in our sieve.

The "euler10" function applies the "divisibleByRange" against the entire set of numbers from 2 to our target value. We have to use the "not" because "divisibleByRange" returns true if it's divisible; and these are the numbers we do *not* want to include in our list.

Why Sequence is Important
Our "isPrime" uses a sequence. The importance of this is the short-circuiting that we saw earlier.

The sequence in "sievePrimes" is important because we don't know how many values we'll end up with. All we know is that there will be fewer than 1414. So rather than allocating a list, we'll use a sequence.

Then we take that sequence and turn it into a list. Why?

Why List is Important
The reason why we turn this sequence into a list is that once we've evaluated it one time, we *do* know how many items are in the list (and there's no reason to re-evaluate them later).

"divisibleByRange" works with a list because it's designed to work with our "sievePrimes" value, which we just saw is a list.

Why Sequence is Important
In "euler10" I start with a sequence because I don't know how many prime numbers will be in our collection.

The rest of the function will "map" the values to 64-bit integers (because our sum will overflow a 32-bit integer), and then we call "sum" to get our result.

Sequences and Lists
In going through these examples, I had to think quite a bit about whether I wanted to use a sequence or a list. I found that if I picked the wrong one, I would end up taking a performance hit.

For example, when I used a sequence for the "sievePrimes" (instead of converting it to a list), things slowed down significantly -- so much that I didn't even wait for the process to finish. Since this value is used over and over again, it's better for it to be in a list that's fully evaluated.

At the same time, for "isPrime" it works much better to use a sequence. As soon as the first value returns from the sequence, the "isEmpty" function returns false. This short-circuits evaluation of the rest of the collection.

Short Version:
Sequences are good when we don't know how many values we'll have or when we want to short-circuit evaluation.
Lists are good when we know how many values we have or when we want to evaluate everything (such as a max or sum).
This is something that I "knew" from reading. But when I was working with these particular problems, I saw the difference in action.

At times, the differences were insignificant. But at other times, the performance differences were dramatic -- to the point of making one solution completely unusable.

Wrap Up
I've been doing a lot of experimentation lately. This has been a great help in showing me the real differences in different solutions. In this case, I feel like I've got a much better handle on when sequences are appropriate and when lists are appropriate. (You'll notice that I didn't include any arrays here -- I guess that's something I'll have to look into.)

I know that there are better solutions to the Euler Problems than then ones that I've got here -- for example, I know that there are recursive implementations of the Sieve of Eratosthenes. But I'm not done looking into these problems just yet. It's time for me to do a bit more exploration online to see what other folks have come up with.

If you have solutions or ideas that you'd like to share, feel free. The comments on the other Euler articles that I've written have been very helpful to me (and hopefully to others as well).

Happy Coding!

1 comment:

  1. You've got the list vs seq down pat here.
    There are a few things I'd do slightly differently -- in the isPrime function, the shorter sequence

    Seq.initInfinite (i -> i+2) |> Seq.takeWhile (i -> i*i <= x)

    instead of

    for i = 2 to x/2 do

    the multiplication being safe from overflow since x is a representable number in Int32.

    When the problem gives you the small primes, it would also be OK to pre-sieve by having 2 and 3 known and then only looking at values of the form (6n ± 1).

    Prime generation is also a good candidate for solutions involving memoization, with isPrime only starting the exhaustive sequence after running out of known prime values.

    ReplyDelete