Friday, December 14, 2012

Not Optimizing Haskell

The flu can go fuck itself in its nonexistent, viral ass. This shit will not beat me. While I run down the clock, I'm profiling more things to make me feel a bit better.

First off, neither GHCi nor Haskell mode comes with an interactive profiler. Or, as far as I can tell, any utilities to make batch profiling any easier. The way you profile Haskell programs is by installing the profiling extensions

apt-get install libghc-mtl-dev libghc-mtl-prof

compiling your program with the profiling flags on

ghc -prof -auto-all -o outFile yourFile.hs

and then running the result with some different profiling flags.

./outfile +RTS -p

That should create a file called outFile.prof in the directory you just ran it from, and that file will contain a well formatted couple of tables that will tell you where your space and time cost-centers are.

So... lets automate this.

(defun ha-custom-profile-buffer ()
  (interactive)
  (find-file-other-window 
   (ha-custom-profile-haskell-file (buffer-file-name))))

(defun ha-custom-profile-haskell-file (abs-filename)
  "Compiles the given file with profiling, 
runs it with the +RTS -p flags and returns
the filename of the profiling output."
  (assert (string= "hs" (file-name-extension abs-filename)))
  (let* ((f-name (file-name-sans-extension abs-filename))
         (tmp (make-temp-file f-name))
         (tmp-name (file-name-nondirectory tmp))
         (tmp-dir (file-name-directory tmp)))
    (message "Compiling...")
    (shell-command (format "ghc -prof -auto-all -o %s '%s'" tmp abs-filename))
    (message "Profiling...")
    (shell-command (format "%s./%s +RTS -p" tmp-dir tmp-name))
    (concat tmp-name ".prof")))

Those functions are both now part of my ha-custom mode. The big one takes a Haskell file, compiles it to a tempfile with the appropriate flags, runs the result with the other appropriate flags, and returns the name of the profiling output file. The little function takes the current buffer and runs it through the big one, then opens the result in a new window. That should make it a bit easier to actually do the profiling.

Actually Profiling Haskell

We started with pretty much the same thing as the Lisp code. And, I'll strip the printing elements again for the purposes of this exercise; we're not interested in how inefficient it is to actually produce a grid based on our model of the world.

module Main where
import Data.List (group, sort, concatMap)
import Data.Set

lifeStep :: Set (Int, Int) -> Set (Int, Int)
lifeStep cells = fromList [head g | g <- grouped cells, viable g]
  where grouped = group . sort . concatMap neighbors . toList
        neighbors (x, y) = [(x+dx, y+dy) | dx <- [-1..1], dy <- [-1..1], (dx,dy) /= (0,0)]
        viable [_,_,_] = True
        viable [c,_] = c `member` cells
        viable _ = False

runLife :: Int -> Set (Int, Int) -> Set (Int, Int)
runLife steps cells = rec (steps - 1) cells
  where rec 0 cells = cells
        rec s cells = rec (s - 1) $! lifeStep cells

glider = fromList [(1, 0), (2, 1), (0, 2), (1, 2), (2, 2)]
blinker = fromList [(1, 0), (1, 1), (1, 2)]
gosperGliderGun = fromList [(24, 0), (22, 1), (24, 1), (12, 2), (13, 2), (20, 2), (21, 2), (34, 2), (35, 2), (11, 3), (15, 3), (20, 3), (21, 3), (34, 3), (35, 3), (0, 4), (1, 4), (10, 4), (16, 4), (20, 4), (21, 4), (0, 5), (1, 5), (10, 5), (14, 5), (16, 5), (17, 5), (22, 5), (24, 5), (10, 6), (16, 6), (24, 6), (11, 7), (15, 7), (12, 8), (13, 8)]

main :: IO ()
main = putStrLn . show $ runLife 5000 gosperGliderGun

It's almost the same, actually, because we determine frequencies differently. Instead of doing a single traversal of the corpus, we do what looks like a much more expensive operation composing group onto sort onto concatMap neighbors. In a book, that would be called "foreshadowing".

A first run-through of M-x ha-custom-profile-buffer gives us

        Fri Dec 14 21:48 2012 Time and Allocation Profiling Report  (Final)

           life21765U60 +RTS -p -RTS

        total time  =       30.15 secs   (30153 ticks @ 1000 us, 1 processor)
        total alloc = 24,382,856,840 bytes  (excludes profiling overheads)

COST CENTRE        MODULE    %time %alloc

lifeStep.grouped   Main       57.4   53.6
lifeStep.neighbors Main       24.7   40.9
lifeStep           Main       11.4    5.5
lifeStep.viable    Main        6.5    0.0


                                                                    individual     inherited
COST CENTRE               MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN                      MAIN                     46           0    0.0    0.0   100.0  100.0
 CAF                      Main                     91           0    0.0    0.0   100.0  100.0
  gosperGliderGun         Main                     97           1    0.0    0.0     0.0    0.0
  main                    Main                     92           1    0.0    0.0   100.0  100.0
   runLife                Main                     93           1    0.0    0.0   100.0  100.0
    runLife.rec           Main                     94        5000    0.0    0.0   100.0  100.0
     lifeStep             Main                     95        4999   11.4    5.5   100.0  100.0
      lifeStep.viable     Main                     99    10002308    6.5    0.0     6.5    0.0
      lifeStep.grouped    Main                     96        4999   57.4   53.6    82.1   94.5
       lifeStep.neighbors Main                     98     2314620   24.7   40.9    24.7   40.9
 CAF                      Data.Set                 90           0    0.0    0.0     0.0    0.0
 CAF                      GHC.Conc.Signal          87           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Handle.FD         80           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding          74           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding.Iconv    62           0    0.0    0.0     0.0    0.0

We're actually only interested in that small table, so I'll omit the exhaustive one for the future. Basically, yes. grouped and neighbors are the resource-hogs here. Even still, this compares favorably against the Common Lisp infinite plane version; both in terms of program complexity and in terms of runtime. Not to mention that the initial CL version actually crashed at ~3000 iterations because it doesn't like tail recursion.

Anyhow, the first thing we're doing this time is limiting the size of the world.

inRange :: Ord a => a -> a -> a -> Bool
inRange low n high = low < n && n < high

lifeStep :: Int -> Set (Int, Int) -> Set (Int, Int)
lifeStep worldSize cells = fromList [head g | g <- grouped cells, viable g]
  where grouped = group . sort . concatMap neighbors . toList
        neighbors (x, y) = [(x+dx, y+dy) | dx <- [-1..1], dy <- [-1..1], 
                            (dx,dy) /= (0,0), inSize (dx+x) (dy+y)]
        inSize x y = inR x worldSize && inR y worldSize
        inR = inRange 0
        viable [_,_,_] = True
        viable [c,_] = c `member` cells
        viable _ = False

runLife :: Int -> Int -> Set (Int, Int) -> Set (Int, Int)
runLife worldSize steps cells = rec (steps - 1) cells
  where rec 0 cells = cells
        rec s cells = rec (s - 1) $! lifeStep worldSize cells

main :: IO ()
main = putStrLn . show $ runLife 50 5000 gosperGliderGun

That's gonna do the same thing it did yesterday; prevent massive, processor-fucking overpopulation.

        Fri Dec 14 22:03 2012 Time and Allocation Profiling Report  (Final)

           life21765GEE +RTS -p -RTS

        total time  =        1.61 secs   (1608 ticks @ 1000 us, 1 processor)
        total alloc = 1,132,473,192 bytes  (excludes profiling overheads)

COST CENTRE        MODULE  %time %alloc

lifeStep.grouped   Main     46.5   41.2
lifeStep.neighbors Main     23.4   37.8
inRange            Main     11.1   11.9
lifeStep           Main      6.2    3.0
lifeStep.viable    Main      6.1    0.0
lifeStep.inSize    Main      3.6    6.0
lifeStep.inR       Main      2.9    0.0

Granted, inRange is on the map as a cost center, but this shaved ~28 seconds off the final run time, I'm gonna call that fair enough. Given the numbers we were posting yesterday, I'm almost tempted to call this good enough. Lets see where it all goes, shall we? Step size of

50

        Fri Dec 14 22:06 2012 Time and Allocation Profiling Report  (Final)

           life21765TOK +RTS -p -RTS

        total time  =        0.03 secs   (29 ticks @ 1000 us, 1 processor)
        total alloc =  18,129,192 bytes  (excludes profiling overheads)

COST CENTRE        MODULE  %time %alloc

lifeStep.grouped   Main     55.2   42.0
lifeStep.neighbors Main     17.2   36.9
inRange            Main     13.8   11.7
main               Main      3.4    0.1
lifeStep           Main      3.4    3.2
lifeStep.inSize    Main      3.4    5.8
lifeStep.inR       Main      3.4    0.0

We've seen 5000 already, so

50 000

        Fri Dec 14 22:07 2012 Time and Allocation Profiling Report  (Final)

           life21765gYQ +RTS -p -RTS

        total time  =       15.94 secs   (15942 ticks @ 1000 us, 1 processor)
        total alloc = 11,262,873,192 bytes  (excludes profiling overheads)

COST CENTRE        MODULE    %time %alloc

lifeStep.grouped   Main       45.3   41.2
lifeStep.neighbors Main       23.0   37.8
inRange            Main       12.7   11.9
lifeStep           Main        6.6    3.0
lifeStep.viable    Main        5.9    0.0
lifeStep.inSize    Main        3.8    6.0
lifeStep.inR       Main        2.4    0.0

5 000 000

        Fri Dec 14 22:37 2012 Time and Allocation Profiling Report  (Final)

           big +RTS -p -RTS

        total time  =     1594.43 secs   (1594429 ticks @ 1000 us, 1 processor)
        total alloc = 1,125,606,873,896 bytes  (excludes profiling overheads)

COST CENTRE        MODULE    %time %alloc

lifeStep.grouped   Main       45.4   41.2
lifeStep.neighbors Main       23.6   37.8
inRange            Main       12.5   11.9
lifeStep           Main        6.2    3.0
lifeStep.viable    Main        5.8    0.0
lifeStep.inSize    Main        3.6    6.0
lifeStep.inR       Main        2.6    0.0

It's funny, after just clipping the board, we start getting much better numbers with unoptimized Haskell than we saw with unoptimized Common Lisp. That's not really much of a victory, since optimized lisp was handily beating the numbers we're putting down today, but it's also not the showdown I want to see. I want to know how optimized Haskell stacks up, and I want to know how Gridless Life stacks up to a gridded implementation. Back to Rosetta Code, I guess. Second verse same as the first; added a grid-appropriate gun[1] and stripped all but the final printing code.

import Data.Array.Unboxed
import Data.List (unfoldr) 

type Grid = UArray (Int,Int) Bool
 -- The grid is indexed by (y, x).
 
life :: Int -> Int -> Grid -> Grid
{- Returns the given Grid advanced by one generation. -}
life w h old =
    listArray b (map f (range b))
  where b@((y1,x1),(y2,x2)) = bounds old
        f (y, x) = ( c && (n == 2 || n == 3) ) || ( not c && n == 3 )
          where c = get x y
                n = count [get (x + x') (y + y') |
                    x' <- [-1, 0, 1], y' <- [-1, 0, 1],
                    not (x' == 0 && y' == 0)]
 
        get x y | x < x1 || x > x2 = False
                | y < y1 || y > y2 = False
                | otherwise       = old ! (y, x)
 
count :: [Bool] -> Int
count = length . filter id

grid :: [String] -> (Int, Int, Grid)
grid l = (width, height, a)
  where (width, height) = (length $ head l, length l)
        a = listArray ((1, 1), (height, width)) $ concatMap f l
        f = map g
        g '.' = False
        g _   = True
 
printGrid :: Int -> Grid -> IO ()
printGrid width = mapM_ f . split width . elems
  where f = putStrLn . map g
        g False = '.'
        g _     = '#'
 
split :: Int -> [a] -> [[a]]
split n = takeWhile (not . null) . unfoldr (Just . splitAt n)

gosperGliderGun = grid
    ["........................#.........................",
     "......................#.#.........................",
     "............##......##............##..............",
     "...........#...#....##............##..............",
     "##........#.....#...##............................",
     "##........#...#.##....#.#.........................",
     "..........#.....#.......#.........................",
     "...........#...#..................................",
     "............##....................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     ".................................................."]

printLife :: Int -> (Int, Int, Grid) -> IO ()
printLife n (w, h, g) = printGrid w . last . take n $ iterate (life w h) g
 
main = printLife 50 gosperGliderGun

Ok, lets rev this sucker up.

50

        Fri Dec 14 22:29 2012 Time and Allocation Profiling Report  (Final)

           life-grid21765tiW +RTS -p -RTS

        total time  =        1.32 secs   (1319 ticks @ 1000 us, 1 processor)
        total alloc = 891,555,608 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

life.get    Main     59.9   50.4
life.f.n    Main     30.5   41.7
life.f      Main      3.9    3.0
count       Main      3.5    0.8
life        Main      2.0    3.9

5000

        Fri Dec 14 22:32 2012 Time and Allocation Profiling Report  (Final)

           life-grid217656sc +RTS -p -RTS

        total time  =      133.77 secs   (133771 ticks @ 1000 us, 1 processor)
        total alloc = 90,810,516,640 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

life.get    Main     59.1   50.5
life.f.n    Main     31.3   41.8
count       Main      3.5    0.8
life.f      Main      3.4    3.0
life        Main      2.4    3.9

That's ... almost sad enough not to be funny. Almost. Do note for the record that this is an order of magnitude up from the gridless version with the same inputs. And when you think about what's involved in each traversal of each corpus, it kind of becomes obvious why that is. The grids' corpus traversal always has 2500 stops. The gridless traversal is somewhere between 50 and 100 for a comparably populated board of the same size. 2500 is our worst case, and we'll probably never hit it.

I'm not even going to bother profiling the higher steps with this approach if 5000 took two minutes. I do still want to see how low we can go, and how we'd go about it.

The first thought I have is to try out that iterate approach, rather than recurring manually

runLife :: Int -> Int -> Set (Int, Int) -> Set (Int, Int)
runLife worldSize steps cells = last . take steps $ iterate (lifeStep worldSize) cells

main :: IO ()
main = putStrLn . show $ runLife 50 50000 gosperGliderGun

Yes, it's more elegant. But will it blend?

        Fri Dec 14 22:45 2012 Time and Allocation Profiling Report  (Final)

           life21765UId +RTS -p -RTS

        total time  =        0.01 secs   (12 ticks @ 1000 us, 1 processor)
        total alloc =  20,022,728 bytes  (excludes profiling overheads)

COST CENTRE      MODULE  %time %alloc

runLife          Main     41.7   36.0
lifeStep.grouped Main     33.3   52.1
lifeStep         Main     25.0   11.7

Hm.

I'm gonna go ahead and put that one down to a profiler error, especially since running the same program in interactive mode confers no such magical acceleration. This does kind of call the process into question somewhat though...

Oh, well, I'm meant to be exploring. Lets pull the same incremental stuff we did with CL yesterday. Firstly, we're already using Set here, so the member check is already as tight as it's going to get. Our last valid profiler ping told us that lifeStep.grouped is where the big costs are paid, so lets see if we can't reduce them somewhat.

import qualified Data.Map as Map 
  (Map, lookup, insert, adjust, delete, fromList, toList)

frequencies :: [(Int, Int)] -> Map.Map (Int, Int) Int
frequencies list = rec list $ Map.fromList []
  where inc = Map.adjust (+1)
        rec [] m = m
        rec (cell:rest) m = rec rest newM
          where newM = if Nothing == Map.lookup cell m
                       then Map.insert cell 1 m
                       else inc cell m

lifeStep :: Int -> Set (Int, Int) -> Set (Int, Int)
lifeStep worldSize cells = fromList [fst g | g <- grouped cells, viable g]
  where grouped = Data.List.filter viable . Map.toList . frequencies . concatMap neighbors . toList
        neighbors (x, y) = [(x+dx, y+dy) | dx <- [-1..1], dy <- [-1..1], 
                            (dx,dy) /= (0,0), inSize (dx+x) (dy+y)]
        inSize x y = inR x worldSize && inR y worldSize
        inR = inRange 0
        viable (_,3) = True
        viable (c,2) = c `member` cells
        viable _ = False

We've added a Map of frequencies, rather than doing the naive group . sort thing. We've also had to tweak viable just a bit to accomodate.

        Fri Dec 14 22:54 2012 Time and Allocation Profiling Report  (Final)

           life21765ucp +RTS -p -RTS

        total time  =        2.41 secs   (2406 ticks @ 1000 us, 1 processor)
        total alloc = 1,216,439,760 bytes  (excludes profiling overheads)

COST CENTRE          MODULE    %time %alloc

frequencies.rec.newM Main       41.2   17.4
lifeStep.neighbors   Main       16.6   35.2
frequencies.inc      Main       12.4   12.2
lifeStep.viable      Main        9.4    4.7
inRange              Main        8.1   11.1
lifeStep.grouped     Main        3.4    8.2
lifeStep             Main        3.4    2.8
lifeStep.inSize      Main        2.3    5.6
lifeStep.inR         Main        1.4    0.0
frequencies.rec      Main        1.3    2.8

That's ... hm. Actually an increase of about a second. Maybe it does comparatively better on bigger data-sets?

main :: IO ()
main = putStrLn . show $ runLife 50 50000 gosperGliderGun
        Fri Dec 14 22:57 2012 Time and Allocation Profiling Report  (Final)

           life217657mv +RTS -p -RTS

        total time  =       23.96 secs   (23961 ticks @ 1000 us, 1 processor)
        total alloc = 12,100,319,760 bytes  (excludes profiling overheads)

COST CENTRE          MODULE    %time %alloc

frequencies.rec.newM Main       39.7   17.4
lifeStep.neighbors   Main       16.0   35.2
frequencies.inc      Main       13.3   12.2
lifeStep.viable      Main        9.5    4.7
inRange              Main        8.6   11.1
lifeStep.grouped     Main        3.6    8.2
lifeStep             Main        3.4    2.8
lifeStep.inSize      Main        2.6    5.6
lifeStep.inR         Main        1.8    0.0
frequencies.rec      Main        1.4    2.8

Nope. It actually does comparatively worse.

Hmmm.

I'm going to cut it here for now. I think I've done enough damage. I won't be putting the latest up[2] for obvious reasons. Yes, I peeked ahead, which is why I knew this particular optimization wouldn't work in Haskell early enough to foreshadow it, but I still wanted to formalize my thoughts about it.

It's hard not to learn something from playing with a languages' profiler. This experience tells me that I might have the wrong model in my head, or it might be that predicting where a traversal will happen is a lot more difficult in lazy languages, or, as I suspect from the latest profiler readouts, it might be that a Haskell Maps' lookup speed isn't constant time. The reason I suspect this is that some of our biggest cost centers are now frequencies.rec.newM (which does a Map.lookup each call) and frequencies.inc (which manipulates a particular element of a Map, so I assume a lookup is part of it).

I'm off to read up on Haskell data structures and test these hypotheses.

Oh. And heal up.


Footnotes

1 - [back] - (by the way, this makes clear that whatever the performance comparisons come down to, the gridless version has a more elegant notation)

2 - [back] - (though the limited-size version and the gridded competitor will be checked in)

8 comments:

  1. "total time = 133.77 secs (133771 ticks @ 1000 us, 1 processor)"

    I got 6.5s for this on my machine. How are you building?

    ReplyDelete
    Replies
    1. Oh, I see. You put how you're building it at the top of the article. Try adding -O2 to that.

      Delete
    2. Or perhaps I missed the point and this was on purpose?

      Delete
    3. Arbitrarily didn't use optimization flags, deliberately used the same command for each program. I guess since there's interest I'll go back through later and re-bench these with `-O` and `-O2` as well.

      Delete
    4. -O2 makes a difference, but it's not as big on my machine as yours. The grid-based solution does 5000 in 17.72 seconds and 50000 in 177.71 seconds. The gridless version does margnially better too; 5000 goes down to 1.02 sec from 1.61 and 50000 goes down to 10.27 from 15.94. 5 000 000 claims to run in 102.46 seconds, but the extended report only lists 500 000 entries for `lifeStep`, so I assume the actual number would be closer to ~700 seconds.

      Delete
  2. As i understand it, Haskell Map is implemented as Tree/Forrests, not hashmap, so lookup is not constant time. It's a blend of linear/log time depending on how wide the forrests are internally.

    ReplyDelete
    Replies
    1. I figured as much. I'll try this again with `Data.Hashmap` instead as soon as I get the profiling libraries for it installed.

      Delete
    2. Hmm. Data.HashMap doesn't seem to yield any sort of improvement over `Data.Map`, with or without -O flags. Do you know if there IS a constant-time lookup construct available somewhere in haskell? Implementing my own hash function isn't something I was expecting to do for this.

      Delete