Thursday, December 27, 2012

Angular Reflections

I've been working with Angular.js for the past little while, both on Web-Mote (still listed as a Perl project for some reason) and various work projects. Overall, the impression is very good. For the most part, the Angular approach saves a lot of lines and cycles over jQuery-style DOM-traversal and manipulation. It always saves a lot of lines over Handlebars-style HTML templating, which was honestly a bit of a surprise at first. Proper routing is slightly more annoying than its backbone.js counterpart, but forces you to break your app out into discrete, composeable pieces, which seems like it would help you scale up down the line.

There are a couple of small places where DOM-traversal seems to be the easier way forward[1], and there's one omission made by the Angular guys[2], but otherwise, I can heartily recommend the library, even after initial frustration.

The big perspective change you need to come to grips with is the shift from imperative/functional-ish programming to a model-centric, reactive approach. Using plain jQuery, you might define a dynamic list as a $.each/append call wrapped in a render method somewhere. You might define the template manually in your js code if it's simple enough, or do it using Handlebars or a similar HTML-templating if it's more involved. If you needed to collect the contents/order of the list later, you traverse the DOM and pull out the chunks you need.

It's not an unreasonable way of going about things, but Angular does it better; the same task is done by using the HTML-DSL to describe the relationship of a model (a literal JS list of objects) to the markup, and then populating that model. The framework reactively updates the DOM whenever model changes occur. Later, when you need the data back, you don't need to traverse anything. You just send the model out to wherever it needs to go.

Lets go through some before and after shots of web-mote for illustrative purposes. Specifically, lets take a look at the controls, since that's the simpler piece. Incidentally, I'm not claiming that this is the most elegant code either before or after. I just want to show you the structural and philosophical differences between approaches.

Before

First, the relevant HTML markup

...

<script id="tmp-control" type="text/x-handlebars-template">
  <li class="{{cmd}}">
    <button class="btn" onclick="mote.command('{{cmd}}');"
            {{#if held}}
            onmousedown="mote.hold('{{cmd}}');" onmouseup="mote.release();" onmouseout="mote.release();"
            {{/if}}>
      <i class="icon-{{cmd}}"></i>
    </button>
  </li>
</script>

<script id="tmp-control-block" type="text/x-handlebars-template">
  <ul>
    {{#each this}}
    {{#control-button this}}{{/control-button}}
    {{/each}}
  </ul>
</script>

...

<div id="controls"></div>

...

The target element gets its own id so that we can refer to it from our jQuery code. The script blocks are Handlebars template declarations. I've elided the rest of the HTML markup because it's all include/template/meta overhead, but you can see it in the appropriate Web-Mote commit if you are so inclined.

Here are the relevant JS declarations

Handlebars.registerHelper("control-button", function (ctrl) {
    return new Handlebars.SafeString(templates.control(ctrl));
});

var templates = {
    control: Handlebars.compile($("#tmp-control").html()),
    controlBlock : Handlebars.compile($("#tmp-control-block").html())
}

var mote = {
    pressed: false,
    hold: function (cmd) {
        mote.release();
        mote.pressed = setInterval(function(){
            mote.command(cmd);
        }, 200);
    },
    release: function () {
        clearInterval(mote.pressed);
        mote.pressed = false;
    },
    renderControls: function (controlLists) {
        $.each(controlLists,
               function (index, e) {
                   $("#controls").append(templates.controlBlock(e));
               })
            },
    command: function (cmd) {
        console.log(cmd);
        $.post("/command", {"command": cmd},
               function () {
                   if (cmd == "pause") {
                       var btn = templates.control({cmd: "play"});
                       $("#controls .pause").replaceWith(btn);
                   } else if (cmd == "play") {
                       var btn = templates.control({cmd: "pause"});
                       $("#controls .play").replaceWith(btn);
                   }
               })
    }
}

// older versions of safari don't like `position: fixed`.
// they also don't like when you set `position: fixed` in a stylesheet,
//   then override that with inline styles.
// what I'm saying is that older versions of safari are assholes
if ($.browser.safari) {
    $("#controls").css({ "position": 'absolute' });
    window.onscroll = function() {
        $("#controls").css({ 
            "top" : window.pageYOffset + 'px'
        });
    };
} else {
    $("#controls").css({ "position": 'fixed' });    
}

command is only relevant because it switches out the pause button for a play button when its pressed successfully. Observe that all of the rendering here is happening through DOM manipulations. We run .append over the result of calling the controlBlock template on each group of player controls, and each call to controlBlock itself applies the control template. When we need to do that button switch I mentioned, we do it by calling .replaceWith on the appropriate DOM selector. We probably could have avoided going to sub-templates for control buttons, but that would have saved us five lines at the outside; just the script tag boilerplate in the HTML markup, and that Handlebars helper definition.

Finally, here's the .ready() call

$(document).ready(function() {
    mote.renderControls(
        [[//{cmd: "step-backward"},
            {cmd: "backward", held: true},
            {cmd: "stop"},
            {cmd: "pause"},
            {cmd: "forward", held: true}
          //{cmd: "step-forward"}
        ],
         [{cmd: "volume-down", held: true}, 
          {cmd: "volume-off"}, 
          {cmd: "volume-up", held: true}]]);
});

That's that. Like I said, this isn't the most elegant code I've ever written. If I really put my mind to it, I might be able to shave off ten lines or so, and clarify my intent in a couple of places, but I think it would be pretty difficult to do much better without fundamentally changing the approach.

After

HTML markup first

<div id="controls" ng-controller="CommandCtrl" ng-style="style">
  <ul ng-repeat="controlsList in controlTree">
    <li ng-repeat="control in controlsList" class="{{control.cmd}}" ng-switch="control.held">
      <button class="btn" ng-switch-when="true" 
              ng-mousedown="command(control.cmd); hold(control.cmd)"
              ng-mouseup="release()" ng-mouseleave="release()">
        <i class="icon-{{control.cmd}}"></i>
      </button>
      <button class="btn" ng-switch-default ng-click="command(control.cmd)">
        <i class="icon-{{control.cmd}}"></i>
      </button>
    </li>
  </ul>
</div>

It should be fairly self-explanatory. That's not the clearest code you're likely to find, but it's illustrative. We've got a bunch of non-HTML directives strewn about; all the stuff starting with ng- is part of the Angular DSL. While we need to do the {{}} thing to evaluate code inside of standard HTML properties, any code inside of ng- properties is automatically run in the context of the controller CommandCtrl.

function CommandCtrl ($scope, $http) {
// older versions of safari don't like `position: fixed`.
// they also don't like when you set `position: fixed` in a stylesheet,
//   then override that with inline styles.
// what I'm saying is that older versions of safari are assholes
    if (util.browser().agent == 'safari') {
        window.onscroll = function() { 
            $scope.style = { position: "absolute", top : window.pageYOffset + 'px' };
        };
    } else {
        $scope.style = { position: "fixed" };
    }

    $scope.held = false;

    $scope.controlTree = [
        [ //{cmd: "step-backward"},
            {cmd: "backward", held: true},
            {cmd: "stop"},
            {cmd: "pause"},
            {cmd: "forward", held: true}
            //{cmd: "step-forward"}
        ],
        [{cmd: "volume-down", held: true}, 
         {cmd: "volume-off"}, 
         {cmd: "volume-up", held: true}]
    ]

    $scope.command = function (cmd) { 
        util.post($http, "/command", {"command": cmd})
            .success(function (data, status, headers, config) {
                $scope.data = data;
                if (cmd == "pause") $scope.controlTree[0][2] = {cmd: "play"}
                else if (cmd == "play") $scope.controlTree[0][2] = {cmd: "pause"}
            })
    }

    $scope.hold = function (cmd) {
        $scope.held = setInterval(function() { $scope.command(cmd) }, 200);
    }

    $scope.release = function (cmd) { 
        clearInterval($scope.held);
        $scope.held = false;
    }
}

That's all, by the way. You've seen all the code for the Angular version, and the two are functionally identical from the users' point of view.

Unlike in the jQuery solution, there's no DOM manipulation here. We've got a model called controlTree which contains the same specification of controls that the earlier version did, but this time, the actual construction of relevant templates is taken care of by the framework. We just specify the relationship between that model and the front-end in the form of the HTML code above, and Angular automatically updates. The clearest demonstration of that is these lines

if (cmd == "pause") $scope.controlTree[0][2] = {cmd: "play"}
else if (cmd == "play") $scope.controlTree[0][2] = {cmd: "pause"}

That's part of sending a command, and all it does is change the contents of our model. The view is updated as soon as this change is made. The equivalent from "Before" is

if (cmd == "pause") {
    var btn = templates.control({cmd: "play"});
    $("#controls .pause").replaceWith(btn);
} else if (cmd == "play") {
    var btn = templates.control({cmd: "pause"});
    $("#controls .play").replaceWith(btn);
}

Where we're back to templating ourselves. You can also see the same principles affecting that code hacking around older versions of Safari; we're just setting up some objects rather than doing DOM traversal ourselves.

if ($.browser.safari) {
  $("#controls").css({ "position": 'absolute' });
  window.onscroll = function() {
    $("#controls").css({ 
        "top" : window.pageYOffset + 'px'
    });
  };
} else {
  $("#controls").css({ "position": 'fixed' });    
}

vs

if (util.browser().agent == 'safari') {
  window.onscroll = function() { 
    $scope.style = { position: "absolute", top : window.pageYOffset + 'px' };
  };
} else {
  $scope.style = { position: "fixed" };
}

The effect is the same, but the particulars of updating and rendering are kept comfortably away from us.

As I said, the above example was picked to clearly illustrate the differences between approaches, not necessarily because it's the biggest gain in clarity I've gotten out of porting over[3]. I'm sure a headache or two will pop up down the line, but I submit that this is a fundamentally more humane way to craft responsive web front-ends than the alternatives.

And I'll be using it where I can from now on.


Footnotes

1 - [back] - (re-ordering complex elements is really the only one I've observed; stuff that's too complex to do like this, but where you still need to pass the current order of some set of UI elements back to the server for persistence. As I said already, angular-ui does it just fine for simple constructs, but for anything more complicated, the Angular solution is ~30-lines of sub-module, where the DOM-traversal solution is a mere 5)

2 - [back] - (the $http.post function doesn't do the jQuery thing of encoding an object as POST parameters. The default behavior is to dump the parameter object to a JSON string and pass that to the server as a post body. I could actually see that being the easier approach if you had perfect control of the server, since that would let you do some not-exactly-HTTP processing on the incoming structure. If you're using a pre-built one, though, you're probably stuck doing something manual and annoying like this

...
post: function ($http, url, data) {
  var encoded = _.map(data, function (val, k) { return encodeURI(k) + "=" + encodeURI(val); });
  $http.defaults.headers.post["Content-Type"] = "application/x-www-form-urlencoded";
  return $http.post(url, encoded.join("&"));
},
...
or (if you're concerned about going with the grain of the framework) this
myModule.config(function ($httpProvider) {
  $httpProvider.defaults.headers.post['Content-Type'] = 'application/x-www-form-urlencoded';
  $httpProvider.defaults.transformRequest = function(data){
    return _.map(data, function (val, k) { return encodeURI(k) + "=" + encodeURI(val); });
  }
});
Not too ugly once you throw in the usual pinch of underscore, but this is the sort of thing that really seems like it should be built in as a default behavior. Unless the Angular devs really think some large portion of their users are going to build their own servers to work the other way)

3 - [back] - (in fact, this is probably the least clarity I've gained by moving over to the reactive approach. As I said earlier, the line-count is usually halved without breaking a sweat)

Friday, December 14, 2012

Not Optimizing Haskell

The flu can go fuck itself in its nonexistent, viral ass. This shit will not beat me. While I run down the clock, I'm profiling more things to make me feel a bit better.

First off, neither GHCi nor Haskell mode comes with an interactive profiler. Or, as far as I can tell, any utilities to make batch profiling any easier. The way you profile Haskell programs is by installing the profiling extensions

apt-get install libghc-mtl-dev libghc-mtl-prof

compiling your program with the profiling flags on

ghc -prof -auto-all -o outFile yourFile.hs

and then running the result with some different profiling flags.

./outfile +RTS -p

That should create a file called outFile.prof in the directory you just ran it from, and that file will contain a well formatted couple of tables that will tell you where your space and time cost-centers are.

So... lets automate this.

(defun ha-custom-profile-buffer ()
  (interactive)
  (find-file-other-window 
   (ha-custom-profile-haskell-file (buffer-file-name))))

(defun ha-custom-profile-haskell-file (abs-filename)
  "Compiles the given file with profiling, 
runs it with the +RTS -p flags and returns
the filename of the profiling output."
  (assert (string= "hs" (file-name-extension abs-filename)))
  (let* ((f-name (file-name-sans-extension abs-filename))
         (tmp (make-temp-file f-name))
         (tmp-name (file-name-nondirectory tmp))
         (tmp-dir (file-name-directory tmp)))
    (message "Compiling...")
    (shell-command (format "ghc -prof -auto-all -o %s '%s'" tmp abs-filename))
    (message "Profiling...")
    (shell-command (format "%s./%s +RTS -p" tmp-dir tmp-name))
    (concat tmp-name ".prof")))

Those functions are both now part of my ha-custom mode. The big one takes a Haskell file, compiles it to a tempfile with the appropriate flags, runs the result with the other appropriate flags, and returns the name of the profiling output file. The little function takes the current buffer and runs it through the big one, then opens the result in a new window. That should make it a bit easier to actually do the profiling.

Actually Profiling Haskell

We started with pretty much the same thing as the Lisp code. And, I'll strip the printing elements again for the purposes of this exercise; we're not interested in how inefficient it is to actually produce a grid based on our model of the world.

module Main where
import Data.List (group, sort, concatMap)
import Data.Set

lifeStep :: Set (Int, Int) -> Set (Int, Int)
lifeStep cells = fromList [head g | g <- grouped cells, viable g]
  where grouped = group . sort . concatMap neighbors . toList
        neighbors (x, y) = [(x+dx, y+dy) | dx <- [-1..1], dy <- [-1..1], (dx,dy) /= (0,0)]
        viable [_,_,_] = True
        viable [c,_] = c `member` cells
        viable _ = False

runLife :: Int -> Set (Int, Int) -> Set (Int, Int)
runLife steps cells = rec (steps - 1) cells
  where rec 0 cells = cells
        rec s cells = rec (s - 1) $! lifeStep cells

glider = fromList [(1, 0), (2, 1), (0, 2), (1, 2), (2, 2)]
blinker = fromList [(1, 0), (1, 1), (1, 2)]
gosperGliderGun = fromList [(24, 0), (22, 1), (24, 1), (12, 2), (13, 2), (20, 2), (21, 2), (34, 2), (35, 2), (11, 3), (15, 3), (20, 3), (21, 3), (34, 3), (35, 3), (0, 4), (1, 4), (10, 4), (16, 4), (20, 4), (21, 4), (0, 5), (1, 5), (10, 5), (14, 5), (16, 5), (17, 5), (22, 5), (24, 5), (10, 6), (16, 6), (24, 6), (11, 7), (15, 7), (12, 8), (13, 8)]

main :: IO ()
main = putStrLn . show $ runLife 5000 gosperGliderGun

It's almost the same, actually, because we determine frequencies differently. Instead of doing a single traversal of the corpus, we do what looks like a much more expensive operation composing group onto sort onto concatMap neighbors. In a book, that would be called "foreshadowing".

A first run-through of M-x ha-custom-profile-buffer gives us

        Fri Dec 14 21:48 2012 Time and Allocation Profiling Report  (Final)

           life21765U60 +RTS -p -RTS

        total time  =       30.15 secs   (30153 ticks @ 1000 us, 1 processor)
        total alloc = 24,382,856,840 bytes  (excludes profiling overheads)

COST CENTRE        MODULE    %time %alloc

lifeStep.grouped   Main       57.4   53.6
lifeStep.neighbors Main       24.7   40.9
lifeStep           Main       11.4    5.5
lifeStep.viable    Main        6.5    0.0


                                                                    individual     inherited
COST CENTRE               MODULE                  no.     entries  %time %alloc   %time %alloc

MAIN                      MAIN                     46           0    0.0    0.0   100.0  100.0
 CAF                      Main                     91           0    0.0    0.0   100.0  100.0
  gosperGliderGun         Main                     97           1    0.0    0.0     0.0    0.0
  main                    Main                     92           1    0.0    0.0   100.0  100.0
   runLife                Main                     93           1    0.0    0.0   100.0  100.0
    runLife.rec           Main                     94        5000    0.0    0.0   100.0  100.0
     lifeStep             Main                     95        4999   11.4    5.5   100.0  100.0
      lifeStep.viable     Main                     99    10002308    6.5    0.0     6.5    0.0
      lifeStep.grouped    Main                     96        4999   57.4   53.6    82.1   94.5
       lifeStep.neighbors Main                     98     2314620   24.7   40.9    24.7   40.9
 CAF                      Data.Set                 90           0    0.0    0.0     0.0    0.0
 CAF                      GHC.Conc.Signal          87           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Handle.FD         80           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding          74           0    0.0    0.0     0.0    0.0
 CAF                      GHC.IO.Encoding.Iconv    62           0    0.0    0.0     0.0    0.0

We're actually only interested in that small table, so I'll omit the exhaustive one for the future. Basically, yes. grouped and neighbors are the resource-hogs here. Even still, this compares favorably against the Common Lisp infinite plane version; both in terms of program complexity and in terms of runtime. Not to mention that the initial CL version actually crashed at ~3000 iterations because it doesn't like tail recursion.

Anyhow, the first thing we're doing this time is limiting the size of the world.

inRange :: Ord a => a -> a -> a -> Bool
inRange low n high = low < n && n < high

lifeStep :: Int -> Set (Int, Int) -> Set (Int, Int)
lifeStep worldSize cells = fromList [head g | g <- grouped cells, viable g]
  where grouped = group . sort . concatMap neighbors . toList
        neighbors (x, y) = [(x+dx, y+dy) | dx <- [-1..1], dy <- [-1..1], 
                            (dx,dy) /= (0,0), inSize (dx+x) (dy+y)]
        inSize x y = inR x worldSize && inR y worldSize
        inR = inRange 0
        viable [_,_,_] = True
        viable [c,_] = c `member` cells
        viable _ = False

runLife :: Int -> Int -> Set (Int, Int) -> Set (Int, Int)
runLife worldSize steps cells = rec (steps - 1) cells
  where rec 0 cells = cells
        rec s cells = rec (s - 1) $! lifeStep worldSize cells

main :: IO ()
main = putStrLn . show $ runLife 50 5000 gosperGliderGun

That's gonna do the same thing it did yesterday; prevent massive, processor-fucking overpopulation.

        Fri Dec 14 22:03 2012 Time and Allocation Profiling Report  (Final)

           life21765GEE +RTS -p -RTS

        total time  =        1.61 secs   (1608 ticks @ 1000 us, 1 processor)
        total alloc = 1,132,473,192 bytes  (excludes profiling overheads)

COST CENTRE        MODULE  %time %alloc

lifeStep.grouped   Main     46.5   41.2
lifeStep.neighbors Main     23.4   37.8
inRange            Main     11.1   11.9
lifeStep           Main      6.2    3.0
lifeStep.viable    Main      6.1    0.0
lifeStep.inSize    Main      3.6    6.0
lifeStep.inR       Main      2.9    0.0

Granted, inRange is on the map as a cost center, but this shaved ~28 seconds off the final run time, I'm gonna call that fair enough. Given the numbers we were posting yesterday, I'm almost tempted to call this good enough. Lets see where it all goes, shall we? Step size of

50

        Fri Dec 14 22:06 2012 Time and Allocation Profiling Report  (Final)

           life21765TOK +RTS -p -RTS

        total time  =        0.03 secs   (29 ticks @ 1000 us, 1 processor)
        total alloc =  18,129,192 bytes  (excludes profiling overheads)

COST CENTRE        MODULE  %time %alloc

lifeStep.grouped   Main     55.2   42.0
lifeStep.neighbors Main     17.2   36.9
inRange            Main     13.8   11.7
main               Main      3.4    0.1
lifeStep           Main      3.4    3.2
lifeStep.inSize    Main      3.4    5.8
lifeStep.inR       Main      3.4    0.0

We've seen 5000 already, so

50 000

        Fri Dec 14 22:07 2012 Time and Allocation Profiling Report  (Final)

           life21765gYQ +RTS -p -RTS

        total time  =       15.94 secs   (15942 ticks @ 1000 us, 1 processor)
        total alloc = 11,262,873,192 bytes  (excludes profiling overheads)

COST CENTRE        MODULE    %time %alloc

lifeStep.grouped   Main       45.3   41.2
lifeStep.neighbors Main       23.0   37.8
inRange            Main       12.7   11.9
lifeStep           Main        6.6    3.0
lifeStep.viable    Main        5.9    0.0
lifeStep.inSize    Main        3.8    6.0
lifeStep.inR       Main        2.4    0.0

5 000 000

        Fri Dec 14 22:37 2012 Time and Allocation Profiling Report  (Final)

           big +RTS -p -RTS

        total time  =     1594.43 secs   (1594429 ticks @ 1000 us, 1 processor)
        total alloc = 1,125,606,873,896 bytes  (excludes profiling overheads)

COST CENTRE        MODULE    %time %alloc

lifeStep.grouped   Main       45.4   41.2
lifeStep.neighbors Main       23.6   37.8
inRange            Main       12.5   11.9
lifeStep           Main        6.2    3.0
lifeStep.viable    Main        5.8    0.0
lifeStep.inSize    Main        3.6    6.0
lifeStep.inR       Main        2.6    0.0

It's funny, after just clipping the board, we start getting much better numbers with unoptimized Haskell than we saw with unoptimized Common Lisp. That's not really much of a victory, since optimized lisp was handily beating the numbers we're putting down today, but it's also not the showdown I want to see. I want to know how optimized Haskell stacks up, and I want to know how Gridless Life stacks up to a gridded implementation. Back to Rosetta Code, I guess. Second verse same as the first; added a grid-appropriate gun[1] and stripped all but the final printing code.

import Data.Array.Unboxed
import Data.List (unfoldr) 

type Grid = UArray (Int,Int) Bool
 -- The grid is indexed by (y, x).
 
life :: Int -> Int -> Grid -> Grid
{- Returns the given Grid advanced by one generation. -}
life w h old =
    listArray b (map f (range b))
  where b@((y1,x1),(y2,x2)) = bounds old
        f (y, x) = ( c && (n == 2 || n == 3) ) || ( not c && n == 3 )
          where c = get x y
                n = count [get (x + x') (y + y') |
                    x' <- [-1, 0, 1], y' <- [-1, 0, 1],
                    not (x' == 0 && y' == 0)]
 
        get x y | x < x1 || x > x2 = False
                | y < y1 || y > y2 = False
                | otherwise       = old ! (y, x)
 
count :: [Bool] -> Int
count = length . filter id

grid :: [String] -> (Int, Int, Grid)
grid l = (width, height, a)
  where (width, height) = (length $ head l, length l)
        a = listArray ((1, 1), (height, width)) $ concatMap f l
        f = map g
        g '.' = False
        g _   = True
 
printGrid :: Int -> Grid -> IO ()
printGrid width = mapM_ f . split width . elems
  where f = putStrLn . map g
        g False = '.'
        g _     = '#'
 
split :: Int -> [a] -> [[a]]
split n = takeWhile (not . null) . unfoldr (Just . splitAt n)

gosperGliderGun = grid
    ["........................#.........................",
     "......................#.#.........................",
     "............##......##............##..............",
     "...........#...#....##............##..............",
     "##........#.....#...##............................",
     "##........#...#.##....#.#.........................",
     "..........#.....#.......#.........................",
     "...........#...#..................................",
     "............##....................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     "..................................................",
     ".................................................."]

printLife :: Int -> (Int, Int, Grid) -> IO ()
printLife n (w, h, g) = printGrid w . last . take n $ iterate (life w h) g
 
main = printLife 50 gosperGliderGun

Ok, lets rev this sucker up.

50

        Fri Dec 14 22:29 2012 Time and Allocation Profiling Report  (Final)

           life-grid21765tiW +RTS -p -RTS

        total time  =        1.32 secs   (1319 ticks @ 1000 us, 1 processor)
        total alloc = 891,555,608 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

life.get    Main     59.9   50.4
life.f.n    Main     30.5   41.7
life.f      Main      3.9    3.0
count       Main      3.5    0.8
life        Main      2.0    3.9

5000

        Fri Dec 14 22:32 2012 Time and Allocation Profiling Report  (Final)

           life-grid217656sc +RTS -p -RTS

        total time  =      133.77 secs   (133771 ticks @ 1000 us, 1 processor)
        total alloc = 90,810,516,640 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

life.get    Main     59.1   50.5
life.f.n    Main     31.3   41.8
count       Main      3.5    0.8
life.f      Main      3.4    3.0
life        Main      2.4    3.9

That's ... almost sad enough not to be funny. Almost. Do note for the record that this is an order of magnitude up from the gridless version with the same inputs. And when you think about what's involved in each traversal of each corpus, it kind of becomes obvious why that is. The grids' corpus traversal always has 2500 stops. The gridless traversal is somewhere between 50 and 100 for a comparably populated board of the same size. 2500 is our worst case, and we'll probably never hit it.

I'm not even going to bother profiling the higher steps with this approach if 5000 took two minutes. I do still want to see how low we can go, and how we'd go about it.

The first thought I have is to try out that iterate approach, rather than recurring manually

runLife :: Int -> Int -> Set (Int, Int) -> Set (Int, Int)
runLife worldSize steps cells = last . take steps $ iterate (lifeStep worldSize) cells

main :: IO ()
main = putStrLn . show $ runLife 50 50000 gosperGliderGun

Yes, it's more elegant. But will it blend?

        Fri Dec 14 22:45 2012 Time and Allocation Profiling Report  (Final)

           life21765UId +RTS -p -RTS

        total time  =        0.01 secs   (12 ticks @ 1000 us, 1 processor)
        total alloc =  20,022,728 bytes  (excludes profiling overheads)

COST CENTRE      MODULE  %time %alloc

runLife          Main     41.7   36.0
lifeStep.grouped Main     33.3   52.1
lifeStep         Main     25.0   11.7

Hm.

I'm gonna go ahead and put that one down to a profiler error, especially since running the same program in interactive mode confers no such magical acceleration. This does kind of call the process into question somewhat though...

Oh, well, I'm meant to be exploring. Lets pull the same incremental stuff we did with CL yesterday. Firstly, we're already using Set here, so the member check is already as tight as it's going to get. Our last valid profiler ping told us that lifeStep.grouped is where the big costs are paid, so lets see if we can't reduce them somewhat.

import qualified Data.Map as Map 
  (Map, lookup, insert, adjust, delete, fromList, toList)

frequencies :: [(Int, Int)] -> Map.Map (Int, Int) Int
frequencies list = rec list $ Map.fromList []
  where inc = Map.adjust (+1)
        rec [] m = m
        rec (cell:rest) m = rec rest newM
          where newM = if Nothing == Map.lookup cell m
                       then Map.insert cell 1 m
                       else inc cell m

lifeStep :: Int -> Set (Int, Int) -> Set (Int, Int)
lifeStep worldSize cells = fromList [fst g | g <- grouped cells, viable g]
  where grouped = Data.List.filter viable . Map.toList . frequencies . concatMap neighbors . toList
        neighbors (x, y) = [(x+dx, y+dy) | dx <- [-1..1], dy <- [-1..1], 
                            (dx,dy) /= (0,0), inSize (dx+x) (dy+y)]
        inSize x y = inR x worldSize && inR y worldSize
        inR = inRange 0
        viable (_,3) = True
        viable (c,2) = c `member` cells
        viable _ = False

We've added a Map of frequencies, rather than doing the naive group . sort thing. We've also had to tweak viable just a bit to accomodate.

        Fri Dec 14 22:54 2012 Time and Allocation Profiling Report  (Final)

           life21765ucp +RTS -p -RTS

        total time  =        2.41 secs   (2406 ticks @ 1000 us, 1 processor)
        total alloc = 1,216,439,760 bytes  (excludes profiling overheads)

COST CENTRE          MODULE    %time %alloc

frequencies.rec.newM Main       41.2   17.4
lifeStep.neighbors   Main       16.6   35.2
frequencies.inc      Main       12.4   12.2
lifeStep.viable      Main        9.4    4.7
inRange              Main        8.1   11.1
lifeStep.grouped     Main        3.4    8.2
lifeStep             Main        3.4    2.8
lifeStep.inSize      Main        2.3    5.6
lifeStep.inR         Main        1.4    0.0
frequencies.rec      Main        1.3    2.8

That's ... hm. Actually an increase of about a second. Maybe it does comparatively better on bigger data-sets?

main :: IO ()
main = putStrLn . show $ runLife 50 50000 gosperGliderGun
        Fri Dec 14 22:57 2012 Time and Allocation Profiling Report  (Final)

           life217657mv +RTS -p -RTS

        total time  =       23.96 secs   (23961 ticks @ 1000 us, 1 processor)
        total alloc = 12,100,319,760 bytes  (excludes profiling overheads)

COST CENTRE          MODULE    %time %alloc

frequencies.rec.newM Main       39.7   17.4
lifeStep.neighbors   Main       16.0   35.2
frequencies.inc      Main       13.3   12.2
lifeStep.viable      Main        9.5    4.7
inRange              Main        8.6   11.1
lifeStep.grouped     Main        3.6    8.2
lifeStep             Main        3.4    2.8
lifeStep.inSize      Main        2.6    5.6
lifeStep.inR         Main        1.8    0.0
frequencies.rec      Main        1.4    2.8

Nope. It actually does comparatively worse.

Hmmm.

I'm going to cut it here for now. I think I've done enough damage. I won't be putting the latest up[2] for obvious reasons. Yes, I peeked ahead, which is why I knew this particular optimization wouldn't work in Haskell early enough to foreshadow it, but I still wanted to formalize my thoughts about it.

It's hard not to learn something from playing with a languages' profiler. This experience tells me that I might have the wrong model in my head, or it might be that predicting where a traversal will happen is a lot more difficult in lazy languages, or, as I suspect from the latest profiler readouts, it might be that a Haskell Maps' lookup speed isn't constant time. The reason I suspect this is that some of our biggest cost centers are now frequencies.rec.newM (which does a Map.lookup each call) and frequencies.inc (which manipulates a particular element of a Map, so I assume a lookup is part of it).

I'm off to read up on Haskell data structures and test these hypotheses.

Oh. And heal up.


Footnotes

1 - [back] - (by the way, this makes clear that whatever the performance comparisons come down to, the gridless version has a more elegant notation)

2 - [back] - (though the limited-size version and the gridded competitor will be checked in)

Thursday, December 13, 2012

Life Extreme - Optimizing for Time in Common Lisp

I'm sick today.

I fucking hate being sick.

So, to make myself feel better, I'm profiling things. Specifically, the Common Lisp version of Life I wrote last time. I'll be using Emacs and SLIME, but I'm pretty sure you can do at least some of this using time in whatever REPL you've got lying around.

(defpackage :life (:use :cl))
(in-package :life)

(defun moore-neighborhood (cell)
  (let ((r '(-1 0 1)))
    (mapcan
         (lambda (delta-x)
           (loop for delta-y in r
              unless (and (= delta-x 0) (= delta-y 0))
              collect (cons (+ (car cell) delta-x) (+ (cdr cell) delta-y))))
         r)))

(defun frequencies (cells)
  (let ((h (make-hash-table :test #'equal)))
    (loop for c in cells
       do (incf (gethash c h 0)))
    h))

(defun life-step (cells)
  (let ((f (frequencies (mapcan #'moore-neighborhood cells))))
    (loop for k being the hash-keys in f
       when (or 
             (= (gethash k f) 3) 
             (and (= (gethash k f) 2) (member k cells :test #'equal)))
         collect k)))

(defun print-world (live-cells &optional (world-size 10))
  (dotimes (y world-size)
    (dotimes (x world-size)
      (if (member (cons x y) live-cells :test #'equal)
          (format t "X")
          (format t ".")))
    (format t "~%")))

(defun run-life (world-size steps cells)
  (when (< 0 steps)
    (run-life world-size (- steps 1) (life-step cells))))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; data related
(defun .cells->list (filename)
  (with-open-file (stream filename)
    (apply #'append
           (loop with y = 0
              for line = (read-line stream nil 'eof) until (eq line 'eof)
              unless (char= (aref line 0) #\!)
              collect (let ((line (loop for x from 0
                                     for char being the elements of line
                                     when (char= char #\O) collect (cons x y)))) 
                        (incf y)
                        line)))))

(defparameter *blinker* '((1 . 2) (2 . 2) (3 . 2)))
(defparameter *glider* '((1 . 0) (2 . 1) (0 . 2) (1 . 2) (2 . 2)))
(defparameter *gosper-glider-gun* 
  '((24 . 0) (22 . 1) (24 . 1) (12 . 2) (13 . 2) (20 . 2) (21 . 2) (34 . 2)
    (35 . 2) (11 . 3) (15 . 3) (20 . 3) (21 . 3) (34 . 3) (35 . 3) (0 . 4) (1 . 4)
    (10 . 4) (16 . 4) (20 . 4) (21 . 4) (0 . 5) (1 . 5) (10 . 5) (14 . 5) (16 . 5)
    (17 . 5) (22 . 5) (24 . 5) (10 . 6) (16 . 6) (24 . 6) (11 . 7) (15 . 7)
    (12 . 8) (13 . 8)))

Gosper's gun is the simplest emitter I could find, and I need to test that sort of thing to convince myself of the performance of this abstract machine. The .cells->list function exists purely to convert files like this into inputs suitable for our peculiar model of the Life world. You'll also notice that I stripped all printing code from run-life; I'm not interested in how inefficient the conversion between sparse-array and grid is, and I imagine that it would have been the main cost-center had I kept it. Lets hop into the REPL

CL-USER> (load "life.lisp")
T
CL-USER> (in-package :life)
#<PACKAGE "LIFE">

Remember to turn on profiling with M-x slime-profile-package life, and answer yes to the options it asks about.

LIFE> (run-life 50 5000 *gosper-glider-gun*)
Control stack guard page temporarily disabled: proceed with caution
Control stack guard page temporarily disabled: proceed with caution
Control stack guard page temporarily disabled: proceed with caution
Control stack guard page temporarily disabled: proceed with caution
Control stack guard page temporarily disabled: proceed with caution
Control stack guard page temporarily disabled: proceed with caution
Control stack guard page temporarily disabled: proceed with caution
Control stack guard page temporarily disabled: proceed with caution
Control stack guard page temporarily disabled: proceed with caution
Control stack guard page temporarily disabled: proceed with caution
Control stack guard page temporarily disabled: proceed with caution
Control stack guard page temporarily disabled: proceed with caution
; Evaluation aborted on #<SB-KERNEL::CONTROL-STACK-EXHAUSTED {100F0230D3}>.

Ok, I guess that's not entirely unexpected. After all, run-life is still recursive, and Common Lisp doesn't guarantee tail-call optimization. Still, we probably got some pretty decent data, even from a failed attempt. M-x slime-profile-report says

  seconds  |     gc     |    consed   |   calls   |  sec/call  |  name  
-------------------------------------------------------------
    11.820 |      0.000 |     590,592 |     3,322 |   0.003558 | LIFE::LIFE-STEP
     4.082 |      1.796 | 534,176,080 |     3,322 |   0.001229 | LIFE::FREQUENCIES
     0.887 |      0.428 | 378,046,784 | 1,073,904 |   0.000001 | LIFE::MOORE-NEIGHBORHOOD
     0.000 |      0.000 |  12,360,624 |     3,322 |   0.000000 | LIFE::RUN-LIFE
-------------------------------------------------------------
    16.790 |      2.224 | 925,174,080 | 1,083,870 |            | Total

estimated total profiling overhead: 1.99 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.84e-6s total profiling, 8.24e-7s internal profiling

frequencies and life-step are obviously the culprits here, and since we now know what the cost-centers are, we can mitigate them. Discounting micro-optimization[1], there are essentially three ways to optimize a piece of code for time[2]

  1. reduce the number of traversals of your corpus
  2. reduce the time taken per traversal
  3. eliminate sequential data dependencies and do more traversals at once through parallelism

We won't be doing the third because the Game of Life problem doesn't inherently lend itself to it; you need to compute step N before you can compute step N+1, and that can't really be helped. We might be able to take advantage of parallelism in a couple of places during each step, but that tends to have its own costs associated and typically doesn't pay off except on very large data sets.

There are a bunch of ways to do one and two. We can re-write pieces of our code with tight loops; reducing readability somewhat but removing traversals where we can. We can change the representation of our corpus to something more easily searchable, or we can be more aggressive up-front about throwing out elements we know we won't need later. We'll probably end up doing all of that.

But first...

(defun run-life (world-size steps cells)
  (declare (ignore world-size))
  (let ((world (copy-list cells)))
    (loop 
       repeat steps
       do (setf world (life-step world)))
    world))

That should get around our snippy little stack warning.

LIFE> (run-life 50 5000 *gosper-glider-gun*)

now returns with a list of 884 living cells. Which makes perfect sense, since this is a generator we're testing. The profiler says "moo".

  seconds  |     gc     |     consed    |   calls   |  sec/call  |  name  
---------------------------------------------------------------
    37.257 |      0.000 |       622,592 |     5,000 |   0.007451 | LIFE::LIFE-STEP
     5.156 |      0.588 | 1,058,382,096 |     5,000 |   0.001031 | LIFE::FREQUENCIES
     1.692 |      0.564 |   821,550,624 | 2,315,504 |   0.000001 | LIFE::MOORE-NEIGHBORHOOD
     0.000 |      0.000 |             0 |         1 |   0.000000 | LIFE::RUN-LIFE
---------------------------------------------------------------
    44.105 |      1.152 | 1,880,555,312 | 2,325,505 |            | Total

estimated total profiling overhead: 4.28 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.84e-6s total profiling, 8.24e-7s internal profiling

So frequencies takes up a fuckton of conses, and the second most execution time, right behind life-step. This preliminary survey probably wasn't worth doing on a program this size; just looking at our defined functions would probably have convinced you who the culprits are and aren't.

(defun life-step (cells)
  (let ((f (frequencies (mapcan #'moore-neighborhood cells))))
    (loop for k being the hash-keys in f
       when (or 
             (= (gethash k f) 3) 
             (and (= (gethash k f) 2) (member k cells :test #'equal)))
         collect k)))

First off, (mapcan #'moore-neighborhood cells) is one traversal of the input. Ok, not too much we can do about that, we need to do it at least once. Calling frequencies on that is a second traversal, and we can probably tweak our code enough that those two happen at once. The subsequent loop call is another traversal of (* ~8 cells). We do actually need to traverse f, but it's currently longer than it needs to be because it's a hash-table that contains all cells in any living cells' Moore neighborhood. Fixing that would mean tweaking frequencies so that it automatically threw out cells with fewer than two or more than three neighbors, since those couldn't possibly be alive next time. Finally, it might not be entirely obvious, but member is a linked-list operation that traverses its list argument each time its called. I put it in the tail end of an and, which means it should only be getting called for cells with two neighbors, but each time it does get called, it traverses some part of cells; all of it, if its argument wasn't alive last time. We'll fix that by using a data type that has a more efficient membership check than a linked list.

Ok, firstly, do it in one traversal

(defun compute-frequencies (cells)
  (let ((h (make-hash-table :test #'equal)))
    (loop for a-cell in cells
       do (loop for c in (moore-neighborhood a-cell)
             do (incf (gethash c h 0))))
    h))

Oh, by the by, I have to apologize for the poor frequencies implementation last time. It turns out that Common Lisp has something like Python's defaultdict built-in; gethash takes an optional third argument which it returns as the default value. Which is nice because (incf (gethash [key] [table] 0)) will do exactly what you think it should. Now then, one traversal eliminated, lets hook the new thing into life-step

(defun life-step (cells)
  (let ((f (compute-frequencies cells)))
    (loop for k being the hash-keys in f
       when (or 
             (= (gethash k f) 3) 
             (and (= (gethash k f) 2) (member k cells :test #'equal)))
         collect k)))

How did we do?

  seconds  |     gc     |     consed    |   calls   |  sec/call  |  name  
---------------------------------------------------------------
    33.800 |      0.012 |    42,812,560 |     5,000 |   0.006760 | LIFE::LIFE-STEP
     4.404 |      0.172 | 1,098,791,040 |     5,000 |   0.000881 | LIFE::COMPUTE-FREQUENCIES
     1.276 |      0.072 |   738,782,288 | 2,315,504 |   0.000001 | LIFE::MOORE-NEIGHBORHOOD
     0.000 |      0.000 |             0 |         1 |   0.000000 | LIFE::RUN-LIFE
---------------------------------------------------------------
    39.481 |      0.256 | 1,880,385,888 | 2,325,505 |            | Total

estimated total profiling overhead: 4.28 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.84e-6s total profiling, 8.24e-7s internal profiling

Not bad, actually. compute-frequencies conses more, but saves us about half a second over the programs' running time. A direct result of this is a ~5 second drop in computing time for a 5000 step Gosper gun. Not too shabby for five minutes' worth of work. Next up, lets try to ignore irrelevant cells. That means not adding them to the result hash unless they've got at least two neighbors, and it means knocking them out if they have more than three. In other words, we'll be wanting more hash-table.

(defun compute-frequencies (cells)
  (let ((lonely (make-hash-table :test #'equal)) 
        (h (make-hash-table :test #'equal)))
    (loop for a-cell in cells
       do (loop for c in (moore-neighborhood a-cell)
             do (let ((res (incf (gethash c lonely 0))))
                  (cond 
                    ((or (= res 2) (= res 3)) (setf (gethash c h) res))
                    ((= res 4) (remhash c h))))))
    h))

My memory is gonna cry, but the processor will have a slightly easier time of this, because it will only need to deal with cells that have a decent shot of being alive in the next iteration.

  seconds  |     gc     |     consed    |   calls   |  sec/call  |  name  
---------------------------------------------------------------
    32.981 |      0.000 |    37,683,200 |     5,000 |   0.006596 | LIFE::LIFE-STEP
     5.931 |      0.288 | 1,718,346,016 |     5,000 |   0.001186 | LIFE::COMPUTE-FREQUENCIES
     1.220 |      0.080 |   638,848,352 | 2,315,504 |   0.000001 | LIFE::MOORE-NEIGHBORHOOD
     0.000 |      0.000 |             0 |         1 |   0.000000 | LIFE::RUN-LIFE
---------------------------------------------------------------
    40.133 |      0.368 | 2,394,877,568 | 2,325,505 |            | Total

estimated total profiling overhead: 4.28 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.84e-6s total profiling, 8.24e-7s internal profiling

Hm. Very slightly easier, it turns out. All the time we buy in reducing the number of cells we need to traverse seems to get eaten by the more complex check. I'm not entirely sure it was worth it, but lets keep that optimization where it is for now. We've got one trick left up our sleeves, and it's changing the representation of cells. At the moment, it's represented by that Lisp mainstay, the Linked List. In Clojure, we'd use a set, but we don't have ready access to those here. So, we'll need to use the next best thing; a data structure with quick insertion and constant-time lookup.

(defun compute-frequencies (cells)
  (let ((lonely (make-hash-table :test #'equal)) 
        (h (make-hash-table :test #'equal)))
    (loop for a-cell being the hash-keys of cells
       do (loop for c in (moore-neighborhood a-cell)
             do (let ((res (incf (gethash c lonely 0))))
                  (cond 
                    ((or (= res 2) (= res 3)) (setf (gethash c h) res))
                    ((= res 4) (remhash c h))))))
    h))

(defun cells->hash (cells)
  (let ((h (make-hash-table :test #'equal :size 800)))
    (loop for c in cells
         do (setf (gethash c h) 0))
    h))

(defun life-step (cells)
  (let ((f (compute-frequencies cells)))
    (loop for k being the hash-keys in f
       when (and (= (gethash k f) 2) (not (gethash k cells)))
       do (remhash k f))
    f))

(defun run-life (world-size steps cells)
  (declare (ignore world-size))
  (let ((world (cells->hash cells)))
    (loop 
       repeat steps
       do (setf world (life-step world)))
    world))

Subtle changes happen to each of those functions to support the overarching change, which is that we're using hash-tables everywhere now. Because member has to traverse the entire list of cells, while gethash is constant time, this should knock the shit out of our performance problems.

  seconds  |     gc     |     consed    |   calls   |  sec/call  |  name  
---------------------------------------------------------------
     5.871 |      0.268 | 1,716,630,608 |     5,000 |   0.001174 | LIFE::COMPUTE-FREQUENCIES
     1.252 |      0.108 |   641,132,304 | 2,315,504 |   0.000001 | LIFE::MOORE-NEIGHBORHOOD
     0.000 |      0.000 |             0 |     5,000 |   0.000000 | LIFE::LIFE-STEP
     0.000 |      0.000 |             0 |         1 |   0.000000 | LIFE::RUN-LIFE
     0.000 |      0.000 |        30,080 |         1 |   0.000000 | LIFE::CELLS->HASH
---------------------------------------------------------------
     7.123 |      0.376 | 2,357,792,992 | 2,325,506 |            | Total

estimated total profiling overhead: 4.28 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.84e-6s total profiling, 8.24e-7s internal profiling

Boom. Headshot.

Granted, we're still consing like crazy, but removing that member check has pushed life-step down so low that it actually takes up significantly fewer resources than friggin moore-neighborhood. We've cut our total running time from ~40 seconds to under 10. In fact, lets crank this fucker up to eleven.

LIFE> (run-life 50 50000 *gosper-glider-gun*)

  seconds  |     gc     |      consed     |    calls    |  sec/call  |  name  
-------------------------------------------------------------------
   371.779 |     23.251 |  96,030,964,864 |      40,182 |   0.009252 | LIFE::COMPUTE-FREQUENCIES
    74.690 |      8.977 |  44,912,583,584 | 136,410,636 |   0.000001 | LIFE::MOORE-NEIGHBORHOOD
     0.000 |      0.000 |          77,920 |      40,182 |   0.000000 | LIFE::LIFE-STEP
     0.000 |      0.000 |               0 |           1 |   0.000000 | LIFE::RUN-LIFE
     0.000 |      0.000 |          21,840 |           1 |   0.000000 | LIFE::CELLS->HASH
-------------------------------------------------------------------
   446.468 |     32.228 | 140,943,648,208 | 136,491,002 |            | Total

estimated total profiling overhead: 251.14 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.84e-6s total profiling, 8.24e-7s internal profiling

Aaaaand I got bored. Yeah, that took a while. What were you expecting? We get plenty of new cells each iteration, and we don't actually throw any away unless they die naturally. Which doesn't happen often when you're dealing with a generator. That's the last "optimization" we can make; instead of (declare (ignoreing the world-size, we can use it to forget cells that lie outside of our target area. It won't help all patterns, but the *gosper-glider-gun* won't create a Malthusian disaster for our computing resources.

(defun run-life (world-size steps cells)
  (let ((world (cells->hash cells)))
    (loop 
       repeat steps
       do (setf world (life-step world world-size)))
    world))

(defun life-step (cells world-size)
  (let ((f (compute-frequencies cells)))
    (loop for k being the hash-keys in f
       when (or 
             (> (car k) world-size)
             (> (cdr k) world-size)
             (and (= (gethash k f) 2) (not (gethash k cells))))
       do (remhash k f))
    f))

There. Now we just chuck all the cells that run off the edge of the world. Provided the world is small enough, that keeps the population from exploding tribble style.

LIFE> (run-life 50 50000 *gosper-glider-gun*)

  seconds  |     gc     |     consed    |   calls   |  sec/call  |  name  
---------------------------------------------------------------
     8.938 |      0.500 | 2,520,410,944 |    50,000 |   0.000179 | LIFE::COMPUTE-FREQUENCIES
     1.801 |      0.100 | 1,100,839,008 | 3,352,906 |   0.000001 | LIFE::MOORE-NEIGHBORHOOD
     0.000 |      0.000 |             0 |    50,000 |   0.000000 | LIFE::LIFE-STEP
     0.000 |      0.000 |             0 |         1 |   0.000000 | LIFE::RUN-LIFE
     0.000 |      0.000 |        52,144 |         1 |   0.000000 | LIFE::CELLS->HASH
---------------------------------------------------------------
    10.739 |      0.600 | 3,621,302,096 | 3,452,908 |            | Total

estimated total profiling overhead: 6.35 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.84e-6s total profiling, 8.24e-7s internal profiling

Pretty good right? All things considered? Before we go, lets take a look at how this approach compares to the traditional grid Life technique. Here's the code pulled from Rosetta Code, using a two-dimensional array instead of a list of the living. Oh, I've commented out printing of intermediate steps, and included a 50x50 field with the Gosper Gun, just to make sure this is as even as possible. And I also have to reset the starting world for :life-grid each time, since its process is destructive.

(defpackage :life-grid (:use :cl))
(in-package :life-grid)

(defun next-life (array &optional results)
  (let* ((dimensions (array-dimensions array))
         (results (or results (make-array dimensions :element-type 'bit))))
    (destructuring-bind (rows columns) dimensions
      (labels ((entry (row col)
                 "Return array(row,col) for valid (row,col) else 0."
                 (if (or (not (< -1 row rows))
                         (not (< -1 col columns)))
                   0
                   (aref array row col)))
               (neighbor-count (row col &aux (count 0))
                 "Return the sum of the neighbors of (row,col)."
                 (dolist (r (list (1- row) row (1+ row)) count)
                   (dolist (c (list (1- col) col (1+ col)))
                     (unless (and (eql r row) (eql c col))
                       (incf count (entry r c))))))
               (live-or-die? (current-state neighbor-count)
                 (if (or (and (eql current-state 1)
                              (<=  2 neighbor-count 3))
                         (and (eql current-state 0)
                              (eql neighbor-count 3)))
                   1
                   0)))
        (dotimes (row rows results)
          (dotimes (column columns)
            (setf (aref results row column)
                  (live-or-die? (aref array row column)
                                (neighbor-count row column)))))))))
 
(defun print-grid (grid &optional (out *standard-output*))
  (destructuring-bind (rows columns) (array-dimensions grid)
    (dotimes (r rows grid)
      (dotimes (c columns (terpri out))
        (write-char (if (zerop (aref grid r c)) #\+ #\#) out)))))
 
(defun run-life (&optional world (iterations 10) (out *standard-output*))
  (let* ((world (or world (make-array '(10 10) :element-type 'bit)))
         (result (make-array (array-dimensions world) :element-type 'bit)))
    (do ((i 0 (1+ i))) ((eql i iterations) world)
;;      (terpri out) (print-grid world out)
      (psetq world (next-life world result)
             result world))))

(defparameter *gosper-glider-gun*
  (let ((w (make-array '(50 50) :element-type 'bit)))
    (loop for (x . y) in '((24 . 0) (22 . 1) (24 . 1) (12 . 2) (13 . 2) (20 . 2) (21 . 2) (34 . 2)
                           (35 . 2) (11 . 3) (15 . 3) (20 . 3) (21 . 3) (34 . 3) (35 . 3) (0 . 4) (1 . 4)
                           (10 . 4) (16 . 4) (20 . 4) (21 . 4) (0 . 5) (1 . 5) (10 . 5) (14 . 5) (16 . 5)
                           (17 . 5) (22 . 5) (24 . 5) (10 . 6) (16 . 6) (24 . 6) (11 . 7) (15 . 7)
                           (12 . 8) (13 . 8))
       do (setf (aref w y x) 1))
    w))

Lets start small

  seconds  |     gc     |   consed  | calls |  sec/call  |  name  
-------------------------------------------------------
     0.005 |      0.000 | 1,588,176 |    50 |   0.000103 | LIFE::COMPUTE-FREQUENCIES
     0.002 |      0.000 |   786,432 | 2,534 |   0.000001 | LIFE::MOORE-NEIGHBORHOOD
     0.000 |      0.000 |    20,768 |     1 |   0.000000 | LIFE::CELLS->HASH
     0.000 |      0.000 |         0 |    50 |   0.000000 | LIFE::LIFE-STEP
     0.000 |      0.000 |         0 |     1 |   0.000000 | LIFE::RUN-LIFE
-------------------------------------------------------
     0.007 |      0.000 | 2,395,376 | 2,636 |            | Total

estimated total profiling overhead: 0.00 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.808e-6s total profiling, 7.04e-7s internal profiling

  seconds  |     gc     |   consed   | calls |  sec/call  |  name  
--------------------------------------------------------
     0.048 |      0.000 | 23,997,056 |    50 |   0.000959 | LIFE-GRID::NEXT-LIFE
     0.000 |      0.000 |          0 |     1 |   0.000000 | LIFE-GRID::RUN-LIFE
--------------------------------------------------------
     0.048 |      0.000 | 23,997,056 |    51 |            | Total

estimated total profiling overhead: 0.00 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.808e-6s total profiling, 7.04e-7s internal profiling

LIFE> (run-life 50 5000 *gosper-glider-gun*)
  seconds  |     gc     |    consed   |  calls  |  sec/call  |  name  
-----------------------------------------------------------
     0.828 |      0.016 | 241,614,336 |   5,000 |   0.000166 | LIFE::COMPUTE-FREQUENCIES
     0.237 |      0.032 | 118,408,160 | 334,156 |   0.000001 | LIFE::MOORE-NEIGHBORHOOD
     0.000 |      0.000 |      57,248 |       1 |   0.000000 | LIFE::CELLS->HASH
     0.000 |      0.000 |           0 |   5,000 |   0.000000 | LIFE::LIFE-STEP
     0.000 |      0.000 |           0 |       1 |   0.000000 | LIFE::RUN-LIFE
-----------------------------------------------------------
     1.064 |      0.048 | 360,079,744 | 344,158 |            | Total

estimated total profiling overhead: 0.62 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.808e-6s total profiling, 7.04e-7s internal profiling

LIFE-GRID> (run-life *gosper-glider-gun* 5000)
  seconds  |     gc     |     consed    | calls |  sec/call  |  name  
-----------------------------------------------------------
     4.760 |      0.288 | 2,400,164,160 | 5,000 |   0.000952 | LIFE-GRID::NEXT-LIFE
     0.014 |      0.000 |             0 |     1 |   0.014479 | LIFE-GRID::RUN-LIFE
-----------------------------------------------------------
     4.775 |      0.288 | 2,400,164,160 | 5,001 |            | Total

estimated total profiling overhead: 0.01 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.808e-6s total profiling, 7.04e-7s internal profiling

Hm. Honestly wasn't expecting to be cleaning the grids' clock yet, but we're using about a quarter of the time and about a sixth of the memory. Remember, at the low-end of the spectrum, the difference between a poor algorithm and a good one isn't very big. If you've got a corpus of length 20, it really doesn't matter whether you pick bubble-sort, quicksort or timsort. In fact, you'd expect the better algorithms to do mildly worse on smaller data sets, since their optimizations don't have as much opportunity to pay for themselves.

Lets crank it up a bit to figure out how these numbers diverge.

LIFE> (run-life 50 50000 *gosper-glider-gun*)
  seconds  |     gc     |     consed    |   calls   |  sec/call  |  name  
---------------------------------------------------------------
     8.812 |      0.456 | 2,418,859,120 |    50,000 |   0.000176 | LIFE::COMPUTE-FREQUENCIES
     2.265 |      0.092 | 1,202,364,272 | 3,352,906 |   0.000001 | LIFE::MOORE-NEIGHBORHOOD
     0.000 |      0.000 |        58,240 |         1 |   0.000000 | LIFE::CELLS->HASH
     0.000 |      0.000 |             0 |    50,000 |   0.000000 | LIFE::LIFE-STEP
     0.000 |      0.000 |             0 |         1 |   0.000000 | LIFE::RUN-LIFE
---------------------------------------------------------------
    11.077 |      0.548 | 3,621,281,632 | 3,452,908 |            | Total

estimated total profiling overhead: 6.24 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.808e-6s total profiling, 7.04e-7s internal profiling

LIFE-GRID> (run-life *gosper-glider-gun* 50000)
  seconds  |     gc     |     consed     |  calls |  sec/call  |  name  
-------------------------------------------------------------
    48.140 |      3.116 | 24,001,592,224 | 50,000 |   0.000963 | LIFE-GRID::NEXT-LIFE
     0.025 |      0.000 |              0 |      1 |   0.024799 | LIFE-GRID::RUN-LIFE
-------------------------------------------------------------
    48.165 |      3.116 | 24,001,592,224 | 50,001 |            | Total

estimated total profiling overhead: 0.09 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.808e-6s total profiling, 7.04e-7s internal profiling

The optimized gridless approach is holding steady at about 1/4 time taken and about 1/6 memory used. Again, because this is a garbage collected language, those affect each other. Each trip of the collector adds precious seconds to the tally of consumed resources, so being a memory hog does come back to bite you in the ass even if you're not directly optimizing for space. Last one. Don't try this at home, unless you have something to do for a little while.

LIFE> (run-life 50 5000000 *gosper-glider-gun*)
  seconds  |     gc     |      consed     |    calls    |  sec/call  |  name  
-------------------------------------------------------------------
   924.769 |     45.233 | 252,714,140,464 |   5,000,000 |   0.000185 | LIFE::COMPUTE-FREQUENCIES
   145.698 |     10.865 | 109,648,457,760 | 335,415,406 |   0.000000 | LIFE::MOORE-NEIGHBORHOOD
     0.000 |      0.000 |              64 |   5,000,000 |   0.000000 | LIFE::LIFE-STEP
     0.000 |      0.000 |               0 |           1 |   0.000000 | LIFE::RUN-LIFE
     0.000 |      0.000 |          20,320 |           1 |   0.000000 | LIFE::CELLS->HASH
-------------------------------------------------------------------
  1070.467 |     56.098 | 362,362,618,608 | 345,415,408 |            | Total

estimated total profiling overhead: 635.56 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.84e-6s total profiling, 8.24e-7s internal profiling

LIFE-GRID> (run-life *gosper-glider-gun* 5000000)
  seconds  |     gc     |       consed      |   calls   |  sec/call  |  name  
-------------------------------------------------------------------
  4818.453 |    340.823 | 2,400,161,918,448 | 5,000,000 |   0.000964 | LIFE-GRID::NEXT-LIFE
     5.769 |      0.000 |             3,008 |         1 |   5.768999 | LIFE-GRID::RUN-LIFE
-------------------------------------------------------------------
  4824.222 |    340.823 | 2,400,161,921,456 | 5,000,001 |            | Total

estimated total profiling overhead: 9.04 seconds
overhead estimation parameters:
  8.000001e-9s/call, 1.808e-6s total profiling, 7.04e-7s internal profiling

We're still the same fraction better, but the numbers have increased pretty drastically. I know which one I'd rather rely on for crunching large life patterns.

These aren't all the optimizations we could pull, by the way. If we wanted to do better, we could inline moore-neighborhood within compute-frequencies, and we could prevent it from consing nearly as much by using its integers directly rather than allocating a fresh list of conses every time. A particular optimization we could do that would be relatively difficult with the grid approach would be to check for a barren world before each step; if we ever get an empty set as a result, we can return immediately rather than spinning wheels until we reach the end of our step counter. It would be easy for us to do, since we just need to check (= 0 (hash-table-count cells)), whereas doing it the obvious way would add another traversal of the corpus per step for the already much slower traditional approach.

Ok. I'm going to sleep. I was going to do a similar writeup using the Haskell profiler, but that took a lot out of me. Hopefully, you've learned something from all this. Fresh code up at my Life github. Feel free to beat those numbers. I'd be particularly interested if someone wanted to do some micro-optimization on the same problem and put forth an explanatory article.


Footnotes

1 - [back] - Which is a huge topic in its own right, and involves things like hand-optimizing memory cache interactions for minimum fall-through and various other low-level, machine oriented optimizations.

2 - [back] - We're optimizing for time because space tends to be cheap, and running things fast is fun.