Friday, March 29, 2013

Actors Again

That cl-actors fork has gotten a few hours thrown at it. This time around, I integrated the Optima pattern-matching library, and made certain things a little easier. I'm nowhere near done yet though; in addition to the rest of the ToDos from last time, and seeing what I can do with green threads, I need to settle things that the Erlang guys have clearly been thinking about for a few years.

First

How do you deal with error reporting/handling here? And I specifically mean

(defmethod initialize-instance :after ((self actor) &key)
  "Uses the main function name to create a thread"
  (with-slots (behavior in name thread) self
    (setf thread 
          (bt:make-thread 
           (lambda () 
             (loop
                (handler-case
                    (let ((res (funcall behavior (dequeue in))))
                      (loop for target in (targets self)
                         do (enqueue res target)))
                  (match-error (e)
                    (format t "There isn't a match clause that fits. Do something more intelligent with unmatched messages.~%~a~%~%" e))
                  (error (e)
                    (format t "BLEARGH! I AM SLAIN! (this should kill the actor, and possibly call some fall-back mechanism)~%~a~%~%" e)))))
           :name name))))

here. Spitting unmatched messages out at *standard-output* sounds kind of ok, until you start thinking about how you'd deal with any kind of non-trivial system, or a situation where you want to defer those messages to someone else who might know how to deal with them. The standard Supervisor infrastructure that Erlang implements looks like it would be a good solution, and will probably going to be easier to put together in Lisp. That's more or less the only sane option for things like unmatched messages, because you don't ever want those to derail the whole system.

The second case there is more ambiguous though.

  (error (e)
    (format t "BLEARGH! I AM SLAIN! (this should kill the actor, and possibly call some fall-back mechanism)~%~a~%~%" e))

That handles all other errors. Run-time snafus like passing the wrong number of arguments to format. For these, you do really, truly want to take the actor out of commission until it gets fixed; there's no point whatsoever in trying to process further messages until that time. So another reasonable approach here would be to use Common Lisp's built-in condition system. That is, re-raise the error and give the user a restart option to define a new behavior in-line.

I don't know, there might be pitfalls there that I'm not seeing, which is why I need to think pretty hard about it, and then try it out.

Second

I want to make sure that networks resulting from this system are flexible enough to withstand change. That's a tougher one, and the built in behav function from the original cl-actors doesn't quite satisfy. The two things we want to preserve from an actor when we're modifying it, if we want robust networks, are its state and its inbound message queue. The second is hopefully obvious, but the first might not be given how that behav function I just linked is implemented. It leaves the task of assigning new state up to the user, who may not know what the latest state of the actor is. Worse than that, there may be no way for them to find out, because that state is locked away in a closure with no hooks other than the ones manually defined by the behavior they're trying to replace. I'm not entirely sure what the solution there is, but it probably won't be straightforward.

What I apparently want is a macro that takes a series of ematch clauses, and returns a macro that accepts a list of state variables/values, which returns a function that takes a message and returns whatever we want to pass on. I'd call that first macro to define the skeleton of my new behavior, then pass the result through to an actor which would provide its own internal state to fill in the gaps, then take the result of that operation and assign it to a new behavior. The end result would be a function which I can define on the outside somewhere, but which will take into consideration an actors most current internal state when finally applied. The queue, of course, goes nowhere, and assuming the new behavior function doesn't error out anywhere, the actor should continue dequeueing on its merry way

Finally

That "assuming" isn't to be glossed over. Replacing an actor or behavior with another is straightforward in toy examples, but in an actual, running system, you want good fail-over capabilities. Specifically, if the new guy chokes on his first couple of messages, you want the option of slotting in the old process before too much time has elapsed. You definitely don't want to grind the entire downstream system to a halt while the programmers figure out what the issue is with the new code. Another thing that might be useful is a default :test message that you can throw at an actor which should tell the sender whether it can respond to the given message. In a situation where you're replacing an existing actor with a new one, or just replacing the behavior of an existing actor, you want to know that the messages sent out by the new thing are going to be intelligible to their targets before committing to the change-over. How the reporting for this is going to work, I haven't the first clue, but I've got more than enough implementation to do already, so I'll probably leave that one for the next article.

Friday, March 22, 2013

Actors

Two things this time. First...

...An Admission

I'm weak.

It turns out there are exactly two things I'm willing to run non-free software for, and one is wifi access[1]. Another other option is of course, buying an Atheros wifi card, which I intend to do eventually but don't have the spare $100 right at this very moment. Lets move on and say no more about this.

Actors

I've been on vacation for a little while now, which finally gave me the chance to get back into some Common Lisp[2]. You know, since I've mostly been hacking Python at work for the past five months or so. Specifically, I got to do some long-overdue thinking on that Actors library I forked forever and a fucking day ago.

The big problem with actors as they're implemented here is that, while they don't care where their messages come from, they very much do care where their messages go. To be fair, this seems to be a very common implementation, and not limited to cl-actors, so I don't think it's worth holding against the author. What it does is force you to choose between three fairly shitty alternatives for composeability:

1, Message Targets

Define a communication convention whereby a piece of the message is going to specify the actor that it needs to be passed to next.

(define-actor greeter () (target name)
  (send target (format nil "Hello, ~a!" name))
  next)

(define-actor printer (stream) (msg)
  (format stream "~a~%" msg)
  next)

(defparameter *greeter* (greeter))
(defparameter *printer* (printer :stream *standard-output*))

(send *greeter* *printer* "whoeverthefuck")

The problem here is that you're setting up a situation where each sender is going to have to know the entire call chain its message will be going through. That's not good because changing any node suddenly becomes an exercise in frustration if you've got a non-trivial actor network set up, and it only gets worse if you want to do anything other than straight chaining actors. For instance, think about how you would implement an imbalanced tree here; a situation where you have actors A through F, and what needs to happen is

    actor-A
    ├──> actor-B
    ├──> actor-C
    └──> actor-D
         └──> actor-E ──> actor-F

2. Globals

The Erlang equivalent is "registered processes"; you define a global name which will refer to your actor instance, and any other actors that need to interact with it use that global name.

(define-actor greeter () (name)
  (send *printer* (format nil "Hello, ~a!" name))
  next)

(define-actor printer (stream) (msg)
  (format stream "~a~%" msg)
  next)

(defparameter *greeter* (greeter))
(defparameter *printer* (printer :stream *standard-output*))

(send *greeter* "whoeverthefuck")

The problem has moved from the last line to the second line. This approach requires you to re-write pieces of every non-leaf actor if you want to use them in a new context. Ideally, an actor wouldn't have care where its messages go, or at least it wouldn't have to care about it until after it's instantiated. That would let you increase the isolation of your components, thereby giving you more and easier opportunities for code reuse.

3. Local State

Instead of manually specifying targets, make the actor track its targets with a piece of local state. You'd then have to pass targets in along with the other initialization parameters.

(define-actor greeter (targets) (name)
  (let ((msg (format nil "Hello, ~a!" name)))
    (mapcar (lambda (trg) ;; blast you, canonical truth value T!
              (send trg msg))
            targets))
  next)

(define-actor printer (stream) (msg)
  (format stream "~a~%" msg)
  next)

(defparameter *printer* (printer :stream *standard-output*))
(defparameter *greeter* (greeter :targets (list *printer*)))

(send *greeter* "whoeverthefuck")

The two problems with this are complexity and definition dependencies. Complexity because, as you can see from that new greeter definition, most of the body code is now dealing with where the message is meant to go next, rather than with the business logic of what this actor is supposed to be doing. I'm tempted to call this the Yak Shaving Anti-pattern, except that someone else has certainly identified and named it already.

The other problem is apparent in the change among those two defparameter lines. Note that *greeter* is now defined second, and that this isn't an accident. If you did it the other way around, you'd discover that *printer* must be defined in order for it to be specified as a message target.It may be a minor annoyance, but I prefer to avoid those where I can.

The Solution?

As far as I can see, oh and thanks to Paul Tarvydas for pointing me in this direction, it's to separate the actors from their call chains. That is, define an actor as essentially a queue, a thread and a function that returns some value given some message, then introduce an external mechanism by which to get that return value to the next node in the network. What we really want to be able to do is something like

(define-actor greeter () (name)
  (format nil "Hello, ~a!" name))

(define-actor printer (stream) (msg)
  (format stream "~a~%" msg))

(defparameter *greeter* (greeter))
(defparameter *printer* (printer :stream *standard-output*))

(link *greeter* *printer*)

(send *greeter* "whoeverthefuck")

which concentrates the links entirely into that call to link, and leaves the actors themselves cheerfully oblivious to what they're interacting with at the time. It also separates out the general patterns of communication[3] from the business logic of an actor body, so your define-actors are only dealing with the stuff they want to do, rather than the minutia of who needs to do the next bit. So, here's how we do it. Firstly, we'll want to change the definition of an actor to take into account the fact that others may be watching.

(defclass actor ()
  ((name :initarg :name
         :initform (error ":name must be specified")
         :accessor name)
   (behavior :initarg :behavior
             :initform (error ":behavior must be specified")
             :accessor behavior
             :documentation "Behavior")
   (watched-by :initarg :watched-by :initform nil 
             :accessor watched-by)
   (in :initform (make-queue) :accessor in
       :documentation "Queue of incoming messages")
   thread))

watched-by is the addition there; it'll hold a list of all actors and/or queues that might need to be notified about this actors' output. Next, we'll want to simplify define-actor slightly, because we want to collect the return value from its behavior rather than assuming it sends the message on itself

(defmacro define-actor (name state vars &body body)
  "Macro for creating actors with the behavior specified by body"
  `(defun ,name (&key (self) ,@state)
     (declare (ignorable self)) ;; You might not care about referencing self, and that's ok
     (setf self (make-actor (lambda ,vars (progn ,@body)) ,(string name)))
     self))

I also took the opportunity to do away with the need for an explicit next. Near as I can tell, this is just going to prevent me from changing out behaviors at runtime and creating one-cycle actors. My intuition about the first is that it'd be easier to define a new actor and insert it into the network than it would be to reliably and correctly rip out the behavior function of one we already have in place, so I don't mind losing that, though I reserve the right to change my mind if experience teaches me the contrary. The second one is a situation where I'd really want to use a thread with an embedded lambda anyway, so not being able to use an actor there doesn't sound particularly disastrous.

Finally, we'll need to change what each cycle through the message queue does.

(defmethod initialize-instance :after ((self actor) &key)
  "Uses the main function name to create a thread"
  (with-slots (behavior in name thread) self
    (setf thread 
          (bt:make-thread 
           (lambda () 
             (loop 
                for res = (apply behavior (dequeue in))
                when (watched-by self)
                  ;; TODO -- Add customization to protocol, rather than always send-all
                do (loop for target in (watched-by self)
                      do (enqueue (list res) target))))
           :name name))))

so, instead of just applying behavior to each message, we get the result and send it on to any watchers. The TODO is there because, as written, an actor always notifies all watchers, and we might want to do something like round-robin scheduling instead. The main reason I'm thinking along those lines is that I'm planning to use this library in the construction of a non-blocking web-server, where I'd want a single listener but multiple, parallel parsers/response-generators picking up some percentage of total load. Doing something other than "send one to everyone" is an integral part of that strategy. We'll see how it goes.

I should note that you don't have to decide to use only one of send/link here; even with the connection system working[4] there are use cases where you really do want a manual send in an actor body. To be fair, most of those use cases seem to be places where you wouldn't really want to use actors in the first place, but I've reserved judgment and left in both options in the interests of flexibility.

Still ToDo

I've already mentioned separating out the send pattern for an actor so that you can have more flexibility in deciding targets. Although, to be fair, I'm not entirely sure whether that's the best approach; it might be possible to implement different behaviors by just specifying different network shapes rather than by complicating actors further. I'll think on it, and probably solicit some advice from people smarter than I am.

Some additional network-oriented constructs would be nice. We've already got link and chain, but it seems like splice, prune and nip might be useful too. splice would take two linked actors and a third actor, and insert the third one between the first two. prune would take an actor, kill it and remove it from any watched-by lists it might be on. nip would basically do the opposite of splice; take three linked actors, remove the middle one and connect the first to the last.

While I'm at it, it would be nice if all these functions, real and notional, played by Actor Model rules rather than doing hard edits. For instance, instead of link doing (push target (watched-by self)), it would send a message to self which would get processed when it came up in the queue. This has a bit more background complexity than the straight-up side-effect, but it prevents the actor from dropping any messages that might be getting processed while the change is taking place.

While I'm at that, it would be nice if actors automatically responded to certain messages without being specified explicitly in the define-actor body. Off the top of my head, :link (for creating connections), :drop (for breaking them), :set (for changing actor state) and :ping (to allow for supervisor-style constructs later).

The reason I'm just listing these rather than building them right now is that some of them would require a fundamental change to the way the system works. For one thing, accepting default messages implies that we're taking a message which is only one object that we then pattern-match on. For another, things like prune imply either a centralized storage method for all actors, or imply two-way links between nodes, neither of which I'm sure is a good idea. It might be better to assume that connections are only going to be created at startup.

Anyhow, in the meanwhile, what I've got here is a trivially composeable actor system, which lets you re-use any of them at will in any context that applies. That by itself makes the effort worth it as far as I'm concerned. I'll see what I can do for the next pass.


Footnotes

1 - [back] - The other is vintage gaming. Which doesn't pollute my main machine, but I do have a desktop set up at home which has a virtual Win XP machine where I installed a bunch of games from the golden age of fantasy gaming; copies of Icewind Dale 2, Planescape Torment, Baldur's Gate 2, and Dungeon Keeper.

2 - [back] - And a bunch of sketching, but this isn't the place for that. If you're interested, go to my deviantart instead, I'll be uploading a new batch shortly.

3 - [back] - Though, as you'll see later, those could probably be separated further still. I'll be working on it for the next little while.

4 - [back] - And in most cases, producing much more elegant and flexible code, I might add.

Thursday, March 14, 2013

What Programming Language Should I Learn?

I've seen this question pop up on various forums with disturbing frequency lately. Enough that I just wrote this so that I can link people to it instead of typing the advice out each time. The stuff I cover here has already been touched on in a post called Self Titled. Go read that if you want more perspective on my opinion, but it's fairly long so I need to put something shorter and more accessible together.

A Better Question

is "What do I want to do with my programming skills?"

If your goal is merely employment in commercial IT in the shallow future, then the answer is simple. You should learn C++. Then one of Java, C#, Objective C or PHP depending on what niche you want to work in. Then you should stop learning things. After a little bit of a grind, and as long as you're not a complete asshole, or idiot, or both, you'll get promoted to team lead somewhere. At that point you're not writing code, so it doesn't really matter what languages you know or how well.

That's it, you can go.

Pretend the rest of this article doesn't exist.

If You're Still Reading

your goal is to push the bleeding edge of Comp Sci/IT forward, and you have your work cut out for you.

If you're serious about this goal, then you need to understand something. Being a new programmer and asking "What language should I learn?" is roughly like being an aspiring carpenter and asking "Should I learn to use hammers or screwdrivers?". You won't get good answers because it's the wrong question. Usually, you get an avalanche of people coming in to push their pet language forward ("Definitely just stick to the hammer"), or to push the currently fashionable answers ("No one uses hammers anymore, learn to use a nail gun"), and you shouldn't listen to any of them.

Languages are tools, but they're not like physical tools. A language is not a bandsaw. It's the Theory of Relativity. A collection of cognitive tools and abstractions that help you think about unfamiliar and counterintuitive concepts precisely enough to explain them to very stupid machines, and perhaps to very inattentive humans. I say this because the askers of the big question often say that someone has told them something like "Blub is old; don't bother with it". That's not a valid argument against a language. Theories don't rust. Occasionally they're disproven, or revised, but merely being old isn't enough to discredit them[1].

If you want to be a brilliant programmer with a chance of impacting the deep future, sure, you need to understand how the underlying physical machine actually works, and C/C++ helps with that, but it's nowhere near sufficient. You need to really understand OO, which means spending a while hacking on Smalltalk or Ruby or Simula. You need to understand the different kinds of OO on offer, which means dealing with class-based systems (like C++/Java et al), prototype systems (JavaScript or Self) and generic-based systems (Common Lisp) at minimum.

You need to go beyond OO; understand functional and declarative programming, in both strongly/statically and dynamically typed flavors. If you just want a list of languages, that means a whirlwind tour of Haskell/Scala/an ML, a Lisp, Prolog or Erlang, and more than I can reasonably list here besides. It's probably a good bet to just look at the Programming Paradigms page on Wikipedia and read anything linked off the right sidebar, it's all relevant.

You need a thorough understanding of compilers, which you can get by putting a few years into really, truly understanding Lisp macros and/or reading the Purple Dragon book[2] and/or writing one. You'll need to know about data structures, both traditional and functional[3], about set theory, and graph theory, and probability theory, and advanced algebra and probably a hundred other things I missed. Including things that are only incidentally related to programming, like source control, human management/interaction, hardware maintenance, writing, security, typing and the social impacts of the work we do.

Learning to program is not a thing you pick up in seven days, and you could do a lot worse than to start by reading that article. Just make sure to also disabuse yourself of the idea that you do it by picking one language and sticking to that.

TL DR

So, in case you skipped directly to this line, the short answer is "all of them, and that's just for starters". Good luck; I'll see you on the other side.


Footnotes

1 - [back] -"Blub has an inactive community" or "Blub's community is principally composed of assholes" are valid arguments against using a language. But keep in mind that you can still learn a lot by understanding a language that assholes use, or that very few people decided to use. Also, keep in mind that the metrics related to these arguments are relative and necessarily personal; if you're close friends with five or six people who use Io, then it really doesn't matter much what the rest of the world is doing.

2 - [back] - If the price-tag scares you, I should mention that there's a way to get a softcover edition for something like $40, but it doesn't include the same exercise sections or cover and is printed on pretty shitty stock. That's what I ended up doing, and mine's still in one piece even after a few years, but I can't find the link to that deal anymore even though one of the customer images is that edition of the book..

3 - [back] - I'm putting the Amazon link there, but the first link in a google search about "Purely Functional Data Structures" seems to be a legitimate, free PDF copy of the same from CMU.

Sunday, March 10, 2013

Haskell Profiling: Third Impact, or, AcidState vs. The World

The other question mentioned last time was "How does AcidState stack up against the other database back-ends?". So, here's a crack at the answer.

My benchmarking method was basically to port the GoGet back-end to HDBC and MongoDB, then see how each one does at

  • user insertion
  • user listing
  • item insertion
  • user querying

All of this was done on a 64-bit Debian Wheezy machine running on top of a Core i3. Data for other platforms/architectures welcome, but I'm not going there myself. Without further ado:

  • SM - Starting with empty tables/collections and dealing with user #42
  • MD - Starting with 1000 user records and dealing with user #789
  • LG - Starting with 50000 user records and dealing with user #42
  • LG2 - Starting with 50000 user records and dealing with user #45678
  • LG-O - Starting with 50000 user records and dealing with user #45678, compiled with ghc -O --make instead of just ghc --make

These are hosted on my own server because Blogger doesn't seem to like the Criterion markup. You'll find the same files in the codebase if you prefer viewing local copies.

You'll note that I only managed the tiny benchmark for MySQL, and it's absent from the rest; this is because the connection kept choking randomly, which is consistent with my real-world experience. Not pictured is the four or five attempts/clears/restarts that I had to pull before even getting the numbers for a 100 user corpus. No other database, including SQLite, did anything similar.

So lets do a rundown.

Obvious

  • The vast majority of time spent inserting a user goes to generating the scrypt hash. This is obvious because of the huge difference between inserting an item and inserting a user. And, really, this is what you want in a real-world scenario. It should take fairly significant resources to try a password so as to make brute-forcing them a poor option, but in hindsight I could have saved myself a lot of time and compute by cutting that portion of user insertion across the board for benchmarking purposes.
  • The ghc optimization flag approximately halves most numbers, and improves AcidState lookups by about 5x.
  • MongoDB consistently outperforms all comers when it comes to user insertion, and performs very well on sub-insertion with small data-sets. The $push directive seems to be much faster than merely popping a new top-level record in, which I assume is why it manages to take about 1/3 the time of the next contender until we get up to the 50k corpus.
  • SQLite loses in every category at every corpus size, but not by as much as I was expecting. It's actually a pretty good little lightweight DB engine, assuming you don't need to support too many simultaneous requests or too much data.
  • AcidState is an absolute fucking monster. The benchmarks it loses, it loses narrowly[1], but the benchmarks it wins, it wins by an 8x or larger margin. Take special note that while the other engines bench in the high single/low double digit milliseconds, Acid consistently posts list and select numbers in the low double-digit microseconds. Granted, insertion speed goes down a bit based on corpus size, but selection speed is always the same range of extremely low numbers. That's excellent for web applications, which tend to have a usage profile of "rare-ish insertions coupled with large and common lookups". It performs suspiciously well on selects. Well enough that I went back to GHCi and tried mapM (getUserAcid acid) [40000..40050] and mapM_ (getUserAcid acid) [40000..45000] on the large corpus, just to make sure it wasn't recording thunk time instead of actual result time. It isn't. An IxSet lookup is actually just that fast.

Not-So Obvious

There isn't as big a difference in code size as I was expecting. Honestly, I thought AcidState would be much chunkier than the competition, but it only turns out to be the longest by about 10 lines. This might be because I was determined to work with Haskell-style type declarations in each of the models. The reasoning there was that I'd typically be wanting to pull out data then convert it to some external data format after the fact[2], so getting a record in a canonical typed format was going to happen sooner or later anyway. This ends up working hardest against MongoDB, where conversion from very loosely-typed k/v pairs ends up looking something like

itemFromMongo [nameField, commentField, statusField, countField] = 
  Item { itemName =  name, itemComment = comment, itemStatus = status, itemCount = count }
  where Database.MongoDB.String n = value nameField
        name = Text.unpack n
        Database.MongoDB.String c = value commentField
        comment = Text.unpack c
        Database.MongoDB.String s = value statusField
        status = read $ Text.unpack s
        Int32 co = value countField
        count = fromIntegral co

...which is ugly as fuck. A couple of those lines could have been eliminated by declaring itemName and itemComment as Text rather than string, but that would only make it very slightly less ugly.

MySQL crashes like a champ. I use it in a couple of real applications, and I remember having configuration problems there too. It really seems to want a fresh connection each time you do any significant chunk of work, and that seems like it would slow the whole thing down further. Like I said, this is probably a misconfiguration somewhere, and I welcome help if someone wants to go over the tests again on a different machine, giving MySQL more airtime. For the benchmarks it completed, it performs marginally better than AcidState on insertion and very marginally better than SQLite on selection.

It is almost trivially easy to port between HDBC back-ends. You need to call a different connect function and pass it different arguments, but that's more or less it. The only other hiccup I ran into here is the different table creation syntax; SQLite barfs if you try to declare something AUTO_INCREMENT[3], but MySQL requires the statement or leaves you to specify the ID manually. I'm not sure what the differences are between implementations of the SQL standard across other engines, but they seem minimal enough that hopping around wouldn't be difficult.

MongoDB really really doesn't mesh with the Haskell way of doing things. I already mentioned this in the first Not-So Obvious point, but I wanted to highlight it a bit more. This is not to say it's bad. In fact, it would be my top choice if not for the massive impedance mismatch it has with the language. My negative opinion may also be exacerbated by the fact that I've used it in Python and Clojure, where there are no such problems because both languages deal with loosely typed k/v pairs as their primary hash construct[4]. As always, it's possible that I'm doing it wrong, in which case, do point that out.

Finally, a hidden advantage that AcidState and to a lesser extent SQLite have is ease of deployment. The other engines all require some degree of setup beyond coding. MySQL needs an installed, running, properly configured server, with appropriate databases and users created, and your program needs to use the appropriate credentials when communicating. MongoDB needs an installed, running server[5]. SQLite just requires that the deployment machine have libsqlite3.so or sqlite3.dll as appropriate. You need to create your tables the first time, but that's it. AcidState doesn't even require that much. All you need to make sure of is that you have the AcidState Haskell library installed when you're compiling your program. The resulting binary has no external deps whatsoever, so you can just run it on any machine of the appropriate architecture and platform. Personally, I'd be willing to give up non-trivial amounts of performance for a simplified setup process, so I'm quite happy that the easiest DB to work with from that perspective is also benching at or near the top for everything, at every corpus size.

Code Notes

That's all my thoughts on results; if that's all you were here for, you can safely disregard the rest of the article.

The code for these tests is here, but it's very raw at the moment. I plan to write some docs and make it easier to run these tests shortly. There's just a couple of things I want to call attention to explicitly so I don't forget about them.

First, note that I'm making one connection and handing it to each function.

main = do
  acid <- openLocalState Acid.initialDB
  mongo <- Mongo.newConn
  sqlite <- Database.HDBC.Sqlite3.connectSqlite3 "GoGetDB"
--  mysql <- Database.HDBC.MySQL.connectMySQL defaultMySQLConnectInfo
  defaultMain 
    [ 
      benchBlock "AcidState" acid 
      (insertUserAcid, insertItemAcid, getUserAcid, (\acid -> query' acid Acid.GetAccounts)),
      bgroup "HDBC" 
      [ 
--        hdbcBlock "MySQL" mysql,
        hdbcBlock "SQLite" sqlite
      ], 
      benchBlock "MongoDB" mongo 
      (insertUserMongo, insertItemMongo, getUserMongo, Mongo.getAccounts)
    ]
  Mongo.close mongo
  Database.HDBC.disconnect sqlite
--  Database.HDBC.disconnect mysql
  createCheckpointAndClose acid

This is the recommended usage for AcidState and MongoDB, but I'm not entirely convinced it's the best approach for arbitrary SQL databases, or entirely sure how HDBC handles connection-pooling. The end result is, I think, to somewhat deflate the times attributed to using the HDBC back-ends.

Second, if you look at how the SQL database is organized

createTables :: Connection -> IO [Integer]
createTables conn = withCommit conn q
  where q conn = mapM (\s -> run conn s []) 
            ["CREATE TABLE accounts (id INTEGER PRIMARY KEY AUTO_INCREMENT, name VARCHAR(120), passphrase VARCHAR(250))",
             "CREATE TABLE items (user VARCHAR(120), name VARCHAR(120), comment VARCHAR(120), status VARCHAR(4), count INTEGER)"]

You'll see that I slightly re-configured the storage approach to match the back-end's relational model. Also, while the individual account selectors return a Maybe Account, getAccounts just does the naive SQL thing of SELECT * FROM

getAccounts :: Connection -> IO [[SqlValue]]
getAccounts conn = withCommit conn q
  where q conn = quickQuery' conn "SELECT * FROM accounts" []

getAccountBy conn column value = withCommit conn q
  where qString = "SELECT * FROM accounts WHERE " ++ column ++ " = ?"
        q conn = do
          res <- quickQuery' conn qString [toSql value]
          case res of
            [] -> return $ Nothing
            (u@[_, SqlByteString name, _]:rest) -> do
              items <- getAccountItems conn $ unpack name
              return $ Just $ accountFromSql u items

accountByName :: Connection -> String -> IO (Maybe Account)
accountByName conn name = getAccountBy conn "name" name

accountById :: Connection -> Integer -> IO (Maybe Account)
accountById conn id = getAccountBy conn "id" id

The MongoDB back-end does the same sort of thing

getAccounts pipe = do
  res <- run pipe $ find (select [] "accounts") >>= rest
  return $ case res of
    Right accounts -> accounts
    _ -> []

getAccountBy pipe property value = do
  res <- run pipe $ findOne $ select [property =: value] "accounts"
  return $ case res of
    Right (Just acct) -> Just $ accountFromMongo acct
    _ -> Nothing

accountById :: Pipe -> Integer -> IO (Maybe Account)
accountById pipe id = getAccountBy pipe "id" id

accountByName :: Pipe -> String -> IO (Maybe Account)
accountByName pipe name = getAccountBy pipe "name" name

That means that both Mongo and the HDBC back-ends should have a massive advantage over Acid for this particular function. Really, if I wanted to make it fair and get everyone to return the same data type, I'd have to write a JOIN for the SQL approach and map a conversion function over the whole thing. Acid gets that for free and, just in case I haven't pointed it out thoroughly enough yet, still schools all the rest on listing accounts.

Thirdly, I used manual auto-incrementing for MongoDB

newAccount :: Val v => Pipe -> v -> v -> IO Account
newAccount pipe name pass = do
  Right ct <- run pipe $ count $ select [] "accounts"
  let u = ["id" =: succ ct, "items" =: ([] :: [Database.MongoDB.Value]), "name" =: name, "passphrase" =: pass]
  run pipe $ insert "accounts" u
  return $ accountFromMongo u

I don't think this adds too much overhead to user insertion, since the Mongo docs imply that count is not an expensive operation, but I thought I'd mention it. This is not how I'd do it in the real world, but I didn't feel like figuring out how to reliably predict a MongoDB ID hash for the purposes of benching it.

Now, to be fair, while the field is tilted slightly towards HDBC, the task actually favors noSQL data stores because of how Accounts and Items relate to one another. Where I'd have to pull some JOIN trickery in a relational database, a mere IxSet lookup gives me the same effect with AcidState, and a recursively nesting Document type does it for Mongo.

Next Steps

What I'd really like at this point is some peer review. Either refinements to the tasks, or implementations for other database back-ends[6], or data from different machines/environments, or general comments on approach/results. It would also be nice if someone did benchmarks with a large enough corpus that the entire thing didn't fit in memory. Remember, the question I'm trying to answer here is "How well does AcidState stack up against other data storage options in Haskell", and at this point the answer looks to be "It destroys them and pisses on the ashes". If that's not your experience, or if I missed some approach or edge case in my testing, it would be nice to find out before I start outright recommending it to everyone.

And that error checking is something I'll have to leave to the internet.


Footnotes

1 - [back] - Except that MongoDB is significantly better at item insertion for data sets in the ~1k user range.

2 - [back] - Probably JSON, for most of my purposes.

3 - [back] - It just automatically does the Right Thing©™ with a PRIMARY KEY field.

4 - [back] - If you've never tried Clojure with Monger, incidentally, I suggest you stop reading now and go give it a shot. It is goddamned glorious.

5 - [back] - I've heard that at higher traffic levels, Mongo tends to need more configuration and apparently assumes that it has the run of all server resources itself. I haven't run into such problems, and it seems like you could fix this by putting it on its own server in any case.

6 - [back] - Specifically, I'm looking for pointers on how to get MySQL working properly, and I wouldn't mind someone putting together running code for PostgreSQL, but I won't be doing either myself.