Monday, April 21, 2014

cl-notebook Thoughts

No nuts and bolts this time. Here's a random collection of insights I've had while trying to put together the version 0.1 of a notebook style editor:

Being async Pays

My initial assumption was that evaluation would be an inline thing. That is, you send out a POST request, that code is evaluated synchronously, saved out to your fact base, and the HTTP response consists of that evaluated result which you can then display prettily on the front-end. Turns out that's a big fat no. Even discounting any issues with the Common Lisp return model, and the eventual attempt at making this a multi-user editor, the synchronous approach doesn't quite work. Which I found out the first time I accidentally wrote a loop without the required while. Something like

(loop for next = (pop stack) do (something-to-next))

The trouble should be obvious. That's the sort of computation that you want off in its own isolated thread. And you want to be able to kill that thread if it turns out to be as boneheaded a mistake as that was, which means that you have to notify the front-end of computations in progress, go off and perform it, then notify the front-end again when you're done. And that's not something you can do synchronously.

In the end, I end up spawning a single tracked thread to do the work of evaluation. I notify the front-ends when it begins work, and then again when it completes. At any point in between, a front-end can send a termination signal to kill that thread rather than waiting it out to completion. You can see the implementation here, here, and here. This will entirely coincidentally make it much easier to extend cl-notebook to a multi-user editor later.

Being surgical pays. Sometimes. Not in the way you'd think.

The first version of this editor basically spat out the complete current notebook state at the front-end on each action, and the front-end re-drew the entire thing on every change. You'd think that the sheer inefficiency of this approach would get to me, but you'd be wrong. It was actually fast enough that had performance been the only factor, I'd have left it at that. The problem with that approach is that you end up clobbering giant chunks of client-side state each time. In particular, re-drawing every cell (and hence, re-initializing each of their CodeMirror instances) meant that I was blowing away undo information for each editor window every time anything happened. And that's annoying as fuck. I'm not sure anyone's formalized the point into a commandment yet, but you really should act as though they have: Thou Shalt Not Mangle Thy Clients' Data. That applies no matter how temporary the data is. Even if your job is to take said data and transform it in some way, say for instance by evaluating it, you should do so with a copy instead of a destructive change. And even when you absolutely must be destructive, be as selectively destructive as you possibly can.

Thinking About Space Doesn't Pay. No, not even that much.

I almost left this one out entirely because it seemed so self-evident, but on reflection, it's something I've had to learn too. Space doesn't even begin to matter for systems like this. I'm talking both about memory and about disk space. Yes, there are some applications for which this is not the case, but it's true as a rule.

The fact-base project in particular has been sticking in my craw in this sense. I kept thinking things like 'Holy shit, I'm keeping history for every cell, on every edit forever. This is going to be huge! I'll have to figure out a way to condense these files, or start them off from a non-zero state so that I can throw out history at some point!'

Completely pointless.

At the beginning of the cl-notebook effort, I started a scratch file which I've been steadily editing, putting through various save/auto-save tests and just all round mauling with updates. This thing has been around for months at this point, and it's taken much harder beatings than the average notebook ever will. Wanna know what its current total size is?

inaimathi@self:~/.cl-notebook/books$ ls -lh base-zg1mm23z
-rw-r--r-- 1 inaimathi inaimathi 675K Apr 20 21:47 base-zg1mm23z
inaimathi@self:~/.cl-notebook/books$

That's smaller than many Word and PDF documents I've worked with, and those don't bother keeping my entire editing history around. So I figure I can get away with treating disk space as if it were infinite for my purposes. Absolute worst case scenario, I'll compress it. And since I'm dealing with plaintext files, that should be rather effective.

inaimathi@self:~/.cl-notebook/books$ pack -t tar.gz base-zg1mm23z
base-zg1mm23z
inaimathi@self:~/.cl-notebook/books$ ls -lh base-zg1mm*
-rw-r--r-- 1 inaimathi inaimathi 675K Apr 20 21:47 base-zg1mm23z
-rw-r--r-- 1 inaimathi inaimathi  23K Apr 20 21:47 base-zg1mm23z.tar.gz

Yeah. I think that'll be good enough.

Return Values are Complicated

I mean, I knew that already, but it turns out there are even more intricacies here. I initially assumed I'd be able to just keep a single return value per cell (by which I mean the return from a single Lisp function, which can be zero, one or more values). Then it hit me that a cell might have more than one expression in it. Then it hit me that return values aren't enough; you need to be able to handle *standard-output* emissions and warnings on a per-expression basis rather than on a per-cell basis, and that we'd want type annotations in some places since we'll be serializing various things to strings and it would otherwise get confusing. Then I hit me and sat down to write down something workable. Each cell now stores a result, which is zero or more values, each of which is actually a value and a type.

That lets the front end figure out what it needs to do on a per cell basis, which means that the server-side implementation of a cells' noise becomes very mechanically simple. It's basically just an extra fact we keep around as a label, which the front-end queries to decide how to transform the high-detail result.

cell-type is not the same thing as cell-language

Early on, I had this idea that I'd be semi-implicit about what's in a cell. At that point there were two cell-types; common-lisp and cl-who. The idea would be that this single cell-type would determine both display and evaluation properties of the contained code. Nope, as it turns out. And the thing that finally made this clear to me is thinking about how I'd treat test code. It's still Common Lisp, you see, so I'd still be evaluating it the same way as any other code cell, but I didn't want it showing up in certain exports.

The solution I ended up settling on is to be explicit about everything. Each cell now has a cell-type as well as a cell-language. The first one being one of markup (for prose blocks), code (for code blocks), and tests (for code blocks that I'd want to separate from actual runtime code).

Naming things is difficult

I think there's a joke about this somewhere. Something along the lines of

The only two difficult problems in programming are naming things, cache invalidation and off-by-one errors.

and man did that ever bite me this time. It's obviously bad to tie the file-system name of something to its display name, if for no reason other than it opens up various injection vectors that you'd rather not open up. It turns out it gets even more complicated when you're dealing with history trees of various documents, and you're trying to reduce headaches for your users. Here, think about this problem for a bit. Say you had a document type that you'd let your users name. We're obviously not naming the on-disk file after the display name of the document, so this is a matter of storing a record in the document that'll keep a user-entered name around for reference purposes. That gives you the additional benefit of being able to roll back renames, and the ability to see what a given document was called at some point in the past. Now, say you want to be able to branch said document. That is, instead of being a single history line, you want to be able to designate certain timelines as belonging to different branches than others. What you need now is four different levels of naming. Five, depending on how ham-handedly you've decided to store and relate those branches. At minimum you need

  • The filename of the document, which is different from
  • The display name of the same document (which might be different in different branches, and at different points in time), which is different from
  • The display name of a particular branch of the document (which might need to be human readable, or user entered) which is different from
  • The collective, still human-readable name for a set of branches belonging to one document.

If you've stored a branch collective as a file/folder on disk, you'll have to name that too. So, what would you do?

Confronted with this problem, I punted. Branching is basically going to be a copying operation. What you'll eventually get, once I put the branching system together and you try to use it, is a complete, self-contained fact base that happens to (possibly temporarily) have the same display-name as its origin (plus the word 'branch'), and a fact or two that point to a specific time point in that origin. From there, they'll diverge and be entirely separate entities. No, I'm not entirely sure this is the best approach, or even an acceptable approach, but it seems to be the only way to avoid asking the user to manage four different names in the space of one document. So I'll take it.

There's probably more where those came from, but they're all I could pull out of my head at short-ish notice. I'll try to let you know how the rest of the project goes, as it happens.

Sunday, April 20, 2014

cl-notebook Introductory Thoughts

So it's about time I talked about this thing, and what the hell exactly I'm thinking. Because I've been working on it for a while, and while it's still kind of buggy, I've already found myself wanting some of the features it has when working with Emacs, or other text editors I've had to work with.

Notebooks

Actually, before I get to that, a little exposition. As far as I know, notebook-style editors already exist for Python, R, and Clojure. And a second one for Clojure. The general idea is to have a web-based interface, with code being divided into small, re-arrangable chunks called cells, each of which are associated with their evluation results. Some cells are code in whichever language the notebook supports, others are just prose, usually in markdown. The idea is that you get a dynamic environment that lets you selectively evaluate small chunklets of code, and intersperse relevant documentation in the form of prose and tests.

As someone who's been eyeing literate programming techniques for a while, this predictably appeals to me. So I've built my own, for my language of choice.

cl-notebook

You can find it at the other end of that github link I start out with. Last time I mentioned this project in passing, I noted that the ultimate goal was replacing Emacs as my Common Lisp IDE of choice, and that's no small task. Despite the existence of subpar, I don't have proper s-expression navigation yet, and I haven't wired up proper auto-completion or argument hinting yet, and there's a bunch of other stuff I still want to build, ranging from the necessary to the frivolous. On the whole, I think I'm on the right track, because certain things are somewhat easier here, and because there are some features that I find myself missing when I hop back into emacs.

Lets just get those out of the way right now, actually. Firstly, I get to program in my browser, which is surprisingly elegant once I hop into full-screen mode. It lets me tab over to search for relevant links to talk about, and since my browser can be set to start up with previously open tabs, I get to resume editing exactly where I was in a later session. Secondly, because of the back-end storage system I'm using, I get to have a running history of all the edits I've ever made, which is updated every time I evaluate a cell (I'm working on having it implicitly updated every so often between evaluations, but don't have that part checked in). Thirdly, I've got exporters wired up that let me put together a book, then export it as an HTML page, or as a .lisp file. And I'm planning to add two more, one to just extract tests and a second to just hand me an executable from the given book.

The first one is minor, and makes it all the easier to randomly check my email or github notifications, so pros and cons. The third could concievably be wired together in Emacs. The second one is huge. I don't know about you, but I've been programmed to hit save every few seconds in whatever editor I've got open just because crashes happen, and I don't want them to be too painful. I guess I could have wired up emacs to do that every so often, but it sounds fiddly as hell. You don't particularly want a standard editor saving every three seconds or so; you might be in the middle of an edit the currently keyed-in part of which doesn't make sense by itself, and most editors 'save' by overwriting your existing file. Which is exactly what you don't want when you've got an unfinished code edit. Hopefully, adding total-history retention to the equation softens the blow.

Core Concepts

Code is organized into books. Each book is the complete history of a bunch of cells. A cell can contain code, tests, or markup in a particular language (currently just Common Lisp, but given how many languages I blog about, it'll probably need at least highlighting support for a few more). The cells' language and type impacts the evaluation approach we take on the back end, as well as which exports it appears in, and in what form. Specifically, common-lisp/markup cells are evaluated as :cl-who forms, don't appear in .lisp exports and only contribute their results to an .html export. By contrast common-lisp/code is straight up evaluated (capturing warnings, errors and standard-output), contribute their contents to .lisp exports, and both their contents and results to .html exports.

In addition to a type, language and id, a cell has a contents, result, and a noise. The contents is what the user has typed in, the result is what that contents evaluates to and the noise dictates how the results are displayed. This is a normal cell:

(+ 1 2)
(+ 2 3)
(format t "Testing ")
(+ 3 4)
(format t "testing")
(format t ". One two.~%")
(format t "Testing.")
(+ 4 5)

Testing testing. One two. Testing.

  • 9 :: integer

This is the same cell with a noise of verbose:

(+ 1 2)
(+ 2 3)
(format t "Testing ")
(+ 3 4)
(format t "testing")
(format t ". One two.~%")
(format t "Testing.")
(+ 4 5)
  • 3 :: integer
  • 5 :: integer

Testing

  • NIL :: null
  • 7 :: integer

testing

  • NIL :: null

. One two.

  • NIL :: null

Testing.

  • NIL :: null
  • 9 :: integer

And again with terse

(+ 1 2)
(+ 2 3)
(format t "Testing ")
(+ 3 4)
(format t "testing")
(format t ". One two.~%")
(format t "Testing.")
(+ 4 5)
  • 9 :: integer

There's also a silent setting which lets you ignore the evaluation result entirely.

You can edit a cell (changing its contents), evaluate it (changing its result) delete it, change any of its mentioned properties, or change the order of cells in a notebook. Each of these is an event that gets initiated by a POST request and gets completed with an event-stream message to any listening front-ends (which means I'll relatively easily be able to make this a multi-user editor when I get to that point). Enough low level stuff, here's an example.

Example

This is a piece of code I actually wrote using cl-notebook.


A parameter is a thing that starts with #-. It might be nullary or unary. A parameter followed by a parameter or an empty list is interpreted as nullary. A parameter followed by a non-parameter is unary. Command line args are more complicated in the general case, but not in cl-notebook

(defun parse-args! (raw)
   (pop raw)
   (flet ((param? (str) (eql #\- (first-char str)))
          (->param (str) (intern (string-upcase (string-left-trim "-" str)) :keyword))
          (->arg (str) (or (parse-integer str :junk-allowed t) str)))
      (loop for next = (pop raw) while next
            if (and (param? next) (or (not raw) (param? (car raw))))
              collect (cons (->param next) t) into params
            else if (param? next)
              collect (cons (->param next) (->arg (pop raw))) into params
            else collect next into args
            finally (return (values params args)))))

(defun get-param (names params) 
  (loop for (a . b) in params 
     if (member a names) do (return b)))
  • GET-PARAM :: symbol
(parse-args! '("./prog-name" "-f" "-h" "something"))
(parse-args! '("./prog-name" "a" "b" "c" "d"))
(parse-args! '("./prog-name" "-p" "4040"))

(get-param '(:f :force) (parse-args! '("./prog-name" "-f" "-h" "something")))
(get-param '(:p :port) (parse-args! '("./prog-name" "-p" "4040")))
(get-param '(:p :port) (parse-args! '("./prog-name" "--port" "4040")))
(get-param '(:p :port) (parse-args! '("./prog-name" "--frob" "4040")))
  • ((:F . T) (:H . "something")) :: cons
  • NIL :: null
  • NIL :: null
  • ("a" "b" "c" "d") :: cons
  • ((:P . 4040)) :: cons
  • NIL :: null
  • T :: boolean
  • 4040 :: integer
  • 4040 :: integer
  • NIL :: null

It's a small utility function for parsing command line arguments in :cl-notebook. You can see all the relevant features on display there; it starts with some documentation prose in a markup cell, has definitions in a code cell, and finally a bunch of example invocations of each thing in a tests cell. They're not really tests, because they don't encode my assumptions about the return values of those calls, but you could imagine them doing so. The point is, they won't be part of a .lisp export, but will show up in an html export like this one.

That's it for the introductory thoughts. I'll try to gather some insights into such editors into the next piece I put together. And I'll continue dogfooding until it gets good enough to call "delicious".

Saturday, April 12, 2014

Querying Fact Bases Redux

So I put together that thing I talked about last time, only to discover three things.

Firstly, the only places I actually needed lazy operation could be handled by passing a body directly to the query macro.

Secondly, when I loaded the thing up, worked up a 40k fact corpus[1] and ran

(for-all (and (?id :user ?name) (?id :time ?time) (?id :number 62))
         :in *base* :get (list ?id ?time ?name))

I'd get this:

You may have noticed that this isn't an animated gif. It hangs there for something on the order of thirty seconds, more if profiling is on, and then returns the expected result. So that won't really do. There's some interesting points I'll talk about later, that have to do with clause order and the underlying operations. But, even though this is probably the worst way to write this particular query, it should return in under a second.

Thirdly, that I had exactly zero use cases for or goals. This might change, but until then, it looks like I don't even need unification[2].

So as a result, I sat down and took the precise opposite approach to traversal that I tried last time. Instead of trying to keep it elegant and lazy, lets make it hacky and eager. Here's our problem, once again:

(for-all (?id :user ?name) 
         :in *base* :get (list ?id ?name))

should basically be the same as

(loop for (a b c) in (current *base*) when (eq b :user) collect (list a c))

and

(for-all (and (?id :user ?name) (?id :time ?time) (?id :number 62) 
         :in *base* :get ?time))

should more or less be equivalent to

(loop for (a b c) in (current *base*) 
   when (eq b :user)
   append (loop for (d e f) in (current *base*)
             when (and (eq d a) (eq e :time)))
             append (loop for (g h i) in (current *base*)
                       when (and (eq g d) (eq h :number) (= i 62))
                       collect f))

Except, you know, it should be smarter about using indices where it can. But that's a pretty straight-forward specification.

lookup and decide-index changes - take 1

The first thing I had to do was change lookup and decide-index a bit, because I wanted them to be mildly less naive. And yeah, I broke down and added some macrology to pull out all the repetition in the index-related functions. Turns out that was a good thing.

(defmacro lookup-index (state &rest indices)
  (with-gensyms (st)
    `(let ((,st ,state))
       (cond ,@(loop for i in indices 
                  for syms = (key->symbols i)
                  collect `((and (indexed? (index ,st) ,i)
                                 ,@syms)
                            (list ,i ,@syms)))))))

(defmethod decide-index ((state fact-base) &optional a b c)
  (lookup-index state :abc :ab :ac :bc :a :b :c))

Short version is, the function now takes a fact-base in addition to an a, b and c, and checks whether a particular type of index is kept for a fact base before otherwise seeing whether it would be appropriate for the current query.

(defmethod lookup ((state fact-base) &key a b c)
  (if (every #'not (list a b c))
      (current state)
      (let ((ix (aif (decide-index state a b c)
                     (gethash (rest it) (gethash (first it) (table (index state))))
                     (current state))))
        (loop for f in ix
           when (and (or (not a) (equal a (first f)))
                     (or (not b) (equal b (second f)))
                     (or (not c) (equal c (third f))))
           collect f))))

lookup now has to be mindful of this, and has to check that the indexed facts match the incoming query. Because we're now potentially using a more general index than the query calls for. My gut tells me this is still a net increase in performance since last time, even though our best case is now On with the size of the result rather than 01. If it comes to it, I'll go back and make that more efficient.

Actually, lets fix it right now.

lookup and decide-index changes - take 2

(defmethod lookup ((state fact-base) &key a b c)
  (if (every #'not (list a b c))
      (current state)
      (multiple-value-bind (index ideal-index) (decide-index state a b c)
        (let ((ix (if index
                      (gethash (rest index) (gethash (first index) (table (index state))))
                      (current state))))
          (if (and index (eq (first index) ideal-index))
              ix
              (loop for f in ix
                 when (and (or (not a) (equal a (first f)))
                           (or (not b) (equal b (second f)))
                           (or (not c) (equal c (third f))))
                 collect f))))))

That more complicated version of lookup expects two values instead of one; which index we're using, and which index we'd ideally use. If the two are the same, we just return the results of our lookup, otherwise we have to do the narrowing traversal. That's about as efficient as it's going to get without making it lazy. Which I guess I could, but not right now. However, we also need a modified decide-index to pull this little trick off. And that's going to be fugly.

(defmacro lookup-index (state &rest indices)
  (with-gensyms (ix ideal applicable?)
    `(let ((,ix (index ,state))
           (,ideal))
       ,@(loop for i in indices 
            for syms = (key->symbols i)
            collect `(let ((,applicable? (and ,@syms)))
                       (when (and (null ,ideal) ,applicable?) (setf ,ideal ,i))
                       (when (and (indexed? ,ix ,i) ,applicable?)
                         (return-from decide-index 
                           (values (list ,i ,@syms) ,ideal)))))
       (values nil ,ideal))))

(defmethod decide-index ((state fact-base) &optional a b c)
  (lookup-index state :abc :ab :ac :bc :a :b :c))

Say what you will about imperative programming; it's efficient. That's a single pass over the relevant indices that returns both the least general applicable index, and the ideal index for a given query. Which means we can now profitably compare the two in lookup, which means that our best case is back up to O1, since we don't need to traverse queries for things we've indexed.

With those modifications, I can pull some fancier crap in translating for-all calls into loops. Specifically, I can do this:

This

(defun goal->destructuring-form (goal &key (bindings (make-hash-table)))
  (labels ((rec (elem)
             (cond ((listp elem)
                    (mapcar #'rec elem))
                   ((or (eq '? elem) (not (variable? elem)))
                    (gensym))
                   ((and (variable? elem) (gethash elem bindings))
                    (gensym))
                   ((variable? elem)
                    (setf (gethash elem bindings) t)
                    elem)
                   (t (error "Somethings' up. goal->destructuring-form~%     ~s~%     ~s~%     ~s"
                             bindings goal elem)))))
    (mapcar #'rec goal)))

(defun goal->lookup (base goal &key (bindings (make-hash-table)))
  (flet ((->ix (elem)
           (cond ((and (variable? elem) (gethash elem bindings))
                  elem)
                 ((any-variables? elem)
                  nil)
                 (t elem))))
    (destructuring-bind (a b c) goal
      `(lookup ,base 
               :a ,(->ix a) 
               :b ,(->ix b)
               :c ,(->ix c)))))

(defun goal->or-expression (a b c goal)
  (flet ((test (term elem) `(equal ,term ,elem)))
    `(and ,(test a (first goal))
          ,(test b (second goal))
          ,(test c (third goal)))))

(defmethod handle-goals ((goal-type (eql 'and)) base goals collecting)
  (let ((bindings (make-hash-table)))
    (labels ((single-goal (destruct lookup tail)
               `(loop for ,destruct in ,lookup ,@tail))
             (rec (goals)
               ;; We want to generate the lookups first, 
               ;; because the bindings are going to be generated
               ;; from the result of the lookup. Meaning, if the bindings 
               ;; are established in a given destruct clause,
               ;; they won't be usable until the NEXT lookup.
               ;; Therefore, even though it isn't immediately obvious, 
               ;; order matters in this let* form

               (let* ((lookup (goal->lookup base (first goals) :bindings bindings))
                      (destruct (goal->destructuring-form (first goals) :bindings bindings)))
                 (if (null (cdr goals))
                     (single-goal destruct lookup `(collect ,collecting))
                     (single-goal destruct lookup `(append ,(rec (rest goals))))))))
      (rec (rest goals)))))

(defmethod handle-goals (goal-type base goals collecting)
  ;; Same story here as in handle-goals (eql 'and) method
  (let* ((bindings (make-hash-table))
         (lookup (goal->lookup base goals :bindings bindings))
         (destruct (goal->destructuring-form goals :bindings bindings)))
    `(loop for ,destruct in ,lookup collect ,collecting)))

(defmacro for-all (goal-term &key in get)
  (with-gensyms (base)
    (let ((template (replace-anonymous (or get `(list ,@(variables-in goal-term))))))
      `(let ((,base ,in))
         ,(handle-goals (first goal-term) base goal-term template)))))

We'll go through it in a minute, but the point of these changes is that writing

(for-all (and (?id :user ?name) (?id :time ?time) (?id :number 62))
         :in *base* :get (list ?id ?time ?name))

should expand directly into something like

(LET ((#:BASE1122 *BASE*))
  (LOOP FOR (?ID #:G1123 ?NAME) 
     IN (LOOKUP #:BASE1122 :A NIL :B :USER :C NIL)
     APPEND (LOOP FOR (#:G1124 #:G1125 ?TIME) 
               IN (LOOKUP #:BASE1122 :A ?ID :B :TIME :C NIL)
               APPEND (LOOP FOR (#:G1126 #:G1127 #:G1128) 
                         IN (LOOKUP #:BASE1122 :A ?ID :B :NUMBER :C 62)
                         COLLECT (LIST ?ID ?TIME ?NAME)))))

rather than the lazy-ish generator tree from last time. Thanks to our re-structuring of lookup, this is about as efficient as it's going to get without re-jigging goal order. The only edge case we have is what happens if the entire goal is perfectly indexable, except it seems that the programmer would use lookup directly in those situations[3].

On to the code review. Reading. Whatever.

(defun goal->destructuring-form (goal &key (bindings (make-hash-table)))
  (labels ((rec (elem)
             (cond ((listp elem)
                    (mapcar #'rec elem))
                   ((or (eq '? elem) (not (variable? elem)))
                    (gensym))
                   ((and (variable? elem) (gethash elem bindings))
                    (gensym))
                   ((variable? elem)
                    (setf (gethash elem bindings) t)
                    elem)
                   (t (error "Somethings' up. goal->destructuring-form~%     ~s~%     ~s~%     ~s"
                             bindings goal elem)))))
    (mapcar #'rec goal)))

step one of the transformation is to put together the destructuring-form for a particular goal

;;             this thing
;;          vvvvvvvvvvvvvvvvvvv
  (LOOP FOR (?ID #:G1123 ?NAME) IN (LOOKUP #:BASE1122 :A NIL :B :USER :C NIL)
...

In order to do that, we have to replace everything other than variables with gensym calls, but keep the same tree structure. loop does deep destructuring, so we can get away with using this as a pattern-matching strategy. We also need to replace already bound variables from previous destructuring-forms with the same gensym calls so they don't get re-assigned unnecessarily.

(defun goal->lookup (base goal &key (bindings (make-hash-table)))
  (flet ((->ix (elem)
           (cond ((and (variable? elem) (gethash elem bindings))
                  elem)
                 ((any-variables? elem)
                  nil)
                 (t elem))))
    (destructuring-bind (a b c) goal
      `(lookup ,base 
               :a ,(->ix a) 
               :b ,(->ix b)
               :c ,(->ix c)))))

The next thing we need to put together is a given goals' lookup clause

;;                                       this thing
;;                                 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
  (LOOP FOR (?ID #:G1123 ?NAME) IN (LOOKUP #:BASE1122 :A NIL :B :USER :C NIL)
...

We're being conservative at the moment, but there's an optimization or two I could still make here. The way we're dealing with these is:

  1. if a given goal-component is a variable, then look it up by its value in the current bindings[4]
  2. if a given goal-component is a compound form which contains any variables, don't index by it[5]
  3. otherwise, use it as an index

Onwards to handle-goal; the real meat of this approach. Lets take a look at how we deal with singleton goals first

(defmethod handle-goals (goal-type base goals collecting)
  ;; Same story here as in handle-goals (eql 'and) method
  (let* ((bindings (make-hash-table))
         (lookup (goal->lookup base goals :bindings bindings))
         (destruct (goal->destructuring-form goals :bindings bindings)))
    `(loop for ,destruct in ,lookup collect ,collecting)))

Easy, right? grab the results of goal->lookup and goal->destructuring-form and stitch them into a loop along with the collecting clause. Nothing fancy here, except for that cryptic note about a different method definition.

(defmethod handle-goals ((goal-type (eql 'and)) base goals collecting)
  (let ((bindings (make-hash-table)))
    (labels ((single-goal (destruct lookup tail)
               `(loop for ,destruct in ,lookup ,@tail))
             (rec (goals)
               ;; We want to generate the lookups first, 
               ;; because the bindings are going to be generated
               ;; from the result of the lookup. Meaning, if the bindings 
               ;; are established in a given destruct clause,
               ;; they won't be usable until the NEXT lookup.
               ;; Therefore, even though it isn't immediately obvious, 
               ;; order matters in this let* form
               (let* ((lookup (goal->lookup base (first goals) :bindings bindings))
                      (destruct (goal->destructuring-form (first goals) :bindings bindings)))
                 (if (null (cdr goals))
                     (single-goal destruct lookup `(collect ,collecting))
                     (single-goal destruct lookup `(append ,(rec (rest goals))))))))
      (rec (rest goals)))))

And this is the full story[6]. Because of the specific way we want lookup and destruct to interact with their containing bindings, their order matters quite a bit. Play around with the macroexpander if you don't quite see it from just the definition.

Anyhow, the way we deal with and goals is by building up a chain of loop forms, each one dealing with a single goal while taking the previous goals into account. All but the last one need to append their results, while the last needs to collect them. The only part we've got left is the now trivial step of putting together the for-all macro interface to the rest of this compilation pipeline[7].

(defmacro for-all (goal-term &key in collecting)
  (with-gensyms (base)
    (let ((template (replace-anonymous (or get `(list ,@(variables-in goal-term))))))
      `(let ((,base ,in))
         ,(handle-goals (first goal-term) base goal-term template)))))

Tadaah!

I haven't yet put together an equivalent facility for the old apply keyword arg, but because of how we've changed up the underlying code processors, collecting can now trivially handle things like

(for-all (and (?id :user ?name) (?id :time ?time) (?id :number 62))
         :in *base* :collecting (list ?name (+ ?id ?time 62)))

This concludes the part of this post wherein I talk about implementation details. The rest is just one or two interesting notes about traversals. If you're getting bored, or tired, this is a pretty good break-point for you.

Traversal Notes

Near the beginning of this piece, I said

...this is probably the worst way to write this particular query...-Inaimathi

referring to

(for-all (and (?id :user ?name) (?id :time ?time) (?id :number 62))
         :in *base* :get (list ?id ?time ?name))

and the reason should be fairly obvious now that we know exactly how we go about finding these answers. Remember, the expansion for this form, after compensating for the different keyword argument in our new for-all, is

(LET ((#:BASE1262 *BASE*))
  (LOOP FOR (?ID #:G1263 ?NAME) 
    IN (LOOKUP #:BASE1262 :A NIL :B :USER :C NIL)
    APPEND (LOOP FOR (#:G1264 #:G1265 ?TIME) 
              IN (LOOKUP #:BASE1262 :A ?ID :B :TIME :C NIL)
              APPEND (LOOP FOR (#:G1266 #:G1267 #:G1268) 
                        IN (LOOKUP #:BASE1262 :A ?ID :B :NUMBER :C 62)
                        COLLECT (LIST ?ID ?TIME ?NAME)))))

and just so that we're perfectly clear on what that means, here's the Lisp-esque pseudo-code

(for-each goal-1
    append (for-each goal-2 
               append (for-each goal-3
                          collect [some list of components])))

Now granted, we're aggressively using indices where we can, so we can slice a lot of the constant time out of this equation depending on how often such an operation happens, but no matter how efficiently we slice it, we're going to take a number of steps equal to goal-3 * (goal-2 * goal-1). That is, we're going On over the candidates for the last goal, for each candidate of the previous goal, for each candidate of the previous goal and so on.

This is why the indices help us a lot. If we couldn't effectively discount swathes of our initial corpus, the performance characteristic would be On^m where n is the size of our fact base and m is the number of goals. Meaning that it behooves us to cut as many candidates as early as possible, since early reductions in our problem space will give us much better returns.

In other words, to paraphrasingly re-iterate Norvig, even though

(for-all (and (?id :user ?name) (?id :time ?time) (?id :number 62))
         :in *base* :get (list ?id ?time ?name))

and

(for-all (and (?id :number 62) (?id :time ?time) (?id :user ?name))
         :in *base* :get (list ?id ?time ?name))

are logically equivalent, the latter is going to perform noticeably better, because (?id :number 62) has a much smaller set of candidate facts than (?id :user ?name) in our particular corpus. One interesting exercise, which I'll leave for next time, would be to have for-all try to optimally sort its and goals by putting the smallest candidate lists at the beginning so as to reduce the search-space with no thought required from the user. The above is a trivial example; there's one goal that has more indexable terms in it than the others, so in general[8] it will probably yield a smaller candidate list. The real way about this feels like it would be to aggressively index goals at the start of a query and sample their corpus size, then sort on that. Not sure if that would cost more than it buys me though, since it feels like that would get complex fast.

Anyway, like I said, I'll leave it for next time.

If I end up seeing performance issues in the things I'm building out of fact-base.

And I get bored.


Footnotes

1 - [back] - Like this, if you're interested:

(defparameter *base* (make-fact-base :indices '(:a :ab :abc)))

(defmethod test-generate! (n)
  (loop repeat n
     do (multi-insert! 
         *base* `((:number ,(random 100)) (:type :digit) 
                  (:time ,(get-universal-time)) 
                  (:user ,(nth (random 7) '("Inaimathi" "Anon" "Someone Else" "Albert" "Beatrice" "Charles" "Daria")))))))

(test-generate! 10000)

2 - [back] - Which makes things much simpler for this approach. Hopefully, you'll see why as we go.

3 - [back] - and they can, since it's still an :exported symbol itself.

4 - [back] - if it has been bound by a previous destructuring-form, it'll be assigned by this point, which means we'll be able to index by it. Otherwise, gethash will return nil, which is exactly what we want.

5 - [back] - This is where we could be a bit more efficient, in case you're interested. If we wanted to be very precise about it, we'd say that we could use a compound form with variables as an index, provided that all of its variables have been bound prior to this point in the traversal. I'm leaving it out for now because

  • it would further complicate an already tricky chunk of code
  • I'm not sure how often this edge case would happen in practice and
  • if it does happen, the current result will be a slightly less efficient traversal, which doesn't sound too bad. If the consequence were incorrect results instead, I'd have reconsidered

6 - [back] - As an aside, this is the first place I've seen in something like 8 years where a comment is appropriate. It doesn't mirror the code to which it pertains and it explains a non-obvious but necessary facet of the implementation. Usually, I'd either work out some naming scheme that would make the point obvious, or just factor out the chunk of code that needs explanation. There doesn't need to be a simple way of doing either here[9].

7 - [back] - And just to highlight this, it is a compilation pipeline. I mentioned this at a semi-Lisp-related meet-up lately, and it's true enough to repeat to the internets: a good way of conceptualizing a Common Lisp macro is as a compiler that takes some Lisp code and emits different Lisp code. Because of the way Lisp is structured, we get the first chunk of an actual compilation pipeline for free, and essentially start with a tokenized input. It's a pretty powerful technique once you get your head around it.

8 - [back] - Though not necessarily in plenty of specific cases.

9 - [back] - Though I guess I could factor that let out into a with-for-all-forms if it turned out I had to repeat it in many places.

Wednesday, April 9, 2014

Querying Fact Bases

So it seems that a lot of people are into this logic programming thing I've been reading about lately. There's the already mentioned Reasoned Schemer trio of Friedman/Byrd/Kiselyov behind the beautiful but arcane miniKanren language, a prolog-like contained in Peter Norvig's Paradigms of Artificial Intelligence Programming chapters 11, 12 and 14, another one in Graham's On Lisp chapters 19, 22, 23, 24 and yet another in chapter 4.4 of Abelson and Sussman's SICP. So there's a lot of literature around dealing with how you go about building a unifier or pattern-matcher[1].

Anyway, I've been consuming this literature for a while, and the part I want to zoom in on is searching the database. The other stuff is easy; a unifier can be straight-forwardly built in about ten lines[2] and handling variables is the same tree-traversal stuff you've seen a hundred times before, but the actual search never seems to be the focus of these things. And I'm looking for a particular type of search for fact-base. I showed it off recently at a Toronto Lisp Group meeting along with the app it's supposed to enable and mentioned that querying is mildly annoying when you get to compound queries. Specifically, I took as an example the very simple database

(defparameter base (make-fact-base))
(insert! base (list 0 :message "This is a sample message"))
(insert! base (list 1 :message "This is another one"))
(insert! base (list 1 :author "Inaimathi"))
(insert! base (list 2 :message "That second one was written by me. This one is a meta-message (also by me)."))
(insert! base (list 2 :author "Inaimathi"))
(insert! base (list 2 :type :meta))

and pointed out that in order to find All the message bodies authored by Inaimathi, you'd have to take two steps

  1. traverse the database looking for facts that have '(:author "Inaimathi") for a rest
  2. get all the facts that have the same first as one of the above and the second :message, and collect their thirds

In actual lisp, that looks like

(loop for (a b c) in (lookup base :b :author :c "Inaimathi") 
      collect (caddar (lookup base :a a :b :message)))

That's even passably fast, thanks to our index system, but it's annoying to write, and it forces me to do a fact-base->objects conversion in some places rather than write out these multi-stage iterations myself. What I'd like to be able to do in the above is something like

(for-all (and (?id :author "Inaimathi") (?id :message ?message)) :in my-fact-base :get ?message)

and have the system figure it out for me. Granted in this situation, you don't gain very much, but it would be a compounding gain for more complex queries. For instance, if I suddenly decided I want to select All the message bodies authored by Inaimathi pertaining to other messages, the query language version handles it very simply:

(for-all (and (?id :author "Inaimathi") (?id :message ?message) (?id :type :meta)) :in my-fact-base :get ?message)

whereas the manual version would add another level of iteration I'd need to work through. Oh, and have fun with the situation where you only want the first 5 or so hits. The easiest solution with the manual approach is searching the entire space and throwing away all but the first n results. You could do better, but you're suddenly in the supremely annoying situation where your queries all look mildly different, but perform the same basic task.

What I figure I'd want is a lazy or lazy-ish way of getting the results. The lazy solution can easily be converted to the eager solution later, but it's really painful to take the eager approach and then find out that you only needed to do about 4% of the work done. I'll be using generators, rather than outright lazy sequences just because they're mildly easier to put together. For a single goal, that's trivial.

(for-all (?id :author "Inaimathi") :in my-fact-base)

All you have to do here is have a generator that runs over the facts in my-fact-base and returns the next matching one it finds. Something like

(defun match-single (goal bindings facts)
  (let ((fs facts))
    (lambda ()
      (loop for res = (unify goal (pop fs) bindings)
         unless (fail? res) do (return res)
         while fs
         finally (return (fail))))))

would do fine. I'm pointedly refusing to commit to an implementation of (fail), unify and bindings at each point for the purposes of this post, but am using the stuff out of Norvig's PAIP source code. For the uninitiated: A goal is the thing you're trying to match; it's an expression that may contain some variables. A variable is a thing that you can substitute; it can either be unbound or assigned a value in a particular set of bindings. If a unification fails, it returns (fail), and if it's successful it returns the set of bindings that would make that unification expression true. For instance, if you unified ?a with 5, starting with empty bindings, unify would return the set of bindings in which ?a is bound to 5.

So the above match-single definition would return a generator which, when called, would either (fail), or return the environment resulting from unifying the next element of facts with goal. Hopefully, straight-forward, though you may need to do a bit of reading up on it if you've never seen the terms before.

The next easiest thing to do would be handling a set of ored goals. That is

(for-all (or (?id :author "Aaron") (?id :author "Bradley")) :in my-fact-base)

It's basically the same thing, except that instead of applying a single goal and checking if it sticks, we're applying several goals in sequence and seeing if any of them stick. Something like

(defun match-ors (goals bindings facts)
  (let ((fs facts))
    (flet ((try-goals (f)
             (loop for g in goals
                for res = (unify g f bindings)
                when res do (return res)
                finally (return (fail)))))
      (lambda ()
        (loop for res = (try-goals (pop fs))
           unless (fail? res) do (return res)
           while fs
           finally (return (fail)))))))

which is by my estimation only marginally more complicated. The tricky part is traversing a fact-base in order to satisfy anded goals. Like in that example I mentioned near the beginning:

(for-all (and (?id :author "Inaimathi") (?id :message ?message) (?id :type :meta)) :in my-fact-base :get ?message)

Think about it.

What you want here is fairly complicated to express in English. I'm still trying to return a generator from the whole thing, but expressing its behavior is a complex.

If you only get one goal, you want to fall through to a call to match-single; that's still fairly straight-forward. The magic happens at more than one goal. And I just deleted about four paragraphs of prose that would have thoroughly confused you. It's not a very easy set of concepts to express in English because it refers to pieces of itself fairly often.

Lets try it this way:

(defun match-ands (goals bindings facts)
  (let ((generator (match-single (first goals) bindings facts))
        (rest-generator))
    (if (null (cdr goals))
        generator
        (labels ((next-gen ()
                   (let ((res (funcall generator)))
                     (if (fail? res)
                         (fail)
                         (setf rest-generator (match-ands (rest goals) res facts)))))
                 (backtrack! ()
                   (if (fail? (next-gen))
                       (fail)
                       (next)))
                 (next ()
                   (if (null rest-generator)
                       (backtrack!)
                       (let ((res (funcall rest-generator)))
                         (if (fail? res)
                             (backtrack!)
                             res)))))
          #'next))))

The image you want, once you've put the initial generator tower together, is one of those combo bike-locks.

If you want to search its entire possibility space, you spin the last ring until it runs out of values. Then you spin the second-to-last ring once, and retry the last ring. When you run out of values on the second-to-last ring, you spin the third-to-last ring once and so on. It's an incredibly tedious exercise, which is why I'd prefer a machine to handle it.

Now, chunk by chunk

(defun match-ands (goals bindings facts)
  (let ((generator (match-single (first goals) bindings facts))
        (rest-generator))
    (if (null (cdr goals))
        generator
        ...

By the time we're calling this function, I assume that it'll be handed at least one goal. You always want the generator of your first goal, and if you only get the one goal, you just return said generator and you're done. Multiple goals are where you need to pull fancy footwork. Again, one chunk at a time:

        ...
        (labels ((next-gen ()
                   (let ((res (funcall generator)))
                     (if (fail? res)
                         (fail)
                         (setf rest-generator (match-ands (rest goals) res facts)))))
                 ...

This is where we set the rest-generator from earlier. It's just the procedure that will return the next result from proving the rest of the goals given the set of bindings built from proving the first goal into the starting set of bindings given to match-ands initially. If calling the first goals' generator fails, we likewise fail; otherwise we set rest-generator to the generator we create by passing the result back up to match-ands.

                 ...
                 (backtrack! ()
                   (if (fail? (next-gen))
                       (fail)
                       (next)))
                 ...

Occasionally, we have to backtrack. Which in this context means we try to call next-gen. If that fails, we likewise fail, otherwise we invoke next. Which...

                 ...
                 (next ()
                   (if (null rest-generator)
                       (backtrack!)
                       (let ((res (funcall rest-generator)))
                         (if (fail? res)
                             (backtrack!)
                             res)))))
                       ...

...sets up an initial rest-generator if there isn't one, then tries to call it. If that fails, we backtrack![3], otherwise we return the result.

          ...
          #'next))))

That next function I just described is the generator we want for a multi-goal proof, which means that it's the final return value from match-ands.

The only required piece of infrastructure left is for-all itself. We want it to be able to provide results, or do something with them lazily. Which means it'll look something like

(defmacro for-all (goal-term &key in get apply)
  (assert in nil "Need a database to query...")
  (when (and get apply)
    (format t ":apply and :get arguments passed in; ignoring :get"))
  (with-gensyms (template gen res facts)
    `(let* ((,facts ,in)
            (,gen ,(cond ((eq 'and (car goal-term))
                          `(match-ands ',(replace-anonymous (rest goal-term)) +succeed+ ,facts))
                         ((eq 'or (car goal-term))
                          `(match-ors ',(replace-anonymous (rest goal-term)) +succeed+ ,facts))
                         (t
                          `(match-single ',goal-term +succeed+ ,facts))))
            ,@(unless apply
              `((,template ',(replace-anonymous (or get goal-term))))))
       (loop for ,res = (funcall ,gen)
          while ,res collect ,(if apply
                                  `(apply (lambda ,(variables-in apply) ,apply)
                                          (subst-bindings ,res ',(variables-in apply)))
                                  `(subst-bindings ,res ,template))))))

Which isn't nearly as complicated as it seems at first glance. Lets go through that too.

(defmacro for-all (goal-term &key in get apply)
  (assert in nil "Need a database to query...")
  (when (and get apply)
    (format t ":apply and :get arguments passed in; ignoring :get"))
    ...

arguments, and trying to be helpful with invocation errors. We want the thing to be readable too, which is why I use "mandatory" keyword arguments in this one.

  ...
  (with-gensyms (template gen res facts)
    `(let* ((,facts ,in)
            (,gen ,(cond ((eq 'and (car goal-term))
                          `(match-ands ',(replace-anonymous (rest goal-term)) +succeed+ ,facts))
                         ((eq 'or (car goal-term))
                          `(match-ors ',(replace-anonymous (rest goal-term)) +succeed+ ,facts))
                         (t
                          `(match-single ',goal-term +succeed+ ,facts))))
   ...

we're setting up some name sanitation for certain words we'd like to use in the definition that should still be usable by the callers of for-all. Note the use of replace-anonymous, the definition can be found in Norvig's prolog implementation. The entirety of that cond decides which of our matchers we're going to use to traverse our corpus.

            ...
            ,@(unless apply
              `((,template ',(replace-anonymous (or get goal-term))))))
            ...

If we get passed the apply argument, we'll be doing something special later. Otherwise, we'll want to slot our results into the template in gen, and failing that, just slot it back into the querying goal form.

       ...
       (loop for ,res = (funcall ,gen)
          while ,res collect ,(if apply
                                  `(apply (lambda ,(variables-in apply) ,apply)
                                          (subst-bindings ,res ',(variables-in apply)))
                                  `(subst-bindings ,res ,template))))))

And that's the meat of it. We're going to be grabbing results out of our generator. As you can see, the special thing we're doing with the apply argument is stitching up a function to apply to a substituted list of our results. If we didn't get an apply, we're just slotting said result back into the template we defined earlier. I find that seeing some macroexpansions really helps understanding at this stage. So, here are the basics:

Plain single-goal:

CL-USER> (macroexpand '(for-all (?id :author "Inaimathi") :in my-fact-base))
(LET* ((#:FACTS1073 MY-FACT-BASE)
       (#:GEN1071
        (MATCH-SINGLE '(?ID :AUTHOR "Inaimathi") +SUCCEED+ #:FACTS1073))
       (#:TEMPLATE1070 '(?ID :AUTHOR "Inaimathi")))
  (LOOP FOR #:RES1072 = (FUNCALL #:GEN1071)
        WHILE #:RES1072
        COLLECT (SUBST-BINDINGS #:RES1072 #:TEMPLATE1070)))
T

or-goals:

CL-USER> (macroexpand '(for-all (or (?id :author "Inaimathi") (?id :message ?message)) :in my-fact-base))
(LET* ((#:FACTS1077 MY-FACT-BASE)
       (#:GEN1075
        (MATCH-ORS '((?ID :AUTHOR "Inaimathi") (?ID :MESSAGE ?MESSAGE))
                   +SUCCEED+ #:FACTS1077))
       (#:TEMPLATE1074 '(OR (?ID :AUTHOR "Inaimathi") (?ID :MESSAGE ?MESSAGE))))
  (LOOP FOR #:RES1076 = (FUNCALL #:GEN1075)
        WHILE #:RES1076
        COLLECT (SUBST-BINDINGS #:RES1076 #:TEMPLATE1074)))
T

and-goals:

CL-USER> (macroexpand '(for-all (and (?id :author "Inaimathi") (?id :message ?message)) :in my-fact-base))
(LET* ((#:FACTS1081 MY-FACT-BASE)
       (#:GEN1079
        (MATCH-ANDS '((?ID :AUTHOR "Inaimathi") (?ID :MESSAGE ?MESSAGE))
                    +SUCCEED+ #:FACTS1081))
       (#:TEMPLATE1078
        '(AND (?ID :AUTHOR "Inaimathi") (?ID :MESSAGE ?MESSAGE))))
  (LOOP FOR #:RES1080 = (FUNCALL #:GEN1079)
        WHILE #:RES1080
        COLLECT (SUBST-BINDINGS #:RES1080 #:TEMPLATE1078)))
T

Using the :get option

CL-USER> (macroexpand '(for-all (and (?id :author "Inaimathi") (?id :message ?message)) :in my-fact-base :get ?message))
(LET* ((#:FACTS1085 MY-FACT-BASE)
       (#:GEN1083
        (MATCH-ANDS '((?ID :AUTHOR "Inaimathi") (?ID :MESSAGE ?MESSAGE))
                    +SUCCEED+ #:FACTS1085))
       (#:TEMPLATE1082 '?MESSAGE))
  (LOOP FOR #:RES1084 = (FUNCALL #:GEN1083)
        WHILE #:RES1084
        COLLECT (SUBST-BINDINGS #:RES1084 #:TEMPLATE1082)))
T

Using the :apply option

CL-USER> (macroexpand '(for-all (and (?id :author "Inaimathi") (?id :message ?message)) :in my-fact-base :apply (format t "~s~%   -Inaimathi" ?message)))
(LET* ((#:FACTS1089 MY-FACT-BASE)
       (#:GEN1087
        (MATCH-ANDS '((?ID :AUTHOR "Inaimathi") (?ID :MESSAGE ?MESSAGE))
                    +SUCCEED+ #:FACTS1089)))
  (LOOP FOR #:RES1088 = (FUNCALL #:GEN1087)
        WHILE #:RES1088
        COLLECT (APPLY
                 (LAMBDA (?MESSAGE) (FORMAT T "~s~%   -Inaimathi" ?MESSAGE))
                 (SUBST-BINDINGS #:RES1088 '(?MESSAGE)))))
T
CL-USER> 

And that's that. Granted, the implementation is a bit more complicated than just writing manual loops, but I'm convinced there are a couple wins here. Firstly, the invocation is simpler, which means that the above definitions will eventually "pay for themselves" in terms of complexity. Secondly, it seems like I could fairly easily mod this into parenscript-friendly forms, which means this'll save me from having to convert fact-bases to object lists on the client side. But that's something I'll tell you about next time.


Footnotes

1 - [back] - Almost always using the question-mark-prefix notation for logic variables for some reason. I'm not sure what the approach gains or loses you yet. I guess in the case of miniKanren, it gains you the ability to unify on vectors since there's no ambiguity, and it might make it easier to read the resulting programs, but I'm not banking on that.

2 - [back] - Though do go over Norvig's version to see a dissection of the common bugs.

3 - [back] - And remember, backtrack! itself fails if it runs out of search space.