The absolute awesomeness of anchored font-lock matchers

Date	Change
2017-07-03	The font-lock spec annotations were updated to better reflect the looping nature of the matcher

People who know my rants on Emacs and especially font-lock-mode know that I consider it a rather crappy hack. Parsing complex context sensitive languages with a bunch of very weak regexes¹ just screams This is a really bad idea! Well, either way I was always forced to admit that yes, it is a hack, but damn does it work in practice! Very rarely there is some problem you can't solve, and if the need comes, you can actually use arbitrary elisp code as the matcher so long as it sets match-data the same way re-search-forward would.

Today I had a problem I thought would finally prove my point about how bad font-lock is and that we should all bike-shed and invent totally awesome formal parsers… then I went back to the docstring and of course Emacs can actually solve the problem.

The issue is the following: I'm writing a DSL which looks kind of like Haskell types, but written in sexps. So where in Haskell one writes

function :: Int -> String -> (String -> Int) -> [Float]

in my DSL it would look something like

(type function :: int -> string -> (string -> int) -> [float])

Now, how would I fontify those string and int occurrences only when they occur inside the type form? Turns out font lock supports Anchored matchers.

The anchored matchers work by first searching for an anchor and only then searching for the thing you want to highlight. This basically allows you to do look-ahead context-sensitive fontification in the sense that the subsequent matchers are tried but if they fail the process continues from where the anchor match ended.².

For the longest time I struggled to understand how the font-lock specifications worked because there is so many different ways to write them. What actually helped me to understand this once and for all was to simply look into the source code and read how it works. I remembered the recent post by Irreal on reading source code. It really is an effective way to learn, especially with software like Emacs being absolutely transparent about everything that is going on inside.

A font lock rule starts with a matcher followed by one or more HIGHLIGHT forms. A HIGHLIGHT form either specifies how to fontify group matched by the matcher or is actually another matcher (this is the anchored matcher). The highlight forms are tried in order and applied one after another, whatever their type is.

The specification is not completely recursive because it only allows one level of nesting, so an anchored matcher can not have other anchored matchers inside it. The anchored matcher has the following syntax:

(MATCHER PRE-MATCH-FORM POST-MATCH-FORM MATCH-HIGHLIGHT ...)

where MATCHER is the search regexp that is tried after the anchor was found, PRE-MATCH-FORM and POST-MATCH-FORM are executed before and after the MATCHER is run so you can set search limits and do other magic if necessary. MATCH-HIGHLIGHT are the usual forms with the groups and faces.

The cool and crucial ingredient is that the MATCHER is run in a cycle until the point goes after the limit. This means that we in a sense "fontify" the region from the anchor to the limit we provide (or end of line by default). We can then reset the position in the POST-MATCH-FORM so the next HIGHLIGHT (anchored matcher) will start from the beginning of the same "region" again. This allows us to define "region specific" font-locking. So cool!

The final annotated rule looks as follows:

(font-lock-add-keywords
 nil
 ;; the first regexp is the anchor of the fontification, meaning the
 ;; "starting point" of the region
 '(("(\\(type\\) +\\(\\(?:\\sw\\|\\s_\\)+\\) +::"
    ;; fontify the `type' as keyword
    (1 font-lock-keyword-face)
    ;; fontify the function name as function
    (2 font-lock-function-name-face)
    ;; look for symbols after the `::', they are types
    ("\\_<\\(\\(?:\\sw\\|\\s_\\)+\\)\\_>"
     ;; set the limit of search to the current `type' form only
     (save-excursion (up-list) (point))
     ;; when we found all the types in the region (`type' form) go
     ;; back to the `::' marker
     (re-search-backward "::")
     ;; fontify each matched symbol as type
     (0 font-lock-type-face))
    ;; when done with the symbols look for the arrows
    ("->"
     ;; we are starting from the `::' again, so set the same limit as
     ;; for the previous search (the `type' form)
     (save-excursion (up-list) (point))
     ;; do not move back when we've found all matches to ensure
     ;; forward progress.  At this point we are done with the form
     nil
     ;; fontify the found arrows as variables (whatever...)
     (0 font-lock-variable-name-face t)))))

And the forms are fontified in very much the same way as the Haskell code above (thanks to Emacs's amazing consistency with font-lock faces, another brilliant design decision).

(type function :: int -> string -> (string -> int) -> [float])
(type constant :: int)

(defun string (string int)
  "The keywords outside of the type form are *not* fontified!")

I repeat it here just for completeness:

function :: Int -> String -> (String -> Int) -> [Float]
constant :: Int

Awesome.

Footnotes:

Emacs RE engine is a lot less powerful than PCRE engines, it doesn't support look-ahead nor back-references among other less commonly used features

For those familiar with Parsec, this is basically the try combinator