Suppose you have a markup language that requires different lexers for different contexts (you have different verbatim environments, for instance). You can often solve this problem entirely within the lexer. This is the approach typically used for dealing with comments when parsing programming language source code, for example. What if, however, the markup language is complex enough that you need some help from the parser to know in which context you sit at any given moment, and therefore which lexer is the right one to call?

The situation above makes the familiar OCamllex (or Sedlex) + Menhir combination problematic. So much so that even if you otherwise have a strong preference for these tools and the grammar of your language is a nice fit for an LR(1) parser generator, you may be forced to adopt some scannerless parsing technique or some exotic parser generator. Nevertheless, I would like to share a couple of approaches which allow you to have your cake and eat it too. That is, perform on-the-fly lexer switching within a Ocamllex (Sedlex, actually) + Menhir framework.

The markup we want to parse is a very simplified sibling of Lambtex, supporting only paragraphs, quote environments, and two different kinds of verbatim-like environments (verbatim proper and source). Within paragraphs, only plain text, bold text, and hyperlinks are supported. Here you'll find a complete sample of our markup language. (Note that this dumbed-down Lambtex is indeed so simple that you could get away with parsing the verbatim-like environments entirely within the lexer context. You'll have to indulge me here!)

The first approach relies on Menhir's new Inspection API, which essentially allows for the current parser state to be inspected from the outside. This solution demands that Menhir be run in incremental mode, which in turn demands the use of the table-based back-end. As the Menhir manual notes, the table-based back-end is generally slower than the default code-based back-end. On the plus side, this solution does not require any hacks within the parser specification itself. It does, however, require a mildly complex Tokenizer layer between the Lexer and the Parser.

The second approach relies on a hack made practical by Menhir's ability to produce parameterised parsers, ie, parsers which are in fact OCaml functors. Suppose thus that we declared our parser to be parameterised by a module C obeying signature Context.S:

%parameter <C: Context.S>

The hack itself consists of using side-effects within the parser specification to set the current lexing context (note the set_general and set_literal rules below). The fact that the parser is parameterised allows us to contain the side-effects within each instantiation of the functor. Though the hack would also work without parser parameterisation, the resulting parser would not be reentrant and could not be safely used in a multi-threaded application. At last, note that this approach also inserts a Tokenizer layer between the Lexer and the Parser. It is however much simpler than the one required by the first approach.

  | set_literal BEGIN_VERBATIM set_general TEXT END_VERBATIM  {Ast.Verbatim $4}
  | /* empty */  {C.(set General)}
  | /* empty */  {C.(set Literal)}
You may have noticed the seemingly odd placement of the set_literal and set_general producers within the sole production of the block rule. These seem to be placed one position before where they should be. The reason is simple: remember that we have to take into account the lookahead token!

And that's it. Both of these approaches work and each has its advantages and disadvantages. I'm leaning towards the second approach for a cleaner reimplementation of Lambtex's current parser, though I can imagine that even hairier markups may require the extra flexibility afforded by the first approach. To conclude, note that I've been deliberately terse in explanation because the complete code is available on Github. Just bear in mind that it is littered with debug statements.