Nleyten

To content | To menu | To search

Tag - lambdoc

Entries feed

Thursday 29 October 2015

On-the-fly lexer switching with Menhir

Suppose you have a markup language that requires different lexers for different contexts (you have different verbatim environments, for instance). You can often solve this problem entirely within the lexer. This is the approach typically used for dealing with comments when parsing programming language source code, for example. What if, however, the markup language is complex enough that you need some help from the parser to know in which context you sit at any given moment, and therefore which lexer is the right one to call?

The situation above makes the familiar OCamllex (or Sedlex) + Menhir combination problematic. So much so that even if you otherwise have a strong preference for these tools and the grammar of your language is a nice fit for an LR(1) parser generator, you may be forced to adopt some scannerless parsing technique or some exotic parser generator. Nevertheless, I would like to share a couple of approaches which allow you to have your cake and eat it too. That is, perform on-the-fly lexer switching within a Ocamllex (Sedlex, actually) + Menhir framework.

The markup we want to parse is a very simplified sibling of Lambtex, supporting only paragraphs, quote environments, and two different kinds of verbatim-like environments (verbatim proper and source). Within paragraphs, only plain text, bold text, and hyperlinks are supported. Here you'll find a complete sample of our markup language. (Note that this dumbed-down Lambtex is indeed so simple that you could get away with parsing the verbatim-like environments entirely within the lexer context. You'll have to indulge me here!)

The first approach relies on Menhir's new Inspection API, which essentially allows for the current parser state to be inspected from the outside. This solution demands that Menhir be run in incremental mode, which in turn demands the use of the table-based back-end. As the Menhir manual notes, the table-based back-end is generally slower than the default code-based back-end. On the plus side, this solution does not require any hacks within the parser specification itself. It does, however, require a mildly complex Tokenizer layer between the Lexer and the Parser.

The second approach relies on a hack made practical by Menhir's ability to produce parameterised parsers, ie, parsers which are in fact OCaml functors. Suppose thus that we declared our parser to be parameterised by a module C obeying signature Context.S:

%parameter <C: Context.S>

The hack itself consists of using side-effects within the parser specification to set the current lexing context (note the set_general and set_literal rules below). The fact that the parser is parameterised allows us to contain the side-effects within each instantiation of the functor. Though the hack would also work without parser parameterisation, the resulting parser would not be reentrant and could not be safely used in a multi-threaded application. At last, note that this approach also inserts a Tokenizer layer between the Lexer and the Parser. It is however much simpler than the one required by the first approach.

block:
  | set_literal BEGIN_VERBATIM set_general TEXT END_VERBATIM  {Ast.Verbatim $4}
set_general:
  | /* empty */  {C.(set General)}
set_literal:
  | /* empty */  {C.(set Literal)}
You may have noticed the seemingly odd placement of the set_literal and set_general producers within the sole production of the block rule. These seem to be placed one position before where they should be. The reason is simple: remember that we have to take into account the lookahead token!

And that's it. Both of these approaches work and each has its advantages and disadvantages. I'm leaning towards the second approach for a cleaner reimplementation of Lambtex's current parser, though I can imagine that even hairier markups may require the extra flexibility afforded by the first approach. To conclude, note that I've been deliberately terse in explanation because the complete code is available on Github. Just bear in mind that it is littered with debug statements.

Monday 17 August 2015

Announcing Lambdoc 1.0-beta4

I'm happy to announce the release of version 1.0-beta4 of Lambdoc, a library providing support for semantically rich documents in web applications. Though Lambdoc was designed with Ocsigen/Eliom integration in mind, it does not actually depend on the Ocsigen server or Eliom, and you may use it with other frameworks. In fact, you may find it useful outside the web application domain altogether.

An overview of Lambdoc's features may be found in previous posts announcing the beta1 and beta3 releases. Between beta3 and beta4, the most salient changes are as follows:

  • Introduction of Lambdoc_core_foldmap, a module for aiding the construction of functions for deep traversal and transformation of a document tree. The basic idea is inspired by the compiler's Ast_mapper module, so it should be widely familiar. Moreover, the foldmapper is the result of a functor parameterised by a custom monad, so it's easily integrated in an application using Lwt or Async if the foldmapping requires doing monadic I/O. The tutorial directory includes some examples using Lambdoc_core_foldmap:

    • Tutorial 7 illustrates one of the simplest possible applications of this feature: a function that counts the number of bold sequences used in a document.
    • Tutorial 8 depicts a link validator that uses Cohttp to verify that all external links are live. Note that it registers itself as a parsing postprocessor, allowing any found errors to be reported together with other unrelated document errors. Moreover, it lives under the Lwt monad.
    • Tutorial 9 implements a simple document transformer which replaces all instances of Eastasia with Eurasia and vice-versa.
  • The addition of Lambdoc_core_foldmap enabled the simplification of the extension mechanism. Previous versions of Lambdoc feature hooks for reading/writing link and image URLs. All of those hooks are now gone.

  • Lambdoc documents may now carry information about the parsed source (location, etc) in every attribute. I briefly entertained the possibility of making the attribute polymorphic, thus allowing for document to carry custom meta-data. However, at this moment I have no practical need for this extra flexibility, and I am wary of increasing complexity in the name of hypothetical use cases.

Lambdoc 1.0-beta4 should be available in OPAM any moment now. Documentation is still a work in progress, and since OCamldoc gets terribly confused with Lambdoc's heavy use of module aliases, we may have to wait for Codoc before proper API docs can get generated. In a small effort to ameliorate this situation, the examples directory includes a tutorial with self-contained demos of Lambdoc's various features.

Monday 30 March 2015

Announcing Lambdoc 1.0-beta3

I'm happy to announce the release of version 1.0-beta3 of Lambdoc, a library providing support for semantically rich documents in web applications. Lambdoc was designed with Ocsigen/Eliom integration in mind, though you may of course use it with other frameworks (it does not actually depend on the Ocsigen server or Eliom). In fact, you may find it useful outside the web application domain altogether.

An overview of Lambdoc's features may be found in the post I wrote announcing the first beta of Lambdoc. The good news is that in the intervening months, some of the most pressing issues with the library have been fixed, and it is now much closer to completion. The bad news is that backward-incompatible changes were required. For most uses these amount to no more than a module renaming fixable by search-and-replace. The extension mechanism suffered a complete overhaul, however (more on that below), and is manifestly incompatible with the first beta. My apologies if anyone was inconvenience by this, and the caveat emptor regarding beta software remains.

Lambdoc 1.0-beta3 should hit the OPAM repos any moment now.

Salient changes since beta 1

  • The module structure was reorganised, with the module packs being ditched in favour of flatter structure reliant on module aliases.
  • OASIS is now used for the build system.
  • Completely revamped extension mechanism. Extensions are now easily composable, and output raw AST values instead of Lambdoc_core values, allowing for greater flexibility. Within the examples directory are some illustrations of the power offered by the extension mechanism:

What to expect before a 1.0 release

Though Lambdoc is perfectly useful right now, there are still some issues to resolve before I'm willing to tag a final 1.0 release. The Markdown support, in particular, is still far from complete. Other prominent issues include #24, #28, #29, #31, #32, and #33. Fortunately, though some of these issues may require backward incompatible changes, these are pretty minor.

Acknowledgements

Massive kudos to Gabriel "Drup" Radanne and Edwin Török for their feedback and code contributions.

Thursday 18 September 2014

Announcing Lambdoc 1.0-beta1

I'm happy to announce release 1.0-beta1 of Lambdoc, a library providing support for semantically rich documents in web applications. Lambdoc was designed with Ocsigen/Eliom integration in mind, though you may of course use it with other frameworks (it does not actually depend on the Ocsigen server or Eliom).

A brief overview of Lambdoc's features

  • A rich set of supported document features, including tables, figures, math, and source-code blocks with syntax-highlighting.
  • Built-in support for multiple input markups (see below), and easy integration of additional custom markups.
  • Runtime customisation of available document features. You may, for instance, declare that a certain class of users may not format text passages in bold.
  • Detailed error messages for mistakes in input markup.
  • A simple macro mechanism.
  • An extension mechanism.
  • The CLI application lambcmd, which allows conversion from any input markup to any output markup from the comfort of the command line.
  • Ships with decent looking CSS, easily customisable to your needs. Note that you'll need CCSS (available on OPAM) if you wish to modify the source for the CSS.

Supported input markups

This first beta of Lambdoc ships with built-in support for four different input markup languages:

  • Lambtex: Shamelessly inspired by LaTeX, Lambtex is my take on what LaTeX should look like if one were to get rid of all legacy baggage and gear it towards publishing on the web. Lambtex supports all of Lambdoc features, and even has a complete manual (which by the way I also recommend if you want to get a comprehensive list of all document features supported in Lambdoc).
  • Lambwiki: Largely inspired by the Wiki Creole syntax, Lambwiki is a light-weight markup language. Though it does not support some of Lambdoc's more advanced features, it is veritably light and its syntactic conventions are IMHO more memorable than Markdown's. Moreover, it also has a complete manual.
  • Lambxml: An XML-markup largely compatible with HTML. I don't find XML to be particularly human-friendly, but Lambxml might prove useful as a gateway for external XML-outputting tools.
  • Markdown: Love it or hate it, Markdown is ubiquitous, and as such supporting it is practically mandatory. Lambdoc supports Markdown via the OMD library, and therefore you should refer to OMD's documentation to learn about the supported flavour of Markdown. Note that Lambdoc's integration of OMD is still experimental, and there are still some issues to be resolved before the final 1.0 release. Prominently, OMD does not currently preserve location information, which is required for Lambdoc's error reporting mechanism. Fortunately, this issue has been acknowledged upstream.

Supported output markups

The only supported output markup is HTML5 via Tyxml. However, the functorial implementation used allows easy integration with Eliom.

Developer documentation

Unfortunately, developer documentation for this beta release is still sparse. Ocsigen/Eliom users are advised to take a look at the four-part tutorial included in the examples directory. The first step of the tutorial is a very minimalistic and straightforward illustration of how Lambdoc can be integrated in Eliom applications. Each subsequent step builds upon this foundation by introducing one new feature. Hopefully this will be enough to get you started.

About the extension mechanism

The extension mechanism is the latest addition to Lambdoc. It allows for the attachment of custom hooks to the processing of inline links, inline and block images, and the generic extern block. It is still somewhat experimental, but hopefully flexible enough to cover most use cases. Check out the last step of the tutorial for a basic example, or the source of lambcmd for a more complex real-world example which uses Bookaml to enable the special protocol isbn for links to books.

On the betaness of this release

Besides the aforementioned issues with the OMD integration, the lack of proper documentation, and the experimental character of the extension mechanism, the beta moniker for this release is also justified by the somewhat ad-hoc build system (I'm not sure OASIS even supports a project using module packs internally). Fortunately, using OPAM should spare you the trouble of worrying about this issue.

One important caveat: though I have no plans for further changes to the API, the betaness of this release also means I'll have no compunction in making them should the need arise.

Concluding remarks

The package is now available on OPAM. It has a tonne of dependencies, but since they are all packaged in OPAM, this shouldn't be a hassle. Note that some of the dependencies (Lwt, Ocsigenserver, Bookaml) apply only to the lambcmd CLI utility, and not the library itself. (Yes, I'm considering simplifying lambcmd for subsequent releases.)

Your comments/suggestions/criticisms are of course welcome. Feel free to send me an email or to open a ticket on the project's page at Github. I'll be particularly thankful if you find any bugs.