Blah, blah, blah
Lambdoc supports mathematical expressions in both TeX and MathML format. The former was chosen because it is the de facto standard for inputting mathematics in a human-friendly manner (for slightly unconventional definitions of both "human" and "friendly"). The latter because several graphical tools for designing math expressions now have the capability of converting their own internal (often proprietary) formats into MathML.
Though browser support for MathML is still largely broken, that will hopefully change with the imminent release of the STIX fonts. So rather than doing a server-side rendering of all maths into images and serving those to the client (the solution that Wikipedia currently uses), I have chosen to take the plunge into the MathML world and serve that instead. It follows that I must somehow convert all those equations in TeX format into MathML.
Now, I was hoping this task would be a no-brainer. The tool that the Wikipedia itself currently uses is texvc, which is written — drumroll — in Ocaml. Unfortunately, texvc's forte is the conversion of TeX into images, having only very minimal support for converting TeX into MathML (it basically chokes on any expression more complex than "x = y"). I was therefore forced to look into alternatives.
I ended up settling on Blahtex. It is being developed with the intention of being integrated into the Mediawiki package that powers Wikipedia. Moreover, besides the commonplace rendering of TeX equations into images, Blahtex also aims at converting TeX into MathML, and is already very competent at that task. Unfortunately, and unline texvc, Blahtex is developed in C++. This is how I ended up spending some time learning about the ways of interfacing C/C++ code with Ocaml.
Ocaml has very good support for interfacing with C code (and indirectly also with C++, since inside a C++ programme you can declare functions to have C-style linking). You can invoke C functions from the Ocaml side, Ocaml functions from the C side, and mix and match as per your requirements. The authorative reference on this lore is the Ocaml manual itself, though I recommend that newcomers start with Florent Monnier's excellent tutorial. While not rocket science, the interfacing does require some attention, particularly on matters where Ocaml's garbage collector is involved. Moreover, when writing the C stubs we are constantly reminded that we have left the safety of Ocaml shores and that we now find ourselves in "Here Be Dragons" territory.
Another issue that need handling was Blahtex's lack of any sanity checks on its input. Since web applications are the most obvious application of a TeX to MathML converter, it worried me that a crafty user could pull a cross-site scripting attack via a math expression. To solve this issue I chose an approach that though heavy-handed is simple to implement and offers beneficial side-effects: the Ocaml side can ensure the sanity of the generated MathML by validating it against the official MathML2 DTD from the W3C. During this process I discovered that the very capable though complex PXP library is the only Ocaml XML parser that can handle the complexity of the MathML2 DTD.
The end result is Blahcaml, a library that offers basic bindings to Blahtex, but with added (optional) sanity checking. I have just released version 1.0, and though it still hasn't been through a lot of testing, it's performing nicely and can be considered stable. Try it out and let me know if you find any problems!

Comments