S-expressions for long-term storage of Ocaml values
Call it marshalling, pickling, serialisation, or whatever else you wish; this operation — where a value (or "variable" in non-FP languages) is extracted from a programme at runtime in order to be stored in disk or transferred through the wire — is critical for many applications. Most programming languages therefore include in their standard libraries some means of performing it.
Ocaml ships with the venerable Marshal module. It is extremely simple to use: suppose we wished to convert into a string str a marshalled representation of a value manuscript of type Manuscript.t. This would suffice: let str = Marshal.to_string manuscript []. The inverse operation — taking a string with a marshalled representation and converting back into a programme value — is also dead easy: let manuscript : Manuscript.t = Marshal.from_string str 0.
The attent observer will have noticed the type annotation in the unmarshalling example above; in general, unmarshalling does require an explicit type annotation (which otherwise is rarely needed in Ocaml). The reason is that the type inference mechanism may not have enough information to determine what is the type of the marshalled representation (imagine, for example, that the routines performing the marshalling and unmarshalling reside in different programmes!). And here lies the Achilles' heel of marshalling in Ocaml: should the programmer make a mistake in specifying the type to be unmarshalled, the programme will most surely segfault. This problem also occurs if a version of the programme stores a marshalled value, which is then read back by a subsequent of the same programme where the value's type has been modified (even if ever so slightly).
Over the years, there appeared a number of extensions to the Ocaml language offering type-safe marshalling ( HashCaml may be the most well-known, but it's not the only one). And judging from comments by Xavier Leroy (the primary developer of the Ocaml language) at the Ocaml users meeting in Paris this last January, there's a good chance that type-safe marshalling will make it into the core language in the near future.
In Lambdium-light, stories and comments are stored in the database backend using the marshalled representation offered by the Marshal module. While this works and is extremely fast, it does have the problem of not being very resilient to future changes in the story and comment format. Therefore, I have been looking at alternatives to Marshal that provide some degree of backwards compatibility while not sacrificing too much speed.
I suspect that the XML-fanboys in the audience wouldn't even think twice, but personally I am far from finding XML a good solution for many of the problem domains where it is applied. For this reason, I asked about this problem in the Caml-list.
Anyway, I am currently leaning towards choosing a solution based on Sexplib, a library that converts Ocaml values to/from S-expressions. It comes with a syntax extension that given a type t automatically writes the sexp_of_t and t_of_sexp "marshalling" functions. Making the transition from Marshal to Sexplib is therefore very straightforward. Another advantage is that S-expressions are essentially just text and are therefore human-readable. Moreover, the format is very compact (a lot more than XML!) and fairly easy to parse. Speed-wise, while obviously not being as fast as Marshal, it is still reasonably fast, especially in native code.
Suppose I have a fairly large story of value Manuscript.t. On my machine, and using Ocaml byte code, marshalling and unmarshalling this value 100,000 times takes approximately 19.68 seconds. Using Sexplib, these operations take 1175 seconds, which is about 60 times slower. However, the times in native code are respectively 17.98 and 105.3 seconds — Sexplib is less than 6 times slower than Marshal. Given the other advantages of Sexplib, these are numbers I can live with.
If you are curious about how I got these numbers (and you should — never take anyone's word at face value when benchmarks are involved!), here follows the run-down of the small programme I used for testing.
run_marshal is a function that given a manuscript, does a marshalling followed by an unmarshalling using the Marshal module. run_sexplib does the same thing but using Sexplib. Note that the latter function actually first converts the Manuscript.t into its Sexp.t representation and then this latter value into a string (and vice-versa for the reverse operation):
let run_marshal manuscript () =
let marshalled = Marshal.to_string manuscript [] in
let unmarshalled : Manuscript.t = Marshal.from_string marshalled 0 in
ignore (unmarshalled)
let run_sexplib manuscript () =
let manuscript_sexp_old = Manuscript.sexp_of_t manuscript in
let mach_str = Sexplib.Sexp.to_string_mach manuscript_sexp_old in
let manuscript_sexp_new = Sexplib.Sexp.of_string mach_str in
let manuscript_new = Manuscript.t_of_sexp manuscript_sexp_new in
ignore (manuscript_new)
I also define a generic benchmarking function that loops a provided function 100,000 times. It uses Unix.gettimeofday to retrieve timing information:
let benchmark test =
let start = Unix.gettimeofday () in
for i = 1 to 100000 do
test ()
done;
let finish = Unix.gettimeofday () in
let duration = finish -. start in
duration
Finally, the main programme simply creates a new manuscript (assume that function get_manuscript returns a new parsed manuscript) and calls the benchmark function with the marshal and sexplib routines:
let () =
let manuscript = get_manuscript () in
let duration_marshal = benchmark (run_marshal manuscript) in
let duration_sexplib = benchmark (run_sexplib manuscript) in
Printf.printf "Marshal: %f\n" duration_marshal;
Printf.printf "Sexplib: %f\n" duration_sexplib

Comments