Caml Weekly News

Hello

Here is the latest Caml Weekly News, for the week of March 17 to 24, 2009.

XML output
ocaml-http is looking for a new maintainer
Google summer of Code proposal

XML output

Archive: http://groups.google.com/group/fa.caml/browse_thread/thread/165323537f036e34#

Rémi Dewitte asked and Gerd Stolpmann answered:

> I have used pxp to parse xml and I am happy with it. I'd like now to
> produce xml and wonder what are the options to do so (possibly the
> simpliest).

Maybe not the simplest: Use the PXP preprocessor to create the output
tree, and print the tree:

http://projects.camlcity.org/projects/dl/pxp-1.2.1/doc/manual/html/ref/Intro_preprocessor.html
http://projects.camlcity.org/projects/dl/pxp-1.2.1/doc/manual/html/ref/Pxp_document.document.html#2_WritingdocumentsasXMLtext


> 
> I think I am going to start with the Printf module. I wonder how well
> it handles utf8 for example. 

UTF-8 are just bytes for printf.

> And I'll have to write a kind of xml_encode function. I am pretty sure
> it has already be done somewhere !

let xml_encode =
  Netencoding.Html.encode 
    ~in_enc:`Enc_utf8
    ~out_enc:`Enc_usascii
    ~prefer_names:false
    ()

That would assume the input is UTF-8 encoded, and the output is
ASCII-encoded. You can control which ASCII characters get the special
XML representation &...; with the unsafe_chars optional argument.
Docs are at
http://projects.camlcity.org/projects/dl/ocamlnet-2.2.9/doc/html-main/Netencoding.Html.html

Sylvain Le Gall also replied:

Maybe it is a bit overkilling, but there is also ocamlduce.

See there:
http://www.cduce.org/ocaml
(dev for ocaml 3.11:)
http://ocamlduce.forge.ocamlcore.org/
http://git.ocamlcore.org/cgi-bin/gitweb.cgi?p=ocamlduce/ocamlduce.git;a=summary

OCamlduce can also be used with Eliom/OCsigen.

AFAIK, using ocamlduce can help you to type check your output tree
directly within OCaml compiler...

Matthieu Wipliez also replied:

Yet another solution is Xmlm by Daniel Bünzli.

http://erratique.ch/software/xmlm

This is probably the easiest and lightweight solution: Xmlm comes as a single
module and its interface, and it's BSD so you can just copy/paste it into your
project.

Michael Ekstrand then added:

I second the xmlm suggestion.  Polling event-based parsing is very slick 
and maps well into the functional paradigm, and its XML writing support 
(generating a stream of events identical to those you read) makes 
generation quite intuitive and reliable.

ocaml-http is looking for a new maintainer

Archive: http://groups.google.com/group/fa.caml/browse_thread/thread/3e99820fd9b65b57#

Stefano Zacchiroli announced:

Hi all, ocaml-http [1] is looking for a new maintainer.
More details have been written on my blog [2].

If you are interested in taking over the maintenance, please mail me
in private.

Cheers.

[1] http://upsilon.cc/~zack/hacking/software/ocaml-http/
[2] http://upsilon.cc/~zack/blog/posts/2009/03/ocaml-http_is_looking_for_a_new_maintainer/

Google summer of Code proposal

Archive: http://groups.google.com/group/fa.caml/browse_thread/thread/3249f410fe061579#

In this long thread full of technical posts, Andrey Riabushenko said and Xavier Leroy replied:

> I would like to develop LLVM frontend to Ocaml language. LLVM does participate
> in GSoC. LLVM do not mind to developed a ocaml frontend as LLVM GSoC project.
> I want to discuss details with you before I will make an official proposal to
> LLVM. [...]

Do authors of ocaml has something to say about the idea?

Da. A number of things, actually.

1- I know of at least 3, maybe 4 other projects that want to do
something with OCaml and LLVM.  Clearly, some coordination between
these efforts is needed.

2- If you're applying for funding through a summer of code project,
you need to articulate good reasons why you want to combine OCaml and
LLVM.  "Ocaml is cool, LLVM is cool, so OCaml+LLVM would be extra
cool" is not enough.  "It will generate PIC-16 code" isn't either.
Run-time code generation could be a good enough reason, but you still
need to weigh the development cost of the LLVM approach against, for
example, hacking the current OCaml code generator so that it emits
machine code in memory instead of assembly code.

3- A language implementation like OCaml breaks down in four big parts:
      1- Front-end compiler
      2- Back-end compiler and code emitter
      3- Run-time system
      4- OS interface
Of these four, the back-end is not the biggest part nor the most
complicated part.  LLVM gives you part 2, but you still need to
consider the other 3 parts.  Saying "I'll do 1, 3 and 4 from scratch",
Harrop-style, means a 5-year project.  To get somewhere within a
reasonable amount of time, you really need to reuse large parts of the
existing OCaml code base.

4- From a quick look at LLVM specs, the two aspects that appear most
problematic w.r.t. Caml are exception handling and GC interface.
LLVM seems to implement C++/Java "zero-cost" exceptions, which are
known to be quite costly for Caml.  (Been there, done that, in the
early 1990s.)  I regret that LLVM does not provide support for
alternate implementations of exceptions, like the C-- intermediate
language of Ramsey et al does:
 http://portal.acm.org/citation.cfm?id=349337

The big issue, however, is GC interface.  There are GC techniques that
need no support from the back-end: stack maps and conservative
collection.  Stack maps are very costly in execution time.
Conservative GC like the Boehm-Weiser GC is already much better.  But
for full efficiency, back-end support is required.  LLVM documents a
minimal interface to produce stack maps and make them available to the
GC at run-time:
 http://llvm.org/docs/GarbageCollection.html

The first thing you want to investigate is whether this interface is
enough to support an exact, relocating collector such as
OCaml's. According to
 http://www.nabble.com/Garbage-collection-td22219430.html
Gordon Henriksen did succeed in interfacing OCaml's GC with LLVM.
Maybe there is still some more work to do here, in which case I
recommend you look into this first.

6- Here is a schematic of the Caml compiler.  (Use a fixed-width font.)

           |
           | parsing and preprocessing
           v
        Parsetree (untyped AST)
           |
           | type inference and checking
           v
        Typedtree (type-annotated AST)
           |
           | pattern-matching compilation, elimination of modules, classes
           v
        Lambda
         /  \
        /    \ closure conversion, inlining, uncurrying,
       v      \  data representation strategy
    Bytecode   \
                \
               Cmm
                |
                | code generation
                v
             Assembly code

In my opinion, the simplest approach is to start at Cmm and generate
LLVM code from there.  Starting at one of the higher-level
intermediate forms would entail reimplementing large chunks of the
OCaml compiler.  Note that Cmm is quite close to the C-- intermediate
language mentioned earlier, so there is a lot to learn from Fermin
Reig's experience with an OCaml/C-- back-end.

7- To finish, I'll preventively deflect some likely reactions by Jon
Harrop:

"But you'll be tied to OCaml's data representation strategy!"
 Right, but 1- implementing you own data representation strategy is
 a lot of work, especially if it is type-based, and 2- OCaml's
 strategy is close to optimal for symbolic computing.

"But LLVM assembly is typed, so you must have types!"
 Just use int32 or int64 as your universal type and cast to
 appropriate pointer types in loads or stores.

"But your code will be tainted by OCaml's evil license!"
 It is trivial to make ocamlopt dump Cmm code in a file or pipe.
 (The -dcmm debug option already does this.)  Then, you can have a
 separate, untainted program that reads the Cmm code and transforms it.

"But shadow stacks are the only way to go for GC interface!"
 No, it's probably the worst approach performance-wise; even a
 conservative GC should work better.

Jon Harrop replied:

> 3- A language implementation like OCaml breaks down in four big parts:
>        1- Front-end compiler
>        2- Back-end compiler and code emitter
>        3- Run-time system
>        4- OS interface
> Of these four, the back-end is not the biggest part nor the most
> complicated part.  LLVM gives you part 2, but you still need to
> consider the other 3 parts.  Saying "I'll do 1, 3 and 4 from scratch",
> Harrop-style, means a 5-year project.

On the contrary, my "style" was to provide the features that I value 
(multicore & FFI) in a usable form (stop-the-world) with the shortest 
possible development time (i.e. <<6 months to create something useful). 
Specifically:

1- Front-end compiler: use camlp4 to provide an embedded DSL for 
high-performance parallel numerics and/or reuse front-ends from existing 
compilers like OCaml, PolyML, MosML, NekoML to build completely new language 
implementations.

2- Back-end compiler and code emitter: reuse LLVM.

3- Run-time system: write the simplest possible precise GC and use 
stop-the-world to apply it to threads that may then run in parallel.

4- OS interface: make it as easy as possible to call C directly.

HLVM had solved (2), (3) and (4) after only 3 months of part-time work. I 
vindicated my style!

> 7- To finish, I'll preventively deflect some likely reactions by Jon
> Harrop:
> 
> "But you'll be tied to OCaml's data representation strategy!"
>   Right, but 1- implementing you own data representation strategy is
>   a lot of work, especially if it is type-based, and

Actually I found that easy, not least because I wanted a user-friendly FFI so 
I just used C's data representation whenever possible.

>   2- OCaml's strategy is close to optimal for symbolic computing.

Is MLton not several times faster than OCaml for symbolic computing?

> "But LLVM assembly is typed, so you must have types!"
>   Just use int32 or int64 as your universal type and cast to
>   appropriate pointer types in loads or stores.

That is entirely possible and could be useful as an incremental improvement to 
OCaml's existing bytecode interpreter but it is not a step toward my goals.

> "But your code will be tainted by OCaml's evil license!"
>   It is trivial to make ocamlopt dump Cmm code in a file or pipe.
>   (The -dcmm debug option already does this.)  Then, you can have a
>   separate, untainted program that reads the Cmm code and transforms it.

Again, that is another technically-feasible step away from my goals because 
OCaml's CMM has already been mangled for its data representation (e.g. 31-bit 
ints, boxed floats).

> "But shadow stacks are the only way to go for GC interface!"
>   No, it's probably the worst approach performance-wise; even a
>   conservative GC should work better.

Building a state-of-the-art optimized concurrent GC Leroy-style means an 
infinity-year project. =:-p

Seriously though, I think it is essential to get a first working version of a 
GC that permits parallel threads. HLVM will be useful to a lot of people long 
before its GC gets optimized.

Using folding to read the cwn in vim 6+

Here is a quick trick to help you read this CWN if you are viewing it using vim (version 6 or greater).

:set foldmethod=expr
:set foldexpr=getline(v:lnum)=~'^=\\{78}$'?'&lt;1':1
zM

If you know of a better way, please let me know.

Old cwn

If you happen to miss a CWN, you can send me a message and I'll mail it to you, or go take a look at the archive or the RSS feed of the archives.

If you also wish to receive it every week by mail, you may subscribe online.

Alan Schmitt