Caml Weekly News

Hello

Here is the latest Caml Weekly News, for the week of February 19 to 26, 2013.

llpp v13
Beta-release of Merlin
Extracting information from HTML documents
strange typechecking result
What is triggering a lot of GC work?
Other Caml News

llpp v13

Archive: https://sympa.inria.fr/sympa/arc/caml-list/2013-02/msg00148.html

malc announced:

New version of llpp (tagged v14) is now available at
http://repo.or.cz/w/llpp.git

Blurb:

llpp a graphical PDF viewer which aims to superficially resemble
less(1)

Changes:

* Bugfixes
* Keyboard handling imrpovements (keypad, Neo layout, altgr)
* Some functionality to make integration with synctex easier
(shift click, -remote command line option)

Beta-release of Merlin

Archive: https://sympa.inria.fr/sympa/arc/caml-list/2013-02/msg00154.html

Frédéric Bour announced:

We are very pleased to announce the beta release of Merlin. Merlin is a tool
which provides smart completion, among other things, in your favorite editor.
As of today, Vim and Emacs are supported.

See it at work:
https://github.com/def-lkb/merlin#screenshots

Its features include:
- completion of values, constructorsand modules based on local scope
- retrieving type of identifiers and/or expressions under cursor
- integration of findlib to manage buildpath
- highlighting of syntax errors, type errors and warnings inside the editor
- to a certain amount, resilience to syntax and type errors

It works only with Ocaml 4.00.1 (may works with newer versions).

If you happen to have an opam installation with the right version
(opam switch 4.00.1), you can try it right away with:

$ opam remote add kiwi http://kiwi.iuwt.fr/~asmanur/opam/.git http://kiwi.iuwt.fr/%7Easmanur/opam/.git 
$ opam install merlin

Then to get started and set-up your editor:
- https://github.com/def-lkb/merlin#setting-up-vim
- https://github.com/def-lkb/merlin#emacs-interface

Check it out at: https://github.com/def-lkb/merlin

The current version still needs to be tested under various systems and
configurations, your feedback is welcome.

Gabriel Scherer then said:

I looked into Merlin in the last few days, so here is a bit more
information if you, like me, are interested in design information
about projects before deciding whether to use (and potentially
contribute to) them.

Interface-wise: using Merlin (at least on emacs) feels a lot like
ProofGeneral (or CoqIde): there is a "checked zone" from the start of
the buffer to the first error, and reliable type information is
available in all this checked zone. The interface is rather simple and
easy to use.

(If you're considering porting Merlin to another editor: merlin is an
OCaml program that inputs and outputs JSON queries, so it seems rather
easy to port to any editor having plugin support in any language with
JSON support. Given its youth, you should however expect Merlin's
protocol to evolve over the next development period, so editor plugin
developers would need to follow Merlin's internal changes for now.)

The major difference between Merlin and -annot-based tools such as
Ocamlspotter or Typerex2 is that Merlin does the work of parsing OCaml
sentences into chunks and sending them incrementally to the
type-checker, instead of sending the whole buffer at once and using
the output. This means that Merlin is much more robust with respect to
files that have errors and cannot be compiled as a whole: you'll get
reliable type information from the start of the file upto the first
error (and some resilience heuristic to have more after). The OCaml
type-checker is in fact able to provide some typing information even
for incorrect programs, so -annot-based tools do support some of those
features, but in a less reliable and more coarse-grained way.

The design trade-off is that Merlin has to copy/adapt more logic from
the compiler: it embeds an OCaml parser derived from the compiler's
one (you won't get parsing bugs as with regexpy elisp modes), and
currently also copies and slightly modifies the type-checker (I hope
this can be changed in the future, possibly by adding some flexibility
to the type-checker regarding eg. production of warnings). This means
more maintenance work to port Merlin to future OCaml versions. On the
other hand, it's probably not reasonable to expect the compiler
type-checker to be re-engineered for optimal incrementality support,
so having an external implementation of incrementality that piggybacks
on the batch typer interface makes sense.

My personal intuition is that Merlin's design is a good long-term
choice for an editor service (as opposed to code-analysis services).
You want tools such as ocamldoc, dead code analysis, or everything
concerned with analysis of an existing correct codebase to be based on
-bin-annot and work on the typedtrees directly (without local support
for parsing and typing programs). You want tools designed to work on
incomplete programs in the process of being created to embed
incrementality logic (insofar as it hasn't been pushed into the
compiler), understanding partial files and local modifications to
cleverly control the parsing&typing process with reactivity in mind.
The One True Editor Mode will merge both aspects, working on .cmt of
already-compiled dependencies on one side, and partial/incremental
information in edition buffers on the other. Merlin can evolve in this
way, as its type-directed analyses (those performed in the "checked
zone") work directly on the typedtree, so could work with .cmt files.

Simon Cruanes then added:

I started using Merlin a few days ago, on vim, and I must say it's a
pleasure to have direct access to completion/type checking/type
querying. Per-project configuration is really simple (just define a list
of source directories, build directories, and ocamlfind-able packages);
then completion and type-checking work seamlessly even with external
code. Maybe that's no big deal for emacs users, but on vim I think this
is quite a big improvement.

It's already very useful for me, so I'd like to thank the authors :)

Extracting information from HTML documents

Archive: https://sympa.inria.fr/sympa/arc/caml-list/2013-02/msg00166.html

José Romildo Malaquias asked and Gerd Stolpmann replied:

> tagsoup[1][2] is a Haskell library for parsing and extracting
> information from (possibly malformed) HTML/XML documents.
> 
> tagsoup provides a basic data type for a list of unstructured tags, a
> parser to convert HTML into this tag type, and useful functions and
> combinators for finding and extracting information.
> 
> Is there a similar library for OCaml?
> 
> I want to write an application which will need to extract some
> information from HTML documents from the web. tagsoup helps a lot in the
> Haskell version of my program. Which OCaml libraries can help me with
> that when porting the application to OCaml?
> 
> [1] http://community.haskell.org/~ndm/tagsoup/
> [2] http://hackage.haskell.org/package/tagsoup

Well, not really identical, but there is at least a robust HTML parser
in OCamlnet:

http://projects.camlcity.org/projects/dl/ocamlnet-3.6.3/doc/html-main/Nethtml.html

Homepage: http://projects.camlcity.org/projects/ocamlnet.html

This parser was once used for Mylife's profile extractor (grabbing
data from profile pages of social networks), and is proven to handle
absolutely bad HTML well. XML should also be no problem.

Florent Monnier then said and Gerd Stolpmann added:

> There's also xmlerr:
> http://www.linux-nantes.org/%7Efmonnier/ocaml/xmlerr/
> 
> but xmlerr is an alpha, experimental, hobbyist, not professional thing.
> Not all but some parts of its code are very quick-n-dirty.
> I've written it for my own use to read HTML web-pages,
> and I'm using it quite often since several years now.
> 99.9% of the time it does what I expect from it.
> It's not able to read XML files that are several Go,
> because it first loads the content in a string and then parses from it
> which was a very poor choice, but at the beginning I was only using it
> to load HTML web-pages.
> 
> Don't expect something of the quality of Nethtml, xmlm and xml-light!
>
> I've never used Nethtml so I cannot say anything about it, but
> from what I can see from the interface is that the type is:
> 
> type document =
>   | Element of (string * (string * string) list * document list)
>   | Data of string
> 
> XmlErr's type is:
> 
> type attr = string * string
> type t =
>   | Tag of string * attr list  (** opening tag *)
>   | ETag of string  (** closing tag *)
>   | Data of string  (** PCData *)
>   | Comm of string  (** Comments *)
> 
> type html = t list
> 
> As a result xmlerr will be able to return a plain representation of:
> <bold><i>text</bold></i>

Right, in quirk mode browsers understand this, although this has always  
been against the specs. Note that even this is possible in quirk mode:
<b>bold <i>bold+italics </b>only italics </i>normal text

Nethtml cannot interpret this in the obviously intended way. In  
practice, this was never a problem, though (fortunately, 99% of the  
code in the web is cleaner than this).

> So it seems that Nethtml will return something corrected.
> Xmlerr doesn't, it only returns what it seems.

Nethtml returns the logical view, i.e. it doesn't return tags but  
elements. (NB Tags are the lexical delimiters of elements.) This is  
actually what you normally want to see because HTML is specified in  
terms of elements (except you write something like an HTML editor where  
also knowing tags as such is important). Nethtml also processes omitted  
tags, e.g. for <a><b>text</a> it will implicitly close the "b" element  
when closing "a". Or even this: <p>para1 <p>para2 - here, Nethtml  
closes the first "p" when it sees the second (because it knows that "p"  
elements cannot contain other "p" elements). Note that this was always  
the tricky part of HTML parsing, and we had most problems in this area.

> Also Xmlerr parses comments because sometimes what I want to get is  
> there.

This is also possible with Nethtml, but optional. Nethml can also parse  
processing instructions, but these are rarely used even in XML files.

> Xmlerr only returns junk for the very XML specific things like <?xml
> and <! things,
> as a result it's not possible to use xmlerr to read, correct and print
> back corrected HTML when there are these kind of elements.

But anyway, an XML token reader like Xmlerr is certainly something  
useful.

Gerd

> The last release also provides a command line utility "htmlxtr".
> This "thing" doesn't require any ocaml programming, it's a basic
> command line tool.
> What htmlxtr does is to "untemplate" templated parts of a web-page
> (but in a very basic way) and print the extracted things on stdout
> (read man ./htmlxtr.1 for more informations).
> I'm interested by suggestions to improve it.
> 
> I'm using xmlerr to make quickly written scripts, for example
> Xmlerr.print_code prints an HTML content as ocaml code with Xmlerr.t
> type, so that I can just quickly copy-paste a piece of it in a
> parttern match and get something from this piece in less than one
> minute.
> When the template of a website changes, I can usually fix my script in
> less than 3 minutes.
> 
> I know that some other programming languages provide utilities and
> libraries for these kind of tasks and that some uses some tricks and
> concepts to extract things from web-pages the more easily possible,
> but I don't know them. If you do and have some time, please tell me
> about it.
> 
> Anyway even if xmlerr is very amateurish,
> I would be interested to get any kind of suggestions about how to  
> improve it.

strange typechecking result

Archive: https://sympa.inria.fr/sympa/arc/caml-list/2013-02/msg00175.html

Matej Kosik asked and Leo White replied:

> For one of my modules, the typechecker started to raise strange 
> complaints. I was not able to figure out exactly why, but at least I 
> wanted to narrow down the problem.
> 
> This small program:
> 
>    type r1 = {l1 : unit list}
> 
>    and r2 = {l2 : int64 list}
> 
>    let rec f1 _ =
>      ()
> 
>    and _ r1 =
>      f1 r1.l1
> 
>    and _ r2 =
>      f1 r2.l2
> 
> is rejected by the typechecker with a following error message:
> 
>    File "test.ml", line 12, characters 5-10:
>    Error: This expression has type int64 list
>           but an expression was expected of type unit list
> 
> I do not understand why the given program was rejected.
> 

I think that by default recursive uses of a function are monomorphic. You 
can fix this with an explicit polymorphic annotation:

   let rec f1: 'a. 'a -> unit = fun _ -> ()
   
   and f2 r1 = 
     f1 r1.l1
   
   and f3 r2 = 
     f1 r2.l2;;

Matej Kosik then asked and Leo White replied:

> I would like to ask:
> 
> - Where is that restriction explained?
>    (I've searched Ocaml reference manual for "monomorphic" but nothing
>     relevant seemed to come up)
> 
> - Where are things like:
> 
>       'a. 'a -> unit
> 
>    described?
> 

I'm not sure if the restriction is explained anywhere but the solution is 
described in:

  http://caml.inria.fr/pub/docs/manual-ocaml-4.00/manual021.html#toc79

What is triggering a lot of GC work?

Archive: https://sympa.inria.fr/sympa/arc/caml-list/2013-02/msg00185.html

Francois Berenger asked:

Is there a way to profile a program in order
to know which places in the source code
trigger a lot of garbage collection work?

I've seen some profiling traces of OCaml programs
of mine, sometimes the trace is very flat,
and the obvious things are only GC-related.

I think it may mean some performance-critical part
is written in a functional style and may benefit
from some more imperative style.

Mark Shinwell replied:

Well, as of last week, there is!

I'm working on a compiler and runtime patch which allows the
identification, without excessive overhead, of every location (source
file name / line number) which causes a minor or major heap allocation
together with the number of words allocated at that point.

There should be something available within the next couple of weeks.
It only works on native code compiled for x86-64 machines at present.
Currently it has only been tested on Linux---although I expect it to
work on other Unix-like platforms with little or no modification.

ygrek then added:

Meanwhile you can use poor man's allocation profiler :
- http://ygrek.org.ua/p/code/pmpa
- https://sympa-roc.inria.fr/wws/arc/caml-list/2011-08/msg00050.html

Gerd Stolpmann also replied to the original question:

This is really a hard question, and I fear an allocation profiler  
cannot always answer it. Imperative style means to use assignments, and  
assignments have often to go through caml_modify, and are not as cheap  
as you would think. In contrast, allocating something new can usually  
avoid caml_modify.

This can have counter-intuitive consequences. Yesterday I sped an  
imperative program up by adding allocations! The idea is so strange  
that I need to report it here. The program uses an array for storing  
intermediate values. Originally, there was only one such array, and  
sooner or later this array was moved to the major heap by the GC.  
Assigning the elements of an array in the major heap with young values  
is the most expensive form of assignment - the array elements are  
temporarily registered as roots by the OCaml runtime. So my idea was to  
create a fresh copy of the array now and then so it is more often in  
the minor heap (the array was quite small). Assignments within the  
minor heap are cheaper - no root registration. The program was 10%  
faster finally.

My general experience is that optimizing the memory behavior is one of  
the most difficult tasks, especially because the OCaml runtime is  
designed for functional programming, and short-living allocations are  
really cheap. Usual rules like "assignment is cheaper than new  
allocation" just do not hold. It depends.

Alain Frisch then said and Gerd Stolpmann replied:

> > This can have counter-intuitive consequences. Yesterday I sped an
> > imperative program up by adding allocations!
> 
> This is really an interesting scenario, thanks for sharing!
> 
> Two other approaches to addressing the same performance issue could have 
> been:
> 
>   1. increase the size of the minor heap so that your array stays in it 
> long enough;
> 
>   2. try to reduce the number of other allocations.
> 
> Did you try one of these approaches as well?  (1 in particular is 
> particularly easy to test.)

No, there was no chance of keeping this array in the minor heap
otherwise, the program was running for too long.

> Gabriel Scherer recently called the community to share representative 
> "benchmarks", in order to help core developers target optimization 
> efforts to where they are useful:
> 
> http://gallium.inria.fr/~scherer/gagallium/we-need-a-representative-benchmark-suite/
> 
> Gabriel: except from LexiFi's contribution, did you get any code?  Gerd: 
> it would be great if you could share the code you mention above; is it 
> an option?  

Unfortunately not - it's an interpreter I developed for my customer. I
can try to create a synthetic demo case just to show the effect. (The
array is in this program actually a kind of stack frame, and it is
interpreting some data manipulation code. When executing a statement,
the current data item is put into the first cell of the frame, so we
have really a lot of assignments here. The data items are strings, and
every data manipulation creates new strings, and this results in some
allocation speed (but not really high, as e.g. in a term rewriter).)

Gerd

> There are a number of optimizations which have been proposed 
> (related to boxing of floats, compilation strategy for let-binding on 
> tuples, etc), which could reduce significantly the allocation rate of 
> some programs.  In my experience, this reduction can be observed on 
> real-sized programs, but it does not translate to noticeable speedups. 
> It might be the case that your program would benefit from such 
> optimizations.  Having access to the code would be very useful!

Gabriel Scherer then said:

Thanks for the friendly poking. I did get some code (I've actually
been surprised by how dedicated some submitters one, eg. Edwin Török),
but my plate has been full non-stop since and I haven't yet taken the
time to put this into shape. It's on my TODO list and I hope to share
some results in the coming weeks.

Regarding the interesting battle story from Gerd, my own idea was to
"oldify" the values before inserting them in the array, in order not
to fire the write barrier. Oldifying values is costly as well, so I'm
not sure if that's interesting if the array is long-lived but the
elements short-lived. And more importantly, the oldifying interface
is, to my knowledge, not exposed to end-users (while it's possible
through the C interface to allocate directly in the old region), so
this cannot be written and tested without ugly hacks right now. I'd
still be curious to know how this solution would compare to the
others.

Török Edwin then replied:

Thanks, its not yet finished though: I meant to add a benchmark for
ocaml-re too and then publish it. I got sidetracked trying to find
some meaningful way to easily represent the results though (the text
output is a bit too verbose).

But since you brought it up I'd like your opinion on plots:

Currently I'm thinking of generating from the .csv:
- one SVG boxplot for (weighted) median/mean of OCaml version X vs Y
  performance
- one SVG paired barplot with confidence intervals for the individual
  benchmarks
- instead of X-axis labels have on-mouse-over tooltips (SVG title
  element) describing benchmark name and time statistics

Initially I tried boxplots for the individual measurements (using
PNG/PDF output of archimedes), but the graphs either looked too
crowded (not enough room to place all labels), or there were too many
graphs and hard to get an overall picture (if I put fewer
benchmarks/page).

Other Caml News

From the ocamlcore planet blog:

Thanks to Alp Mestan, we now include in the Caml Weekly News the links to the
recent posts from the ocamlcore planet blog at http://planet.ocaml.org/.

On the n-ary cartesian product:
  http://gallium.inria.fr/blog/on-the-nary-cartesian-product

Piqi 0.6.0:
  http://caml.inria.fr/cgi-bin/hump.cgi?contrib=785

Merlin:
  http://caml.inria.fr/cgi-bin/hump.cgi?contrib=848

Quick tip: the ocamlbuild -documentation option:
  http://gallium.inria.fr/blog/quick-tip-the-ocamlbuild-documentation-option

Old cwn

If you happen to miss a CWN, you can send me a message and I'll mail it to you, or go take a look at the archive or the RSS feed of the archives.

If you also wish to receive it every week by mail, you may subscribe online.

Alan Schmitt