OCaml Weekly News

Hello

Here is the latest OCaml Weekly News, for the week of September 30 to October 07, 2014.

slacko-0.9.0
Feedback on -safe-string migration attempts
Thoughts on targeting windows
Other OCaml News

slacko-0.9.0

Archive: https://sympa.inria.fr/sympa/arc/caml-list/2014-10/msg00011.html

Marek Kubica announced:

I'd like to announce the first public release of Slacko, a REST API
binding to Slack https://slack.com/. I'm looking for input on it, so
I can publish a 1.0 soon.

The source code can be found on GitHub,
https://github.com/Leonidas-from-XIV/slacko and it is already
uploaded in OPAM http://opam.ocaml.org/packages/slacko/slacko.0.9.0/.

More information can be found on my blog,
http://xivilization.net/~marek/blog/2014/10/01/introducing-slacko/.

Malcolm Matalka then said and Marek Kubica replied:

> I would either change the 'apierror' type to normal variants, or
> spread out the apierror to multiple types.  Can every API function
> really return every one of those errors?  The value of polymorphic
> variants is they are module, and you can share the constructors
> across types.

You're right, there is a common set of errors that most functions return
and then there are some specific errors.

Originally I didn't want to create such a huge type but have every
function have its own polymorphic variants that apply to it, so users
of the API know exactly what errors might be thrown and have to handle
these. The huge type is a direct result of the "validate" function
https://github.com/Leonidas-from-XIV/slacko/blob/bc6be47cca8d2088d3d7cfb9af0603a350ab0215/src/slacko.ml#L79
which tests all cases.

I tried specifying a smaller subset of polymorphic variants, but the
compiler always selected the full set. I don't know how to fix this,
because on the one hand side I don't want to have a big monolithic
error type but on the other hand, I don't want to duplicate a subset of
the validate function in every function.

Looking forward for ideas on how to solve this in a more elegant way.

> I would also structure the API calls in terms of the Result.t type in
> Core.  Right now success is part of the error type.

If I understand you correctly, this should do the trick with the
current code:

type result = Success of Yojson.Basic.json | Error of apierror

For future releases I plan to replace the strings in the signatures
with proper types (so, some "string" turn into a "token" types, others
into timestamp types) and to process the resulting JSON into some more
convenient data structures. But for now, I hope to get the error
handling right and then build upon this.

Gabriel Scherer then suggested:

Re. the "validate" function, you can do it using the ability to
pattern-match on a named subset of the polymorphic variants.

Let's consider a simplification of your problem with only two classes
of errors (taken from your code):

  type error_cant_kick = [
  | `Cant_kick_from_general
  | `Cant_kick_from_last_channel
  | `Cant_kick_self
  ]

  type error_cant_invite = [
  | `Cant_invite
  | `Cant_invite_self
  ]

  type 'a error = [
  | `Success of 'a
  | error_cant_kick
  | error_cant_invite
  ]

The `validate` function indeed processes all cases, returning an
everything-possible error type:

  let validate : _ -> _ error = function
    | `String "cant_invite_self" -> `Cant_invite_self
    | `String "cant_invite" -> `Cant_invite
    | `String "cant_kick_self" -> `Cant_kick_self
    | `String "cant_kick_from_general" -> `Cant_kick_from_general
    | `String "cant_kick_from_last_channel" -> `Cant_kick_from_last_channel

But you can refine its result if you know (from the informal
API specification) that it can only be some subset, using the #foo
syntax. In the example below, `kick` and `invite` have been refined
with distinct subsets:

  let api_mismatch err =
    failwith ("API mismatch: unexpected error " (*^ to_string err *))

  let kick f =
    f ()
    |> validate
    |> function
      | `Success v -> `Success v
      | #error_cant_kick as err -> err
      | other -> api_mismatch other

  let invite f =
    f ()
    |> validate
    |> function
      | `Success v -> `Success v
      | #error_cant_invite as err -> err
      | other -> api_mismatch other

`kick` now returns a [> error_cant_kick | `Success of 'a ], while `invite`
returns [> error_cant_invite | `Success of 'a ].

Marek Kubica then said and Jacques Garrigue replied:

> Thank you for this idea, I was actually looking for a way to define
> subsets of polymorphic variants, I didn't know that I can just put them
> into other types and was unaware of the # syntax. Now my functions
> return just the possible error variants, not all of them anymore.
> 
> I've reworked the code to get rid of the huge apierror type (which I
> suppose I can throw away completely, since a user of the API will never
> need it anyway), but it lead to quite huge blow-up of signatures:
> 
> https://github.com/Leonidas-from-XIV/slacko/blob/37c3626bb9574bc99325267fb4c9f9c3c4f4730c/src/slacko.mli
> 
> I am going to simplify these subsets a bit (since e.g. all history
> methods use the same polymorphic variants and some possible error types
> imply other error types) but it kinda looks very verbose now - any hints
> on what can be done?

I don’t think it’s particularly verbose.
You have documented all the possible errors for each function, which is
valuable information, and it would take the same space (or more) if you
were to write this in comments.
If you think this is hard to read, you can try some factorization, but I’m not
sure it would help. Also, from a stylistic point of view I would write variant
definitions on a single line, since there are never more than 3 cases.

This said, there is always a tension between safety an complexity.
If in practice static checking of possible errors doesn’t matter that much,
it would be easier to write this code using normal variants.

Feedback on -safe-string migration attempts

Archive: https://sympa.inria.fr/sympa/arc/caml-list/2014-10/msg00019.html

Gabriel Scherer said:

I recently converted Extlib to work with safe-string ( the patch can
be found in the ocaml-lib-devel archives,
http://sourceforge.net/p/ocaml-lib/mailman/message/32877133/ ), and
while it mostly went smoothly, there was a pain point that I think
would be worth discussing.

The question is, when converting an existing library interface, how to
decide whether any given part of the API should remain a "string" or
be moved to "bytes" (
http://caml.inria.fr/pub/docs/manual-ocaml/libref/Bytes.html ) -- or
maybe provide two functions, one for each type.

# The problem

The new distinction between bytes and string, added in 4.02, actually
plays on two different intuitions:
- bytes represents (1) mutable (2) sequences of bytes
- string represents (1) immutable (2) end-user text (which happen to
be represented as sequence of bytes, but we could think of
representing them as eg. Javascript strings in the future and with
js_of_ocaml, or with ropes, etc.)

The problem is that aspects (1) and (2) are somewhat orthogonal. I
don't think we're interested in mutable end-user texts, but I
encountered a few notable cases of (1) immutable (2) sequences of
bytes. The problem is: should those be typed as string, or bytes?

(There may be a difference between functions that assume their
arguments are immutable, and function that simply guarantee that they
won't themselves mutate their arguments. For now I'll assume those two
cases count as "immutable sequences of bytes").

Right now, the standard library itself does a strange job of making a
choice. The Marshal module (
http://caml.inria.fr/pub/docs/manual-ocaml/libref/Marshal.html )
appears to favor the choice of "bytes" for non-mutated byte sequences
(eg. data_size, total_size), while the Digest module (
http://caml.inria.fr/pub/docs/manual-ocaml/libref/Digest.html )
remained in the land of strings.


# An ideal solution

In an ideal world, I claim the best solution would be the following.
Given that it is clear (to me) that mutable byte sequences and
immutable byte sequences share the same representation, we should use
phantom type to distinguish them:

  type mut
  type immut
  type 'a bytes

  val get : 'a bytes -> int -> char
  val set : mut bytes -> int -> char
  Digest.t = immut bytes

Using phantom types had been considered at the time of the
bytes/string split, but rejected because suddenly adding polymorphism
to string literals and string functions broke a lot of code ("The type
of this expression, ..., contains a type variable that cannot be
generalized", or suddenly-polymorphic method return types). More
importantly, we do not want to enforce string and bytes to always have
the same underlying representation. Neither arguments hold for
mutable/immutable bytes.


# Going forward

It is probably a bit too late to change the "bytes" type in the
compiler standard library. (Well, feel free to disagree on this.)
And maybe we don't need to: just as more featureful, higher-level
libraries have been developed outside the OCaml distribution, we could
think of having a safer, higher-level phantom representation of byte
sequences, as an external library.

Regardless of what we do about this, I would recommend that immutable
byte sequences (things that are, by design, not text) be represented
as "bytes" rather than "string"¹. If/whenever a consensus on a safer
phantom representation appear, it will be possible to convert to it
without changing the representation.
Similarly, if your bytes-taking function does not mutate or capture
its input, you should mention it informally in its
specification/documentation (and maybe express this with a phantom
type later): this is important to reason about, for example, (un)safe
conversions on those byte sequences.

¹: a dissenting opinion could suggest that it is more important to get
the type-checker help re. mutability than expose the distinction
between byte-level data and text (which should be an abstract type in
some UTF8 library anyway), and thus immutable anything should rather
be "string". I think the phantom type approach is superior, and we
should design interfaces with it in mind.

Jacques Garrigue replied:

I think this is an interesting proposal.
We didn’t consider it when adding the -safe-string option, but I do
not think it is too late to change things in the compiler:
keeping compatibility with pre-4.02 code is essential, but Bytes
itself is still experimental, so changing the bytes type should be ok.

Actually, I think no decision was reached whether the ability to have
different internal representations for string and bytes is really important.
This may matter when you use javascript as backend, but otherwise?
If we decided to keep the same representation by default, then a
reasonable approach would be to adopt your proposal with the following
extra twist:

* in safe-string mode, string is an alias for immut bytes
	type 'a bytes
	type string = immut bytes
* in legacy mode, bytes is an alias for string
	type string
	type 'a bytes = string

To keep good compatibility, the functions in the String module would
only have monomorphic types.
The notation "s.[n]" is a subtle case, but a solution could be to have it
expanded to the monomorphic String.get in legacy mode, and to the polymorphic
Bytes.get in safe-string mode. This would allow to keep the "s.[n] <- e”
notation too.

I think this would be more comfortable to use than the current state.

Alain Frisch then said:

I think it's good to keep the freedom to change the representation of
immutable text.  This could simplify some migration path towards
Unicode and/or allow using more clever representations (such as ropes,
strings with lazy concatenation, etc).

Generally speaking, I agree with Gabriel that the distinction between
strings and bytes is more about "text" vs "compact byte array" data,
than between immutable vs mutable.  I'm not sure about the need to
track immutability (or read-only permission) on "byte array", though.
Why would this be more important than on other kind of arrays?   In
particular, for numerical code, it would be equally useful to specify
immutability of vectors/matrices.  Do we want to go into this
direction of tracking mutation of arrays?

Actually, I think it would be interesting to introduce a module type
"ARRAY" for arrays with a fixed type "elt" of elements, add a Make
functor to the Array module, and arrange so that "Bytes" is a subtype
of "ARRAY with type elt = char" (in particular, it's "t" type
shouldn't be more parametric than the one in the ARRAY signature).  We
could similarly provide a compact "BoolArray" implementation, and if
we ever decide to drop the automatic runtime unboxing of float arrays,
we would of course provide a "FloatArray" replacement. "Polymorphic"
algorithms on arrays could be parametrized by a first-class ARRAY
module argument (and this would possibly work nicely with modular
implicits, and possibly with a more aggressive inliner).

This is quite independent from the current discussion, but perhaps it
shows that we shouldn't treat "Bytes" too specifically compared to
other kinds of arrays.

Gabriel Scherer also replied:

I'm strongly convinced that allowing a difference in representation is
the right choice. I'm not actively pushing for using another
representation, but in my experience thinking of them as distinct is
a very good thought-experiment to design better interfaces.

I would compare this to the idea that "data created at an immutable
type might be allocated in read-only memory": I'm not pushing for this
to be implemented, but it's an excellent thought-experiment to explain
why certains use of Obj.magic are assuredly wrong.

(In my experience so far working with the stdlib, Extlib and
Batteries, I have not felt that the distinct representations were
painful. In particular, I needed Bytes.get rarely enough that the lack
of s.[n] syntax was never an issue.)


I would thus rather suggest the following interfaces:
  type 'a bytes (* = string  under -unsafe-string *)
  type string

An interesting interface change would be the conversion functions. We
currently have:

  Bytes.of_string : string -> bytes
  Bytes.to_string : bytes -> string

  Bytes.copy : bytes -> bytes

  (* see Bytes documentation *)
  Bytes.unsafe_of_string : string -> bytes
  Bytes.unsafe_to_string : bytes -> string

We could instead have something like the following:

  Bytes.immut_of_string : string -> immut bytes
  Bytes.immut_to_string : immut bytes -> string

  Bytes.copy : 'a bytes -> 'b bytes

  (* same usage restrinction than Bytes.unsafe_{of,to}_string,
     see the documentation *)
  Bytes.unsafe_mut : immut bytes -> mut bytes
  Bytes.unsafe_immut : bytes bytes -> immut bytes

with the aliases

  Bytes.of_string : string -> 'a bytes
  of_string s = copy (immut_of_string s)

  Bytes.to_string : 'a bytes -> string
  to_string s = immut_to_string (copy s)

  Bytes.unsafe_to_string : mut bytes -> string
  unsafe_to_string s = immut_to_string (unsafe_immut s)

(unsafe_of_string could go away: it can only be correctly used if the
resulting bytes is used immutably, so it is superseded by the safe
immut_of_string function.)

Gerd Stolpmann also replied to the original post:

> # The problem
> 
> The new distinction between bytes and string, added in 4.02, actually
> plays on two different intuitions:
> - bytes represents (1) mutable (2) sequences of bytes
> - string represents (1) immutable (2) end-user text (which happen to
> be represented as sequence of bytes, but we could think of
> representing them as eg. Javascript strings in the future and with
> js_of_ocaml, or with ropes, etc.)

Well, I think there are different views on this: In the OCaml stdlib
there is no distinction between character and byte, and it is left to
the user how to represent text (e.g. to use multibyte UTF-8 text). From
that point of view it is clear that "string" is for immutable data, no
matter weather text or bytes, and "bytes" is for mutable data of either
kind. However, as you found out this is sometimes impractical. If you
have some data you don't want to draw a somewhat arbitrary line between
mutable and immutable appearances of it.

I also will have to convert a fairly amount of code, and esp. for
Ocamlnet I really don't see how to do this cleanly, because the kind of
data suddenly changes. For instance the HTTP client reads bytes into a
bytes buffer, but the HTTP headers have more the characteristics of
text.

> The problem is that aspects (1) and (2) are somewhat orthogonal. I
> don't think we're interested in mutable end-user texts, but I
> encountered a few notable cases of (1) immutable (2) sequences of
> bytes. The problem is: should those be typed as string, or bytes?
> 
> (There may be a difference between functions that assume their
> arguments are immutable, and function that simply guarantee that they
> won't themselves mutate their arguments. For now I'll assume those two
> cases count as "immutable sequences of bytes").
> 
> Right now, the standard library itself does a strange job of making a
> choice. The Marshal module (
> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Marshal.html )
> appears to favor the choice of "bytes" for non-mutated byte sequences
> (eg. data_size, total_size), while the Digest module (
> http://caml.inria.fr/pub/docs/manual-ocaml/libref/Digest.html )
> remained in the land of strings.

I also noticed that some functionality is now available over two
interfaces. In particular, there are now functions for writing from a
bytes buffer and for writing from a string buffer (e.g. Unix.write and
Unix.write_substring). My thinking is that all stdlib functions should
now be provided in that manner.

> # An ideal solution
> 
> In an ideal world, I claim the best solution would be the following.
> Given that it is clear (to me) that mutable byte sequences and
> immutable byte sequences share the same representation, we should use
> phantom type to distinguish them:
> 
>   type mut
>   type immut
>   type 'a bytes
> 
>   val get : 'a bytes -> int -> char
>   val set : mut bytes -> int -> char
>   Digest.t = immut bytes
> 
> Using phantom types had been considered at the time of the
> bytes/string split, but rejected because suddenly adding polymorphism
> to string literals and string functions broke a lot of code ("The type
> of this expression, ..., contains a type variable that cannot be
> generalized", or suddenly-polymorphic method return types). More
> importantly, we do not want to enforce string and bytes to always have
> the same underlying representation. Neither arguments hold for
> mutable/immutable bytes.

A new variant! The problem is still that it introduces polymorphisms.

For OCamlnet I was more thinking of providing all internal string
functionality also with string_reader interface. A string_reader is a
little abstraction on top of string/bytes/char bigarray that abstracts
the representation, and provides the most needed functions (at least
what String/Bytes have, plus extensions like searching, conversions,
maybe even regexps). I think that's the missing piece to make the
"bytes" change acceptable:

module String_reader : sig
  type t
  val for_string : string -> t
  val for_bytes : bytes -> t
  val for_memory : (char,...,...) Bigarray.Array1.t -> t

  val get : t -> int -> char
  val sub_string : t -> int -> int -> string
  val sub_bytes : t -> int -> int -> bytes
  val blit_to_bytes : t -> int -> bytes -> int -> int -> unit
  val blit_to_memory : t -> int -> (char,...) Bigarray.Array1.t -> int
-> int -> unit

  val index : t -> char -> int

  val search_leftmost : t -> t -> int -> int
  val search_rightmost : t -> t -> int -> int

  val get_int32_le : t -> int -> int32
  val get_int32_be : t -> int -> int32
  val get_int64_le : t -> int -> int64
  val get_int64_be : t -> int -> int64

  (* plus more ... not sure yet what to cover exactly *)

end

Thoughts on targeting windows

Archive: https://sympa.inria.fr/sympa/arc/caml-list/2014-10/msg00020.html

Anil Madhavapeddy announced, continuing this old thread:

At a recent compiler hacking session in Cambridge, Don Syme pointed
out a great Travis-like service for running regular Windows builds
called Appveyor(.com).  In order to get more familiar with the Windows
toolchain, I ported some of Thomas Braibant's instructions for
compiling OPAM on Windows using it here:

Cygwin scripts: https://github.com/ocaml/opam/blob/master/appveyor.yml
Build output:   https://ci.appveyor.com/project/avsm/opam/build/1.0.38

Appveyor can be used much like Travis and build every Git checkin on
Windows, except that they unfortunately overwrite each other's status
flags (the green tick or red cross against each commit), so they
cannot be simultaneously used on one GitHub repository right now.
I contacted GitHub support and they indicated that they are adding
support for multiple CI tools into the UI, but do not have a time
estimate for when that would be ready.

In the meanwhile though, I hope Appveyor comes in useful for anyone
wanting to automate Windows testing via a free hosted service.  Pull
requests to improve OPAM's Appveyor scripts in this regard (for MinGW
or Cygwin or ideally testing them both) would be welcome.

Other OCaml News

From the ocamlcore planet blog:

Thanks to Alp Mestan, we now include in the OCaml Weekly News the links to the
recent posts from the ocamlcore planet blog at http://planet.ocaml.org/.

Recursive list:
  http://shayne-fletcher.blogspot.com/2014/10/recursive-list.html

Ocsigen: releases of Js_of_ocaml, Eliom, Server, Macaque, Tyxml:
  http://ocsigen.org/

Senior Software Engineer at McGraw-Hill Education (Full-time):
  http://functionaljobs.com/jobs/8744-senior-software-engineer-at-mcgraw-hill-education

Old cwn

If you happen to miss a CWN, you can send me a message and I'll mail it to you, or go take a look at the archive or the RSS feed of the archives.

If you also wish to receive it every week by mail, you may subscribe online.

Alan Schmitt