Re: [containers-users] Possible additions to Containers and Friends

2018-03-06 Thread SP

Speaking-too-soon is a valid and powerful code verification technique; it 
exploits tempting the bugs to make their move.

--
SP
___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-03-06 Thread Simon Cruanes
Of course I spoke too soon, and missed so validation cases (that would
have been accepted by Peter's code).

In particular, I just learnt about some interesting corner cases of UTF8,
namely overlong encodings.

If anyone is knowledgeable about UTF8, reviewing the code would be
greatly appreciated!

Cheers! :)


-- 
Simon Cruanes

http://weusepgp.info/
key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA 62B6


signature.asc
Description: PGP signature
___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-03-06 Thread Simon Cruanes
I merged and adapted the code from Peter:

https://github.com/c-cube/ocaml-containers/blob/master/src/core/CCUtf8_string.mli
https://github.com/c-cube/ocaml-containers/blob/master/src/core/CCUtf8_string.ml

it's stricter (only accepts valid UTF8) and the random tests should
ensure that it agrees with Uutf on what is valid utf8, and on the list
of codepoints of a valid UTF8 string.

The code is not that complicated: encoding is 25 lines, decoding is 67
lines. I had to rewrite part of it to make it strictly UTF8 compliant.

Comments very welcome! And thanks again to Peter, without whom I'd never
have had the courage to do it.

-- 
Simon Cruanes

http://weusepgp.info/
key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA 62B6


signature.asc
Description: PGP signature
___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-03-01 Thread peter frey


I'm not sure I understand, what is the point of supporting "more" than

utf8?

In the original utf8 standard the encoding is:
The code is encoded as a string of length 1 + additional length.
The additional length is a 0-ary encoding of the length '10' to 
'110'  (i.e.: 1.. 6)

The first char supplies 1 to 7 bits; the following chars supply 6 bits each.
The maximal # bits is 31 bits. (5 * 6 + low bit from 0-byte).
I am using this encoding but it is no longer 'standard'.  Instead the range
  0xD7FF .. 0xE000 is excluded from the TOTAL range 0 .. 0x10
In Uutf8 only this range is accepted.


All I tried to say is: My code does not encode the current standard; in 
fact it does

little checking. (Encodes more - checks less).

Calling it utf31 would be an informal way of signaling this;
we can call anything what we want to call it.

I will write a filter that does verification; especially :
A code that has length 1 + n must have n bytes following with format 
10xx;  if the decoder encounters 0xxx or x1xxx or end of 
string; that is an error.

Uutf8 replaces such sequences with an error code.

peter


___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-02-26 Thread peter frey
Simon occasionally includes code from some other part of the libraries 
to avoid requiring, say, Gen to access Sequence or Containers; I don't 
remember offhand.  In the case of some tiny piece of code thats 
sensible. (And so far that is all I have provided)


Pervasives has now a type uchar which Uutf uses consistently. In the 
case that one is not dealing with the many exigencies that Uutf deals 
with, thats a bit overkill?  Alain Frisch used ints in Ulex and I 
remember another usage of uchar only in Camomille.


Uutf has a smaller range of codes that it accepts; namely Utf8.
Thats 0 to (1024 * 16) + (1024 * 16) + (64 * 1024) where 64k portion at 
the end is also excluded from the 1Mb range before.


Originally Utf8 encoded all possible codes in the positive int32 range.
I prefer to revert to the old standard; (and call it Utf31) since I this 
allows me to encode alphabets that are larger than, but include, the 
utf8 range.  (This may not work with js_of_ocaml; but not all 
applications involve the web)

When it comes to comparing utf8 chars (codes; ints) consider the following:

utop # ((=));;
- : 'a -> 'a -> bool =  after loading utop
utop # open Containers;;
utop # ((=));;
- : int -> int -> bool =   type is overloaded
utop #

( I noticed this also in Jane Streets's "Sequence" .  Possibly they want 
to avoid 'polymorphic'  comparisons; i.e.: comparisons that examine the 
internal structure)


Did you intend to do that?

Peter


On 2018-02-25 05:12 AM, SP wrote:

On Sat, Feb 24, 2018 at 12:21:52PM -0600, Simon Cruanes wrote:

We could build on uutf, it's relatively small and doesn't have too many
deps. However, I also don't think utf8 is that complicated that we
couldn't just redo the codepoint<-> byte conversions in a simpler


Make it uutf compatible then, so one can either use uutf for full
functionality or use a few basic converters provided in Containers.



___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-02-25 Thread Simon Cruanes
Well, there's the standard uchar type, I think compatibility is achievable :) ___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-02-24 Thread Simon Cruanes
Le Sat, 24 Feb 2018, Drup wrote:
> Shouldn't we just standardize on bunzli's libraries (including the new
> https://github.com/dbuenzli/utext) instead of trying to re-write code that
> usually ends up being quite subtle in each standard library ?

We could build on uutf, it's relatively small and doesn't have too many
deps. However, I also don't think utf8 is that complicated that we
couldn't just redo the codepoint<-> byte conversions in a simpler (and
arguably lower overhead) way. In particular,
`Utf8string.to_seq : t -> codepoint sequence` could be faster than
calling uutf with all its poly variants.

For utext, meh. It's not stable yet, and relies on a complicated non
standard underlying vec structure. Ustring (or Utf8string, possibly, as
it's the only reasonable string to support) should be a (possibly
private) alias to string. There should be a similar Utf8buffer where you
can push/pop codepoints and append other Utf8buffers.


-- 
Simon Cruanes

http://weusepgp.info/
key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA 62B6


signature.asc
Description: PGP signature
___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-02-22 Thread SP
> Thanks for the suggestions. I'm no expert in unicode, but I do agree
> that such basic functionalities should be more easily available.
> Maybe a `Ustring` module in containers would make sense (as a private
> alias to `string`); most functionalities below would fit there

Is this for facilitation or implementation of UTF?

There is an implementation here 

-- 
SP
___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


[containers-users] Possible additions to Containers and Friends

2018-02-10 Thread peter frey

(*
Reading recent posts on discuss.ocal.org gives me the impression that 
some tiny

number of utf related routines should be more easily available.
Container's Sequence.t and Gen.t, in particular could benefit from a 
couple of

simple routines.  The code below fits well into that frame work.
I am treating it as public domain code but feel free to make it your own 
and to

include it where appropriate. (Perhaps some of the tests could go into the
example directory...)
The routines here DO NOT verify unless its unavoidable.  In particular they
accept ALL code points that can be encoded by the original Utf8 definition.
It is only a matter of language; we could call it utf31 ...
Restricting the range is trivial; as would be including some 
verification code.




*)

open Containers

(* Create a generator from a utf8-string. Each call produces a code point.
 * The optional parameter srcIdx specifies then start point in the string.
 * srcIdx must point to a valid suffix of a utf8 string.
 * *)
let gen_of_utf8 ?(srcIdx=ref 0) str =
    let lim = String.length str in
    let assemble_next () = (* we come here only for multi-byte 
characters *)
  let cv jmax accu =  (* utf8 character length; construction of 
uchar *)
    let rec cv' j accu' =  (* inner loop j = 1..jmax ; each 
uchar *)

  let ch = Char.code str.[ !srcIdx + j] in
  let next = ( (accu' lsl 6) lor ( ch land 0x7f )) in
  if j = jmax then begin (* except for 1st, each char gives 
6 bits*)
    srcIdx := !srcIdx + j +1; Some next    (* +1 for 
1st char *)

  end else cv' (succ j) next
  in cv' 1  (* 1st char is already proccessed! *) accu
    in if !srcIdx >= lim then None else
    let n = str.[ !srcIdx ] in match n with
    (* 0xxx *) | '\000' .. '\127' -> incr srcIdx; Some (int_of_char n)
    (* 110y *) | '\128' .. '\223' -> cv 1 ((Char.code n) land 0b1 )
    (* 1110 *) | '\224' .. '\239' -> cv 2 ((Char.code n) land 0b )
    (* 0uuu *) | '\240' .. '\247' -> cv 3 ((Char.code n) land 0b111 )
    (* 10vv *) | '\248' .. '\251' -> cv 4 ((Char.code n) land 0b11 )
    (* 110w *) | '\252' .. '\253' -> cv 5 ((Char.code n) land 0b1 )
    (* 111X *) | '\254' .. '\255' -> raise (Failure "Bad stream")
  in  assemble_next;;


(* The 'natural' stream representation of a utf-string is a generator.
 * But Sequences are not far away ... *)
let makeUtf8Seq ?(srcIdx=ref 0) str = Sequence.of_gen (gen_of_utf8 
~srcIdx str)




(* Convert a code point to a string; Hopefully some day this will be in the
 * standard library. There are various equally trivial versions of this 
around.

 * The returned string is created (allocated) fresh for each k.
 * *)

let code_to_string k =
  let mask = 0b11 in
  if k < 0 || k >= 0x400 then begin
    let s = Bytes.create 6 in
    Bytes.unsafe_set s 0 (Char.chr (0xfc + (k lsr 30)));
    Bytes.unsafe_set s 1 (Char.unsafe_chr (0x80 lor ((k lsr 24) land 
mask)));
    Bytes.unsafe_set s 2 (Char.unsafe_chr (0x80 lor ((k lsr 18) land 
mask)));
    Bytes.unsafe_set s 3 (Char.unsafe_chr (0x80 lor ((k lsr 12) land 
mask)));
    Bytes.unsafe_set s 4 (Char.unsafe_chr (0x80 lor ((k lsr 6) land 
mask)));

    Bytes.unsafe_set s 5 (Char.unsafe_chr (0x80 lor (k land mask)));
    s end
  else if k <= 0x7f then
    Bytes.make 1 (Char.unsafe_chr k)
  else if k <= 0x7ff then begin
    let s = Bytes.create 2 in
    Bytes.unsafe_set s 0 (Char.unsafe_chr (0xc0 lor (k lsr 6)));
    Bytes.unsafe_set s 1 (Char.unsafe_chr (0x80 lor (k land mask)));
    s end
  else if k <= 0x then begin
    let s = Bytes.create 3 in
    Bytes.unsafe_set s 0 (Char.unsafe_chr (0xe0 lor (k lsr 12)));
    Bytes.unsafe_set s 1 (Char.unsafe_chr (0x80 lor ((k lsr 6) land 
mask)));

    Bytes.unsafe_set s 2 (Char.unsafe_chr (0x80 lor (k land mask)));
    s end
  else if k <= 0x1f then begin
    let s = Bytes.create 4 in
    Bytes.unsafe_set s 0 (Char.unsafe_chr (0xf0 + (k lsr 18)));
    Bytes.unsafe_set s 1 (Char.unsafe_chr (0x80 lor ((k lsr 12) land 
mask)));
    Bytes.unsafe_set s 2 (Char.unsafe_chr (0x80 lor ((k lsr 6) land 
mask)));

    Bytes.unsafe_set s 3 (Char.unsafe_chr (0x80 lor (k land mask)));
    s end
  else begin
    let s = Bytes.create 5 in
    Bytes.unsafe_set s 0 (Char.unsafe_chr (0xf8 + (k lsr 24)));
    Bytes.unsafe_set s 1 (Char.unsafe_chr (0x80 lor ((k lsr 18) land 
mask)));
    Bytes.unsafe_set s 2 (Char.unsafe_chr (0x80 lor ((k lsr 12) land 
mask)));
    Bytes.unsafe_set s 3 (Char.unsafe_chr (0x80 lor ((k lsr 6) land 
mask)));

    Bytes.unsafe_set s 4 (Char.unsafe_chr (0x80 lor (k land mask)));
    s end

let string_to_code str =
    let cv jmax accu =  (* utf8 character length; construction of 
uchar *)

    if jmax >  String.length str then raise (Failure "string_to_code")
    else let rec cv' j accu' =   (* inner loop j = 1..jmax ; each 
uchar *)

  let ch = Char.code (String.unsafe_get str j) in
  let next = (