Re: [containers-users] Possible additions to Containers and Friends

2018-03-06 Thread SP

Speaking-too-soon is a valid and powerful code verification technique; it 
exploits tempting the bugs to make their move.

--
SP
___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-03-06 Thread Simon Cruanes
Of course I spoke too soon, and missed so validation cases (that would
have been accepted by Peter's code).

In particular, I just learnt about some interesting corner cases of UTF8,
namely overlong encodings.

If anyone is knowledgeable about UTF8, reviewing the code would be
greatly appreciated!

Cheers! :)


-- 
Simon Cruanes

http://weusepgp.info/
key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA 62B6


signature.asc
Description: PGP signature
___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-03-06 Thread Simon Cruanes
I merged and adapted the code from Peter:

https://github.com/c-cube/ocaml-containers/blob/master/src/core/CCUtf8_string.mli
https://github.com/c-cube/ocaml-containers/blob/master/src/core/CCUtf8_string.ml

it's stricter (only accepts valid UTF8) and the random tests should
ensure that it agrees with Uutf on what is valid utf8, and on the list
of codepoints of a valid UTF8 string.

The code is not that complicated: encoding is 25 lines, decoding is 67
lines. I had to rewrite part of it to make it strictly UTF8 compliant.

Comments very welcome! And thanks again to Peter, without whom I'd never
have had the courage to do it.

-- 
Simon Cruanes

http://weusepgp.info/
key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA 62B6


signature.asc
Description: PGP signature
___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-03-01 Thread peter frey


I'm not sure I understand, what is the point of supporting "more" than

utf8?

In the original utf8 standard the encoding is:
The code is encoded as a string of length 1 + additional length.
The additional length is a 0-ary encoding of the length '10' to 
'110'  (i.e.: 1.. 6)

The first char supplies 1 to 7 bits; the following chars supply 6 bits each.
The maximal # bits is 31 bits. (5 * 6 + low bit from 0-byte).
I am using this encoding but it is no longer 'standard'.  Instead the range
  0xD7FF .. 0xE000 is excluded from the TOTAL range 0 .. 0x10
In Uutf8 only this range is accepted.


All I tried to say is: My code does not encode the current standard; in 
fact it does

little checking. (Encodes more - checks less).

Calling it utf31 would be an informal way of signaling this;
we can call anything what we want to call it.

I will write a filter that does verification; especially :
A code that has length 1 + n must have n bytes following with format 
10xx;  if the decoder encounters 0xxx or x1xxx or end of 
string; that is an error.

Uutf8 replaces such sequences with an error code.

peter


___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-02-26 Thread peter frey
Simon occasionally includes code from some other part of the libraries 
to avoid requiring, say, Gen to access Sequence or Containers; I don't 
remember offhand.  In the case of some tiny piece of code thats 
sensible. (And so far that is all I have provided)


Pervasives has now a type uchar which Uutf uses consistently. In the 
case that one is not dealing with the many exigencies that Uutf deals 
with, thats a bit overkill?  Alain Frisch used ints in Ulex and I 
remember another usage of uchar only in Camomille.


Uutf has a smaller range of codes that it accepts; namely Utf8.
Thats 0 to (1024 * 16) + (1024 * 16) + (64 * 1024) where 64k portion at 
the end is also excluded from the 1Mb range before.


Originally Utf8 encoded all possible codes in the positive int32 range.
I prefer to revert to the old standard; (and call it Utf31) since I this 
allows me to encode alphabets that are larger than, but include, the 
utf8 range.  (This may not work with js_of_ocaml; but not all 
applications involve the web)

When it comes to comparing utf8 chars (codes; ints) consider the following:

utop # ((=));;
- : 'a -> 'a -> bool =  after loading utop
utop # open Containers;;
utop # ((=));;
- : int -> int -> bool =   type is overloaded
utop #

( I noticed this also in Jane Streets's "Sequence" .  Possibly they want 
to avoid 'polymorphic'  comparisons; i.e.: comparisons that examine the 
internal structure)


Did you intend to do that?

Peter


On 2018-02-25 05:12 AM, SP wrote:

On Sat, Feb 24, 2018 at 12:21:52PM -0600, Simon Cruanes wrote:

We could build on uutf, it's relatively small and doesn't have too many
deps. However, I also don't think utf8 is that complicated that we
couldn't just redo the codepoint<-> byte conversions in a simpler


Make it uutf compatible then, so one can either use uutf for full
functionality or use a few basic converters provided in Containers.



___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-02-25 Thread Simon Cruanes
Well, there's the standard uchar type, I think compatibility is achievable :) ___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-02-24 Thread Simon Cruanes
Le Sat, 24 Feb 2018, Drup wrote:
> Shouldn't we just standardize on bunzli's libraries (including the new
> https://github.com/dbuenzli/utext) instead of trying to re-write code that
> usually ends up being quite subtle in each standard library ?

We could build on uutf, it's relatively small and doesn't have too many
deps. However, I also don't think utf8 is that complicated that we
couldn't just redo the codepoint<-> byte conversions in a simpler (and
arguably lower overhead) way. In particular,
`Utf8string.to_seq : t -> codepoint sequence` could be faster than
calling uutf with all its poly variants.

For utext, meh. It's not stable yet, and relies on a complicated non
standard underlying vec structure. Ustring (or Utf8string, possibly, as
it's the only reasonable string to support) should be a (possibly
private) alias to string. There should be a similar Utf8buffer where you
can push/pop codepoints and append other Utf8buffers.


-- 
Simon Cruanes

http://weusepgp.info/
key 49AA62B6, fingerprint 949F EB87 8F06 59C6 D7D3  7D8D 4AC0 1D08 49AA 62B6


signature.asc
Description: PGP signature
___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users


Re: [containers-users] Possible additions to Containers and Friends

2018-02-22 Thread SP
> Thanks for the suggestions. I'm no expert in unicode, but I do agree
> that such basic functionalities should be more easily available.
> Maybe a `Ustring` module in containers would make sense (as a private
> alias to `string`); most functionalities below would fit there

Is this for facilitation or implementation of UTF?

There is an implementation here 

-- 
SP
___
Containers-users mailing list
Containers-users@lists.ocaml.org
http://lists.ocaml.org/listinfo/containers-users