Re: [polyml] Re: reading a file on byte level

2012-10-17 Thread Makarius

On Wed, 17 Oct 2012, David Matthews wrote:


Michael,
It's probably a good idea to send questions like this to the Poly/ML mailing 
list and I'm copying this reply to the list.  There are people on the list 
who may have experience of what you're trying to do and also any replies get 
archived and reduce the chances of the same question being asked repeatedly.


To reply to your specific question: there's no difference in efficiency 
between TextIO and BinIO on Unix.  On Windows TextIO is more complicated 
because of the need to convert between CRLF and \n.  I'm not clear how you 
are accessing the file, whether it is random or sequential.  To read 
individual bytes you can always use BinIO.input1 or TextIO.input1.  This 
would save you having to explode a vector.


BinIO deals with values of type Word8.word and the vector is 
Word8Vector.vector.  These are unsigned byte values.


Regards,
David

On 16/10/2012 20:59, Michael Moeller wrote:

Hi,

I've got a problem regarding huge files as input. I need to get access to
the single byteslike with fgetc() in C. TextIO and subsequently exploding
the string does the jobbut performance matters and it seems to me like
BinIOsaves me from the 'explode' step.Unfortunately I can't figure out
how to access theresulting BinIO.vector. I'd be appreciative to seean
exampleand please tell me if this is not the right approach.

Regards,
Michael


David probably intended to send this back to Michael Moeller, and just got 
the M completion wrong.



Nonetheless, I can offer two hints from the Isabelle/ML code base:

 (1) Combinators fold_lines and fold_pages to access large text files
 in a scalable manner:

 
http://isabelle.in.tum.de/repos/isabelle/file/Isabelle2012/src/Pure/General/file.ML#l132

 (2) Some applications of BinIO.vector over sockets, which had to be
 trashed for compatibility with old Poly/ML 5.4.x versions:

 
http://isabelle.in.tum.de/repos/isabelle/file/63144ea111f7/src/Pure/System/system_channel.ML
 
http://isabelle.in.tum.de/repos/isabelle/file/8b20be429cf3/src/Pure/System/system_channel.ML


Makarius
___
polyml mailing list
polyml@inf.ed.ac.uk
http://lists.inf.ed.ac.uk/mailman/listinfo/polyml


Re: [polyml] Re: reading a file on byte level

2012-10-17 Thread Makarius

On Wed, 17 Oct 2012, Michael Moeller wrote:

Thanks for the links, I'll look into it. Talking about large amounts of 
data, do you happen to have an example of using HashArray? This would be 
great.


Maybe someone else on the list has experience with hash tables.


In Isabelle/ML we do the large data management in memory as immutable 
tree structures.  This allows to conserve a lot of memory in Poly/ML 5.5.0 
in particular, where David provides this exceedingly nice online sharing 
of old/life data on the heap.  It also works well with multicore / 
multithread programming to have everything persistent, immutable, sharable 
by default.



The main workhorse for map/set-like structures in Isabelle/ML is 
http://isabelle.in.tum.de/repos/isabelle/file/Isabelle2012/src/Pure/General/table.ML


This is a quite efficient and reasonably simple implementation of 2-3 
trees.  When starting that in 1999, I've first had a look at the red-black 
tree implementation in SML/NJ (ported from C) and was not convinced -- 
later some people found actual problems in it that might be still there.


Another prerequisite of efficiency are reasonable order functions defined 
on the key values for indexing tables.  E.g. for type string we look at 
the length in the descriptor first, before walking through the memory to 
compare the text (you cannot do this in C).  Same for complex types like 
lists, trees, lambda-terms etc.: first compare superficial outline 
structure, then walk down to the content.



Since red-black trees are more present in the public these days (why?), 
several people have worked on this verified implementation using 
Isabelle/HOL: 
http://isabelle.in.tum.de/repos/isabelle/file/Isabelle2012/src/HOL/Library/RBT_Impl.thy 
http://isabelle.in.tum.de/repos/isabelle/file/Isabelle2012/src/HOL/Library/RBT.thy


You can backport that to SML manually, or use the Isabelle/HOL generator 
to produce SML, Haskell, Scala etc. -- the result would not look much 
worse than the C - SML translation performed by hand in SML/NJ some years 
ago.



Makarius
___
polyml mailing list
polyml@inf.ed.ac.uk
http://lists.inf.ed.ac.uk/mailman/listinfo/polyml