Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-23 Thread Johan Tibell
On Fri, May 20, 2011 at 11:12 PM, Simon Meier iridc...@gmail.com wrote:
 There, seems to be a historical artefact here. The new Write
 abstraction in system-io-write is different from the one used in
 blaze-builder. It's type is

  data Write a = Write Int (a - Ptr Word8 - IO (Ptr Word8))

 This definition ensures that the bound on the number of bytes written
 is independent of the value being encoded. That's crucial for the
 implementation of `mapWriteByteString`. It also benefits the other
 Write combinators, as the bound can always be computed in a
 data-independent fashion. Inlining, is therefore really sufficient to
 arrive at a constant bound during compile time.

I don't see why this makes a difference, you could still do

myWrite x = Write (length x) (\ _ p - pokePokePoke p x)

 I don't see how this Write type can be emulated using `writeAtMost`, do you?

There's no difference, as I showed above. Both can result in data
dependent lengths. It's up to the programmer to make sure the length
is independent of the value being written, when so desired.

 Hmm, all my Writes are top-level function definitions annotated with
 {-# INLINE #-}. Moreover, all combinators for Writes are also inlined
 and all their calls are saturated. Therefore, I thought GHC is capable
 of optimizing away the pattern matches on the Write constructor.

You also need to make all top-level functions non-recursive but from
what I remember you did so. The case for Writes is the same as for
higher-order arguments, the call site must meet the definition site.
So if you have something like:

myWrite :: Write Word8

writeList :: Write a - [a] - ...

f xs = writeList myWrite xs

we need to make sure both myWrite and writeList are inlined into f.
The case is similar for writeAtMost. The question is what happens if
the user ever fails to get everything to inline optimally. In the
writeAtMost case just have an indirect function call instead of a
direct one. In the Write case we also have extra allocation and
indirection. We've had such problems in e.g. attoparsec. While things
should inline properly in big programs they rarely do. Same problem
exists for fusion where fusion constructors end up in the final
program although they should have been eliminated.

 I'm happy to remove Writes, if there's a superior way of sharing the
 low-level encoding code that they abstract. However, I did peek at
 Core from time to time and found that the Write constructors were
 optimized away. I currently see Writes as an expert domain to be used
 by authors of libraries like bytestring, text, aeson, blaze-html, etc.
 With appropriate documentation and benchmarks I expect them to be able
 to make good choices w.r.t. inlining and partial application.

I agree. Writes (and writeAtMost) would be the domain of experts.

If we expects write to be reused a lot it might make sense to have a
separate Write type. Note that I'd be reluctant to see dependencies
that involve I/O underneath bytestring as it's designed as a pure data
structure library (and is likely to have things involving I/O on top
of it).

Cheers,
Johan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-20 Thread Simon Meier
2011/5/19 Antoine Latter aslat...@gmail.com:
 On Thu, May 19, 2011 at 3:06 PM, Simon Meier iridc...@gmail.com wrote:

 The core problem that drove me towards this solution is the abundance
 of different IntX and WordX types. Each of them requiring a separate
 Write for big-endian, little-endian, host-endian, lower-case-hex, and
 uper-case-hex encodings; i.e., currently, there are

 int8BE   :: Write Int8
 int16BE :: Write Int16
 int32BE :: Write Int32
 ...
 hexLowerInt8 :: Write Int8
 ...

 and so on. As you can see
 (http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html)
 this approach clutters the public API quite a bit. Hence, I'm thinking
 of using a separate type-class for each encoding; i.e.,


 If Johan's work on Data.Binary and rewrite rules works out, then it
 would cut the exposed API in half, which helps.

 We could then use the module and package system to further keep the
 API clean, with builders which output a specific encoding could live
 in separate modules. This could also keep the names of the functions
 short, as well.

 That would require coming up with logical divisions for the functions
 you're creating, and I don't understand the big picture enough to help
 with that.

  class BigEndian a where
    bigEndian :: Write a

 This collapses the big-endian encodings of all 10 bounded-size (signed
 and unsigned) integer types under a single name with a well-defined
 semantics. Moreover, it's standard Haskell 98. For the hex-encodings,
 I'm thinking about providing type-classes

  class HexLower a where
    hexLower :: Write a

  class HexLowerNoLead a where
    hexLowerNoLead :: Write a

  ...

 for ASCII encoding and each of the standard Unicode encodings in a
 separate module. The user can then select the right ones using
 qualified imports. In most cases, he won't even need qualification, as
 mixing different character encodings is seldomly used.


 I think we may be at cross-purposes here, and might not even be
 discussing the same thing - I would imagine that any sort of 'Builder'
 type included in the bytestring package would only provide the core
 combinators for packing data into low-level binary formats, so
 discussions about text encoding issues, converting to hexidecimal and
 Html escaping are going above my head.

 This seems like what the 'text' package was written for - to separate
 out the construction of textual data from choosing its encoding.

 Are there use-cases where the 'text' package is too slow for this sort
 of approach?

 Take care,
 Antoine

 What do you think about such an interface? Is there another catch
 hidden, I'm not seeing? BTW, note that Writes are a pure compile time
 abstraction and are thought to be completely inlined. In typical, uses
 cases there's no efficiency overhead stemming from these typeclasses.

 best regards,
 Simon



Yes, for example using the current 'text' package is sup-optimal for
dyamically generating UTF-8 encoded HTML pages. The job is simple: the
data which is originally held in standard Haskell types (e.g., String)
needs to be HTML escaped and UTF-8 encoded and sprinkled with tags in
between.

For blaze-html using blaze-builder the cost for a tag is a memcpy of
the corresponding tag and the cost for a single character is one call
to the nested case statement determining if the char needs to be
escaped (one memcpy of its escaped version) or what bytes need to be
written for UTF-8 encoding the char. This solution works with a single
output buffer.

For a solution using the text library the cost of creating the
underlying UTF-16 array is similar to the cost for blaze-builder.
However, you now also need to UTF-8 encode the UTF-16 array. This
costs you more than double, as now you also have to inspect every
character of every tag. For ~50% of your data you suddenly have to
spend a lot more effort!

I agree that the text library is a good choice for representing
Unicode data of an application. However, for high-performance
applications it pays off to think of its output in binary form and
exploit the offered shortcuts. That's where blaze-builder and the like
come in.

thanks for your input,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-20 Thread Simon Meier
2011/5/20 Bas van Dijk v.dijk@gmail.com:
 On 19 May 2011 10:53, Johan Tibell johan.tib...@gmail.com wrote:
 Long term we'd like to switch bytestring over
 from ForeignPtr to ByteArray#, if possible. There are currently some
 technical obstacles to such a switch

 BTW I'm working with Roman Leshchinskiy to create the
 vector-bytestring package which provides:
 type ByteString = Data.Vector.Storable.Vector Word8
 and exports the same API as the bytestring package (no support for
 lazy bytestrings yet)

 A storable vector still uses a ForeignPtr but maybe this will make the
 switch to unboxed Vectors (which use ByteArray#) easier.

 Expect some code to be up somewhere next week.

 Bas


OK, that sounds interesting. I'm looking forward to your code.

One trick, I'm using in the blaze-builder implementation is that the
current buffer is denoted by a region (pf :: Ptr Word8, pe :: Ptr
Word8) of the next free byte `pf` and the first byte after the buffer
`pe`. This only works for pinned ByteArrays/ForeignPtrs. Do you know
what the cost of such an array/foreignptr is? Moreover, after creation
I could unpin the array. Do you perhaps know if thats possible in
principle?

best regards,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-20 Thread Simon Meier
2011/5/20 Henning Thielemann schlepp...@henning-thielemann.de:
 Simon Meier schrieb:

 There are many providers of Writes. Each bounded-length-encoding of a
 standard Haskell value is likely to have a corresponding Write. For
 example, encoding an Int32 as a big-endian, little-endian, and
 host-endian byte-sequence is currently achieved with the following
 three functions.

   writeInt32BE :: Write Int32
   writeInt32LE :: Write Int32
   writeInt32HE :: Write Int32

 I would like to avoid naming all these encodings individually.

 Maybe this one helps:
 http://hackage.haskell.org/packages/archive/storable-endian/0.2.4/doc/html/Data-Storable-Endian.html
 ?


Thanks. I didn't know of that package. It for sure provides some
valuable input. Moreover, it pointed me to Antoine Latter's byteorder
package (http://hackage.haskell.org/package/byteorder). I might use
it, although I would rather have the ByteOrder determined at compile
time. Shouldn't GHC provide some means to determine this, as it's
compiling for a fixed architecture anyway? What experience have you
made Antoine when you implemented this package?

best regards,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-20 Thread Johan Tibell
Hi Simon,

On Thu, May 19, 2011 at 10:46 PM, Simon Meier iridc...@gmail.com wrote:
 Write achieves this separation, but it has some costs which I'm not
 entirely comfortable with.

 First, it leads to lots of API duplication. For every type (e.g. Word)
 we want to be able serialize we have two morally identical functions

    writeWordhost :: Word - Write
    fromWordhost :: Word - Builder

 in the API, where the latter simply calls the former and does some
 additional wrapping.

 Yes, I agree with this duplication. I'll explain below what we gain
 from it. Note that I factored out the whole Write stuff into its own
 library (system-io-write) for the bytestring integration. Therefore,
 an end-user of bytestring will only see the Builder versions except
 he's doing more low-level stuff to gain some extra performance.

There are (at least) two cases where I think the simple Builder API
must perform well for it to be usable on its own: simple loops and
sequential writes. To be specific, if the following two cases don't
compile into near optimal code, there's a compiler bug we should fix.
First, a simple loop:

f :: [Word8] - Builder
f [] = mempty
f (x:xs) = singleton x `mappend` xs

This code is already quite low level, there should be enough
information here for the compiler to emit a simple loop with one
buffer bounds check per iteration. Second, a bunch of sequential
writes:

g :: Word8 - Word8 - Word8 - Word8 - Builder
g a b c d = singleton `mappend` (b `mappend` (c `mappend` d))

This ought to compile to a single bounds check followed by for memory writes.

The user shouldn't have to get more low-level than this in these
simple examples. Today this is currently only true for the second
example, which we can solve using rewrite rules. The first example
doesn't work due to the GHC compiler bug I mentioned.


 Simon, is the reason for this duplication this comment on top of
 Blaze.ByteString.Builder.Word?

    snip

 That's one of the reasons, but not the main one. The core reason is
 that Write's provide
 an interface between implementors of the low-level bit-twiddling
 required to efficiently implement encodings like UTF-8 or Base16 and
 the providers of efficient traversal functions through (streams of)
 Haskell values. For simple traversals like

   fromWrite          :: Write a - a - Builder
   fromWriteList      :: Write a - [a] - Builder
   fromWriteUnfoldr   :: Write b - (a - Maybe (b, a)) - a - Builder

 there might be the option that GHC is clever enough and can find the
 efficient loop. However, for more complicated functions like

   mapWriteByteString :: Write Word8 - S.ByteString - Builder

 That certainly isn't the case. I'm using quite a few tricks there [3]
 to enable a tight inner loop with few live variables.

Right. So this argues for having an escape hatch, and I agree we
should have one. Write at writeAtMost are both such escape hatches and
I believe them to equal in expressiveness. This shouldn't come as a
surprise as Write is writeAtMost with one argument reified into into a
constructor field:

writeAtMost :: Int - (Ptr Word8 - IO Int) - IO ()
data Write = {-# UNPACK #-} !Int (Ptr Word8 - IO (Ptr Word8))

(That the second argument of writeAtMost is an Int instead of a Ptr
Word8 as in Write is an unimportant difference.)

There are some operational differences.

* The argument to Write can be inspected at runtime, while the
argument to writeAtMost can only be inspected at compile time (by a
rewrite rule).

* Write might exist at runtime, if it's allocation site cannot be seen
by its use site, which hard to guarantee in general (it requires
serious staring at Core). This is not the case for writeAtMost, unless
it's partially applied.

* The second field of Write is lazy. I'm not sure what, if any,
implications this might have for how GHC optimizes the code.

 In my opinion, Writes and Builders have different use-cases and
 different semantics. Providing a type modeling Writes makes therefore
 sense to me. Moreover, note that Writes are built as a compile time
 abstraction. All their definitions are intended to be completely
 inlined and care is taken that the inliner also does so. Therefore,
 they incur no runtime cost.

This is up to the user of the Write abstraction to ensure, as any
function that takes a Write as an argument must have the correct
INLINE incantations applied to make this happen.

Cheers,
Johan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-20 Thread Simon Meier
2011/5/20 Johan Tibell johan.tib...@gmail.com:
 Hi Simon,

 On Thu, May 19, 2011 at 10:46 PM, Simon Meier iridc...@gmail.com wrote:
 Write achieves this separation, but it has some costs which I'm not
 entirely comfortable with.

 First, it leads to lots of API duplication. For every type (e.g. Word)
 we want to be able serialize we have two morally identical functions

    writeWordhost :: Word - Write
    fromWordhost :: Word - Builder

 in the API, where the latter simply calls the former and does some
 additional wrapping.

 Yes, I agree with this duplication. I'll explain below what we gain
 from it. Note that I factored out the whole Write stuff into its own
 library (system-io-write) for the bytestring integration. Therefore,
 an end-user of bytestring will only see the Builder versions except
 he's doing more low-level stuff to gain some extra performance.

 There are (at least) two cases where I think the simple Builder API
 must perform well for it to be usable on its own: simple loops and
 sequential writes. To be specific, if the following two cases don't
 compile into near optimal code, there's a compiler bug we should fix.
 First, a simple loop:

    f :: [Word8] - Builder
    f [] = mempty
    f (x:xs) = singleton x `mappend` xs

 This code is already quite low level, there should be enough
 information here for the compiler to emit a simple loop with one
 buffer bounds check per iteration. Second, a bunch of sequential
 writes:

    g :: Word8 - Word8 - Word8 - Word8 - Builder
    g a b c d = singleton `mappend` (b `mappend` (c `mappend` d))

 This ought to compile to a single bounds check followed by for memory writes.

 The user shouldn't have to get more low-level than this in these
 simple examples. Today this is currently only true for the second
 example, which we can solve using rewrite rules. The first example
 doesn't work due to the GHC compiler bug I mentioned.

I agree with you that simple uses of the Builder API should be
optimized well. I think we can also guide the user indirectly by
making writing the efficient code even easier than writing the
possibly dangerous one. I'm thinking about providing type-classes for
standard encodings. For example,

class Utf8 a where
utf8 :: a - Builder

instance Utf8 String where
utf8 = fromWriteList writeCharUtf8

-- further instances: Char, Text

This yields another way of navigating around difficult optimization territory.

 Simon, is the reason for this duplication this comment on top of
 Blaze.ByteString.Builder.Word?

    snip

 That's one of the reasons, but not the main one. The core reason is
 that Write's provide
 an interface between implementors of the low-level bit-twiddling
 required to efficiently implement encodings like UTF-8 or Base16 and
 the providers of efficient traversal functions through (streams of)
 Haskell values. For simple traversals like

   fromWrite          :: Write a - a - Builder
   fromWriteList      :: Write a - [a] - Builder
   fromWriteUnfoldr   :: Write b - (a - Maybe (b, a)) - a - Builder

 there might be the option that GHC is clever enough and can find the
 efficient loop. However, for more complicated functions like

   mapWriteByteString :: Write Word8 - S.ByteString - Builder

 That certainly isn't the case. I'm using quite a few tricks there [3]
 to enable a tight inner loop with few live variables.

 Right. So this argues for having an escape hatch, and I agree we
 should have one. Write at writeAtMost are both such escape hatches and
 I believe them to equal in expressiveness. This shouldn't come as a
 surprise as Write is writeAtMost with one argument reified into into a
 constructor field:

    writeAtMost :: Int - (Ptr Word8 - IO Int) - IO ()
    data Write = {-# UNPACK #-} !Int (Ptr Word8 - IO (Ptr Word8))

 (That the second argument of writeAtMost is an Int instead of a Ptr
 Word8 as in Write is an unimportant difference.)

There, seems to be a historical artefact here. The new Write
abstraction in system-io-write is different from the one used in
blaze-builder. It's type is

  data Write a = Write Int (a - Ptr Word8 - IO (Ptr Word8))

This definition ensures that the bound on the number of bytes written
is independent of the value being encoded. That's crucial for the
implementation of `mapWriteByteString`. It also benefits the other
Write combinators, as the bound can always be computed in a
data-independent fashion. Inlining, is therefore really sufficient to
arrive at a constant bound during compile time.

I don't see how this Write type can be emulated using `writeAtMost`, do you?

 There are some operational differences.

 * The argument to Write can be inspected at runtime, while the
 argument to writeAtMost can only be inspected at compile time (by a
 rewrite rule).

 * Write might exist at runtime, if it's allocation site cannot be seen
 by its use site, which hard to guarantee in general (it requires
 serious staring at Core). This is not the case for writeAtMost, unless

Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-19 Thread Johan Tibell
Hi Simon,

On Wed, May 18, 2011 at 7:32 PM, Simon Meier iridc...@gmail.com wrote:
 In fact, one of my current goals with this work is to polish it such
 that it can be integrated into the 'bytestring' library.

We should definitely add a builder monoid in the bytestring package.

Since Write mentions IO, I thought I should point out that we need to
separate any code that mentions IO from the the code that doesn't
(i.e. the pure Builder API). The use of IO is an implementation detail
in bytestring. We should follow the existing bytestring pattern and
put any code that mentions IO in e.g.
Data.ByteString.Lazy.Builder.Internal. This allows the few people who
need to access the internals to do so while making it clear that these
are in fact internals. Long term we'd like to switch bytestring over
from ForeignPtr to ByteArray#, if possible. There are currently some
technical obstacles to such a switch, but factoring out the IO code at
least makes it somewhat easier if we ever get around to switching.

Avoiding IO in the main API means that the main builder type must not
mention IO (or things related to IO, such as Storable).

 The core principle used to tackle (1) is avoiding intermediate data
 structures.  The core abstraction used is the one of a Write (see [1]
 for the corresponding library.)

  data Write a = Write Int (a - Ptr Word8 - IO (Ptr Word8))

 A value `Write bound io :: Write a` denotes an encoding scheme for
 values of type `a` that uses at most `bound` bytes space. Given a
 values `x :: a` and a pointer `po` to the next free byte `io x po`
 encodes `x` to memory starting from `po` and returns the pointer to
 the next free byte after the encoding of `x`.

 In most cases Writes are used as an abstract datatype. They serve as
 an interface between implementors of the low-level bit-twiddling
 required to efficiently implement encodings like UTF-8 or Base16 and
 the providers of efficient traversal functions through streams of
 Haskell values. Hence, typical users of Writes are functions like

  fromWrite          :: Write a - a - Builder
  fromWriteList      :: Write a - [a] - Builder
  fromWriteUnfoldr   :: Write b - (a - Maybe (b, a)) - a - Builder
  mapWriteByteString :: Write Word8 - S.ByteString - Builder

We want to allow users to efficiently create new builders, for their
own data type. This is crucial as the bytestring package cannot
provide efficient builders for every possible type, as it would have
to depend on most of Hackage (i.e. on all packages that define types
that we want efficient builders for) to do so. Allowing the user to
get hold of the underlying buffer in a controlled way makes the
builder extensible. This is good.

Write achieves this separation, but it has some costs which I'm not
entirely comfortable with.

First, it leads to lots of API duplication. For every type (e.g. Word)
we want to be able serialize we have two morally identical functions

writeWordhost :: Word - Write
fromWordhost :: Word - Builder

in the API, where the latter simply calls the former and does some
additional wrapping.

See 
http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html
for examples.

Simon, is the reason for this duplication this comment on top of
Blaze.ByteString.Builder.Word?

Note that for serializing a three tuple (x,y,z) of bytes (or other
word values) you should use the expression

fromWrite $ writeWord8 x `mappend` writeWord8 y `mappend` writeWord z

instead of

fromWord8 x `mappend` fromWord8 y `mappend` fromWord z

The first expression will result in a single atomic write of three
bytes, while the second expression will check for each byte, if
there is free space left in the output buffer. Coalescing these
checks can improve performance quite a bit, as long as you use it
sensibly.

Coalescing of buffer space checks can be achieved without separating
writes into Write and Builder. I've done so in the binary package [1]
using rewrite rules. The rewrite rules fire reliable so that any
syntactic series of puts i.e.

f = do
putWord8 1
putWord8 2
putWord8 3

result in one bounds check, followed by three pokes into the buffer.
To do so all that is needed is to define all builders in terms of

writeAtMost :: Int - (Ptr Word8 - IO Int) - Builder

and create a rewrite rule for append/writeAtMost. writeAtMost is
essentially the same as your Write [2], except it never leads to any
constructors getting allocated.

At the moment, the addition of Write means that

import Blaze.ByteString.Builder

f :: [Word8] - Builder
f xs = fromWriteList writeWord8 xs

is faster than the Data.Binary equivalent

import Data.Binary.Builder

g :: [Word8] - Builder
g [] = mempty
g (x:xs) = singleton x `mappend` g xs

Fortunately this was due to a bug in GHC [3]. After this bug has been
fixed I expect Data.Binary to perform on par with
Blaze.ByteString.Builder, 

Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-19 Thread Henk-Jan van Tuyl
On Wed, 18 May 2011 23:21:27 +0200, Antoine Latter aslat...@gmail.com  
wrote:



I don't have a problem with these extensions being in the Haskell
Platform, as the platform currently only targets GHC, but the
bytestring package itself might have a higher standard of portability.


So you want the packages that use extensions thrown out, as soon as  
another Haskell compiler becomes popular? As one of the targets of the  
Haskell Platform is stability, it is not advisable to target just one  
compiler.


Regards,
Henk-Jan van Tuyl


--
http://Van.Tuyl.eu/
http://members.chello.nl/hjgtuyl/tourdemonad.html
--

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-19 Thread Antoine Latter
On May 19, 2011 4:57 AM, Henk-Jan van Tuyl hjgt...@chello.nl wrote:

 On Wed, 18 May 2011 23:21:27 +0200, Antoine Latter aslat...@gmail.com
wrote:

 I don't have a problem with these extensions being in the Haskell
 Platform, as the platform currently only targets GHC, but the
 bytestring package itself might have a higher standard of portability.


 So you want the packages that use extensions thrown out, as soon as
another Haskell compiler becomes popular? As one of the targets of the
Haskell Platform is stability, it is not advisable to target just one
compiler.


Portability and adherence to standards is a goal worth striving for, but the
platform policy stated on the wiki is that packages in the platform should
build on all compilier targets.

I think we need to be pragmatic about what we include - for example
functional dependencies are still controversial, but that doesn't mean that
the 'mtl' package should be tossed out of the platform.

But perhaps it does mean that, for example, the 'containers' package should
be subject to a higher level of scutiny for its public API. The 'bytestring'
package might also be such a package where we prioritize portability.

I'm not active in the maintaince of the platform; perhaps I'm mis-stating
the goals and policies.

Antoine
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-19 Thread Simon Meier
Hi Antoine, thanks for your feedback.

2011/5/18 Antoine Latter aslat...@gmail.com:
 On Wed, May 18, 2011 at 12:32 PM, Simon Meier iridc...@gmail.com wrote:
 Hello Haskell-Cafe,





 There are many providers of Writes. Each bounded-length-encoding of a
 standard Haskell value is likely to have a corresponding Write. For
 example, encoding an Int32 as a big-endian, little-endian, and
 host-endian byte-sequence is currently achieved with the following
 three functions.

  writeInt32BE :: Write Int32
  writeInt32LE :: Write Int32
  writeInt32HE :: Write Int32

 I would like to avoid naming all these encodings individually.
 Especially, as the situation becomes worse for more elaborate
 encodings like hexadecimal encodings. There, we encounter encodings
 like the utf8-encoding of the hexadecimal-encoding with lower-case
 letters of an Int32.

  writeInt32HexLowerUtf8 :: Write Int32

 I really don't like that. Therefore, I'm thinking about the following
 solution based on type-classes. We introduce a single typeclass

  class Writable a where
      write :: Write a

 and use a bunch of newtypes to denote our encodings.

  newtype Ascii7   a = Ascii7   { unAscii7   :: a }
  newtype Utf8     a = Utf8     { unUtf8     :: a }
  newtype HexUpper a = HexUpper { unHexUpper :: a }
  newtype HexLower a = HexLower { unHexLower :: a }
  ...

 Assuming FlexibleInstnaces, we can write encodings like the above
 hex-encoding as instances

  instance Write (Utf8 (HexLower Int32)) where
    write = ...

 This composes rather nicely and allows the implementations to exploit
 special properties of the involved data. For example, if we also had a
 HTML escaping marker

  newtype Html     a = Html     { unHtml     :: a }

 Then, the instance

  instance Write (Utf8 (HTML (HexLower Int32))) where
    write (Utf8 (HTML (HexLower i))) = write (Utf8 (HexLower i))

 If I were authoring the above code, I don't see why that code is any
 easier to write or easier to read than:

 urf8HtmlHexLower i = utf8HexLower i

 And if I were using the encoding functions, I would much prefer to see:

 urf8HtmlHexLower magicNumber

 In my code, instead of:

 write $ Utf8 $ HTML $ HexLower magicNumber

 In addition, this would be difficult for me as a developer using the
 proposed library, because I would have no way to know which
 combinations of newtypes are valid from reading the haddocks.

 Maybe I'm missing something fundamental, but this approach seems more
 cumbersome to me as a library author (more boilerplate) and as the
 user of the library (less clarity in the docs and in the resultant
 code).

Hmm, that's a valid point you raise here. Especially, the
documentation issue bothers me.

The core problem that drove me towards this solution is the abundance
of different IntX and WordX types. Each of them requiring a separate
Write for big-endian, little-endian, host-endian, lower-case-hex, and
uper-case-hex encodings; i.e., currently, there are

int8BE   :: Write Int8
int16BE :: Write Int16
int32BE :: Write Int32
...
hexLowerInt8 :: Write Int8
...

and so on. As you can see
(http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html)
this approach clutters the public API quite a bit. Hence, I'm thinking
of using a separate type-class for each encoding; i.e.,

  class BigEndian a where
bigEndian :: Write a

This collapses the big-endian encodings of all 10 bounded-size (signed
and unsigned) integer types under a single name with a well-defined
semantics. Moreover, it's standard Haskell 98. For the hex-encodings,
I'm thinking about providing type-classes

  class HexLower a where
hexLower :: Write a

  class HexLowerNoLead a where
hexLowerNoLead :: Write a

  ...

for ASCII encoding and each of the standard Unicode encodings in a
separate module. The user can then select the right ones using
qualified imports. In most cases, he won't even need qualification, as
mixing different character encodings is seldomly used.

What do you think about such an interface? Is there another catch
hidden, I'm not seeing? BTW, note that Writes are a pure compile time
abstraction and are thought to be completely inlined. In typical, uses
cases there's no efficiency overhead stemming from these typeclasses.

best regards,
Simon

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-19 Thread Simon Meier
Hi Johan,

thanks for the extensive and motivating feedback.

2011/5/19 Johan Tibell johan.tib...@gmail.com:
 On Wed, May 18, 2011 at 7:32 PM, Simon Meier iridc...@gmail.com wrote:
 In fact, one of my current goals with this work is to polish it such
 that it can be integrated into the 'bytestring' library.

 We should definitely add a builder monoid in the bytestring package.

 Since Write mentions IO, I thought I should point out that we need to
 separate any code that mentions IO from the the code that doesn't
 (i.e. the pure Builder API). The use of IO is an implementation detail
 in bytestring. We should follow the existing bytestring pattern and
 put any code that mentions IO in e.g.
 Data.ByteString.Lazy.Builder.Internal. This allows the few people who
 need to access the internals to do so while making it clear that these
 are in fact internals. Long term we'd like to switch bytestring over
 from ForeignPtr to ByteArray#, if possible. There are currently some
 technical obstacles to such a switch, but factoring out the IO code at
 least makes it somewhat easier if we ever get around to switching.

 Avoiding IO in the main API means that the main builder type must not
 mention IO (or things related to IO, such as Storable).

I completely agree with you. The system-io-write library [1] and the
bytestring fork [2] I'm working on provide separate interfaces for
standard and expert users. The naming of the system-io-write library
is tentative and can be adapted once it's place is clear.

 The core principle used to tackle (1) is avoiding intermediate data
 structures.  The core abstraction used is the one of a Write (see [1]
 for the corresponding library.)

  data Write a = Write Int (a - Ptr Word8 - IO (Ptr Word8))

 A value `Write bound io :: Write a` denotes an encoding scheme for
 values of type `a` that uses at most `bound` bytes space. Given a
 values `x :: a` and a pointer `po` to the next free byte `io x po`
 encodes `x` to memory starting from `po` and returns the pointer to
 the next free byte after the encoding of `x`.

 In most cases Writes are used as an abstract datatype. They serve as
 an interface between implementors of the low-level bit-twiddling
 required to efficiently implement encodings like UTF-8 or Base16 and
 the providers of efficient traversal functions through streams of
 Haskell values. Hence, typical users of Writes are functions like

  fromWrite          :: Write a - a - Builder
  fromWriteList      :: Write a - [a] - Builder
  fromWriteUnfoldr   :: Write b - (a - Maybe (b, a)) - a - Builder
  mapWriteByteString :: Write Word8 - S.ByteString - Builder

 We want to allow users to efficiently create new builders, for their
 own data type. This is crucial as the bytestring package cannot
 provide efficient builders for every possible type, as it would have
 to depend on most of Hackage (i.e. on all packages that define types
 that we want efficient builders for) to do so. Allowing the user to
 get hold of the underlying buffer in a controlled way makes the
 builder extensible. This is good.

 Write achieves this separation, but it has some costs which I'm not
 entirely comfortable with.

 First, it leads to lots of API duplication. For every type (e.g. Word)
 we want to be able serialize we have two morally identical functions

    writeWordhost :: Word - Write
    fromWordhost :: Word - Builder

 in the API, where the latter simply calls the former and does some
 additional wrapping.

Yes, I agree with this duplication. I'll explain below what we gain
from it. Note that I factored out the whole Write stuff into its own
library (system-io-write) for the bytestring integration. Therefore,
an end-user of bytestring will only see the Builder versions except
he's doing more low-level stuff to gain some extra performance.

 See 
 http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html
 for examples.

 Simon, is the reason for this duplication this comment on top of
 Blaze.ByteString.Builder.Word?

    Note that for serializing a three tuple (x,y,z) of bytes (or other
    word values) you should use the expression

        fromWrite $ writeWord8 x `mappend` writeWord8 y `mappend` writeWord z

    instead of

        fromWord8 x `mappend` fromWord8 y `mappend` fromWord z

    The first expression will result in a single atomic write of three
    bytes, while the second expression will check for each byte, if
    there is free space left in the output buffer. Coalescing these
    checks can improve performance quite a bit, as long as you use it
    sensibly.

That's one of the reasons, but not the main one. The core reason is
that Write's provide
an interface between implementors of the low-level bit-twiddling
required to efficiently implement encodings like UTF-8 or Base16 and
the providers of efficient traversal functions through (streams of)
Haskell values. For simple traversals like

  fromWrite          :: Write a - a - 

Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-19 Thread Antoine Latter
On Thu, May 19, 2011 at 3:06 PM, Simon Meier iridc...@gmail.com wrote:

 The core problem that drove me towards this solution is the abundance
 of different IntX and WordX types. Each of them requiring a separate
 Write for big-endian, little-endian, host-endian, lower-case-hex, and
 uper-case-hex encodings; i.e., currently, there are

 int8BE   :: Write Int8
 int16BE :: Write Int16
 int32BE :: Write Int32
 ...
 hexLowerInt8 :: Write Int8
 ...

 and so on. As you can see
 (http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html)
 this approach clutters the public API quite a bit. Hence, I'm thinking
 of using a separate type-class for each encoding; i.e.,


If Johan's work on Data.Binary and rewrite rules works out, then it
would cut the exposed API in half, which helps.

We could then use the module and package system to further keep the
API clean, with builders which output a specific encoding could live
in separate modules. This could also keep the names of the functions
short, as well.

That would require coming up with logical divisions for the functions
you're creating, and I don't understand the big picture enough to help
with that.

  class BigEndian a where
    bigEndian :: Write a

 This collapses the big-endian encodings of all 10 bounded-size (signed
 and unsigned) integer types under a single name with a well-defined
 semantics. Moreover, it's standard Haskell 98. For the hex-encodings,
 I'm thinking about providing type-classes

  class HexLower a where
    hexLower :: Write a

  class HexLowerNoLead a where
    hexLowerNoLead :: Write a

  ...

 for ASCII encoding and each of the standard Unicode encodings in a
 separate module. The user can then select the right ones using
 qualified imports. In most cases, he won't even need qualification, as
 mixing different character encodings is seldomly used.


I think we may be at cross-purposes here, and might not even be
discussing the same thing - I would imagine that any sort of 'Builder'
type included in the bytestring package would only provide the core
combinators for packing data into low-level binary formats, so
discussions about text encoding issues, converting to hexidecimal and
Html escaping are going above my head.

This seems like what the 'text' package was written for - to separate
out the construction of textual data from choosing its encoding.

Are there use-cases where the 'text' package is too slow for this sort
of approach?

Take care,
Antoine

 What do you think about such an interface? Is there another catch
 hidden, I'm not seeing? BTW, note that Writes are a pure compile time
 abstraction and are thought to be completely inlined. In typical, uses
 cases there's no efficiency overhead stemming from these typeclasses.

 best regards,
 Simon


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-19 Thread Henning Thielemann
Simon Meier schrieb:

 There are many providers of Writes. Each bounded-length-encoding of a
 standard Haskell value is likely to have a corresponding Write. For
 example, encoding an Int32 as a big-endian, little-endian, and
 host-endian byte-sequence is currently achieved with the following
 three functions.
 
   writeInt32BE :: Write Int32
   writeInt32LE :: Write Int32
   writeInt32HE :: Write Int32
 
 I would like to avoid naming all these encodings individually.

Maybe this one helps:
http://hackage.haskell.org/packages/archive/storable-endian/0.2.4/doc/html/Data-Storable-Endian.html
?

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-19 Thread Bas van Dijk
On 19 May 2011 10:53, Johan Tibell johan.tib...@gmail.com wrote:
 Long term we'd like to switch bytestring over
 from ForeignPtr to ByteArray#, if possible. There are currently some
 technical obstacles to such a switch

BTW I'm working with Roman Leshchinskiy to create the
vector-bytestring package which provides:
type ByteString = Data.Vector.Storable.Vector Word8
and exports the same API as the bytestring package (no support for
lazy bytestrings yet)

A storable vector still uses a ForeignPtr but maybe this will make the
switch to unboxed Vectors (which use ByteArray#) easier.

Expect some code to be up somewhere next week.

Bas

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-19 Thread wren ng thornton

On 5/19/11 5:51 PM, Antoine Latter wrote:

On Thu, May 19, 2011 at 3:06 PM, Simon Meieriridc...@gmail.com  wrote:


The core problem that drove me towards this solution is the abundance
of different IntX and WordX types. Each of them requiring a separate
Write for big-endian, little-endian, host-endian, lower-case-hex, and
uper-case-hex encodings; i.e., currently, there are

int8BE   :: Write Int8
int16BE :: Write Int16
int32BE :: Write Int32
...
hexLowerInt8 :: Write Int8
...

and so on. As you can see
(http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/Blaze-ByteString-Builder-Word.html)
this approach clutters the public API quite a bit. Hence, I'm thinking
of using a separate type-class for each encoding; i.e.,


It seems to me that a better way of handling this would be to explicitly 
define an ADT (or type-level equivalent) for naming the different format 
options. That is, something like this:


data Endianness = BE | LE | HE
data Radix  = Binary | Octal | Decimal | Hexadecimal
...
data WIFormat = WIFormat
{ endianness :: {-# UNPACK #-} !Endianness
, radix  :: {-# UNPACK #-} !Radix
...}
class WriteWI a where
writeWI :: WIFormat - Write a

If you're sure that you can get rid of the typeclass overhead, then you 
should be able to get rid of the case analysis on the ADT as well (by 
making sure to always use writeWI fully saturated). But this way, you 
only need to deal with one class and it's obvious how to extend it (as 
opposed to your newtype solution where it's not clear whether the order 
of newtype wrapping matters, etc).


Of course, I'm not advocating that specific ADT for encoding format 
types. For example, it's only in decimal format where there's any 
difference between Word* and Int* types, since the signedness never 
shows up explicitly in binary, oct, or hex representations. There's also 
the issues you've mentioned about whether hex is upper case or lower 
case, whether there's a leading sigil like 0 or 0o for oct, or 0x, \x, 
U+,... for hex. And so on. So you'll need to figure out what all the 
formats are you want to offer, but it should be straightforward to come 
up with an ADT like the one above, and then you can just case match on 
it to choose the specific format.


As for the class, if you run into too much type ambiguity and want to 
avoid the need for type signatures, then you can add an unused argument 
of type @a@ as is common in other core libraries needing to be H98 
compliant.


--
Live well,
~wren

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-18 Thread Simon Meier
Hello Haskell-Cafe,

my main question is whether requiring FlexibleInstances is a problem
for code that aims to become part of the Haskell platform. The
following explanation gives the context for this question.

As some of you may know the blaze-builder library is now used in quite
a few places. That's nice, but it doesn't mean that blaze-builder is a
finished solution to the problem of providing an API for
high-performance buffered output (creation of chunked representations)
of sequences of bytes.

In fact, one of my current goals with this work is to polish it such
that it can be integrated into the 'bytestring' library. This has the
benefit that functions that create lazy bytestrings (e.g., pack, map,
unfoldr, filter) can be implemented such that they create well-sized
chunks even if the argument bytestring is hugely fragmented. Moreover,
this integration also establishes a single builder type as the output
representation. Therefore, other creators of bytestrings (e.g.,
'text', 'base16-bytestring', 'zlib') can provide results of type
Builder, which enables O(1) appends of their results and the
preservation of well-sizedness of the created chunks.

As part of this goal, I'm currently working on the first of the
following three points that are paramount to achieving great encoding
performance:

  1. Ensure that individual Haskell values are encoded with minimal overhead.
  2. Ensure that concatenation of sequences of bytes is efficient.
  3. Ensure that the average chunk size is large.

The core principle used to tackle (1) is avoiding intermediate data
structures.  The core abstraction used is the one of a Write (see [1]
for the corresponding library.)

  data Write a = Write Int (a - Ptr Word8 - IO (Ptr Word8))

A value `Write bound io :: Write a` denotes an encoding scheme for
values of type `a` that uses at most `bound` bytes space. Given a
values `x :: a` and a pointer `po` to the next free byte `io x po`
encodes `x` to memory starting from `po` and returns the pointer to
the next free byte after the encoding of `x`.

In most cases Writes are used as an abstract datatype. They serve as
an interface between implementors of the low-level bit-twiddling
required to efficiently implement encodings like UTF-8 or Base16 and
the providers of efficient traversal functions through streams of
Haskell values. Hence, typical users of Writes are functions like

  fromWrite  :: Write a - a - Builder
  fromWriteList  :: Write a - [a] - Builder
  fromWriteUnfoldr   :: Write b - (a - Maybe (b, a)) - a - Builder
  mapWriteByteString :: Write Word8 - S.ByteString - Builder

They consume the given datastructure efficiently and wrap the Writes
in the bounds checking code to detect when a buffer is full and
request a new one.

There are many providers of Writes. Each bounded-length-encoding of a
standard Haskell value is likely to have a corresponding Write. For
example, encoding an Int32 as a big-endian, little-endian, and
host-endian byte-sequence is currently achieved with the following
three functions.

  writeInt32BE :: Write Int32
  writeInt32LE :: Write Int32
  writeInt32HE :: Write Int32

I would like to avoid naming all these encodings individually.
Especially, as the situation becomes worse for more elaborate
encodings like hexadecimal encodings. There, we encounter encodings
like the utf8-encoding of the hexadecimal-encoding with lower-case
letters of an Int32.

  writeInt32HexLowerUtf8 :: Write Int32

I really don't like that. Therefore, I'm thinking about the following
solution based on type-classes. We introduce a single typeclass

  class Writable a where
  write :: Write a

and use a bunch of newtypes to denote our encodings.

  newtype Ascii7   a = Ascii7   { unAscii7   :: a }
  newtype Utf8 a = Utf8 { unUtf8 :: a }
  newtype HexUpper a = HexUpper { unHexUpper :: a }
  newtype HexLower a = HexLower { unHexLower :: a }
  ...

Assuming FlexibleInstnaces, we can write encodings like the above
hex-encoding as instances

  instance Write (Utf8 (HexLower Int32)) where
write = ...

This composes rather nicely and allows the implementations to exploit
special properties of the involved data. For example, if we also had a
HTML escaping marker

  newtype Html a = Html { unHtml :: a }

Then, the instance

  instance Write (Utf8 (HTML (HexLower Int32))) where
write (Utf8 (HTML (HexLower i))) = write (Utf8 (HexLower i))

exploits that no HTML escaping is required for a hex-number.  Assuming
FlexibleContexts, the user can also build abbreviations for builders
using fixed encodings.

  utf8 :: Writable (Utf8 a) = a - Builder
  utf8 = fromWrite write . Utf8

Note that, on the Builder level, a probably better way would be to
have an analogous 'ToBuilder' typeclass to abstract the various
encodings. Part of these instances then reuse the corresponding
instances from Writable.

I think this type-class based interface to select the correct
efficient implementation of an 

Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-18 Thread Antoine Latter
On Wed, May 18, 2011 at 12:32 PM, Simon Meier iridc...@gmail.com wrote:
 Hello Haskell-Cafe,





 There are many providers of Writes. Each bounded-length-encoding of a
 standard Haskell value is likely to have a corresponding Write. For
 example, encoding an Int32 as a big-endian, little-endian, and
 host-endian byte-sequence is currently achieved with the following
 three functions.

  writeInt32BE :: Write Int32
  writeInt32LE :: Write Int32
  writeInt32HE :: Write Int32

 I would like to avoid naming all these encodings individually.
 Especially, as the situation becomes worse for more elaborate
 encodings like hexadecimal encodings. There, we encounter encodings
 like the utf8-encoding of the hexadecimal-encoding with lower-case
 letters of an Int32.

  writeInt32HexLowerUtf8 :: Write Int32

 I really don't like that. Therefore, I'm thinking about the following
 solution based on type-classes. We introduce a single typeclass

  class Writable a where
      write :: Write a

 and use a bunch of newtypes to denote our encodings.

  newtype Ascii7   a = Ascii7   { unAscii7   :: a }
  newtype Utf8     a = Utf8     { unUtf8     :: a }
  newtype HexUpper a = HexUpper { unHexUpper :: a }
  newtype HexLower a = HexLower { unHexLower :: a }
  ...

 Assuming FlexibleInstnaces, we can write encodings like the above
 hex-encoding as instances

  instance Write (Utf8 (HexLower Int32)) where
    write = ...

 This composes rather nicely and allows the implementations to exploit
 special properties of the involved data. For example, if we also had a
 HTML escaping marker

  newtype Html     a = Html     { unHtml     :: a }

 Then, the instance

  instance Write (Utf8 (HTML (HexLower Int32))) where
    write (Utf8 (HTML (HexLower i))) = write (Utf8 (HexLower i))

If I were authoring the above code, I don't see why that code is any
easier to write or easier to read than:

 urf8HtmlHexLower i = utf8HexLower i

And if I were using the encoding functions, I would much prefer to see:

 urf8HtmlHexLower magicNumber

In my code, instead of:

 write $ Utf8 $ HTML $ HexLower magicNumber

In addition, this would be difficult for me as a developer using the
proposed library, because I would have no way to know which
combinations of newtypes are valid from reading the haddocks.

Maybe I'm missing something fundamental, but this approach seems more
cumbersome to me as a library author (more boilerplate) and as the
user of the library (less clarity in the docs and in the resultant
code).

Antoine

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

2011-05-18 Thread Antoine Latter
On Wed, May 18, 2011 at 12:32 PM, Simon Meier iridc...@gmail.com wrote:
 I think this type-class based interface to select the correct
 efficient implementation of an encoding is rather nice. However, I
 don't know if 'FlexibleInstances' and 'FlexibleContexts' are fine to
 use in code that aims to become part of the Haskell platform.
 Moreover, I might well overlook some drawbacks of this design.


I forgot to answer your main question :-)

I don't have a problem with these extensions being in the Haskell
Platform, as the platform currently only targets GHC, but the
bytestring package itself might have a higher standard of portability.

Have you heard from the 'bytestring' maintainers?

Antoine

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe