Re: [Haskell-cafe] No copy XML parser (rough idea only)

2010-05-20 Thread Neil Mitchell
Hi Joachim,

I have been playing around with this idea myself in TagSoup
(http://community.haskell.org/~ndm/tagsoup). The largest conceptual
problem I came across was that TagSoup decodes entities (i.e. gt;
becomes ). However, I think that's a minor issue, and entity
resolution can be turned off in TagSoup to guarantee copy-free
parsing.

The practical problem is that writing an HTML parser is hard, and that
writing it in a way that works on both String/ByteString/LBS and gets
the copy-free behaviour in the right places is tricky. I am still
working on program optimisation techniques that I think will make this
feasible (based around the idea of supercompilation, to expose the
underlying structure of the parser, without writing it in too painful
a manner). So with any luck your copy-free XML parser will happen
sooner or later.

Thanks, Neil

On Fri, May 14, 2010 at 8:20 PM, Joachim Breitner
m...@joachim-breitner.de wrote:
 Hi,

 Am Freitag, den 14.05.2010, 15:31 -0300 schrieb Felipe Lessa:
 On Fri, May 14, 2010 at 08:57:42AM -0700, John Millikin wrote:
  Additionally, since the original bytestring is shared in your types,
  potentially very large buffers could be locked in memory due to
  references held by only a small portion of the document. Chopping a
  document up into events or nodes creates some overhead due to the
  extra pointers, but allows unneeded portions to be freed.

 However, if your bytestring comes from mmap'ed memory this
 drawback wouldn't apply :D.

 exactly. Of course such a library would not be a general-purpose tool,
 but in cases where you know that you need most of the document for most
 of the time, e.g. when doing statistics on it, this would be acceptable.

 Also note that even after chopping into nodes, if you don’t make sure
 you drop the reference to root in a timely manner, the same thing would
 happen.

 Am Freitag, den 14.05.2010, 08:57 -0700 schrieb John Millikin:
 The primary problem I see with this is that XML content is
 fundamentally text, not bytes. Using your types, two XML documents
 with identical content but different encodings will have different
 Haskell values (and thus be incorrect regarding Eq, Ord, etc).

 The instances could be adapted... but this will be expensive, of course.

 One could also convert documents that are not utf-8 encoded as a whole
 and then work on that.

 If you'd like memory-efficient text storage, using Bryan O'Sullivan's
 text package[1] is probably the best option. It uses packed Word16
 buffers to store text as UTF-16. Probably not as efficient as a type
 backed by UTF-8, but it's much much better than String.

 Right. For arbtt, I tried to switch from String to text, and it actually
 got slower. The reason (I think) was that besides passing strings
 around, it mainly runs pcre-light on them – which wants utf8-encoded
 bytestrings.

 I ended up creating a newtype¹ around utf8-encoded ByteStrings and the
 result was quite satisfying, both memory- and runtime-wise. I wish we
 had a package providing a standard type for this type that would become
 similarly popular. There is at least one more packages on hackage that
 defines this type:
 http://hackage.haskell.org/packages/archive/regex-tdfa-utf8/1.0/doc/html/Text-Regex-TDFA-UTF8.html


 Greetings,
 Joachim

 ¹ http://darcs.nomeata.de/arbtt/src/Data/MyText.hs

 --
 Joachim Breitner
  e-Mail: m...@joachim-breitner.de
  Homepage: http://www.joachim-breitner.de
  ICQ#: 74513189
  Jabber-ID: nome...@joachim-breitner.de

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] No copy XML parser (rough idea only)

2010-05-14 Thread Joachim Breitner
Hi,

an idea recently crossed my mind that describes a (possibly?) useful way
of accessing XML data from Haskell that is very memory efficient – at
least as long as you only read the XML; for XML processing and
generation, other existing libraries are probably well suited.

Given a (possibly very large) XML file as a ByteString, a „normal“ XML
parser would have an algebraic data type for node etc., parsing the
ByteString into a tree of haskell values. Simplified a bit, the data
structure could look like this:
 data Node = Node String [Attrib] [Node] | CData String

The in-memory representation of such a a data structure is presumably
very large, compared to the XML ByteString. Additionally, all the
content is copied in memory and I evaluated a part of the Node (e.g. the
tag name), this is retained by the constructor as long as the Node, and
thus the whole document tree, is retained.

My idea is to copy as little as possible, and refer to the existing
ByteString as much as possible. Remember that a substring of a
ByteString points at the same memory region, just with a different
offset and length.

So, in my case, the datatype looks like
 newtype Node = Node ByteString
 newtype CData = CData ByteString
 newtype Attrib = Attrib ByteString
where the constructors would not be exported. Instead, accessor methods
would be provided:

 tagName :: Node - String
 tagAttribs :: Node - [Attrib]
 tagContent :: Node - [Either Node CData]
 cDataContent :: CData - ByteString

So when searching for a certain tag, the library would only have to
generate (and retain) pointers to the ByteString, but the tag name
themselves do not have to be retained after the comparision.

If it turns out to be too expensive to parse the fragment that is
referenced by Node on every invocation of tagContent, this could be made
a field of the constructor. Due to lazy evaluation, it will only be
evaluated when needed, but then kept in memory:
 data Node = Node ByteString [Either Node CData]

About cDataContent: This method cannot just return the fragment of the
document unaltered, unfortunately, as XML entities will have to be
replaced. But if it returns a lazy ByteString, everything between the
entities can be a chunk refering to the original data, again avoiding
the creation of a copy.


Now this is just a quick idea I wanted to share. I have not tried it
yet, so I might be overlooking something which prevents the whole thing
from working. It might also be the case that, due to lazy evaluation,
the “usual” approach is actually well enough (at least if it uses
ByteString instead of String). I’m curious what you think,

Joachim Breitner


-- 
Joachim Breitner
  e-Mail: m...@joachim-breitner.de
  Homepage: http://www.joachim-breitner.de
  ICQ#: 74513189
  Jabber-ID: nome...@joachim-breitner.de


signature.asc
Description: This is a digitally signed message part
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] No copy XML parser (rough idea only)

2010-05-14 Thread John Millikin
The primary problem I see with this is that XML content is
fundamentally text, not bytes. Using your types, two XML documents
with identical content but different encodings will have different
Haskell values (and thus be incorrect regarding Eq, Ord, etc).

Additionally, since the original bytestring is shared in your types,
potentially very large buffers could be locked in memory due to
references held by only a small portion of the document. Chopping a
document up into events or nodes creates some overhead due to the
extra pointers, but allows unneeded portions to be freed.

If you'd like memory-efficient text storage, using Bryan O'Sullivan's
text package[1] is probably the best option. It uses packed Word16
buffers to store text as UTF-16. Probably not as efficient as a type
backed by UTF-8, but it's much much better than String.

I know the data types you specified are just examples, but they're
leaving out some important XML features -- namespaces, entity
references, etc. Consider either reading the XML spec, or perhaps use
my package xml-types[2] as a starting point for designing your type
hierarchy.

[1] http://hackage.haskell.org/package/text
[2] http://hackage.haskell.org/package/xml-types
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] No copy XML parser (rough idea only)

2010-05-14 Thread Felipe Lessa
On Fri, May 14, 2010 at 08:57:42AM -0700, John Millikin wrote:
 Additionally, since the original bytestring is shared in your types,
 potentially very large buffers could be locked in memory due to
 references held by only a small portion of the document. Chopping a
 document up into events or nodes creates some overhead due to the
 extra pointers, but allows unneeded portions to be freed.

However, if your bytestring comes from mmap'ed memory this
drawback wouldn't apply :D.

--
Felipe.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] No copy XML parser (rough idea only)

2010-05-14 Thread Joachim Breitner
Hi,

Am Freitag, den 14.05.2010, 15:31 -0300 schrieb Felipe Lessa:
 On Fri, May 14, 2010 at 08:57:42AM -0700, John Millikin wrote:
  Additionally, since the original bytestring is shared in your types,
  potentially very large buffers could be locked in memory due to
  references held by only a small portion of the document. Chopping a
  document up into events or nodes creates some overhead due to the
  extra pointers, but allows unneeded portions to be freed.
 
 However, if your bytestring comes from mmap'ed memory this
 drawback wouldn't apply :D.

exactly. Of course such a library would not be a general-purpose tool,
but in cases where you know that you need most of the document for most
of the time, e.g. when doing statistics on it, this would be acceptable.

Also note that even after chopping into nodes, if you don’t make sure
you drop the reference to root in a timely manner, the same thing would
happen.

Am Freitag, den 14.05.2010, 08:57 -0700 schrieb John Millikin:
 The primary problem I see with this is that XML content is
 fundamentally text, not bytes. Using your types, two XML documents
 with identical content but different encodings will have different
 Haskell values (and thus be incorrect regarding Eq, Ord, etc).

The instances could be adapted... but this will be expensive, of course.

One could also convert documents that are not utf-8 encoded as a whole
and then work on that.

 If you'd like memory-efficient text storage, using Bryan O'Sullivan's
 text package[1] is probably the best option. It uses packed Word16
 buffers to store text as UTF-16. Probably not as efficient as a type
 backed by UTF-8, but it's much much better than String.

Right. For arbtt, I tried to switch from String to text, and it actually
got slower. The reason (I think) was that besides passing strings
around, it mainly runs pcre-light on them – which wants utf8-encoded
bytestrings.

I ended up creating a newtype¹ around utf8-encoded ByteStrings and the
result was quite satisfying, both memory- and runtime-wise. I wish we
had a package providing a standard type for this type that would become
similarly popular. There is at least one more packages on hackage that
defines this type:
http://hackage.haskell.org/packages/archive/regex-tdfa-utf8/1.0/doc/html/Text-Regex-TDFA-UTF8.html


Greetings,
Joachim

¹ http://darcs.nomeata.de/arbtt/src/Data/MyText.hs

-- 
Joachim Breitner
  e-Mail: m...@joachim-breitner.de
  Homepage: http://www.joachim-breitner.de
  ICQ#: 74513189
  Jabber-ID: nome...@joachim-breitner.de


signature.asc
Description: This is a digitally signed message part
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe