Re: Problems with lazy-xml

2011-02-12 Thread Marko Topolnik
  Just guessing, but is it something to do with this (from the docstring
  of parse-seq)?

  it will be run in a separate thread and be allowed to get
   ahead by queue-size items, which defaults to maxint.

As I've figured it out, when there's XPP on the classpath, and I'm
using it, the code that does the parsing is entirely different and
does not involve a separate thread (see with_pull.clj). The parser
sits and waits for events to be requested from the lazy seq.

 Finally, you might want to be able to walk an XML document larger
 than would fit in memory.  I'm not sure if lazy-xml has ever been
 able to do this as it would need to be vigilant about not
 retaining the root of the returned document tree.

I was under the impression that if the client was careful to lose the
nodes' parents, they would be free for garbage collection, as well as
previous siblings. The point is to first navigate to the
desired :content lazy seq and then lose all other refs. Then you are
in the same position as with any old lazy seq -- you run through it
without retaining the head, and things are good.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Problems with lazy-xml

2011-02-12 Thread Chouser
On Sat, Feb 12, 2011 at 4:16 AM, Marko Topolnik
marko.topol...@gmail.com wrote:
  Just guessing, but is it something to do with this (from the docstring
  of parse-seq)?

  it will be run in a separate thread and be allowed to get
   ahead by queue-size items, which defaults to maxint.

 As I've figured it out, when there's XPP on the classpath, and I'm
 using it, the code that does the parsing is entirely different and
 does not involve a separate thread (see with_pull.clj). The parser
 sits and waits for events to be requested from the lazy seq.

Yes, XPP is superior to SAX, especially for this sort of laziness.

 Finally, you might want to be able to walk an XML document larger
 than would fit in memory.  I'm not sure if lazy-xml has ever been
 able to do this as it would need to be vigilant about not
 retaining the root of the returned document tree.

 I was under the impression that if the client was careful to lose the
 nodes' parents, they would be free for garbage collection, as well as
 previous siblings. The point is to first navigate to the
 desired :content lazy seq and then lose all other refs. Then you are
 in the same position as with any old lazy seq -- you run through it
 without retaining the head, and things are good.

It may work, but wasn't a design goal when I was originally writing
lazy-xml, so it's possible I have stray head- or root-holding in the
code.

The problem you're observing is because of my use of drop-last, which
forces the parsing of one extra sibling at *each* level of the tree.
In your test case, this is a *lot* of extra parsing at exactly the
point where you don't want it.  The drop-last is used because the last
node of each interior sequence holds the non-descendant events and so
must be dropped from the content seq.

I'm currently trying to come up with a different way of passing around
the non-descendant events so that drop-last isn't necessary, but it's
...tricky, at least for my poor fuzzy brain.

--Chouser
http://joyofclojure.com/

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Problems with lazy-xml

2011-02-12 Thread Marko Topolnik
How about replacing
  (drop-last sibs)
with
  (remove vector? sibs)
?

remove will not access the next seq member in advance and the only
vector in sibs is the last element. I tried this change and it works
for the test code from the original post.

On Feb 12, 4:43 pm, Chouser chou...@gmail.com wrote:
 On Sat, Feb 12, 2011 at 4:16 AM, Marko Topolnik

 marko.topol...@gmail.com wrote:
   Just guessing, but is it something to do with this (from the docstring
   of parse-seq)?

   it will be run in a separate thread and be allowed to get
    ahead by queue-size items, which defaults to maxint.

  As I've figured it out, when there's XPP on the classpath, and I'm
  using it, the code that does the parsing is entirely different and
  does not involve a separate thread (see with_pull.clj). The parser
  sits and waits for events to be requested from the lazy seq.

 Yes, XPP is superior to SAX, especially for this sort of laziness.

  Finally, you might want to be able to walk an XML document larger
  than would fit in memory.  I'm not sure if lazy-xml has ever been
  able to do this as it would need to be vigilant about not
  retaining the root of the returned document tree.

  I was under the impression that if the client was careful to lose the
  nodes' parents, they would be free for garbage collection, as well as
  previous siblings. The point is to first navigate to the
  desired :content lazy seq and then lose all other refs. Then you are
  in the same position as with any old lazy seq -- you run through it
  without retaining the head, and things are good.

 It may work, but wasn't a design goal when I was originally writing
 lazy-xml, so it's possible I have stray head- or root-holding in the
 code.

 The problem you're observing is because of my use of drop-last, which
 forces the parsing of one extra sibling at *each* level of the tree.
 In your test case, this is a *lot* of extra parsing at exactly the
 point where you don't want it.  The drop-last is used because the last
 node of each interior sequence holds the non-descendant events and so
 must be dropped from the content seq.

 I'm currently trying to come up with a different way of passing around
 the non-descendant events so that drop-last isn't necessary, but it's
 ...tricky, at least for my poor fuzzy brain.

 --Chouserhttp://joyofclojure.com/

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Problems with lazy-xml

2011-02-12 Thread Marko Topolnik
Also, the xpp-based parser is almost an order of magnitude slower than
the sax-based one. The only thing it lacks is a couple of type hints:

(defn- attrs [^XmlPullParser xpp]

(defn- ns-decs [^XmlPullParser xpp]

  (let [step (fn [^XmlPullParser xpp]

These hints increase the performance from 400% slower to 30% faster
than sax.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Problems with lazy-xml

2011-02-12 Thread Marko Topolnik
On Feb 12, 7:55 pm, Marko Topolnik marko.topol...@gmail.com wrote:
 How about replacing
   (drop-last sibs)
 with
   (remove vector? sibs)
 ?

This was slightly naive. We also need these changes:

In siblings:

:end-element   [[(rest s)]]

In mktree:

(cons
  (struct element (:name elem) (:attrs elem) (remove vector?
sibs))
  (lazy-seq (get (last sibs) 0)))

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Problems with lazy-xml

2011-02-12 Thread Marko Topolnik
In fact, it is enough to replace (drop-last sibs) with (remove seq?
sibs).
.
On Feb 12, 9:54 pm, Marko Topolnik marko.topol...@gmail.com wrote:
 On Feb 12, 7:55 pm, Marko Topolnik marko.topol...@gmail.com wrote:

  How about replacing
    (drop-last sibs)
  with
    (remove vector? sibs)
  ?

 This was slightly naive. We also need these changes:

 In siblings:

 :end-element   [[(rest s)]]

 In mktree:

 (cons
           (struct element (:name elem) (:attrs elem) (remove vector?
 sibs))
           (lazy-seq (get (last sibs) 0)))

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Problems with lazy-xml

2011-02-11 Thread Marko Topolnik
Right now I'm working with a 300k-record file, but the code must scale
into the millions, and, as I mentioned, it is already spewing
OutOfMemoy errors. Also, on a more abstract level, it's just not right
to thrash the memory of a concurrent server-side component for
absolutely no good reason.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Problems with lazy-xml

2011-02-11 Thread Benny Tsai
Can you post a link to a (sanitized, if need be) sample file?

On Feb 11, 1:21 am, Marko Topolnik marko.topol...@gmail.com wrote:
 Right now I'm working with a 300k-record file, but the code must scale
 into the millions, and, as I mentioned, it is already spewing
 OutOfMemoy errors. Also, on a more abstract level, it's just not right
 to thrash the memory of a concurrent server-side component for
 absolutely no good reason.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Problems with lazy-xml

2011-02-11 Thread Marko Topolnik
http://db.tt/iqTo1Q4

This is a sample XML file with 1000 records -- enough to notice a
significant delay when evaluating the code from the original post.

Chouser, could you spare a second here? I've been looking and looking
at mktree and siblings for two days now and can't for the life of me
find out why it would eagerly parse the whole contents of an element
as soon as I acces its struct! The code looks perfectly correct.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Problems with lazy-xml

2011-02-11 Thread Benny Tsai
I can confirm that the same thing is happening on my end as well.  The
XML is parsed lazily:

user= (time (let [root (parse-trim (reader huge.xml))] (-
root :content type)))
Elapsed time: 45.57367 msecs
clojure.lang.LazySeq

...but as soon as I try to do anything with the struct map for the
DataArea element (second element in root's content), the entire
element appears to be parsed eagerly.

user= (time (let [root (parse-trim (reader huge.xml))] (-
root :content second type)))
Elapsed time: 884.905205 msecs
clojure.lang.PersistentStructMap

I spent some time looking at the source for lazy-xml as well, but
wasn't able to locate where the problem lies :(

On Feb 11, 3:07 am, Marko Topolnik marko.topol...@gmail.com wrote:
 http://db.tt/iqTo1Q4

 This is a sample XML file with 1000 records -- enough to notice a
 significant delay when evaluating the code from the original post.

 Chouser, could you spare a second here? I've been looking and looking
 at mktree and siblings for two days now and can't for the life of me
 find out why it would eagerly parse the whole contents of an element
 as soon as I acces its struct! The code looks perfectly correct.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Problems with lazy-xml

2011-02-11 Thread Chris Perkins
On Feb 11, 5:07 am, Marko Topolnik marko.topol...@gmail.com wrote:
 http://db.tt/iqTo1Q4

 This is a sample XML file with 1000 records -- enough to notice a
 significant delay when evaluating the code from the original post.

 Chouser, could you spare a second here? I've been looking and looking
 at mktree and siblings for two days now and can't for the life of me
 find out why it would eagerly parse the whole contents of an element
 as soon as I acces its struct! The code looks perfectly correct.

Just guessing, but is it something to do with this (from the docstring
of parse-seq)?

it will be run in a separate thread and be allowed to get
  ahead by queue-size items, which defaults to maxint.

Doesn't sound like it's actually lazy unless you explicitly specify a
queue-size.


- Chris

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Problems with lazy-xml

2011-02-11 Thread Chouser
On Fri, Feb 11, 2011 at 2:35 PM, Chris Perkins chrisperkin...@gmail.com wrote:
 On Feb 11, 5:07 am, Marko Topolnik marko.topol...@gmail.com wrote:
 http://db.tt/iqTo1Q4

 This is a sample XML file with 1000 records -- enough to notice a
 significant delay when evaluating the code from the original post.

 Chouser, could you spare a second here? I've been looking and looking
 at mktree and siblings for two days now and can't for the life of me
 find out why it would eagerly parse the whole contents of an element
 as soon as I acces its struct! The code looks perfectly correct.

I can reproduce the behavior you describe.  I'll look into it.

 Just guessing, but is it something to do with this (from the docstring
 of parse-seq)?

 it will be run in a separate thread and be allowed to get
  ahead by queue-size items, which defaults to maxint.

 Doesn't sound like it's actually lazy unless you explicitly specify a
 queue-size.

There's a few different reasons someone might want lazy.

You might want to be able to start examining the tree before it's
done being parsed -- this is the default behavior of lazy-xml.
Your code will block if it catches up with the parser, but
shouldn't block earlier than it has to.

Having the parsing thread do no work more than necessary (queue
size 1) is another possibly desired behavior -- sounds like the
OP's desire in which case the queue size does need to be
specified.

Finally, you might want to be able to walk an XML document larger
than would fit in memory.  I'm not sure if lazy-xml has ever been
able to do this as it would need to be vigilant about not
retaining the root of the returned document tree.

--Chouser

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Problems with lazy-xml

2011-02-10 Thread Marko Topolnik
I am required to process a huge XML file with 300,000 records. The
structure is like this:

root
  header

  /header
  body
record.../record
record.../record
... 299,998 more
  /body
/root

Obviously, it is of key importance not to allocate memory for all the
records at once. If I do this:

(use ['clojure.contrib.lazy-xml :only ['parse-trim]])
(use ['clojure.java.io :only ['reader]])

(- (parse-trim (reader huge.xml))
 :content
 second
 :tag)

This should only parse the start-tag body, but it parses all the way
down to /body -- at least it tries to, failing with
OutOfMemoryError.

Am I wrong in expecting the entire contents of body not to be
parsed? :content is supposed to be a lazy seq, so even if I access its
head, it should still not parse more than just the first record
element, right?

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Problems with lazy-xml

2011-02-10 Thread Mike Meyer
On Thu, 10 Feb 2011 07:22:55 -0800 (PST)
Marko Topolnik marko.topol...@gmail.com wrote:

 I am required to process a huge XML file with 300,000 records. The
 structure is like this:
 
 root
   header
 
   /header
   body
 record.../record
 record.../record
 ... 299,998 more
   /body
 /root
 
 Obviously, it is of key importance not to allocate memory for all the
 records at once.

I don't think it's obvious. Maybe I'm missing something? Like - how
big are the records? If they're less than 1K, that's at most 300 meg
in core - which is large, but not impossible on modern hardware. I've
been handling .5G data structures in core for the last few years (in
Python, anyway). I've run into at least one stupid garbage collector
that insisted on scanning such structures even though they weren't
changing, which pretty much killed performance. Maybe you have a fast
startup requirement, which building the initial data structure would
kill. Maybe something else?

  Thanks,
  mike
-- 
Mike Meyer m...@mired.org http://www.mired.org/consulting.html
Independent Network/Unix/SCM consultant, email for more information.

O ascii ribbon campaign - stop html mail - www.asciiribbon.org
   

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en