Re: Problems with lazy-xml
Just guessing, but is it something to do with this (from the docstring of parse-seq)? it will be run in a separate thread and be allowed to get ahead by queue-size items, which defaults to maxint. As I've figured it out, when there's XPP on the classpath, and I'm using it, the code that does the parsing is entirely different and does not involve a separate thread (see with_pull.clj). The parser sits and waits for events to be requested from the lazy seq. Finally, you might want to be able to walk an XML document larger than would fit in memory. I'm not sure if lazy-xml has ever been able to do this as it would need to be vigilant about not retaining the root of the returned document tree. I was under the impression that if the client was careful to lose the nodes' parents, they would be free for garbage collection, as well as previous siblings. The point is to first navigate to the desired :content lazy seq and then lose all other refs. Then you are in the same position as with any old lazy seq -- you run through it without retaining the head, and things are good. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Problems with lazy-xml
On Sat, Feb 12, 2011 at 4:16 AM, Marko Topolnik marko.topol...@gmail.com wrote: Just guessing, but is it something to do with this (from the docstring of parse-seq)? it will be run in a separate thread and be allowed to get ahead by queue-size items, which defaults to maxint. As I've figured it out, when there's XPP on the classpath, and I'm using it, the code that does the parsing is entirely different and does not involve a separate thread (see with_pull.clj). The parser sits and waits for events to be requested from the lazy seq. Yes, XPP is superior to SAX, especially for this sort of laziness. Finally, you might want to be able to walk an XML document larger than would fit in memory. I'm not sure if lazy-xml has ever been able to do this as it would need to be vigilant about not retaining the root of the returned document tree. I was under the impression that if the client was careful to lose the nodes' parents, they would be free for garbage collection, as well as previous siblings. The point is to first navigate to the desired :content lazy seq and then lose all other refs. Then you are in the same position as with any old lazy seq -- you run through it without retaining the head, and things are good. It may work, but wasn't a design goal when I was originally writing lazy-xml, so it's possible I have stray head- or root-holding in the code. The problem you're observing is because of my use of drop-last, which forces the parsing of one extra sibling at *each* level of the tree. In your test case, this is a *lot* of extra parsing at exactly the point where you don't want it. The drop-last is used because the last node of each interior sequence holds the non-descendant events and so must be dropped from the content seq. I'm currently trying to come up with a different way of passing around the non-descendant events so that drop-last isn't necessary, but it's ...tricky, at least for my poor fuzzy brain. --Chouser http://joyofclojure.com/ -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Problems with lazy-xml
How about replacing (drop-last sibs) with (remove vector? sibs) ? remove will not access the next seq member in advance and the only vector in sibs is the last element. I tried this change and it works for the test code from the original post. On Feb 12, 4:43 pm, Chouser chou...@gmail.com wrote: On Sat, Feb 12, 2011 at 4:16 AM, Marko Topolnik marko.topol...@gmail.com wrote: Just guessing, but is it something to do with this (from the docstring of parse-seq)? it will be run in a separate thread and be allowed to get ahead by queue-size items, which defaults to maxint. As I've figured it out, when there's XPP on the classpath, and I'm using it, the code that does the parsing is entirely different and does not involve a separate thread (see with_pull.clj). The parser sits and waits for events to be requested from the lazy seq. Yes, XPP is superior to SAX, especially for this sort of laziness. Finally, you might want to be able to walk an XML document larger than would fit in memory. I'm not sure if lazy-xml has ever been able to do this as it would need to be vigilant about not retaining the root of the returned document tree. I was under the impression that if the client was careful to lose the nodes' parents, they would be free for garbage collection, as well as previous siblings. The point is to first navigate to the desired :content lazy seq and then lose all other refs. Then you are in the same position as with any old lazy seq -- you run through it without retaining the head, and things are good. It may work, but wasn't a design goal when I was originally writing lazy-xml, so it's possible I have stray head- or root-holding in the code. The problem you're observing is because of my use of drop-last, which forces the parsing of one extra sibling at *each* level of the tree. In your test case, this is a *lot* of extra parsing at exactly the point where you don't want it. The drop-last is used because the last node of each interior sequence holds the non-descendant events and so must be dropped from the content seq. I'm currently trying to come up with a different way of passing around the non-descendant events so that drop-last isn't necessary, but it's ...tricky, at least for my poor fuzzy brain. --Chouserhttp://joyofclojure.com/ -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Problems with lazy-xml
Also, the xpp-based parser is almost an order of magnitude slower than the sax-based one. The only thing it lacks is a couple of type hints: (defn- attrs [^XmlPullParser xpp] (defn- ns-decs [^XmlPullParser xpp] (let [step (fn [^XmlPullParser xpp] These hints increase the performance from 400% slower to 30% faster than sax. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Problems with lazy-xml
On Feb 12, 7:55 pm, Marko Topolnik marko.topol...@gmail.com wrote: How about replacing (drop-last sibs) with (remove vector? sibs) ? This was slightly naive. We also need these changes: In siblings: :end-element [[(rest s)]] In mktree: (cons (struct element (:name elem) (:attrs elem) (remove vector? sibs)) (lazy-seq (get (last sibs) 0))) -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Problems with lazy-xml
In fact, it is enough to replace (drop-last sibs) with (remove seq? sibs). . On Feb 12, 9:54 pm, Marko Topolnik marko.topol...@gmail.com wrote: On Feb 12, 7:55 pm, Marko Topolnik marko.topol...@gmail.com wrote: How about replacing (drop-last sibs) with (remove vector? sibs) ? This was slightly naive. We also need these changes: In siblings: :end-element [[(rest s)]] In mktree: (cons (struct element (:name elem) (:attrs elem) (remove vector? sibs)) (lazy-seq (get (last sibs) 0))) -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Problems with lazy-xml
Right now I'm working with a 300k-record file, but the code must scale into the millions, and, as I mentioned, it is already spewing OutOfMemoy errors. Also, on a more abstract level, it's just not right to thrash the memory of a concurrent server-side component for absolutely no good reason. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Problems with lazy-xml
Can you post a link to a (sanitized, if need be) sample file? On Feb 11, 1:21 am, Marko Topolnik marko.topol...@gmail.com wrote: Right now I'm working with a 300k-record file, but the code must scale into the millions, and, as I mentioned, it is already spewing OutOfMemoy errors. Also, on a more abstract level, it's just not right to thrash the memory of a concurrent server-side component for absolutely no good reason. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Problems with lazy-xml
http://db.tt/iqTo1Q4 This is a sample XML file with 1000 records -- enough to notice a significant delay when evaluating the code from the original post. Chouser, could you spare a second here? I've been looking and looking at mktree and siblings for two days now and can't for the life of me find out why it would eagerly parse the whole contents of an element as soon as I acces its struct! The code looks perfectly correct. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Problems with lazy-xml
I can confirm that the same thing is happening on my end as well. The XML is parsed lazily: user= (time (let [root (parse-trim (reader huge.xml))] (- root :content type))) Elapsed time: 45.57367 msecs clojure.lang.LazySeq ...but as soon as I try to do anything with the struct map for the DataArea element (second element in root's content), the entire element appears to be parsed eagerly. user= (time (let [root (parse-trim (reader huge.xml))] (- root :content second type))) Elapsed time: 884.905205 msecs clojure.lang.PersistentStructMap I spent some time looking at the source for lazy-xml as well, but wasn't able to locate where the problem lies :( On Feb 11, 3:07 am, Marko Topolnik marko.topol...@gmail.com wrote: http://db.tt/iqTo1Q4 This is a sample XML file with 1000 records -- enough to notice a significant delay when evaluating the code from the original post. Chouser, could you spare a second here? I've been looking and looking at mktree and siblings for two days now and can't for the life of me find out why it would eagerly parse the whole contents of an element as soon as I acces its struct! The code looks perfectly correct. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Problems with lazy-xml
On Feb 11, 5:07 am, Marko Topolnik marko.topol...@gmail.com wrote: http://db.tt/iqTo1Q4 This is a sample XML file with 1000 records -- enough to notice a significant delay when evaluating the code from the original post. Chouser, could you spare a second here? I've been looking and looking at mktree and siblings for two days now and can't for the life of me find out why it would eagerly parse the whole contents of an element as soon as I acces its struct! The code looks perfectly correct. Just guessing, but is it something to do with this (from the docstring of parse-seq)? it will be run in a separate thread and be allowed to get ahead by queue-size items, which defaults to maxint. Doesn't sound like it's actually lazy unless you explicitly specify a queue-size. - Chris -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Problems with lazy-xml
On Fri, Feb 11, 2011 at 2:35 PM, Chris Perkins chrisperkin...@gmail.com wrote: On Feb 11, 5:07 am, Marko Topolnik marko.topol...@gmail.com wrote: http://db.tt/iqTo1Q4 This is a sample XML file with 1000 records -- enough to notice a significant delay when evaluating the code from the original post. Chouser, could you spare a second here? I've been looking and looking at mktree and siblings for two days now and can't for the life of me find out why it would eagerly parse the whole contents of an element as soon as I acces its struct! The code looks perfectly correct. I can reproduce the behavior you describe. I'll look into it. Just guessing, but is it something to do with this (from the docstring of parse-seq)? it will be run in a separate thread and be allowed to get ahead by queue-size items, which defaults to maxint. Doesn't sound like it's actually lazy unless you explicitly specify a queue-size. There's a few different reasons someone might want lazy. You might want to be able to start examining the tree before it's done being parsed -- this is the default behavior of lazy-xml. Your code will block if it catches up with the parser, but shouldn't block earlier than it has to. Having the parsing thread do no work more than necessary (queue size 1) is another possibly desired behavior -- sounds like the OP's desire in which case the queue size does need to be specified. Finally, you might want to be able to walk an XML document larger than would fit in memory. I'm not sure if lazy-xml has ever been able to do this as it would need to be vigilant about not retaining the root of the returned document tree. --Chouser -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Problems with lazy-xml
I am required to process a huge XML file with 300,000 records. The structure is like this: root header /header body record.../record record.../record ... 299,998 more /body /root Obviously, it is of key importance not to allocate memory for all the records at once. If I do this: (use ['clojure.contrib.lazy-xml :only ['parse-trim]]) (use ['clojure.java.io :only ['reader]]) (- (parse-trim (reader huge.xml)) :content second :tag) This should only parse the start-tag body, but it parses all the way down to /body -- at least it tries to, failing with OutOfMemoryError. Am I wrong in expecting the entire contents of body not to be parsed? :content is supposed to be a lazy seq, so even if I access its head, it should still not parse more than just the first record element, right? -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Problems with lazy-xml
On Thu, 10 Feb 2011 07:22:55 -0800 (PST) Marko Topolnik marko.topol...@gmail.com wrote: I am required to process a huge XML file with 300,000 records. The structure is like this: root header /header body record.../record record.../record ... 299,998 more /body /root Obviously, it is of key importance not to allocate memory for all the records at once. I don't think it's obvious. Maybe I'm missing something? Like - how big are the records? If they're less than 1K, that's at most 300 meg in core - which is large, but not impossible on modern hardware. I've been handling .5G data structures in core for the last few years (in Python, anyway). I've run into at least one stupid garbage collector that insisted on scanning such structures even though they weren't changing, which pretty much killed performance. Maybe you have a fast startup requirement, which building the initial data structure would kill. Maybe something else? Thanks, mike -- Mike Meyer m...@mired.org http://www.mired.org/consulting.html Independent Network/Unix/SCM consultant, email for more information. O ascii ribbon campaign - stop html mail - www.asciiribbon.org -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en