Re: [racket-users] Re: note about parsing speed of xml vs sxml?

2020-06-29 Thread Alex Harsanyi


On Tuesday, June 30, 2020 at 7:48:14 AM UTC+8 Neil Van Dyke wrote:

> Is even 2x speedup helpful for your purpose? 


Yes it is, and for my purpose `read-xml` is fine even without any speed 
improvement.  In the sports field, XML (via the TCX format) is a legacy 
technology.  Typical TCX files are about 1Mb in size, the 14Mb one is a 
very large one.   Setting ` xml-count-bytes` to #t while calling `read-xml` 
gets me a speed improvement at a low effort, but it is not worth adding 
another package dependency just to support a legacy technology.

3 seconds is one old magic 
> number for user patience in HCI, so I suppose there's still a big 
> difference between 4 seconds and almost 10 seconds? 
>

I am not sure where you got the 3 seconds from, but even 3 seconds is too 
long to wait on a button callback.  For large files, both read-xml and sxml 
would need to have a progress dialog with a cancel button, or some other 
form of user feedback, if one wants to make a "well behaved" GUI.
 

> For large (and absolutely massive) XML... SSAX can shine even better 
> than in this comparison, since you can, say, populate a database *while 
> you're parsing, without first constructing the intermediate 
> representation* of xexpr or SXML.  GC-wise, with the database-populating 
> scenario, you'll probably end up with small, little-referencing, local, 
> short-lived allocations.  Besides GC costs, you'll also use less RAM 
> (possibly lower AWS bill), and be less likely to push into swap (which 
> would be bad for performance). 
>

... if you are willing to deal with the complexity of a SAX interface, that 
is.  I have written code for parsing documents (correctly!) using a SAX 
interface, and the resulting code was so complex that I had to use a code 
generator for it, but yes, the resulting code was very fast.   Would I do 
it again? No.

The complexity of SAX parsing is probably why most people use a DOM style 
interface...
 

> In addition to SSAX's current performance characteristics and 
> opportunities... There might also be opportunity to optimize SSAX 
> significantly for Racket. Oleg is a famously capable Scheme programmer, 
> but he was writing SSAX in fairly portable Scheme code, a couple decades 
> ago, when he wrote SSAX.  I did an initial packaging of SSAX for PLT 
> Scheme, Kirill Lisovsky later did many packagings of various SXML-ish 
> tools (including his own), and then John Clements did more work to 
> package Oleg's SXML-ish tools for Racket... But I don't know that anyone 
> has had motivation to try to optimize Racket's SSAX port, using current 
> Racket features, and tuning for current performance characteristics. 
>

> Side note regarding performance comparison... FWIW, SSAX might be doing 
> some things `read-xml` doesn't, such as namespace resolution, entity 
> reference resolution, and some validation. 
>

You used the phrase "might be doing...", does that mean that it might not 
do those things?

Alex.

 

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/affcfe0e-a5a7-43a6-9019-8876dc40ed03n%40googlegroups.com.


Re: [racket-users] Re: note about parsing speed of xml vs sxml?

2020-06-29 Thread Neil Van Dyke
Is even 2x speedup helpful for your purpose?  3 seconds is one old magic 
number for user patience in HCI, so I suppose there's still a big 
difference between 4 seconds and almost 10 seconds?


For large (and absolutely massive) XML... SSAX can shine even better 
than in this comparison, since you can, say, populate a database *while 
you're parsing, without first constructing the intermediate 
representation* of xexpr or SXML.  GC-wise, with the database-populating 
scenario, you'll probably end up with small, little-referencing, local, 
short-lived allocations.  Besides GC costs, you'll also use less RAM 
(possibly lower AWS bill), and be less likely to push into swap (which 
would be bad for performance).


In addition to SSAX's current performance characteristics and 
opportunities... There might also be opportunity to optimize SSAX 
significantly for Racket.  Oleg is a famously capable Scheme programmer, 
but he was writing SSAX in fairly portable Scheme code, a couple decades 
ago, when he wrote SSAX.  I did an initial packaging of SSAX for PLT 
Scheme, Kirill Lisovsky later did many packagings of various SXML-ish 
tools (including his own), and then John Clements did more work to 
package Oleg's SXML-ish tools for Racket... But I don't know that anyone 
has had motivation to try to optimize Racket's SSAX port, using current 
Racket features, and tuning for current performance characteristics.


Side note regarding performance comparison... FWIW, SSAX might be doing 
some things `read-xml` doesn't, such as namespace resolution, entity 
reference resolution, and some validation.


--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/828c5a26-0dec-a1d6-5ca3-04376113bf6c%40neilvandyke.org.


[racket-users] Re: note about parsing speed of xml vs sxml?

2020-06-29 Thread Alex Harsanyi
I installed the sxml package out of curiosity, and while it is faster, it 
is not 4 times as fast, as your tests indicate. I used the following test 
program with a 14Mb XML file (a bike ride in TCX format):

(define file-name 
"../MyPackages/more-df-tests/tcx-data/2015-09-27-0755_Road_Cycling_WF.tcx")
;; Make sure the file is in the cache
(call-with-input-file file-name
  (lambda (in) (let loop ([c (read-char in)]) (unless (eof-object? c) 
(loop (read-char in))
(collect-garbage 'major)
(time (void (call-with-input-file file-name (lambda (in) 
(ssax:xml->sxml in null)
(collect-garbage 'major)
(time (void (call-with-input-file file-name read-xml)))

On my laptop the times are:

 ssax:xml->sxml : cpu time: 4031 real time: 4128 gc time: 157
 read-xml: cpu time: 9578 real time: 10031 gc time: 3270

The big difference I found so far is that `read-xml` will store the 
location (line number, column and file offset) for each element, and 
enabled `port-count-lines!` by default.  If I use:

(parameterize ([xml-count-bytes #t])
  (time (void (call-with-input-file file-name read-xml

The results are much closer together, although `read-xml` is still slower 
and spends more time in the garbage collector:

 ssax:xml->sxml :  cpu time: 4187 real time: 4233 gc time: 202
 read-xml: cpu time: 5797 real time: 5824 gc time: 1251

Perhaps a note could be added to the documentation indicating that users 
can speed up `read-xml` significantly if they set `xml-count-bytes` to #t.

Alex.

On Saturday, June 27, 2020 at 11:05:42 AM UTC+8 'John Clements' via 
users-redirect wrote:

> I’m parsing a large-ish apple plist file, (18 megabytes), and I find that 
> the built-in xml parsing (read-xml) takes about five times as long as the 
> sxml version (11 seconds vs 2.4 seconds on my machine), and that the plist 
> parser is way longer, at 18 seconds.
>
> Would anyone object if I added a margin note to this effect to the xml 
> docs?
>
> John
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/f99175fa-de2a-4434-9984-78446b3cf828n%40googlegroups.com.


Re: [racket-users] Re: note about parsing speed of xml vs sxml?

2020-06-29 Thread Hendrik Boom
On Sun, Jun 28, 2020 at 06:01:27PM -0700, Alex Harsanyi wrote:
> I tested the your string port version and I also wrote a "string-append" 
> version of the xml reader and they are both slower by about 10-15% on my 
> machine, when compared to the current read-xml implementation which uses 
> `list->string`.  It looks like `list->string` is not the bottleneck here.

Odd -- to remove all that storage-allocaation overhead and to find it 
gets slower...

Perhaps the overhead lies in the Scheme interpreter?  Does it allocate 
lots of storage?
Would using chez racket help any?

-- hendrik
> 
> There are some small improvements that can be made from micro 
> optimizations.  For example, I changed `name-char?` to not use 
> `name-start?` but instead check for all chars, and I also changed 
> `lex-name` to construct the list in reverse and use `(list->string (reverse 
> chars))`, plus I reordered the cond condition to check the common case 
> first (that the next character is a name-char? and not a 'special one).  
> However, this resulted in only about 5-10% speed improvement, nowhere near 
> where the 4 time speedup when using `sxml`, as reported by John.
> 
> In the end, it may well be that speeding up `read-xml` can only be done by 
> these types of micro-optimizations.  Another thing I looked into was the 
> "pattern" used for reading: all the `read-xml` code will use the pattern of 
> "peeking" the next character, deciding if it is good, than reading it.  
> This is much slower than just reading the characters directly.  These are 
> the results from just reading in a 14Mb XML file:
> 
> read-char only:  cpu time: 312 real time: 307 gc time: 0
> read-char-or-special only:  cpu time: 750 real time: 741 gc time: 0
> peek-char than read-char:  cpu time: 1234 real time: 1210 gc time: 0
> peek-char-or-special than read-char-or-special:  cpu time: 1688 real 
> time: 1690 gc time: 0
> 
> Using this code:
> 
> (define file-name "your-test-file-here.xml")
> 
> (printf "read-char only~%")
> (collect-garbage 'major)
> (time
>  (call-with-input-file file-name
>(lambda (in)
>  (let loop ([c (read-char in)])
>(if (eof-object? c)
>(void)
>(loop (read-char in)))
> 
> (printf "read-char-or-special only~%")
> (collect-garbage 'major)
> (time
>  (call-with-input-file file-name
>(lambda (in)
>  (let loop ([c (read-char-or-special in)])
>(if (eof-object? c)
>(void)
>(loop (read-char-or-special in)))
> 
> (printf "peek-char than read-char~%")
> (collect-garbage 'major)
> (time
>  (call-with-input-file file-name
>   (lambda (in)
> (let loop ([c (peek-char in)])
>   (if (eof-object? c)
>   (void)
>   (begin
> (void (read-char in))
> (loop (peek-char in
> 
> (printf "peek-char-or-special than read-char-or-special~%")
> (collect-garbage 'major)
> (time
>  (call-with-input-file file-name
>   (lambda (in)
> (let loop ([c (peek-char-or-special in)])
>   (if (eof-object? c)
>   (void)
>   (begin
> (void (read-char-or-special in))
> (loop (peek-char-or-special in
> 
> Alex.
> 
> On Monday, June 29, 2020 at 5:30:43 AM UTC+8 rmculp...@gmail.com wrote:
> 
> > Thanks Alex for pointing out the use of list->string. I've created a PR (
> > https://github.com/racket/racket/pull/3275) that changes that code to use 
> > string ports instead (similar to Hendrik's suggestion, but the string port 
> > handles resizing automatically). Could someone (John?) with some large XML 
> > files lying around try the changes and see if they help?
> >
> > Ryan
> >
> >
> > On Sun, Jun 28, 2020 at 9:56 PM Neil Van Dyke  
> > wrote:
> >
> >> If anyone wants to optimize `read-xml` for particular classes of use, 
> >> without changing the interface, it might be very helpful to run your 
> >> representative tests using the statistical profiler.
> >>
> >> The profiler text report takes a little while of tracing through 
> >> manually to get a feel for how to read and use it, but it can be 
> >> tremendously useful, and is worth learning to do if you need performance.
> >>
> >> After a first pass with that, you might also want to look at how costly 
> >> allocations/GC are, and maybe do some controlled experiments around 
> >> that.  For example, force a few GC cycles, run your workload under 
> >> profiler, check GC time during, and forced time after.  If you're 
> >> dealing with very large graphs coming out of the parser, I don't know 
> >> whether those are enough to matter with the current GC mechanism, but 
> >> maybe also check GC time while you're holding onto large graphs, when 
> >> you release them, and after they've been collected.  At some point, GC 
> >> gets hard for at least me to reason about, but some things make sense, 
> >> and other things you decide when to stop digging. :)  If you record all 
> >> your measurements, you can compare empir

Re: [racket-users] note about parsing speed of xml vs sxml?

2020-06-29 Thread Bonface M. K.
Neil Van Dyke  writes:

> I think anyone using XML or HTML seriously with Racket should probably at 
> least
> be told of the SXML family of tools.  And warned about the compatibility
> problems.
>
> Though not tell them *everywhere* XML&HTML in the docs.  For example, I 
> figure a
> tutorial for Racket Web Server shouldn't distract readers with that.
>
> As you know, :) there are some useful tools using SXML, and Oleg's SSAX parser
> has some different properties than core Racket's XML parser.
>
> Complication: The incompatibility between SXML and core Racket's 
> representations
> of XML&HTML is an unfortunate accident of parallel invention, and I think will
> tend to be confusing to new people.  I once tried to address the confusion in
> the `sxml-intro` documentation package,
> "https://www.neilvandyke.org/racket/sxml-intro/";, and I'm unhappy with the
> result.  The details in my document say more than perhaps anyone will ever 
> want
> to know, and, "optics"-wise, make the situation look worse than it actually is
> in practice.  I think you could do a more graceful job of this.
>
> (Someday, someone might undertake the large task of SXML-ifying all the many
> non-SXML bits of Racket, and incidentally reunite Racket with the rest of the
> Scheme community in that regard.  I started, with one piece, but got
> interrupted. "https://www.neilvandyke.org/racket/rws-html-template/"  :)

Thanks for this! Tbh, I never knew of this.

-- 
Bonface M. K. (https://www.bonfacemunyoki.com)
One Divine Emacs To Rule Them All
GPG key = D4F09EB110177E03C28E2FE1F5BBAE1E0392253F

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/86sgeeqik0.fsf%40gmail.com.