Re: [racket-users] requirements for streaming html parser

2019-06-07 Thread Neil Van Dyke

Greg Hendershott wrote on 6/6/19 9:51 PM:

Although I don't think I currently /need/ a streaming parser for speed or space 
reasons, I can imagine using one.


I'm not certain I immediately need the performance boost either. But a 
Web proxy idea I'm toying with would need at least one great DSL, to 
encourage casual distributed development among large numbers of people, 
and it seems a shame to do all the work to implement the somewhat 
performance-sensitive DSLs atop an inefficient interface. (And I suspect 
that such DSLs would need to be almost totally rewritten if later moving 
to a streaming interface.  This isn't a startup, in which we have at 
least subsistence funding initially, and the expectation of more 
resources for rewriting later.)


For an approximately real time HTTP proxy application, being efficient 
means typical responsiveness of page loads, and minimizing GC burps.  
For a different application, of actively Web crawling, or processing 
billions of pages from a pre-crawled Web corpus, on a tight 
research/non-profit/startup budget, performance of the parser might mean 
you can use a smaller or fewer servers (though your "cloud" 
infrastructure traffic and transfer costs would probably remain the 
same), and also reduce your GC time.



I'd suggest making something where the user supplies an "on-element" "callback", which is 
called with each element -- plus the "path" of ancestor elements. The user's callback can do 
whatever it wants.


That sounds like one good interface.  It would probably be implemented 
atop a more flexible fold interface, somewhat like in Oleg Kiselyov's 
SSAX, or in my `json-parsing` experiment.  And other interfaces would 
also be implemented atop that.  I'd do the underlying fold interface a 
bit different, and have to analyze the actual performance.

http://okmij.org/ftp/Scheme/xml.html
https://www.neilvandyke.org/racket/json-parsing/

Some of the SXML transformation languages are also worth looking at, for 
inspiration.  Oleg did one, and Jim Bender's `sxml-match` is another.  I 
did one like Oleg's, tried to do an improved one (that might've been the 
scariest `syntax-rules` example ever), and did an unfortunately 
closed-source one similar to `sxml-match` but with some additional 
features.  There's also probably places for using modern Racket `match`.

https://planet.racket-lang.org/display.ss?package=sxml-match.plt&owner=jim

There's also Xpath and CSS selectors, as you said, and (in my scraping 
experience) you sometimes start with `sxpath`, to find a desired chunk 
of the page, and then switch to other methods to process that chunk.



I think it would be good if you host Racket packages on GitHub, GitLab, or 
similar other site


I agree, and I'd probably start with new packages.  (And I *almost* 
moved old packages to GitHub, right before the acquisition was 
announced.  And ever since then, there's a question mark over how 
investors will cash out of GitLab, or whether they'll do something like 
signal reassuring intentions by switching to being a social benefit 
corp.; they just got another $100M of investment last September.)  The 
main barrier, though, is time and priorities, when not making 
money/career off this.  And if I had time to disrupt workflow, there are 
some other workflow changes I'd do first.


Thank you for your comments.

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/a1e753e2-0ab0-9f4f-e575-99b269eccbf5%40neilvandyke.org.
For more options, visit https://groups.google.com/d/optout.


Re: [racket-users] requirements for streaming html parser

2019-06-06 Thread Greg Hendershott
Although I don't think I currently /need/ a streaming parser for speed
or space reasons, I can imagine using one.

I'd suggest making something where the user supplies an "on-element"
"callback", which is called with each element -- plus the "path" of
ancestor elements. The user's callback can do whatever it wants.

That could be its own, focused library. I won't say "simple" because
you're parsing HTML!! :)

I can imagine other libraries built on top of that. One I would want to
use (or write myself, share, and use) would offer something like CSS
selectors. Not their syntax. Just some simple function combinators to
express the equivalent. (Because xml/path is close, and maybe enough for
XML, but not quite enough for real-world HTML.) In fact I already do
this, on the full HTML. I can imagine doing this on top of a streaming
parser, instead.

So those are my quick thoughts. I hope that's helpful, and also, other
people will have even better feedback for you.


p.s. One suggestion I have, which you might not like: I think it would
be good if you host Racket packages on GitHub, GitLab, or similar other
site you find least objectionable. I respect your rationale for not
doing that, to-date. On the other hand, people these days like to see
the full git commit history, issues, and pull requests. It helps them
evaluate a package, and feel good about future availability. When they
don't, it can be a speed bump to adoption. If you're aware of all this
but still don't want to do that, again I 100% respect that. Just my
opinion and perspective.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/87blza17t5.fsf%40greghendershott.com.
For more options, visit https://groups.google.com/d/optout.


[racket-users] requirements for streaming html parser

2019-06-06 Thread Neil Van Dyke
If anyone has a use for a *streaming* permissive HTML parser (i.e., one 
that calls your specific bits of code while it's parsing, rather than it 
constructing some kind of representation of the entire page for your 
code to process afterwards), I'd be interested in what specifically 
you'd like it to do.


(For example, in a simplifying/security Web proxy, in which you want to 
reduce memory requirements on the proxy host, and perhaps also response 
latency, by transforming as you go.  Or in a Web scraper for large query 
results, in which, say, you want to be sending a large amount of 
extracted data to as rows to a different database without buffering up a 
huge page and/or the rows.  Or for scraping a small part of a large page 
without allocating a parsed representation of the entire page.  Or maybe 
you'd like the performance properties of streaming, and the convenience 
of a pattern-based scraper or transformation language atop that.  )


--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/d2f9c369-31fc-5e8d-1bb4-1751e84c6793%40neilvandyke.org.
For more options, visit https://groups.google.com/d/optout.