Greg Hendershott wrote on 6/6/19 9:51 PM:
Although I don't think I currently /need/ a streaming parser for speed or space
reasons, I can imagine using one.
I'm not certain I immediately need the performance boost either. But a
Web proxy idea I'm toying with would need at least one great DSL, to
encourage casual distributed development among large numbers of people,
and it seems a shame to do all the work to implement the somewhat
performance-sensitive DSLs atop an inefficient interface. (And I suspect
that such DSLs would need to be almost totally rewritten if later moving
to a streaming interface. This isn't a startup, in which we have at
least subsistence funding initially, and the expectation of more
resources for rewriting later.)
For an approximately real time HTTP proxy application, being efficient
means typical responsiveness of page loads, and minimizing GC burps.
For a different application, of actively Web crawling, or processing
billions of pages from a pre-crawled Web corpus, on a tight
research/non-profit/startup budget, performance of the parser might mean
you can use a smaller or fewer servers (though your "cloud"
infrastructure traffic and transfer costs would probably remain the
same), and also reduce your GC time.
I'd suggest making something where the user supplies an "on-element" "callback", which is
called with each element -- plus the "path" of ancestor elements. The user's callback can do
whatever it wants.
That sounds like one good interface. It would probably be implemented
atop a more flexible fold interface, somewhat like in Oleg Kiselyov's
SSAX, or in my `json-parsing` experiment. And other interfaces would
also be implemented atop that. I'd do the underlying fold interface a
bit different, and have to analyze the actual performance.
http://okmij.org/ftp/Scheme/xml.html
https://www.neilvandyke.org/racket/json-parsing/
Some of the SXML transformation languages are also worth looking at, for
inspiration. Oleg did one, and Jim Bender's `sxml-match` is another. I
did one like Oleg's, tried to do an improved one (that might've been the
scariest `syntax-rules` example ever), and did an unfortunately
closed-source one similar to `sxml-match` but with some additional
features. There's also probably places for using modern Racket `match`.
https://planet.racket-lang.org/display.ss?package=sxml-match.plt&owner=jim
There's also Xpath and CSS selectors, as you said, and (in my scraping
experience) you sometimes start with `sxpath`, to find a desired chunk
of the page, and then switch to other methods to process that chunk.
I think it would be good if you host Racket packages on GitHub, GitLab, or
similar other site
I agree, and I'd probably start with new packages. (And I *almost*
moved old packages to GitHub, right before the acquisition was
announced. And ever since then, there's a question mark over how
investors will cash out of GitLab, or whether they'll do something like
signal reassuring intentions by switching to being a social benefit
corp.; they just got another $100M of investment last September.) The
main barrier, though, is time and priorities, when not making
money/career off this. And if I had time to disrupt workflow, there are
some other workflow changes I'd do first.
Thank you for your comments.
--
You received this message because you are subscribed to the Google Groups "Racket
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/racket-users/a1e753e2-0ab0-9f4f-e575-99b269eccbf5%40neilvandyke.org.
For more options, visit https://groups.google.com/d/optout.