On Sat, Feb 14, 2015 at 12:06 PM, Emmanuel Engelhart <[email protected]> wrote:
> On 14.02.2015 20:52, Nitin Gupta wrote: > >> I hope HTML would be made available with the same frequency as XML >> (wikitext) dumps; it would save me yet another attempt to make >> wikitext >> parser. Thanks. >> >> For any API points you provide, it would be helpful if you could >> also >> mention expected maximum load from a client (req/s), so client >> writers >> can throttle accordingly. >> >> >> Kiwix already publish full HTML snapshots packed in ZIM files >> (snapshots with and without pictures). We publish monthly updates >> for most of Wikimedia projects and are working to to it for all the >> projects: >> http://www.kiwix.org >> >> >> I somehow missed the Kiwix project and HTML dump is all I'm interested >> in (text only for now since images can have copyright issues). >> Surprisingly, I could not find link to kiwix ZIM dump without images, >> assuming default offered for download has thumbnails. >> > > Have a look to the "all_nopic" links: > http://www.kiwix.org/wiki/Wikipedia_in_all_languages > > The latest all_nopic dump for english wikipedia I can see is from 2014-01. Anyways, as Gabriel mentioned, it looks like wikimedia is going to generate and provide regularly updated HTML dumps for various projects directly -- hopefully sometime soon, so maybe that can then be used as gold source. > The solution is coded in Node.js and uses the Parsoid API: >> https://sourceforge.net/p/__kiwix/other/ci/master/tree/__mwoffliner/ >> <https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/> >> >> We face recurring stability problems (with the 'http' module) which >> is impairing the rollout for all project. If you are a Node.js >> expert your help is really welcome. >> >> >> I'm no http expert but I see that you are downloading full article >> content from Parsoid API. Have you considered the approach of just >> downloading the entire XML dump and then extracting articles out of >> that. You would still need to download images, do template expansion >> over http but still it saves a lot. I have used this approach here: >> > > Parsing wiki code is a nightmare (if you want to reach Mediawiki quality > of output & maintain that code base). It's far more easy to write a scraper > based on Parsoid API. > > Yes, it's a nightmare to parse wikitext markup but in this case, the frontend (wikipedia.go) is not parsing wikitext at all. The XML dump simply encapsulates wikitext in a well structured XML. So, all the frontent is doing is extract the wikitext from XML and pass it to the backend (server.js -- running locally) service which uses Parsoid module to parse this wikitext to HTML. Thanks, Nitin
_______________________________________________ Wikitext-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitext-l
