On Sat, Feb 14, 2015 at 12:06 PM, Emmanuel Engelhart <[email protected]>
wrote:

> On 14.02.2015 20:52, Nitin Gupta wrote:
>
>>         I hope HTML would be made available with the same frequency as XML
>>         (wikitext) dumps; it would save me yet another attempt to make
>>         wikitext
>>         parser. Thanks.
>>
>>         For any API points you provide, it would be helpful if you could
>>         also
>>         mention expected maximum load from a client (req/s), so client
>>         writers
>>         can throttle accordingly.
>>
>>
>>     Kiwix already publish full HTML snapshots packed in ZIM files
>>     (snapshots with and without pictures). We publish monthly updates
>>     for most of Wikimedia projects and are working to to it for all the
>>     projects:
>>     http://www.kiwix.org
>>
>>
>> I somehow missed the Kiwix project and HTML dump is all I'm interested
>> in (text only for now since images can have copyright issues).
>> Surprisingly, I could not find link to kiwix ZIM dump without images,
>> assuming default offered for download has thumbnails.
>>
>
> Have a look to the "all_nopic" links:
> http://www.kiwix.org/wiki/Wikipedia_in_all_languages
>
>

The latest all_nopic dump for english wikipedia I can see is from 2014-01.
Anyways, as Gabriel mentioned, it looks like wikimedia is going to generate
and provide regularly updated HTML dumps for various projects directly --
hopefully sometime soon, so maybe that can then be used as gold source.



>      The solution is coded in Node.js and uses the Parsoid API:
>>     https://sourceforge.net/p/__kiwix/other/ci/master/tree/__mwoffliner/
>>     <https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/>
>>
>>     We face recurring stability problems (with the 'http' module) which
>>     is impairing the rollout for all project. If you are a Node.js
>>     expert your help is really welcome.
>>
>>
>> I'm no http expert but I see that you are downloading full article
>> content from Parsoid API. Have you considered the approach of just
>> downloading the entire XML dump and then extracting articles out of
>> that. You would still need to download images, do template expansion
>> over http but still it saves a lot. I have used this approach here:
>>
>
> Parsing wiki code is a nightmare (if you want to reach Mediawiki quality
> of output & maintain that code base). It's far more easy to write a scraper
> based on Parsoid API.
>
>
Yes, it's a nightmare to parse wikitext markup but in this case, the
frontend (wikipedia.go) is not parsing wikitext at all. The XML dump simply
encapsulates wikitext in a well structured XML. So, all the frontent is
doing is extract the wikitext from XML and pass it to the backend
(server.js -- running locally) service which uses Parsoid module to parse
this wikitext to HTML.

Thanks,
Nitin
_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply via email to