Re: [Wikitext-l] Parsoid template expansion

Nitin Gupta Sat, 14 Feb 2015 11:53:31 -0800

On Sat, Feb 14, 2015 at 2:21 AM, Emmanuel Engelhart <[email protected]>
wrote:


> On 14.02.2015 08:03, Nitin Gupta wrote:
>
>> I hope HTML would be made available with the same frequency as XML
>> (wikitext) dumps; it would save me yet another attempt to make wikitext
>> parser. Thanks.
>>
>> For any API points you provide, it would be helpful if you could also
>> mention expected maximum load from a client (req/s), so client writers
>> can throttle accordingly.
>>
>
> Kiwix already publish full HTML snapshots packed in ZIM files (snapshots
> with and without pictures). We publish monthly updates for most of
> Wikimedia projects and are working to to it for all the projects:
> http://www.kiwix.org
>
>
I somehow missed the Kiwix project and HTML dump is all I'm interested in
(text only for now since images can have copyright issues). Surprisingly, I
could not find link to kiwix ZIM dump without images, assuming default
offered for download has thumbnails.



> The solution is coded in Node.js and uses the Parsoid API:
> https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/
>
> We face recurring stability problems (with the 'http' module) which is
> impairing the rollout for all project. If you are a Node.js expert your
> help is really welcome.
>
>
I'm no http expert but I see that you are downloading full article content
from Parsoid API. Have you considered the approach of just downloading the
entire XML dump and then extracting articles out of that. You would still
need to download images, do template expansion over http but still it saves
a lot. I have used this approach here:

https://github.com/nitingupta910/wikiparser

And it only requires http transfers (by Parsoid nodejs module) for template
expansion.

Thanks,
Nitin

_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Re: [Wikitext-l] Parsoid template expansion

Reply via email to