Hi,

Is there any progress regarding html dumps?

I'm not interested in html dumps as such, but I believe that HTML is way
nicer way of getting raw text of articles out of a wiki dump. See this
proof of concept [1].

However, what I believe would be very useful for the scientific community
are syntacticly parsed dumps of Wikipedia. Right now everyone uses
different pipelines to parsed Wikipedia, which are often undocumented,
outdated and unreproducible.

At IWCS we are running a two day hackathon [2] and I think that one useful
project would be to come up with a documented and easily reproducible way
of getting parsed versions of wikipedia dumps. I've started some noted as
part of NLTK corpus readers [3], but this might grow into a separate
project.

So, I see an easily deployable pipeline of:

  enwiki.bz2 -> raw_text.bz2 -> parsed_text.bz2

as a perfect project for the hackathon. Ideally, this should be picked up
by someone to produce regular dumps (but I don't know who will be willing
to invest computational resources).

Do you have any ideas/suggestions that I should take care of?

In case you are in London on April 11-12 you are welcome to take part in
the hackathon.

[1]
http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/Wikipedia%20dump.ipynb
[2] http://iwcs2015.github.io/hackathon.html
[3] http://iwcs2015.github.io/hackathon.html#nltk-corpus-readers

--
Dima

On Sun, Feb 15, 2015 at 8:36 AM, Nitin Gupta <[email protected]>
wrote:

>
>
> On Sat, Feb 14, 2015 at 12:06 PM, Emmanuel Engelhart <[email protected]>
> wrote:
>
>> On 14.02.2015 20:52, Nitin Gupta wrote:
>>
>>>         I hope HTML would be made available with the same frequency as
>>> XML
>>>         (wikitext) dumps; it would save me yet another attempt to make
>>>         wikitext
>>>         parser. Thanks.
>>>
>>>         For any API points you provide, it would be helpful if you could
>>>         also
>>>         mention expected maximum load from a client (req/s), so client
>>>         writers
>>>         can throttle accordingly.
>>>
>>>
>>>     Kiwix already publish full HTML snapshots packed in ZIM files
>>>     (snapshots with and without pictures). We publish monthly updates
>>>     for most of Wikimedia projects and are working to to it for all the
>>>     projects:
>>>     http://www.kiwix.org
>>>
>>>
>>> I somehow missed the Kiwix project and HTML dump is all I'm interested
>>> in (text only for now since images can have copyright issues).
>>> Surprisingly, I could not find link to kiwix ZIM dump without images,
>>> assuming default offered for download has thumbnails.
>>>
>>
>> Have a look to the "all_nopic" links:
>> http://www.kiwix.org/wiki/Wikipedia_in_all_languages
>>
>>
>
> The latest all_nopic dump for english wikipedia I can see is from
> 2014-01.  Anyways, as Gabriel mentioned, it looks like wikimedia is going
> to generate and provide regularly updated HTML dumps for various projects
> directly -- hopefully sometime soon, so maybe that can then be used as gold
> source.
>
>
>
>>      The solution is coded in Node.js and uses the Parsoid API:
>>>     https://sourceforge.net/p/__kiwix/other/ci/master/tree/__mwoffliner/
>>>     <https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/>
>>>
>>>     We face recurring stability problems (with the 'http' module) which
>>>     is impairing the rollout for all project. If you are a Node.js
>>>     expert your help is really welcome.
>>>
>>>
>>> I'm no http expert but I see that you are downloading full article
>>> content from Parsoid API. Have you considered the approach of just
>>> downloading the entire XML dump and then extracting articles out of
>>> that. You would still need to download images, do template expansion
>>> over http but still it saves a lot. I have used this approach here:
>>>
>>
>> Parsing wiki code is a nightmare (if you want to reach Mediawiki quality
>> of output & maintain that code base). It's far more easy to write a scraper
>> based on Parsoid API.
>>
>>
> Yes, it's a nightmare to parse wikitext markup but in this case, the
> frontend (wikipedia.go) is not parsing wikitext at all. The XML dump simply
> encapsulates wikitext in a well structured XML. So, all the frontent is
> doing is extract the wikitext from XML and pass it to the backend
> (server.js -- running locally) service which uses Parsoid module to parse
> this wikitext to HTML.
>
> Thanks,
> Nitin
>
>
> _______________________________________________
> Wikitext-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>
>
_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply via email to