Hi Dmitrijs,

we are currently waiting for hardware to be allocated. We hope to have a
first set of dumps 1-2 weeks from now, with the intention to provide dumps
at regular intervals. See https://phabricator.wikimedia.org/T17017 and
dependencies for the progress on this.

We are also considering which distribution format to use for the HTML
dumps. One option is a lzma-compressed sqlite database. Please weigh in on
this at https://phabricator.wikimedia.org/T93396.

Thanks,

Gabriel

On Mon, Mar 16, 2015 at 3:29 AM, Dmitrijs Milajevs <[email protected]>
wrote:

> Hi,
>
> Is there any progress regarding html dumps?
>
> I'm not interested in html dumps as such, but I believe that HTML is way
> nicer way of getting raw text of articles out of a wiki dump. See this
> proof of concept [1].
>
> However, what I believe would be very useful for the scientific community
> are syntacticly parsed dumps of Wikipedia. Right now everyone uses
> different pipelines to parsed Wikipedia, which are often undocumented,
> outdated and unreproducible.
>
> At IWCS we are running a two day hackathon [2] and I think that one useful
> project would be to come up with a documented and easily reproducible way
> of getting parsed versions of wikipedia dumps. I've started some noted as
> part of NLTK corpus readers [3], but this might grow into a separate
> project.
>
> So, I see an easily deployable pipeline of:
>
>   enwiki.bz2 -> raw_text.bz2 -> parsed_text.bz2
>
> as a perfect project for the hackathon. Ideally, this should be picked up
> by someone to produce regular dumps (but I don't know who will be willing
> to invest computational resources).
>
> Do you have any ideas/suggestions that I should take care of?
>
> In case you are in London on April 11-12 you are welcome to take part in
> the hackathon.
>
> [1]
> http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/Wikipedia%20dump.ipynb
> [2] http://iwcs2015.github.io/hackathon.html
> [3] http://iwcs2015.github.io/hackathon.html#nltk-corpus-readers
>
> --
> Dima
>
> On Sun, Feb 15, 2015 at 8:36 AM, Nitin Gupta <[email protected]>
> wrote:
>
>>
>>
>> On Sat, Feb 14, 2015 at 12:06 PM, Emmanuel Engelhart <[email protected]>
>> wrote:
>>
>>> On 14.02.2015 20:52, Nitin Gupta wrote:
>>>
>>>>         I hope HTML would be made available with the same frequency as
>>>> XML
>>>>         (wikitext) dumps; it would save me yet another attempt to make
>>>>         wikitext
>>>>         parser. Thanks.
>>>>
>>>>         For any API points you provide, it would be helpful if you could
>>>>         also
>>>>         mention expected maximum load from a client (req/s), so client
>>>>         writers
>>>>         can throttle accordingly.
>>>>
>>>>
>>>>     Kiwix already publish full HTML snapshots packed in ZIM files
>>>>     (snapshots with and without pictures). We publish monthly updates
>>>>     for most of Wikimedia projects and are working to to it for all the
>>>>     projects:
>>>>     http://www.kiwix.org
>>>>
>>>>
>>>> I somehow missed the Kiwix project and HTML dump is all I'm interested
>>>> in (text only for now since images can have copyright issues).
>>>> Surprisingly, I could not find link to kiwix ZIM dump without images,
>>>> assuming default offered for download has thumbnails.
>>>>
>>>
>>> Have a look to the "all_nopic" links:
>>> http://www.kiwix.org/wiki/Wikipedia_in_all_languages
>>>
>>>
>>
>> The latest all_nopic dump for english wikipedia I can see is from
>> 2014-01.  Anyways, as Gabriel mentioned, it looks like wikimedia is going
>> to generate and provide regularly updated HTML dumps for various projects
>> directly -- hopefully sometime soon, so maybe that can then be used as gold
>> source.
>>
>>
>>
>>>      The solution is coded in Node.js and uses the Parsoid API:
>>>>     https://sourceforge.net/p/__kiwix/other/ci/master/tree/__
>>>> mwoffliner/
>>>>     <https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/>
>>>>
>>>>     We face recurring stability problems (with the 'http' module) which
>>>>     is impairing the rollout for all project. If you are a Node.js
>>>>     expert your help is really welcome.
>>>>
>>>>
>>>> I'm no http expert but I see that you are downloading full article
>>>> content from Parsoid API. Have you considered the approach of just
>>>> downloading the entire XML dump and then extracting articles out of
>>>> that. You would still need to download images, do template expansion
>>>> over http but still it saves a lot. I have used this approach here:
>>>>
>>>
>>> Parsing wiki code is a nightmare (if you want to reach Mediawiki quality
>>> of output & maintain that code base). It's far more easy to write a scraper
>>> based on Parsoid API.
>>>
>>>
>> Yes, it's a nightmare to parse wikitext markup but in this case, the
>> frontend (wikipedia.go) is not parsing wikitext at all. The XML dump simply
>> encapsulates wikitext in a well structured XML. So, all the frontent is
>> doing is extract the wikitext from XML and pass it to the backend
>> (server.js -- running locally) service which uses Parsoid module to parse
>> this wikitext to HTML.
>>
>> Thanks,
>> Nitin
>>
>>
>> _______________________________________________
>> Wikitext-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>>
>>
>
> _______________________________________________
> Wikitext-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitext-l
>
>
_______________________________________________
Wikitext-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitext-l

Reply via email to