Hi Dmitrijs, we are currently waiting for hardware to be allocated. We hope to have a first set of dumps 1-2 weeks from now, with the intention to provide dumps at regular intervals. See https://phabricator.wikimedia.org/T17017 and dependencies for the progress on this.
We are also considering which distribution format to use for the HTML dumps. One option is a lzma-compressed sqlite database. Please weigh in on this at https://phabricator.wikimedia.org/T93396. Thanks, Gabriel On Mon, Mar 16, 2015 at 3:29 AM, Dmitrijs Milajevs <[email protected]> wrote: > Hi, > > Is there any progress regarding html dumps? > > I'm not interested in html dumps as such, but I believe that HTML is way > nicer way of getting raw text of articles out of a wiki dump. See this > proof of concept [1]. > > However, what I believe would be very useful for the scientific community > are syntacticly parsed dumps of Wikipedia. Right now everyone uses > different pipelines to parsed Wikipedia, which are often undocumented, > outdated and unreproducible. > > At IWCS we are running a two day hackathon [2] and I think that one useful > project would be to come up with a documented and easily reproducible way > of getting parsed versions of wikipedia dumps. I've started some noted as > part of NLTK corpus readers [3], but this might grow into a separate > project. > > So, I see an easily deployable pipeline of: > > enwiki.bz2 -> raw_text.bz2 -> parsed_text.bz2 > > as a perfect project for the hackathon. Ideally, this should be picked up > by someone to produce regular dumps (but I don't know who will be willing > to invest computational resources). > > Do you have any ideas/suggestions that I should take care of? > > In case you are in London on April 11-12 you are welcome to take part in > the hackathon. > > [1] > http://nbviewer.ipython.org/urls/bitbucket.org/dimazest/phd-buildout/raw/tip/notebooks/Wikipedia%20dump.ipynb > [2] http://iwcs2015.github.io/hackathon.html > [3] http://iwcs2015.github.io/hackathon.html#nltk-corpus-readers > > -- > Dima > > On Sun, Feb 15, 2015 at 8:36 AM, Nitin Gupta <[email protected]> > wrote: > >> >> >> On Sat, Feb 14, 2015 at 12:06 PM, Emmanuel Engelhart <[email protected]> >> wrote: >> >>> On 14.02.2015 20:52, Nitin Gupta wrote: >>> >>>> I hope HTML would be made available with the same frequency as >>>> XML >>>> (wikitext) dumps; it would save me yet another attempt to make >>>> wikitext >>>> parser. Thanks. >>>> >>>> For any API points you provide, it would be helpful if you could >>>> also >>>> mention expected maximum load from a client (req/s), so client >>>> writers >>>> can throttle accordingly. >>>> >>>> >>>> Kiwix already publish full HTML snapshots packed in ZIM files >>>> (snapshots with and without pictures). We publish monthly updates >>>> for most of Wikimedia projects and are working to to it for all the >>>> projects: >>>> http://www.kiwix.org >>>> >>>> >>>> I somehow missed the Kiwix project and HTML dump is all I'm interested >>>> in (text only for now since images can have copyright issues). >>>> Surprisingly, I could not find link to kiwix ZIM dump without images, >>>> assuming default offered for download has thumbnails. >>>> >>> >>> Have a look to the "all_nopic" links: >>> http://www.kiwix.org/wiki/Wikipedia_in_all_languages >>> >>> >> >> The latest all_nopic dump for english wikipedia I can see is from >> 2014-01. Anyways, as Gabriel mentioned, it looks like wikimedia is going >> to generate and provide regularly updated HTML dumps for various projects >> directly -- hopefully sometime soon, so maybe that can then be used as gold >> source. >> >> >> >>> The solution is coded in Node.js and uses the Parsoid API: >>>> https://sourceforge.net/p/__kiwix/other/ci/master/tree/__ >>>> mwoffliner/ >>>> <https://sourceforge.net/p/kiwix/other/ci/master/tree/mwoffliner/> >>>> >>>> We face recurring stability problems (with the 'http' module) which >>>> is impairing the rollout for all project. If you are a Node.js >>>> expert your help is really welcome. >>>> >>>> >>>> I'm no http expert but I see that you are downloading full article >>>> content from Parsoid API. Have you considered the approach of just >>>> downloading the entire XML dump and then extracting articles out of >>>> that. You would still need to download images, do template expansion >>>> over http but still it saves a lot. I have used this approach here: >>>> >>> >>> Parsing wiki code is a nightmare (if you want to reach Mediawiki quality >>> of output & maintain that code base). It's far more easy to write a scraper >>> based on Parsoid API. >>> >>> >> Yes, it's a nightmare to parse wikitext markup but in this case, the >> frontend (wikipedia.go) is not parsing wikitext at all. The XML dump simply >> encapsulates wikitext in a well structured XML. So, all the frontent is >> doing is extract the wikitext from XML and pass it to the backend >> (server.js -- running locally) service which uses Parsoid module to parse >> this wikitext to HTML. >> >> Thanks, >> Nitin >> >> >> _______________________________________________ >> Wikitext-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikitext-l >> >> > > _______________________________________________ > Wikitext-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitext-l > >
_______________________________________________ Wikitext-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitext-l
