Hi! Thank you for the reply. I made the following tasks:
https://phabricator.wikimedia.org/T298436 https://phabricator.wikimedia.org/T298437 Mitar On Sat, Jan 1, 2022 at 6:07 PM Ariel Glenn WMF <ar...@wikimedia.org> wrote: > > Hello Mitar! I'm glad you are finding the Wikimedia Enterprise dumps useful. > > For your tar.gz question, this is the format that the Wikimedia Enterprise > dataset consumers prefer, from what I understand. But I would suggest that if > you are interested in other formats, you might open a task on phabricator > with a feature request, and add the Wikimedia Enterprise project tag ( > https://phabricator.wikimedia.org/project/view/4929/ ). > > As to the API, I'm only familiar with the endpoints for bulk download, so > you'll want to ask the Wikimedia Enterprise folks, or have a look at their > API documentation here: > https://www.mediawiki.org/wiki/Wikimedia_Enterprise/Documentation > > Ariel > > > On Sat, Jan 1, 2022 at 4:30 PM Mitar <mmi...@gmail.com> wrote: >> >> Hi! >> >> Awesome! >> >> Is there any reason they are tar.gz files of one file and not simply >> bzip2 of the file contents? Wikidata dumps are bzip2 of one json and >> that allows parallel decompression. Having both tar (why tar of one >> file at all?) and gz in there really requires one to first decompress >> the whole thing before you can process it in parallel. Is there some >> other way I am missing? >> >> Wikipedia dumps are done with multistream bzip2 with an additional >> index file. That could be nice here too, if one could have an index >> file and then be able to immediately jump to a JSON line for >> corresponding articles. >> >> Also, is there an API endpoint or Special page which can return the >> same JSON for a single Wikipedia page? The JSON structure looks very >> useful by itself (e.g., not in bulk). >> >> >> Mitar >> >> >> On Tue, Oct 19, 2021 at 4:57 PM Ariel Glenn WMF <ar...@wikimedia.org> wrote: >> > >> > I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for >> > October 17-18th are available for public download; see >> > https://dumps.wikimedia.org/other/enterprise_html/ for more information. We >> > expect to make updated versions of these files available around the 1st/2nd >> > of the month and the 20th/21st of the month, following the cadence of the >> > standard SQL/XML dumps. >> > >> > This is still an experimental service, so there may be hiccups from time to >> > time. Please be patient and report issues as you find them. Thanks! >> > >> > Ariel "Dumps Wrangler" Glenn >> > >> > [1] See https://www.mediawiki.org/wiki/Wikimedia_Enterprise for much more >> > about Wikimedia Enterprise and its API. >> > _______________________________________________ >> > Wiki-research-l mailing list -- wiki-researc...@lists.wikimedia.org >> > To unsubscribe send an email to wiki-research-l-le...@lists.wikimedia.org >> >> >> >> -- >> http://mitar.tnode.com/ >> https://twitter.com/mitar_m >> _______________________________________________ >> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org >> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org >> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ > > _______________________________________________ > Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org > To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org > https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ -- http://mitar.tnode.com/ https://twitter.com/mitar_m _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/