[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

Mitar Sun, 02 Jan 2022 04:49:44 -0800

Hi!

Thank you for the reply. I made the following tasks:


https://phabricator.wikimedia.org/T298436
https://phabricator.wikimedia.org/T298437


Mitar

On Sat, Jan 1, 2022 at 6:07 PM Ariel Glenn WMF <ar...@wikimedia.org> wrote:
>
> Hello Mitar! I'm glad you are finding the Wikimedia Enterprise dumps useful.
>
> For your tar.gz question, this is the format that the Wikimedia Enterprise 
> dataset consumers prefer, from what I understand. But I would suggest that if 
> you are interested in other formats, you might open a task on phabricator 
> with a feature request, and add  the Wikimedia Enterprise project tag ( 
> https://phabricator.wikimedia.org/project/view/4929/ ).
>
> As to the API, I'm only familiar with the endpoints for bulk download, so 
> you'll want to ask the Wikimedia Enterprise folks, or have a look at their 
> API documentation here: 
> https://www.mediawiki.org/wiki/Wikimedia_Enterprise/Documentation
>
> Ariel
>
>
> On Sat, Jan 1, 2022 at 4:30 PM Mitar <mmi...@gmail.com> wrote:
>>
>> Hi!
>>
>> Awesome!
>>
>> Is there any reason they are tar.gz files of one file and not simply
>> bzip2 of the file contents? Wikidata dumps are bzip2 of one json and
>> that allows parallel decompression. Having both tar (why tar of one
>> file at all?) and gz in there really requires one to first decompress
>> the whole thing before you can process it in parallel. Is there some
>> other way I am missing?
>>
>> Wikipedia dumps are done with multistream bzip2 with an additional
>> index file. That could be nice here too, if one could have an index
>> file and then be able to immediately jump to a JSON line for
>> corresponding articles.
>>
>> Also, is there an API endpoint or Special page which can return the
>> same JSON for a single Wikipedia page? The JSON structure looks very
>> useful by itself (e.g., not in bulk).
>>
>>
>> Mitar
>>
>>
>> On Tue, Oct 19, 2021 at 4:57 PM Ariel Glenn WMF <ar...@wikimedia.org> wrote:
>> >
>> > I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for
>> > October 17-18th are available for public download; see
>> > https://dumps.wikimedia.org/other/enterprise_html/ for more information. We
>> > expect to make updated versions of these files available around the 1st/2nd
>> > of the month and the 20th/21st of the month, following the cadence of the
>> > standard SQL/XML dumps.
>> >
>> > This is still an experimental service, so there may be hiccups from time to
>> > time. Please be patient and report issues as you find them. Thanks!
>> >
>> > Ariel "Dumps Wrangler" Glenn
>> >
>> > [1] See https://www.mediawiki.org/wiki/Wikimedia_Enterprise for much more
>> > about Wikimedia Enterprise and its API.
>> > _______________________________________________
>> > Wiki-research-l mailing list -- wiki-researc...@lists.wikimedia.org
>> > To unsubscribe send an email to wiki-research-l-le...@lists.wikimedia.org
>>
>>
>>
>> --
>> http://mitar.tnode.com/
>> https://twitter.com/mitar_m
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/



-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

Reply via email to