[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-10 Thread Ariel Glenn WMF
I would hazard a guess that your bz2 unzip app does not handle multistream
files in an appropriate way, Wurgl. The multistream files consist of
several bzip2-compressed files concatenated together; see
https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps
for details.  Try downloading the entire file via curl, and then look into
the question of the bzip app issues separately. Maybe it will turn out that
you are encountering some other problem. But first, see if you can download
the entire file and get its hash to check out.

Ariel

On Wed, Jan 10, 2024 at 5:15 PM Xabriel Collazo Mojica <
xcoll...@wikimedia.org> wrote:

> Gerhad: Thanks for the extra checks!
>
> Wolfgang: I can confirm Gerhad's findings. The file appears correct, and
> ends with the right footer.
>
> On Wed, Jan 10, 2024 at 10:50 AM Gerhard Gonter  wrote:
>
>> On Fri, Jan 5, 2024 at 5:03 PM Wurgl  wrote:
>> >
>> > Hello!
>> >
>> > I am having some unexpected messages, so I tried the following:
>> >
>> > curl -s
>> https://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-articles-multistream.xml.bz2
>> | bzip2 -d | tail
>> >
>> > an got this:
>> >
>> > bzip2: Compressed file ends unexpectedly;
>> > perhaps it is corrupted?  *Possible* reason follows.
>> > bzip2: Inappropriate ioctl for device
>> > Input file = (stdin), output file = (stdout)
>> >
>> > It is possible that the compressed file(s) have become corrupted.
>>
>> The file I received was fine and the sha1sum matches that of
>> wikidatawiki-20240101-pages-articles-multistream.xml.bz2 mention in
>> the posting of Xabriel Collazo Mojica:
>>
>> --- 8< ---
>> $ sha1sum wikidatawiki-latest-pages-articles-multistream.xml.bz2
>> 1be753ba90e0390c8b65f9b80b08015922da12f1
>> wikidatawiki-latest-pages-articles-multistream.xml.bz2
>> --- >8 ---
>>
>> bunzip2 did not report any problem, however, my first attempt to
>> decompress ended with a full disk after more that 2.3 TB of xml.
>>
>> The second attempt
>> --- 8< ---
>> $  bunzip2 -cv wikidatawiki-latest-pages-articles-multistream.xml.bz2
>> | tail -n 1 >
>> wikidatawiki-latest-pages-articles-multistream_tail_-n_1.xml
>>   wikidatawiki-latest-pages-articles-multistream.xml.bz2: done
>> --- >8 ---
>>
>> resulted in nice XML fragment which ends with
>> --- 8< ---
>>   
>> Q124069752
>> 0
>> 118244259
>> 
>>   2042727399
>>   2042727216
>>   2024-01-01T20:37:28Z
>>   
>> Kalepom
>> 1900170
>>   
>>   /* wbsetclaim-create:2||1 */ [[Property:P2789]]:
>> [[Q16506931]]
>>   wikibase-item
>>   application/json
>>   ...
>>   9gw926vh84k1b5h6wnuvlvnd2zc3a9b
>> 
>>   
>> 
>> --- >8 ---
>>
>> So, I assume, your curl did not return the full 142 GB of
>> wikidatawiki-latest-pages-articles-multistream.xml.bz2 .
>>
>> P.S.: I'll start a new bunzip2 to a larger scratch disk just to find
>> out, how big this xml file really is.
>>
>> regards, Gerhard
>> ___
>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>
>
>
> --
> Xabriel J. Collazo Mojica (he/him, pronunciation
> 
> )
> Sr Software Engineer
> Wikimedia Foundation
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Wiki content and other dumps new ownership, feedback requested on new version!

2023-09-27 Thread Ariel Glenn WMF
Hello folks!

For some years now, I've been the main or only point of contact for the
Wiki project sql/xml dumps semimonthly, as well as for a number of
miscellaneous weekly datasets.

This work is now passing to Data Platform Engineering (DPE), and your new
points of contact, starting right away, will be Will Doran (email:wdoran)
and Virginia Poundstone (email:vpoundstone). I'll still be lending a hand
in the background for a little while but by the end of the month I'll have
transitioned into a new role at the Wikimedia Foundation, working more
directly on MediaWiki itself.

The Data Products team, a subteam of DPE, will be managing the current
dumps day-to-day, as well as working on a new dumps system intended to
replace and greatly improve the current one. What formats will it produce,
and what content, and in what bundles?  These are all great questions, and
you have a chance to help decide on the answers. The team is gathering
feedback right now; follow this link [
https://docs.google.com/forms/d/e/1FAIpQLScp2KzkcTF7kE8gilCeSogzpeoVN-8yp_SY6Q47eEbuYfXzsw/viewform?usp=sf_link]
to give your input!

If you want to follow along on work being done on the new dumps system, you
can check the phabricator workboard at
https://phabricator.wikimedia.org/project/board/6630/ and look for items
with the "Dumps 2.0" tag.

Members of the Data Products team are already stepping up to manage the
xmldatadumps-l mailing list, so you should not notice any changes as far as
that goes.

And as always, for dumps-related questions people on this list cannot
answer, and which are not covered in the docs at
https://meta.wikimedia.org/wiki/Data_dumps or
https://wikitech.wikimedia.org/wiki/Dumps you can always email ops-dumps
(at) wikimedia.org.

See you on the wikis!

Ariel Glenn
ar...@wikimedia.org
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Inconsistency of Wikipedia dump exports with content licenses

2023-08-04 Thread Ariel Glenn WMF
Hi Dušan,

The legal team handles all manner of legal issues. You'll need to be
patient. I can't speed up their process for you, nor give you more
information than I already have.

Also, please don't send duplicate messages to the list. That would be
considered spam.  Thanks!

Ariel Glenn
dumps co-maintainer

On Fri, Aug 4, 2023 at 11:56 AM Dušan Kreheľ  wrote:

> @Platonides Thanks for your comment. I analyzed your post and updated
> the documents.
>
> 2023-07-26 3:45 GMT+02:00, Platonides :
> > On Tue, 25 Jul 2023 at 15:14, Dušan Kreheľ 
> wrote:
> >
> >> Hello, Wikipedia export is not right licensed. Could this be brought
> >> into compliance with the licenses? The wording of the violation is:
> >> https://krehel.sk/Oprava_poruseni_licencei_CC_BY-SA_a_GFDL/ (Slovak).
> >>
> >> Dušan Kreheľ
> >
> >
> > Hello Dušan
> >
> > I would encourage you to write in English. I have used an automatic
> > translator to look at your pages, but such machine translation may not
> > convey correctly what you intended.
> >
> > Also note, this is not the right venue for some of the issues you seem to
> > expect.
> >
> > The main point I think you are missing is that *all the GFDL content is
> > also under a CC-BY-SA license*, per the license update performed in 2009
> >  as
> > allowed by GFDL 1.3. All the text is under a CC-BY-SA license (or
> > compatible, e.g. text in Public Domain), *most* of it also under GFDL,
> but
> > not all.
> > It's thus enough to follow the CC-BY-SA terms.
> >
> > The interpretation is that for webpages it is enough to include a link,
> > there's no need to include all extra resources (license text, list of
> > authors, etc.) *on the same HTTP response*. Just like you don't need to
> > include all of that on *every* page of a book under that license, but
> only
> > once, usually placed at the end of the book.
> >
> > Note that the text of the GFDL is included in the dumps by virtue of
> being
> > in pages such as
> >
> https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License
> > (it may not be the best approach, but it *is* included)
> >
> > Images in the pages are considered an aggregate, and so they are accepted
> > under a different license than the text.
> >
> > That you license the text under the *GFDL unversioned, with no invariant
> > sections, front-cover texts, or back-cover texts* describes how you agree
> > to license the content that you submit to the site. It does not restrict
> > your rights granted by the license. You could edit a GFDL article and
> > publish your version in your blog under a specific GFDL version and
> > including an invariant section. But that would not be accepted in
> > Wikipedia.
> >
> > You may have a point in the difference between CC-BY-SA 3.0 and CC-BY-SA
> > 4.0, though. There could be a more straightforward display of the license
> > for reusers than expecting they determine the exact version by manually
> > checking the date of last publication.
> >
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Inconsistency of Wikipedia dump exports with content licenses

2023-07-26 Thread Ariel Glenn WMF
I was away from work for the past two days and so unable to reply.  My
apologies!  Indeed, Dušan, if you want to sort out exactly what to do
with/about the licenses, the legal team is the way to go. Reach them at
legal (at) wikimedia.org. Hope you get it sorted!

Ariel

On Wed, Jul 26, 2023 at 5:57 AM p858snake  wrote:

> To expand on platonides response,
>
> As pointed out on your other emails relating this subject, the best
> contact would be to email the legal team address.
>
> On Wed, 26 July 2023, 11:46 am Platonides,  wrote:
>
>> On Tue, 25 Jul 2023 at 15:14, Dušan Kreheľ  wrote:
>>
>>> Hello, Wikipedia export is not right licensed. Could this be brought
>>> into compliance with the licenses? The wording of the violation is:
>>> https://krehel.sk/Oprava_poruseni_licencei_CC_BY-SA_a_GFDL/ (Slovak).
>>>
>>> Dušan Kreheľ
>>
>>
>> Hello Dušan
>>
>> I would encourage you to write in English. I have used an automatic
>> translator to look at your pages, but such machine translation may not
>> convey correctly what you intended.
>>
>> Also note, this is not the right venue for some of the issues you seem to
>> expect.
>>
>> The main point I think you are missing is that *all the GFDL content is
>> also under a CC-BY-SA license*, per the license update performed in 2009
>>  as
>> allowed by GFDL 1.3. All the text is under a CC-BY-SA license (or
>> compatible, e.g. text in Public Domain), *most* of it also under GFDL,
>> but not all.
>> It's thus enough to follow the CC-BY-SA terms.
>>
>> The interpretation is that for webpages it is enough to include a link,
>> there's no need to include all extra resources (license text, list of
>> authors, etc.) *on the same HTTP response*. Just like you don't need to
>> include all of that on *every* page of a book under that license, but
>> only once, usually placed at the end of the book.
>>
>> Note that the text of the GFDL is included in the dumps by virtue of
>> being in pages such as
>> https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License
>> (it may not be the best approach, but it *is* included)
>>
>> Images in the pages are considered an aggregate, and so they are accepted
>> under a different license than the text.
>>
>> That you license the text under the *GFDL unversioned, with no invariant
>> sections, front-cover texts, or back-cover texts* describes how you
>> agree to license the content that you submit to the site. It does not
>> restrict your rights granted by the license. You could edit a GFDL article
>> and publish your version in your blog under a specific GFDL version and
>> including an invariant section. But that would not be accepted in Wikipedia.
>>
>> You may have a point in the difference between CC-BY-SA 3.0 and CC-BY-SA
>> 4.0, though. There could be a more straightforward display of the license
>> for reusers than expecting they determine the exact version by manually
>> checking the date of last publication.
>>
>>
>>
>> ___
>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: The license for some files in the dump exports

2023-07-21 Thread Ariel Glenn WMF
I'm not sure which text you are relying on. But the legal information for
the licensing of content in the dumps can be found here:
https://dumps.wikimedia.org/legal.html I hope that helps.

Ariel Glenn
dumps co-maintainer
ar...@wikimedia.org


On Fri, Jul 21, 2023 at 12:10 PM Dušan Kreheľ  wrote:

> Hello.
>
> Could you change the license for some files in the dump exports?
> Because CC BY-SA is not suitable for exports that have content in
> different licenses. More read:
> https://krehel.sk/Oprava_poruseni_licencii_CC_BY-SA_a_GFDL/ (Slovak).
>
> Dušan Kreheľ
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: "Experimental" Status of Enterprise HTML Dumps

2023-05-10 Thread Ariel Glenn WMF
Hello Evan,

The Enterprise HTML dumps should be publicly available around the 22nd and
the 3rd of each month, though there can be delays. We don't expect that to
change any time soon. As to their content or the namespaces, I can't answer
to that; someone from WIkimedia Enterprise will have to discuss their
plans. More information about their content is available at
https://meta.wikimedia.org/wiki/Wikimedia_Enterprise and you might be able
to get a question about it answered on the corresponding discussion page.
Hope that helps to clarify things a bit.

Ariel Glenn
ar...@wikimedia.org

On Fri, May 5, 2023 at 11:54 PM Evan Lloyd New-Schmidt 
wrote:

> Hi, I'm starting a project that will involve repeated processing of HTML
> wikipedia articles.
>
> Using the enterprise dumps seems like it would be much simpler than
> converting the XML dumps, but I don't know what the "experimental"
> status really means.
>
> I see in the original announcement post from a year and a half ago that
> there is a warning about bugs and downtime, but the meta wiki page and
> dumps site don't have any more information.
>
> Is there less of a commitment to keep posting the enterprise dumps
> compared to the database XML dumps?
>
>
> Thanks,
> Evan
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Interruption in service of production of wikidata entity dumps and others

2023-04-18 Thread Ariel Glenn WMF
Due to switch maintenance, this week's dumps of wikidata entities, other
weekly datasets, and today's adds-changes dumps may not be produced.

All datasets should be back on a normal production schedule the following
week.

Apologies for the inconvenience!

Ariel Glenn
ar...@wikimedia.org
Community Datasets co-maintainer
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Too (Two) many FAQs

2023-03-31 Thread Ariel Glenn WMF
My apologies for the duplicate FAQ this month. We recently deployed a new
server and the old one, now retired, still had the FAQ generation job
running on it.  We should be back to the usual number of FAQ emails (one)
next month.  Thanks!

Ariel Glenn
dumps co-maintainer
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Querying for recently created pages

2023-01-18 Thread Ariel Glenn WMF
Eric,

We don't produce dumps of the revision table in sql format because some of
those revisions may be hidden from public view, and even metadata about
them should not be released. We do however publish so-called Adds/Changes
dumps once a day for each wiki, providing stubs and content files in xml of
just new pages and revisions since the last such dump. They lag about 12
hours behind to allow vandalism and such to be filtered out by wiki admins,
but hopefully that's good enough for your needs.  You can find those here:
https://dumps.wikimedia.org/other/incr/

Ariel Glenn
ar...@wikimedia.org

On Tue, Jan 17, 2023 at 6:22 AM Eric Andrew Lewis <
eric.andrew.le...@gmail.com> wrote:

> Hi,
>
> I am interested in performing analysis on recently created pages on
> English Wikipedia.
>
> One way to find recently created pages is downloading a meta-history file
> for the English language, and filter through the XML, looking for pages
> where the oldest revision is within the desired timespan.
>
> Since this requires a library to parse through XML string data, I would
> imagine this is much slower than a database query. Is page revision data
> available in one of the SQL dumps which I could query for this use case?
> Looking at the exported tables list
> ,
> it does not look like it is. Maybe this is intentional?
>
> Thanks,
> Eric Andrew Lewis
> ericandrewlewis.com
> +1 610 715 8560
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: XML Data Dumps 20220701

2022-07-03 Thread Ariel Glenn WMF
There is an issue with the availability of these dumps for retrieval for
publishing to the public. This is being tracked in
https://phabricator.wikimedia.org/T311441 and updates will be posted there.

Ariel Glenn
ar...@wikimedia.org

On Sun, Jul 3, 2022 at 9:37 PM  wrote:

> The folder
> https://dumps.wikimedia.org/other/enterprise_html/runs/20220701/
> is created, but empty as of 20220703
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Access imageinfo data in a dump

2022-02-05 Thread Ariel Glenn WMF
The text table itself is not dumped, because some entries in it may be
related to hidden revisions or deleted pages, and thus not viewable by
ordinary users.

The text id is given in the content dumps as an xml tag before the wrapped
wikitext content, and you can associate the items that way.

Ariel

On Fri, Feb 4, 2022 at 10:43 PM Mitar  wrote:

> Hi!
>
> Will do. Thanks.
>
> After going through the image table dump, it seems not all data is in
> there. For example, page count for Djvu files is missing. Instead of
> metadata in the image table dump, a reference to text table [1] is
> provided:
>
> {"data":[],"blobs":{"data":"tt:609531648","text":"tt:609531649"}}
>
> But that table itself does not seem to be available as a dump? Or am I
> missing something or misunderstanding something?
>
> [1] https://www.mediawiki.org/wiki/Manual:Text_table
>
>
> Mitar
>
> On Fri, Feb 4, 2022 at 6:54 AM Ariel Glenn WMF 
> wrote:
> >
> > This looks great! If you like, you might add the link and a  brief
> description to this page:
> https://meta.wikimedia.org/wiki/Data_dumps/Other_tools  so that more
> people can find and use the library :-)
> >
> > (Anyone else have tools they wrote and use, that aren't on this list?
> Please add them!)
> >
> > Ariel
> >
> > On Fri, Feb 4, 2022 at 2:31 AM Mitar  wrote:
> >>
> >> Hi!
> >>
> >> If it is useful to anyone else, I have added to my library [1] in Go
> >> for processing dumps support for processing SQL dumps directly,
> >> without having to load them into a database. So one can process them
> >> directly to extract data, like dumps in other formats.
> >>
> >> [1] https://gitlab.com/tozd/go/mediawiki
> >>
> >>
> >> Mitar
> >>
> >> On Thu, Feb 3, 2022 at 9:13 AM Mitar  wrote:
> >> >
> >> > Hi!
> >> >
> >> > I see. Thanks.
> >> >
> >> >
> >> > Mitar
> >> >
> >> > On Thu, Feb 3, 2022 at 7:17 AM Ariel Glenn WMF 
> wrote:
> >> > >
> >> > > The media/file descriptions contained in the dump are the wikitext
> of the revisions of pages with the File: prefix, plus the metadata about
> those pages and revisions (user that made the edit, timestamp of edit, edit
> comment, and so on).
> >> > >
> >> > > Width and hieght of the image, the media type, the sha1 of the
> image and a few other details can be obtained by looking at the
> image.sql.gz file available for download for the dumps for each wiki. Have
> a look at https://www.mediawiki.org/wiki/Manual:Image_table for more info.
> >> > >
> >> > > Hope that helps!
> >> > >
> >> > > Ariel Glenn
> >> > >
> >> > >
> >> > >
> >> > > On Wed, Feb 2, 2022 at 10:45 PM Mitar  wrote:
> >> > >>
> >> > >> Hi!
> >> > >>
> >> > >> I am trying to find a dump of all imageinfo data [1] for all files
> on
> >> > >> Commons. I thought that "Articles, templates, media/file
> descriptions,
> >> > >> and primary meta-pages" XML dump would contain that, given the
> >> > >> "media/file descriptions" part, but it seems this is not the case.
> Is
> >> > >> there a dump which contains that information? And what is
> "media/file
> >> > >> descriptions" then? Wiki pages of files?
> >> > >>
> >> > >> [1] https://www.mediawiki.org/wiki/API:Imageinfo
> >> > >>
> >> > >>
> >> > >> Mitar
> >> > >>
> >> > >> --
> >> > >> http://mitar.tnode.com/
> >> > >> https://twitter.com/mitar_m
> >> > >> ___
> >> > >> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> >> > >> To unsubscribe send an email to
> xmldatadumps-l-le...@lists.wikimedia.org
> >> >
> >> >
> >> >
> >> > --
> >> > http://mitar.tnode.com/
> >> > https://twitter.com/mitar_m
> >>
> >>
> >>
> >> --
> >> http://mitar.tnode.com/
> >> https://twitter.com/mitar_m
>
>
>
> --
> http://mitar.tnode.com/
> https://twitter.com/mitar_m
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Access imageinfo data in a dump

2022-02-03 Thread Ariel Glenn WMF
This looks great! If you like, you might add the link and a  brief
description to this page:
https://meta.wikimedia.org/wiki/Data_dumps/Other_tools  so that more people
can find and use the library :-)

(Anyone else have tools they wrote and use, that aren't on this list?
Please add them!)

Ariel

On Fri, Feb 4, 2022 at 2:31 AM Mitar  wrote:

> Hi!
>
> If it is useful to anyone else, I have added to my library [1] in Go
> for processing dumps support for processing SQL dumps directly,
> without having to load them into a database. So one can process them
> directly to extract data, like dumps in other formats.
>
> [1] https://gitlab.com/tozd/go/mediawiki
>
>
> Mitar
>
> On Thu, Feb 3, 2022 at 9:13 AM Mitar  wrote:
> >
> > Hi!
> >
> > I see. Thanks.
> >
> >
> > Mitar
> >
> > On Thu, Feb 3, 2022 at 7:17 AM Ariel Glenn WMF 
> wrote:
> > >
> > > The media/file descriptions contained in the dump are the wikitext of
> the revisions of pages with the File: prefix, plus the metadata about those
> pages and revisions (user that made the edit, timestamp of edit, edit
> comment, and so on).
> > >
> > > Width and hieght of the image, the media type, the sha1 of the image
> and a few other details can be obtained by looking at the image.sql.gz file
> available for download for the dumps for each wiki. Have a look at
> https://www.mediawiki.org/wiki/Manual:Image_table for more info.
> > >
> > > Hope that helps!
> > >
> > > Ariel Glenn
> > >
> > >
> > >
> > > On Wed, Feb 2, 2022 at 10:45 PM Mitar  wrote:
> > >>
> > >> Hi!
> > >>
> > >> I am trying to find a dump of all imageinfo data [1] for all files on
> > >> Commons. I thought that "Articles, templates, media/file descriptions,
> > >> and primary meta-pages" XML dump would contain that, given the
> > >> "media/file descriptions" part, but it seems this is not the case. Is
> > >> there a dump which contains that information? And what is "media/file
> > >> descriptions" then? Wiki pages of files?
> > >>
> > >> [1] https://www.mediawiki.org/wiki/API:Imageinfo
> > >>
> > >>
> > >> Mitar
> > >>
> > >> --
> > >> http://mitar.tnode.com/
> > >> https://twitter.com/mitar_m
> > >> ___
> > >> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> > >> To unsubscribe send an email to
> xmldatadumps-l-le...@lists.wikimedia.org
> >
> >
> >
> > --
> > http://mitar.tnode.com/
> > https://twitter.com/mitar_m
>
>
>
> --
> http://mitar.tnode.com/
> https://twitter.com/mitar_m
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Access imageinfo data in a dump

2022-02-02 Thread Ariel Glenn WMF
The media/file descriptions contained in the dump are the wikitext of the
revisions of pages with the File: prefix, plus the metadata about those
pages and revisions (user that made the edit, timestamp of edit, edit
comment, and so on).

Width and hieght of the image, the media type, the sha1 of the image and a
few other details can be obtained by looking at the image.sql.gz file
available for download for the dumps for each wiki. Have a look at
https://www.mediawiki.org/wiki/Manual:Image_table for more info.

Hope that helps!

Ariel Glenn



On Wed, Feb 2, 2022 at 10:45 PM Mitar  wrote:

> Hi!
>
> I am trying to find a dump of all imageinfo data [1] for all files on
> Commons. I thought that "Articles, templates, media/file descriptions,
> and primary meta-pages" XML dump would contain that, given the
> "media/file descriptions" part, but it seems this is not the case. Is
> there a dump which contains that information? And what is "media/file
> descriptions" then? Wiki pages of files?
>
> [1] https://www.mediawiki.org/wiki/API:Imageinfo
>
>
> Mitar
>
> --
> http://mitar.tnode.com/
> https://twitter.com/mitar_m
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Directory listing too small/filename too long

2021-11-28 Thread Ariel Glenn WMF
You can get the filename listing a couple of other ways:
Check the directory listing for the specific date, i.e.
https://dumps.wikimedia.org/wikidatawiki/20211120/
Get the status file from that or the "latest" directory, i.e.
https://dumps.wikimedia.org/wikidatawiki/20211120/dumpstatus.json
Get one of the hash file lists from that or the "latest" directory, i.e.
https://dumps.wikimedia.org/wikidatawiki/20211120/wikidatawiki-20211120-md5sums.json

The directory listing you mention above is generated automatically by nginx
via the autoindex directive, and the field widths are not configurable that
I can see.

Hope that helps!

Ariel

On Thu, Nov 25, 2021 at 5:09 PM Wurgl  wrote:

> Hallo!
>
> Maybe you could change the directory listing a little bit.
>
> In https://dumps.wikimedia.org/wikidatawiki/latest/ I see a lot of
> file names which are too long:
>
> wikidatawiki-latest-pages-articles-multistream-..> 23-Nov-2021 20:53
> 329566492
> wikidatawiki-latest-pages-articles-multistream-..> 24-Nov-2021 16:37
>   877
> wikidatawiki-latest-pages-articles-multistream-..> 23-Nov-2021 07:17
>   1471303
> ...
>
> Thanks!
> Wolfgang
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Wikimedia Enterprise HTML dumps available for public download

2021-10-19 Thread Ariel Glenn WMF
I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for
October 17-18th are available for public download; see
https://dumps.wikimedia.org/other/enterprise_html/ for more information. We
expect to make updated versions of these files available around the 1st/2nd
of the month and the 20th/21st of the month, following the cadence of the
standard SQL/XML dumps.

This is still an experimental service, so there may be hiccups from time to
time. Please be patient and report issues as you find them. Thanks!

Ariel "Dumps Wrangler" Glenn

[1] See https://www.mediawiki.org/wiki/Wikimedia_Enterprise for much more
about Wikimedia Enterprise and its API.
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: still only a partial dump for 20210801 for a lot of wikis

2021-08-09 Thread Ariel Glenn WMF
Not the script itself but we have a permissions problem on some status
files that I'm having trouble stamping out. See
https://phabricator.wikimedia.org/T288192 for updates as they come in.

Ariel

On Mon, Aug 9, 2021 at 10:18 AM griffin tucker <
lmxxlmwikwik3...@griffintucker.id.au> wrote:

> strangely the enwiki dump is complete, but not a lot of the other
> wikis (such as enwiktionary) that are a lot smaller
>
> usually they're finished after a few days
>
> something going wrong with the dump script?
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


Re: [Xmldatadumps-l] enwiki dump ?

2021-02-03 Thread Ariel Glenn WMF
The enwiki run got a later start this month as we switched hosts around for
migration to a more recent version of the OS. But it's currently moving
along nicely. Thanks for the report though!

Ariel

On Wed, Feb 3, 2021 at 1:27 PM Nicolas Vervelle  wrote:

> Hi,
>
> Is there a problem with enwiki dump ? It seems it still hasn't started for
> the February dump.
>
> Nico
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] November 2nd dump run delayed half a day, wikidata full page content not ready yet

2020-11-22 Thread Ariel Glenn WMF
The files are now all available, as has been noted on the task. The bz2
files and 7z files are just fine and can be processed as usual.

Ariel

On Fri, Nov 20, 2020 at 2:37 PM Ariel Glenn WMF  wrote:

> Hello folks,
>
> I hope everyone is in good health and staying safe in these troubled times.
>
> Speaking of trouble, in the course of making an improvement to the xml/sql
> dumps, I introduced a bug, and so now I am doing the cleanup from that.
>
> The short version:
>
> There will be a 7z file missing from the wikidata full page content dumps,
> to be made available in a day or two.
> The corresponding bz2 file should become available later today, but it's
> possible that I will instead provide a slightly longer file which has a bad
> bz2 block on the end, and pages at the end larger than are specified in the
> filename. This would mean MANUAL PROCESSING IF YOU USE these full page
> content dumps. If this happens, I'll send an email update.
>
> The long version:
>
> See https://phabricator.wikimedia.org/T268333
>
> IN ALL CASES the xml/dumps run for the 20th of the month (ie. today)
> should start late tonight UTC time, if not earlier.
>
> My apologies for the inconvenience!
>
> Ariel
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] November 2nd dump run delayed half a day, wikidata full page content not ready yet

2020-11-20 Thread Ariel Glenn WMF
Hello folks,

I hope everyone is in good health and staying safe in these troubled times.

Speaking of trouble, in the course of making an improvement to the xml/sql
dumps, I introduced a bug, and so now I am doing the cleanup from that.

The short version:

There will be a 7z file missing from the wikidata full page content dumps,
to be made available in a day or two.
The corresponding bz2 file should become available later today, but it's
possible that I will instead provide a slightly longer file which has a bad
bz2 block on the end, and pages at the end larger than are specified in the
filename. This would mean MANUAL PROCESSING IF YOU USE these full page
content dumps. If this happens, I'll send an email update.

The long version:

See https://phabricator.wikimedia.org/T268333

IN ALL CASES the xml/dumps run for the 20th of the month (ie. today) should
start late tonight UTC time, if not earlier.

My apologies for the inconvenience!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Mirror status

2020-08-03 Thread Ariel Glenn WMF
The page is in our puppet repo; see
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/files/web/html/public_mirrors.html

You can submit a patch to gerrit yourself if you like; see
https://www.mediawiki.org/wiki/Gerrit/Tutorial for setting up and working
with gerrit. Alternatively you can say "Um, that's too many hoops, can you
do it?" and I'll be happy to sync it up to the list on meta directly.

Ariel

On Mon, Aug 3, 2020 at 12:07 PM Count Count 
wrote:

> Hi Ariel!
>
> Thanks for this report! Would you be willing to open a task in phabricator
>> about the bytemark mirror, and tag it with dumps-generation so that it gets
>> into the right queue?
>> https://phabricator.wikimedia.org/maniphest/task/edit/form/1/
>>
>
> Sure, https://phabricator.wikimedia.org/T259467
>
> You could update the Umeå University mirror information on the wiki page
>> directly if you like:
>> https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors
>>
>
> Done, also marked the stalled mirrors. The Free Mirror Project mirror was
> removed from that page on Feb. 19, but that is still not reflected on
> https://dumps.wikimedia.org/mirrors.html. Is that page updated manually
> from the meta page? Can someone update that page?
>
> Best regards,
>
> Count Count
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Mirror status

2020-08-03 Thread Ariel Glenn WMF
Thanks for this report! Would you be willing to open a task in phabricator
about the bytemark mirror, and tag it with dumps-generation so that it gets
into the right queue?
https://phabricator.wikimedia.org/maniphest/task/edit/form/1/

The C3SL mirror has technical issues with DNS that are unresolved, although
we should revisit that.
https://gerrit.wikimedia.org/r/c/operations/puppet/+/556216 has a small bit
of the discussion around this.

You could update the Umeå University mirror information on the wiki page
directly if you like:
https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_Mirrors

Again, thanks!

Ariel

On Sun, Aug 2, 2020 at 11:06 PM Count Count 
wrote:

> Just checked the mirrors.
>
>- The Academic Computer Club, Umeå University mirror apparently only
>mirrors the last two good dumps, not the last five:
>https://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/
>- The Bytemark mirror seems to have stopped mirroring in May 2020:
>https://wikimedia.bytemark.co.uk/enwiki/
>- The C3SL mirror seems to have stopped mirroring in November 2019:
>http://wikipedia.c3sl.ufpr.br/enwiki/
>- The Free Mirror Project mirror seems to work fine.
>- The Your.org mirror seems to work fine.
>-  https://wikimedia.mirror.us.dev/ seems to work fine.
>
> Maybe the hosters of the failing mirrors should be contacted or the list
> updated?
>
> Best regards,
>
> Count Count
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] List of dumped wikis, discrepancy with Wikidata

2020-08-02 Thread Ariel Glenn WMF
labswiki and labtestwiki are copies of Wikitech, which is maintained and
dumped in a special fashion. You can find those dumps here:
https://dumps.wikimedia.org/other/wikitech/dumps/
uk.wikiversity.org does not exist.
ecwikimedia, as you rightly note, is private.
The remaining wikis have all been deleted. We dump closed wikis but we do
not dump deleted ones.

I hope this addresses your concerns.

Ariel

On Sun, Aug 2, 2020 at 1:04 AM Count Count 
wrote:

> Hi!
>
> I am currently working on a dump search and download tool for all
> Wikimedia wikis. In order to find out which Wikimedia wikis exist I used
> Wikidata. While comparing the list of wikis from Wikidata with the list of
> dumped projects I found out that the following wikis are currently not
> being dumped:
>
>- alswikibooks (last dump 20180101)
>- alswikiquote (last dump 20180101)
>- alswiktionary  (last dump 20180101)
>- ecwikimedia (never dumped, private but not marked private in
>Wikidata?)
>- fixcopyrightwiki (last dump 20200220)
>- labswiki (never dumped?)
>- labtestwiki (never dumped?)
>- mowiki (last dump 20180101)
>- mowiktionary (last dump 20180101)
>- ru_sibwiki (last dump 20071011)
>- ukwikiversity (never dumped?)
>
> Is there an uptodate machine-readable list of currently dumped wikis
> besides https://dumps.wikimedia.org/backup-index.html?
>
> (Off-topic) Spoiler for dump searching tool on my laptop:
> $ target/release/wdgrep "asdfdefased"
> /c/Users/xyz/wpdumps/dewiki-20200701-pages-articles-multistream.xml -v --ns
> 0
> Searched 21437.064 MiB in 8.467969 seconds (2531.5474 MiB/s).
>
> Best regards,
>
> Count Count
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Has anyone had success with data deduplication?

2020-07-29 Thread Ariel Glenn WMF
The basic problem is that the page content dumps are ordered by revision
number within each page, which makes good sense for dumps users but means
that the addition of a single revision to a page will shift all of the
remaining data ,resulting in different compressed blocks. That's going to
be true regardless of the compression type.

In the not too distant future we might switch over to multi-stream output
files for all page content, fixing the page id range per stream for bz2
files. This might let a user check the current list of page ids against the
previous one and only get the streams with the pages they want, in the
brave new Hadoop-backed object store of my dreams. 7z files are another
matter altogether and I don't see how we can do better there without
rethinking them altogether.

Can you describe which dump files you are keeping and why having them in
sequence is useful? Maybe we can find a workaround that will let you get
what you need without keeping a bunch of older files.

Ariel

On Tue, Jul 28, 2020 at 8:48 AM Count Count 
wrote:

> Hi!
>
> The underlying filesystem (ZFS) uses block-level deduplication, so unique
> chunks of 128KiB (default value) are only stored once. The 128KB chunks
> making up dumps are mostly unique since there is no alignment so
> deduplication will not help as far as I can see.
>
> Best regards,
>
> Count Count
>
> On Tue, Jul 28, 2020 at 3:51 AM griffin tucker 
> wrote:
>
>> I’ve tried using freenas/truenas with a data deduplication volume to
>> store multiple sequential dumps, however it doesn’t seem to save much space
>> at all – I was hoping someone could point me in the right direction so that
>> I can download multiple dumps and not have it take up so much room
>> (uncompressed).
>>
>>
>>
>> Has anyone tried anything similar and had success with data deduplication?
>>
>>
>>
>> Is there a guide?
>> ___
>> Xmldatadumps-l mailing list
>> Xmldatadumps-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Request for Wikipedia dump of February 2017

2020-07-19 Thread Ariel Glenn WMF
Dear Rajakumaran Archulan,

Older dumps can often be found on the Internet Archive. The February 2017
full dumps for the English language Wikipedia are here:
https://archive.org/details/enwiki-20170201

A reminder for all new and older members of this list: comprehensive
documentation for dumps users is available on MetaWiki:
https://meta.wikimedia.org/wiki/Data_dumps  In the section "Getting the
dumps" there are pointers for locating older dumps that are no longer
available on the Wikimedia dumps download host.

Ariel Glenn
ar...@wikimedia.org

On Mon, Jul 20, 2020 at 6:54 AM Rajakumaran Archulan <
archulan...@cse.mrt.ac.lk> wrote:

> Dear sir/madam,
>
> I am a final year undergrad at the department of computer science &
> engineering at University of Moratuwa, Sri Lanka. We are in the process of
> building an evaluator for word embeddings for our final year project.
>
> We need the *Wikipedia dump of February 2017* for our research purpose.
> We searched across the web for several hours. But we couldn't find it. It
> would be grateful if you grant us access to the above corpus to
> continue our research.
>
> Thank you!
>
> --
> *Best regards,*
> *R.Archulan*
> *Final year undergrad (16' Batch),*
> *Dept. of Computer Science & Engineering,*
> *Faculty of Engineering,*
> *University of Moratuwa, Sri Lanka.*
> *Mobile: (+94) 771761696*
> *Linkedin* 
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] sample html dumps available FOR QA ONLY

2020-07-10 Thread Ariel Glenn WMF
NOTE: I did not produce the HTML dumps, they are being managed by another
team.

If you are interested in weighing in on the output format, what's missing,
etc, here is the phabricator task: https://phabricator.wikimedia.org/T257480
Your comments and suggestions would be welcome!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] Commons structured data dumps

2020-07-09 Thread Ariel Glenn WMF
RDF dumps of structured data from commons are now available at
https://dumps.wikimedia.org/other/wikibase/commonswiki/ They are run on a
weekly basis.

See https://lists.wikimedia.org/pipermail/wikidata/2020-July/014125.html
for more information.

Enjoy!
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Dumps stalled

2020-06-10 Thread Ariel Glenn WMF
They aren't, but the rsync copying files to the web server is behind. See
https://phabricator.wikimedia.org/T254856 for that. They'll catch up in the
next day or so.

Ariel

On Wed, Jun 10, 2020 at 7:36 PM Bruce Myers via Xmldatadumps-l <
xmldatadumps-l@lists.wikimedia.org> wrote:

> The wikidatawiki, commonswiki, enwiki 20200601 dumps appear to be stalled
> with last file writes on June 4 and 5.
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Wikitaxi omits certain information from pages and inconvenient strings are found throughout the pages

2020-05-04 Thread Ariel Glenn WMF
The WikiTaxi software is maintained by a group unaffiliated with the
Wikimedia Foundation, if it is maintained at all. I see that the wiki (
www.wikitaxi.org) has not been updated in years. There is a contact email
listed there which you might try: m...@wikitaxi.org

The parts you highlight are references to Lua modules which apparently are
not handled properly by the WikiTaxi application.  Given that source code
is not available for the application, according to the FAQ (
https://www.yunqa.de/delphi/wiki/wikitaxi/index ) your best hope is to see
if the email works or else try an alternative such as Kiwix (
https://www.kiwix.org/en/).


On Mon, May 4, 2020 at 9:13 AM Fluffy Cat  wrote:

> Hi.
>
> *I apologize if this is the wrong place to ask, but I have sent multiple
> messages to the WikiTaxi Facebook page and have got no reply.
>
> To clarify my problem, please see the attached images and compare it to "
> https://en.wikipedia.org/wiki/Cat; (the online version of the wiki). I
> have highlighted some of the respective parts of the page, which are
> causing the problem. Besides being inconvenient regarding visuals, alot of
> these unidentified strings replace actual information in the Wiki, due to
> which that info becomes inaccessible. I have faced this problem in every
> Wikitaxi page that I have used.
>
> My WikiTaxi version is 1.3.0 and the dump file is called "Offline
> Wiki.taxi", and has a size of 25.59 GBs.
>
> Any help is appreciated.
>
> Cat
>
>
>
>
>
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Wikipedia xml dumps 2009-2013

2020-04-30 Thread Ariel Glenn WMF
You might check our archives as well as archive.org: see
https://meta.wikimedia.org/wiki/Data_dumps/Finding_older_xml_dumps if you
have not already done so.

Otherwise perhaps someone on the list will have a copy available.

Ariel

On Thu, Apr 30, 2020 at 1:15 PM Katja Schmahl 
wrote:

> Hi all,
>
> I’m doing research to the existence of gender bias in Wikipedia texts over
> time. To do this, I need old pages-articles.xml dumps. I am still looking
> for dumps from 2009 and 2011-2013, does anyone know how I can get one of
> these or does someone have one of these stored themselves?
>
> Thanks in advance,
> Katja Schmahl
>
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] BREAKING CHANGE: feature removal (private table dumps)

2020-04-06 Thread Ariel Glenn WMF
For the past few years we have not dumped private tables at all; they would
not be accessible to the public in any case, and they do not suffice as a
backup in case of catastrophic failure.

We are therefore removing the feature to dump private tables along with
public tables in a dump run. Anyone who wishes to use the dump scripts in
our python repo to dump privat tables in their wiki will need to create a
separate dumps configuration file and tables yaml file describing which
tables to dump and where to put them, as a separate dump run.

This change will be committed by April 20, 2020, in time for the second
dump run of the month.

Note that this does not impact the actual output of the Wikimedia SQL/XML
dumps at all, since we have not been dumping private tables since late 2016.

See T249508 to follow along.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Duplicate entry in last Spanish dump

2020-04-06 Thread Ariel Glenn WMF
The issue is being tracked here: https://phabricator.wikimedia.org/T249477
A fix has been deployed but may not take effect on all affected wikis
before the end of the run. In that case we will run manual no-op jobs on
these wikis to fix up the symlinks and the index.html files.

On Sun, Apr 5, 2020 at 9:16 AM Ariel Glenn WMF  wrote:

> Thanks for this report!
>
> This bug must have been introduced in my recent updates to file listing
> methods.
>
> The multistream file is produced and available for download by changing
> the file name in the download url.
>
> I'll have a look Monday to see about fixing up the index.html output
> generation.
>
> Ariel
>
> On Sat, Apr 4, 2020 at 11:51 AM Benjamín Valero Espinosa <
> benjaval...@gmail.com> wrote:
>
>> Hi,
>>
>> Sorry if this is not the right place to report this.
>>
>> In the last Spanish Wikipedia dump (still in progress):
>>
>> https://dumps.wikimedia.org/eswiki/20200401/
>>
>> the "pages-articles" dump is duplicated. I guess, based on the dump from
>> March (and also the previous ones) that they are really two different
>> files, so the filename should reflect it as before.
>>
>> I am aware there had been recent modifications in the way the multistream
>> dumps are built, so maybe there is some kind of issue there.
>>
>>
>> Best Regards,
>> ___
>> Xmldatadumps-l mailing list
>> Xmldatadumps-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Duplicate entry in last Spanish dump

2020-04-05 Thread Ariel Glenn WMF
Thanks for this report!

This bug must have been introduced in my recent updates to file listing
methods.

The multistream file is produced and available for download by changing the
file name in the download url.

I'll have a look Monday to see about fixing up the index.html output
generation.

Ariel

On Sat, Apr 4, 2020 at 11:51 AM Benjamín Valero Espinosa <
benjaval...@gmail.com> wrote:

> Hi,
>
> Sorry if this is not the right place to report this.
>
> In the last Spanish Wikipedia dump (still in progress):
>
> https://dumps.wikimedia.org/eswiki/20200401/
>
> the "pages-articles" dump is duplicated. I guess, based on the dump from
> March (and also the previous ones) that they are really two different
> files, so the filename should reflect it as before.
>
> I am aware there had been recent modifications in the way the multistream
> dumps are built, so maybe there is some kind of issue there.
>
>
> Best Regards,
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] No second dump run this month

2020-03-19 Thread Ariel Glenn WMF
As mentioned earlier on the xmldatadumps-l, the dumps are running very slow
this month, ince the vslow db hosts they use are also serving live traffic
during a tables migration. Even manual runs of partial jobs would not help
the situation any, so there will be NO SECOND DUMP RUN THIS MONTH. The
March 1 Wikidata run is still in process but it should complete in the next
several days.

With any luck everything will be back to normal in April and we'll be able
to conduct two runs as usual from then on.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] Monthly dumps for March; possible no second run

2020-03-16 Thread Ariel Glenn WMF
Hello everybody,

Those of you who follow the dumps closely may have notice that they are
running slower than usual this month. That is because the db servers on
which they run are also serving live traffic, so that a wikidata-related
migration can complete before the end of the month.

I will try to do extra manual runs to see if I can get the wikidata page
content dumps to complete by the 20th, but if it turns out not to be
feasible, we may delay or skip the March 20th run.

In any case the two April runs should happen on schedule.

Thanks in advance for your understanding.

Ariel

P.S. Please stay safe and look after each other.
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] kowiki joins the ranks of the 'big wikis'

2020-02-28 Thread Ariel Glenn WMF
Happy almost March, everyone!

Kowiki dumps jobs now take long enough to run for certain steps that the
wiki has been moved to the 'big wikis' list. This means that 6 parallel
jobs will produce output for stubs and page content dumps, similarly to
frwiki, dewiki and so on. See [1] for more.

This will take effect with the next dump run, starting tomorrow.
Please adjust your scripts accordingly.

Ariel

[1] https://phabricator.wikimedia.org/T245721
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Was format change plan postponed?

2020-02-11 Thread Ariel Glenn WMF
Good morning!

We are a bit delayed due to some code changes that need to go in. We hope
to make the switch in March; I'll send an update with the target date when
all patches have been deployed.  My apologies for not updating the list.

You can follow the progress of this changeover on
https://phabricator.wikimedia.org/T238972

Ariel

On Wed, Feb 12, 2020 at 5:35 AM Itsuki Toyota  wrote:

> Hi, Xmldatadumps team
>
> As you know, the format of xml dumps should change after February 1, 2020:
>
> https://lists.wikimedia.org/pipermail/xmldatadumps-l/2019-November/001508.html
>
> However, I cannot find any changes on the Japanese dumps such as
> jawiki-20200201-pages-articles.xml.bz2.
> If the format change plan was postponed, could you tell me the date when
> this change will occur?
>
> Cheers,
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Ordering of revisions

2020-01-17 Thread Ariel Glenn WMF
The queries to get page and revision metadata are ordered by page id, and
within each page, by revision id. This is guaranteed.
The behavior of rev_parent_id is not guaranteed however, in certain edge
cases. See e.g. https://phabricator.wikimedia.org/T193211

Anyone who uses this field care to weigh in?

Ariel

On Fri, Jan 17, 2020 at 10:52 AM Christopher Wolfram <
chriscwolf...@gmail.com> wrote:

> Hi,
>
> Perhaps there is documentation about this, but I have looked for the past
> hour and haven’t found anything.
>
> I was wondering if it is guaranteed that all revisions given in the
> enwiki-latest-pages-meta-history files are in order of
> parent->child->grandchild->… In a few examples, it looks like they follow
> this pattern. I ask because I need them in order and it would be nice if I
> didn’t have to do that with the  field.
>
> Thank you,
> Christopher
>
>
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Your help requested (testing decompression)

2020-01-08 Thread Ariel Glenn WMF
I'd like to move ahead with producing multistream files for *all* bz2
compressed output by March 1. So if you have strenuous objections, now is
the time to weigh in, see https://phabricator.wikimedia.org/T239866 !

Even files which are not produced by compressing and concatenating other
files will have the mediawiki/siteinfo header as one bz2 stream, the
mediawiki close tag as another bz2 stream, and the body containing all page
and revision content as a third stream. This will allow us to generate
pages-articles and pages-meta-current files from their parts 1-6 files in a
matter of a few minutes, cutting out many hours from the dump runs overall.

Please check your tools using the files linked in the previous emails and
make sure that they work.

Thanks!

Ariel

On Thu, Dec 5, 2019 at 12:01 AM Ariel Glenn WMF  wrote:

> if you use one of the utilities listed here:
> https://phabricator.wikimedia.org/T239866
> I'd like you to download one of the 'multistream' dumps and see if your
> utility decompresses it fully or not (you can compare the md5sum of the
> decompressed content to the regular file's decompressed content and see if
> they are the same). Then note the results and the version of the utility on
> this task.
>
> Alternatively, if you use some other utility to work with the bz2 files,
> please test using that, and add that on the task too.
>
> Here are two files for download and comparison of decompressed content:
>
>
> https://dumps.wikimedia.org/cewiki/20191201/cewiki-20191201-pages-articles.xml.bz2
> and
>
> https://dumps.wikimedia.org/cewiki/20191201/cewiki-20191201-pages-articles-multistream.xml.bz2
>
> Both are around 50 megabytes.
>
> Thank you in advance to whomever participates!
>
> Ariel
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] Your help requested (testing decompression)

2019-12-04 Thread Ariel Glenn WMF
if you use one of the utilities listed here:
https://phabricator.wikimedia.org/T239866
I'd like you to download one of the 'multistream' dumps and see if your
utility decompresses it fully or not (you can compare the md5sum of the
decompressed content to the regular file's decompressed content and see if
they are the same). Then note the results and the version of the utility on
this task.

Alternatively, if you use some other utility to work with the bz2 files,
please test using that, and add that on the task too.

Here are two files for download and comparison of decompressed content:

https://dumps.wikimedia.org/cewiki/20191201/cewiki-20191201-pages-articles.xml.bz2
and
https://dumps.wikimedia.org/cewiki/20191201/cewiki-20191201-pages-articles-multistream.xml.bz2

Both are around 50 megabytes.

Thank you in advance to whomever participates!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] Comments requested: produce empty abstract files for Wikidata?

2019-10-21 Thread Ariel Glenn WMF
Currently, the abstracts dump for Wikidata consists of 62 million entries,
all of which contain  instead of any real
abstract. Instead of this, I am considering producing abstract files that
would contain only the mediawiki header and footer and the usual siteinfo
contents. What do people think about this?

Rationale:

It takes 36 hours of time to produce these useless files.
It places an extra burden on the db servers for no good reason.
It requires more bandwidth to download and process these useless files than
having a file with no entries.
Wikidata will only ever have Q-entities or other entities in the main
namespace that are not text or wikitext and so are not suitable for
abstracts.

Please comment here or on the task:
https://phabricator.wikimedia.org/T236006

If there are no comments or blockers after a week, I'll start implementing
this, and it will likely go into effect for the November 20th run.

Your faithful dumps wrangler,

Ariel Glenn
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Incremental dumps

2019-09-11 Thread Ariel Glenn WMF
All dumps were interrupted for a period of several days due to a MediaWiki
change. See https://phabricator.wikimedia.org/T232268 for details.

Ariel

On Wed, Sep 11, 2019 at 4:43 PM colin johnston 
wrote:

> Any news on retention time for backups as well :)
>
> Col
>
>
> > On 11 Sep 2019, at 14:38, Dario Montagnini 
> wrote:
> >
> > Hello,
> > I would like to know if there are information about the incremental
> dumps.
> > I noticed that the generation has been suspended for about six days.
> >
> > Thank you!
> > ___
> > Xmldatadumps-l mailing list
> > Xmldatadumps-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] New dumps mirror in the United States (Colorado)

2019-08-09 Thread Ariel Glenn WMF
Greetings dumps users, remixers and sharers!

I'm happy to announce that we have another mirror of the last 5 XML dumps,
located in the United States, for your downloading pleasure.

All the information you need is here:
https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_mirrors
The last entry in the list before archive.org is the new site,
https://wikimedia.mirror.us.dev

Huge thanks to Chip for volunteering space and bandwidth for this. Let's
put it to good use!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Wikidate Entitites 24/06/2019 dump missing

2019-07-01 Thread Ariel Glenn WMF
This dump was incomplete due to a problem with MediaWiki code. It was
removed so that scripts such as yours would not process a file with half
the entities in it.

This week's run should provide a new and complete file. For more
information, you can follow along on the Phabricator task:
https://phabricator.wikimedia.org/T226601

On Mon, Jul 1, 2019 at 11:49 PM Petra Kubernátová 
wrote:

> Good afternoon,
> I have a question regarding the Wikidata Entities data dump and I was not
> able to find a suitable place where I could ask it.
>
> We have been using the Wikidata Entities data dump for quite a while, but
> the last two weeks we have been having an issue where the data dump archive
> has disappeared from the website, or it has not been there at all.
>
> I mean here: https://dumps.wikimedia.org/other/wikidata/
>
> 20190624.json.gz returns a File Not Found.
>
>
> Could you please tell me where I could find this file or redirect me to
> someone who could give me more information?
>
>
> Thank you very much for your great work.
>
> Kind regards,
>
> Petra K.
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] svwiki to move to the ranks of the 'big wikis'

2019-06-20 Thread Ariel Glenn WMF
Hello dumps users and re-users!

As you know, some wikis are large enough that we produce dumps of some
files in 6 pieces in parallel. We'll begin doing this for svwiki starting
on July 1. You can follow along on https://phabricator.wikimedia.org/T226200
if interested. If you have not previously worked with a wiki large enough
to have files split up, you can look at e.g. frwiki or itwiki dumps to see
what files are produced.

Thanks to Ebonetti90 for the request!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Wikipedia Dumps required

2019-05-29 Thread Ariel Glenn WMF
You can find some older dumps at https://dumps.wikimedia.org/archive/ (see
https://meta.wikimedia.org/wiki/Data_dumps/Finding_older_xml_dumps for more
about finding older dumps in general). I didn't see the March 2006 files
but these https://dumps.wikimedia.org/archive/enwiki/20060816/ are later in
that year; perhaps they would do?

While I'm here, https://meta.wikimedia.org/wiki/Data_dumps is a good
starting point for looking for dumps documentation overall; if there's
something missing there, let me know.

Good luck!

Ariel

On Wed, May 29, 2019 at 4:53 PM Muhammad Ali  wrote:

> Hello,
> I am doing my Master thesis in Germany and i want the Wikipedia database
> of March 26,2006.
>
> Could you please tell me where i can find that data ?
>
> Best regards,
> Muhammad Ali
>
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Approx. number of pages in enwiki-latest-pages-articles.xml

2019-05-27 Thread Ariel Glenn WMF
The number should be around 19414056, the same number of pages in the
stubs-articles file.

On Tue, May 28, 2019 at 8:35 AM Sigbert Klinke 
wrote:

> Hi,
>
> I would be interested to know how many pages in
> enwiki-latest-pages-articles.xml . My own count gives 19,4 Mio. pages.
> Can this be, at least roughly, confirmed?
>
> In the internet I just find these numbers:
>
> 5,861,178 - I guess this are all namespace 0 pages
> 47,826,337 - this are all pages in all namespaces
> ‎
> Sigbert
>
> --
> https://hu.berlin/sk
> https://hu.berlin/mmstat3
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] some dump failures today

2019-03-06 Thread Ariel Glenn WMF
Those of you watching the xml/sql dumps run this month may have noticed
some dump failures today. These were caused by depooling of the database
server for maintenance while the dump hosts were querying it. The jobs in
question should be rerun automatically over the next few days, and I'll be
keeping an eye on things.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] wikimedia.bytemark.co.uk mirror is not updated from 2017-11

2019-03-04 Thread Ariel Glenn WMF
Please contact priv...@wikimedia.org about GDPR issues; this is the group
dealing with compliance. Thanks a lot.

On Mon, Mar 4, 2019 at 11:35 AM colin johnston 
wrote:

> The dumps public and/or mirrored need fixed retention policies attached
> and checked for compliance.
> Private information is present in talk page edits which get selectively
> removed/edited.
>
> GDPR issues cannot truly be adhered to if removal of content is actioned
> since dumps/mirrored information is not updated.
> GDPR reports need to be published with all article refs related and the
> dumps/mirrored updated to reflect compliance of removal.
>
> Colin
>
>
> On 4 Mar 2019, at 09:24, Ariel Glenn WMF  wrote:
>
> All of the information in these mirrored dump files is publicly available
> to any user; no private information is provided. For GDPR-specific issues,
> please contact priv...@wikimedia.org
> Thanks!
>
> On Mon, Mar 4, 2019 at 11:03 AM colin johnston 
> wrote:
>
>> How is GDPR issue handled with this mirrored information ?
>> How is retention guidelines followed with this mirrored information ?
>>
>> Colin
>>
>>
>>
>> On 4 Mar 2019, at 08:52, Ariel Glenn WMF  wrote:
>>
>> Excuse this very late reply. The index.html page is out of date but the
>> mirrored directories for various current runs are there. I'm checking with
>> a colleague about making sure the index page gets copied over.
>>
>> Ariel
>>
>> On Wed, Feb 6, 2019 at 1:14 PM Mariusz "Nikow" Klinikowski <
>> mariuszklinikow...@gmail.com> wrote:
>>
>>> Greetings XML Dump users and contributors!
>>>
>>> Looks like https://wikimedia.bytemark.co.uk/ is not updated from
>>> 2017-11-26. I think, maybe somebody should delete it from mirror list or
>>> contact bytemark notify them?
>>>
>>> Best regards,
>>> Mariusz "Nikow" Klinikowski.
>>>
>>> ___
>>> Xmldatadumps-l mailing list
>>> Xmldatadumps-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>>
>> ___
>> Xmldatadumps-l mailing list
>> Xmldatadumps-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>
>>
>>
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] wikimedia.bytemark.co.uk mirror is not updated from 2017-11

2019-03-04 Thread Ariel Glenn WMF
All of the information in these mirrored dump files is publicly available
to any user; no private information is provided. For GDPR-specific issues,
please contact priv...@wikimedia.org
Thanks!

On Mon, Mar 4, 2019 at 11:03 AM colin johnston 
wrote:

> How is GDPR issue handled with this mirrored information ?
> How is retention guidelines followed with this mirrored information ?
>
> Colin
>
>
>
> On 4 Mar 2019, at 08:52, Ariel Glenn WMF  wrote:
>
> Excuse this very late reply. The index.html page is out of date but the
> mirrored directories for various current runs are there. I'm checking with
> a colleague about making sure the index page gets copied over.
>
> Ariel
>
> On Wed, Feb 6, 2019 at 1:14 PM Mariusz "Nikow" Klinikowski <
> mariuszklinikow...@gmail.com> wrote:
>
>> Greetings XML Dump users and contributors!
>>
>> Looks like https://wikimedia.bytemark.co.uk/ is not updated from
>> 2017-11-26. I think, maybe somebody should delete it from mirror list or
>> contact bytemark notify them?
>>
>> Best regards,
>> Mariusz "Nikow" Klinikowski.
>>
>> ___
>> Xmldatadumps-l mailing list
>> Xmldatadumps-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] wikimedia.bytemark.co.uk mirror is not updated from 2017-11

2019-03-04 Thread Ariel Glenn WMF
Excuse this very late reply. The index.html page is out of date but the
mirrored directories for various current runs are there. I'm checking with
a colleague about making sure the index page gets copied over.

Ariel

On Wed, Feb 6, 2019 at 1:14 PM Mariusz "Nikow" Klinikowski <
mariuszklinikow...@gmail.com> wrote:

> Greetings XML Dump users and contributors!
>
> Looks like https://wikimedia.bytemark.co.uk/ is not updated from
> 2017-11-26. I think, maybe somebody should delete it from mirror list or
> contact bytemark notify them?
>
> Best regards,
> Mariusz "Nikow" Klinikowski.
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] question about wikidata entity dumps usage (please forward to interested parties)

2019-02-16 Thread Ariel Glenn WMF
Hey folks,

We've had a request to reschedule the way the various wikidata entity dumps
are run. Right now they go once a week on set days of the week; we've been
asked about pegging them to specific days of the month, rather as the
xml/sql dumps are run. See https://phabricator.wikimedia.org/T216160 for
more info.

Is this going to cause problems for anyone? Do you ingest these dumps on a
schedule, and what works for you? Please weigh in here or on the
phabricator task; thanks!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] New dumps mirror: The Free Mirror Project

2019-02-06 Thread Ariel Glenn WMF
I am happy to announce a new mirror site, located in Canada, which is
hosting the last two good dumps of all projects. Please welcome and put to
good use https://dumps.wikimedia.freemirror.org/ !

I want to thank Adam for volunteering bandwidth and space and for getting
everything set up. More information about the project can be found at
http://freemirror.org/   Enjoy!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] incorrect links for pages articles multistream files for big wikis

2019-01-23 Thread Ariel Glenn WMF
Folks may have noticed already that the links presented for downlod of
pages-articles-multistream dumps are incorrect on the web pages for big
wikis. The files exist for download but the wrong links were created.

I'll be looking into that and fixing it up over the next days, but in the
meantime you can manually download the files by specifying the right name.
Apologies for the inconvenience.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] Change in multistream dump file production

2019-01-19 Thread Ariel Glenn WMF
TL;DR: Don't panic, the single articles multistream bz2 file for big wikis
will be produced shortly after the new smaller fles.

Long version: For big wikis which already have split up article files, we
now produce one multistream file per article file. These are now recombined
into a single file later, wth a single index file in the fashion everyone
is used to.

This part of the speedup work mentioned in the previous email.

Have a good weekend,

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] mwbzutils BREAKING CHANGE

2019-01-19 Thread Ariel Glenn WMF
If you use recompressxml in the mwbzutils package, as of version 0.0.9
(just deployed) it no longer writes bz2 compressed data by default to
stdout; instead it relies on the extension of the output file and will
write either gzipped, bz2 or plain text output, accordingly. This means
that if it is directed to write to stdout, this will be uncompressed data.

You can work around this in your scripts by piping the text from stdout to
bzip directly from recompressxml.

This change came as part of some speedup work. I won't discuss that more
until we see how the next couple of runs go.

Thanks for your understanding.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Dump for enwiki blocked ?

2018-10-22 Thread Ariel Glenn WMF
The dumps are not blocked but a change in the way stubs dumps are processed
has slowed down the queries considerably.  This issue is being tracked here:
https://phabricator.wikimedia.org/T207628

Ariel

On Mon, Oct 22, 2018 at 1:07 PM Nicolas Vervelle 
wrote:

> Hi,
>
> The dump for enwiki seems to be blocked in "First-pass for page XML data
> dumps" :
> 2018-10-20 10:49:29 in-progress First-pass for page XML data dumps
> 2018-10-22 09:32:02: enwiki (ID 298305) 416 pages (0.8|30.1/sec all|curr),
> 36000 revs (70.5|72.4/sec all|curr), ETA 2019-03-13 10:20:01 [max 865183738]
>
> Other dumps like frwiki seem blocked also, maybe waiting for enwiki ?
>
> Nico
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] some revisions missing from Sept 13 adds-changes dump

2018-10-12 Thread Ariel Glenn WMF
If you are a user of the adds-changes (so-called "incremental") dumps, read
on.

All dumps use database servers in our eqiad data center. For the past
month, the wiki projects have used primary database masters out of our
codfw data center; on one of these days, a number of revisions did not
replicate properly to eqiad for about 50 minutes. This was not discovered
until we switched back to using the database servers in eqiad as primary
masters.

The date that replication was broken was September 13th. Because dumps use
eqiad database servers, the adds-changes dumps published on that date are
also missing that data.

We suggest that if you are using adds-changes dumps for a local mirror, you
import the next regular run of xml/sql dumps, starting on Oct 20th (for
current revisons only) or Nov 1st (for full history), which should contain
the missing revisions.

For more information on this incident, you may follow:
https://phabricator.wikimedia.org/T206743

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] flow dumps failures being worked on

2018-10-05 Thread Ariel Glenn WMF
These issues have been cleared up and flow dumps are being produced
properly.

Ariel

On Thu, Sep 6, 2018 at 1:51 PM Ariel Glenn WMF  wrote:

> This is being tracked here: https://phabricator.wikimedia.org/T203647
> You probably won't see much in the way of updates until all the jobs ahve
> completed; they are in progress now.
>
> Ariel
>
> On Thu, Sep 6, 2018 at 11:02 AM, Ariel Glenn WMF 
> wrote:
>
>> Hello dumps users!
>>
>> You may have noticed that a number of wikis have had dumps failures on
>> the flow dumps step. The cause is known (a cleanup of mediawiki core that
>> didn't carry over to the extension) and these jobs should be fixed up today
>> or tomorrow.
>>
>> Ariel
>>
>
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] adds-changes (so-called 'incremental') dumps failed today

2018-10-02 Thread Ariel Glenn WMF
Somehow I committed but did not deploy one of the changes, so local testing
worked great and the production run of course failed. The missing code is
now live (I checked) so everything should be back to normal tomorrow.

Ariel

On Mon, Oct 1, 2018 at 5:26 PM Ariel Glenn WMF  wrote:

> The failure was a side effect of a configuration change that will,
> ironically enough, make it easier to test the 'other' dumps, including
> eventually these ones, in mediawiki-vagrant; see
> https://phabricator.wikimedia.org/T201478 for more information about that.
>
> They should run tomorrow and contain the content for the missing run as
> well.
>
> Apologies for the inconvenience.
>
> Ariel
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] adds-changes (so-called 'incremental') dumps failed today

2018-10-01 Thread Ariel Glenn WMF
The failure was a side effect of a configuration change that will,
ironically enough, make it easier to test the 'other' dumps, including
eventually these ones, in mediawiki-vagrant; see
https://phabricator.wikimedia.org/T201478 for more information about that.

They should run tomorrow and contain the content for the missing run as
well.

Apologies for the inconvenience.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] Oct 3 2018: RFC on xml dumps schema update to be discussed at TechCom

2018-10-01 Thread Ariel Glenn WMF
Hey dumps users and contributors!

This Wednesday, Oct 3 at 2pm PST(21:00 UTC, 23:00 CET) in #wikimedia-office
TechCom will have a discussion about the RFC for the upcomign xml schema
update needed for Multi-Content Revision content.

Phabricator task: https://phabricator.wikimedia.org/T199121
TechCom minutes announcing the meeting:
https://www.mediawiki.org/wiki/Wikimedia_Technical_Committee/Minutes/2018-09-26
Draft RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps

If you have comments or suggestions, please show up!

Thanks,

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] flow dumps failures being worked on

2018-09-06 Thread Ariel Glenn WMF
This is being tracked here: https://phabricator.wikimedia.org/T203647
You probably won't see much in the way of updates until all the jobs ahve
completed; they are in progress now.

Ariel

On Thu, Sep 6, 2018 at 11:02 AM, Ariel Glenn WMF 
wrote:

> Hello dumps users!
>
> You may have noticed that a number of wikis have had dumps failures on the
> flow dumps step. The cause is known (a cleanup of mediawiki core that
> didn't carry over to the extension) and these jobs should be fixed up today
> or tomorrow.
>
> Ariel
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] flow dumps failures being worked on

2018-09-06 Thread Ariel Glenn WMF
Hello dumps users!

You may have noticed that a number of wikis have had dumps failures on the
flow dumps step. The cause is known (a cleanup of mediawiki core that
didn't carry over to the extension) and these jobs should be fixed up today
or tomorrow.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] huwiki, arwiki to be treated as 'big wikis' and run parallel jobs

2018-08-20 Thread Ariel Glenn WMF
Starting September 1, huwiki and arwiki, which both take several days to
complete the revsion history content dumps, will be moved to the 'big
wikis' list, meaning that they will run jobs in parallel as do frwiki,
ptwiki and others now, for a speedup.

Please update your scripts accordingly.  Thanks!

Task for this: https://phabricator.wikimedia.org/T202268

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] missing adds-changes dumps, page titles for today

2018-08-08 Thread Ariel Glenn WMF
These jobs did not run today due to a change in how maintenance scripts
handle unknown arguments. The problem has been fixed and the jobs should
run regularly tomorrow.
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] MultiContent Revisions and changes to the XML dumps

2018-08-02 Thread Ariel Glenn WMF
As many of you may know, MultiContent Revisions are coming soon (October?)
to a wiki near you. This means that we need changes to the XML dumps
schema; these changes will likely NOT be backwards compatible.

Initial discussion will take place here:
https://phabricator.wikimedia.org/T199121

For background on MultiContent Revisions and their use on e.g. Commons or
WikiData, see:

https://phabricator.wikimedia.org/T200903 (Commons media medata)
https://phabricator.wikimedia.org/T194729 (Wikidata entites)
https://www.mediawiki.org/wiki/Requests_for_comment/Multi-Content_Revisions
(MCR generally)

There may be other, better tickets/pages for background; feel free to
supplement this list if you have such links.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] hewiki dump to be added to 'big wikis' and run with multiple processes

2018-07-19 Thread Ariel Glenn WMF
Good morning!

The pages-meta-history dumps for hewiki take 70 hours these days, the
longest of any wiki not already running with parallel jobs. I plan to add
it to the list of 'big wikis' starting August 1st, meaning that 6 jobs will
run in parallel producing the usual numbered file output; look at e.g.
frwiki dumps for an example.

Please adjust any download/processing scripts accordingly.

Thanks!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] change to output file numbering of big wikis

2018-05-31 Thread Ariel Glenn WMF
TL;DR:
Scripts that reply on xml files numbered 1 through 4 should be updated to
check for 1 through 6.

Explanation:

A number of wikis have stubs and page content files generated 4 parts at a
time, with the appropriate number added to the filename. I'm going to be
increasing that thi month to 6.

The reason for the increase is that near the end of the run there are
usually just a few big wikis taking their time at completing. If they run
with 6 processes at once, they'll finish up a bit sooner.

If you have scripts that rely on the number 4, just increase it to 6 and
you're done.

This will go into effect for the June 1 run and all runs afterwards.

Thanks!
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] pagecounts-ez missing April files (was Re: [Wikitech-l] changes coming to large dumps)

2018-04-10 Thread Ariel Glenn WMF
If it's gone, that's coincidence. Flagging this to look into, thanks for
the report. Please follow that ticket,
https://phabricator.wikimedia.org/T184258 for more info.

On Tue, Apr 10, 2018 at 5:35 PM, Derk-Jan Hartman <
d.j.hartman+wmf...@gmail.com> wrote:

> It seems that the pagecounts-ez sets disappeared from
> dumps.wikimedia.org starting this date. Is that a coincidence ?
> Is it https://phabricator.wikimedia.org/T189283 perhaps ?
>
> DJ
>
> On Thu, Mar 29, 2018 at 2:42 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> wrote:
> > Here it comes:
> >
> > For the April 1st run and all following runs, the Wikidata dumps of
> > pages-meta-current.bz2 will be produced only as separate downloadable
> > files, no recombined single file will be produced.
> >
> > No other dump jobs will be impacted.
> >
> > A reminder that each of the single downloadable pieces has the siteinfo
> > header and the mediawiki footer so they may all be processed separately
> by
> > whatever tools you use to grab data out of the combined file. If your
> > workflow supports it, they may even be processed in parallel.
> >
> > I am still looking into what the best approach is for the pags-articles
> > dumps.
> >
> > Please forward wherever you deem appropriate. For further updates, don't
> > forget to check the Phab ticket!  https://phabricator.wikimedia.
> org/T179059
> >
> > On Mon, Mar 19, 2018 at 2:00 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> > wrote:
> >
> >> A reprieve!  Code's not ready and I need to do some timing tests, so the
> >> March 20th run will do the standard recombining.
> >>
> >> For updates, don't forget to check the Phab ticket!
> >> https://phabricator.wikimedia.org/T179059
> >>
> >> On Mon, Mar 5, 2018 at 1:10 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> >> wrote:
> >>
> >>> Please forward wherever you think appropriate.
> >>>
> >>> For some time we have provided multiple numbered pages-articles bz2
> file
> >>> for large wikis, as well as a single file with all of the contents
> combined
> >>> into one.  This is consuming enough time for Wikidata that it is no
> longer
> >>> sustainable.  For wikis where the sizes of these files to recombine is
> "too
> >>> large", we will skip this recombine step. This means that downloader
> >>> scripts relying on this file will need to check its existence, and if
> it's
> >>> not there, fall back to downloading the multiple numbered files.
> >>>
> >>> I expect to get this done and deployed by the March 20th dumps run.
> You
> >>> can follow along here: https://phabricator.wikimedia.org/T179059
> >>>
> >>> Thanks!
> >>>
> >>> Ariel
> >>>
> >>
> >>
> > ___
> > Wikitech-l mailing list
> > wikitec...@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] New web server for dumps/datasets, OLD ONE GOING AWAY

2018-04-04 Thread Ariel Glenn WMF
Folks,

As you'll have seen from previous email, we are now using a new beefier
webserver for your dataset downloading needs. And the old server is going
away on TUESDAY April 10th.

This means that if you are using 'dataset1001.wikimedia.org' or the IP
address itself in your scripts, you MUST change it before Tuesday, or it
will stop working.

There will be no further reminders.

Thanks!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] Change for abstracts dumps, primarily for wikidata

2018-04-04 Thread Ariel Glenn WMF
Those of you that rely on the abstracts dumps will have noticed that the
content for wikidata is pretty much useless.  It doesn't look like a
summary of the page because main namespace articles on wikidata aren't
paragraphs of text. And there's really no useful summary to be generated,
even if we were clever.

We have instead decided to produce abstracts output only for pages in the
main namespace that consist of text. For pages that are of type
wikidata-item, json and so on, the  tag will contain the
attribute 'not-applicable' set to the empty string. This impacts a very few
pages on other wikis; for the full list and for more information on this
change, see  https://phabricator.wikimedia.org/T178047

We hope this change will be merged in a week or so; it won't take effect
for wikidata until the next dumps run on April 20th, since the wikidata
abstracts are already in progress.

If you have any questions, don't hesitate to ask.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] changes coming to large dumps

2018-03-29 Thread Ariel Glenn WMF
Here it comes:

For the April 1st run and all following runs, the Wikidata dumps of
pages-meta-current.bz2 will be produced only as separate downloadable
files, no recombined single file will be produced.

No other dump jobs will be impacted.

A reminder that each of the single downloadable pieces has the siteinfo
header and the mediawiki footer so they may all be processed separately by
whatever tools you use to grab data out of the combined file. If your
workflow supports it, they may even be processed in parallel.

I am still looking into what the best approach is for the pags-articles
dumps.

Please forward wherever you deem appropriate. For further updates, don't
forget to check the Phab ticket!  https://phabricator.wikimedia.org/T179059

On Mon, Mar 19, 2018 at 2:00 PM, Ariel Glenn WMF <ar...@wikimedia.org>
wrote:

> A reprieve!  Code's not ready and I need to do some timing tests, so the
> March 20th run will do the standard recombining.
>
> For updates, don't forget to check the Phab ticket!
> https://phabricator.wikimedia.org/T179059
>
> On Mon, Mar 5, 2018 at 1:10 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> wrote:
>
>> Please forward wherever you think appropriate.
>>
>> For some time we have provided multiple numbered pages-articles bz2 file
>> for large wikis, as well as a single file with all of the contents combined
>> into one.  This is consuming enough time for Wikidata that it is no longer
>> sustainable.  For wikis where the sizes of these files to recombine is "too
>> large", we will skip this recombine step. This means that downloader
>> scripts relying on this file will need to check its existence, and if it's
>> not there, fall back to downloading the multiple numbered files.
>>
>> I expect to get this done and deployed by the March 20th dumps run.  You
>> can follow along here: https://phabricator.wikimedia.org/T179059
>>
>> Thanks!
>>
>> Ariel
>>
>
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] changes coming to large dumps

2018-03-05 Thread Ariel Glenn WMF
Please forward wherever you think appropriate.

For some time we have provided multiple numbered pages-articles bz2 file
for large wikis, as well as a single file with all of the contents combined
into one.  This is consuming enough time for Wikidata that it is no longer
sustainable.  For wikis where the sizes of these files to recombine is "too
large", we will skip this recombine step. This means that downloader
scripts relying on this file will need to check its existence, and if it's
not there, fall back to downloading the multiple numbered files.

I expect to get this done and deployed by the March 20th dumps run.  You
can follow along here: https://phabricator.wikimedia.org/T179059

Thanks!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Missing pages in enwiki pages-articles-multistream dumps

2018-02-27 Thread Ariel Glenn WMF
It turns out that this happens for exactly 27 pages, those at the end of
each enwiki-20180220-stub-articlesXX.xml.gz file.  Tracking here:
https://phabricator.wikimedia.org/T188388

Ariel

On Tue, Feb 27, 2018 at 10:45 AM, Ryan Hitchman  wrote:

> Multiple pages are missing from the enwiki pages-articles-multistream
> dumps from 20180201 and 20180220.
>
> Page id 88444: "Phosphor" doesn't appear in the index or in the data
> stream. This also happens for TARDIS, Psalm 132, and many others
>
> Why would the dump be partial?
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] Delaying the second November run by 2 days

2017-11-20 Thread Ariel Glenn WMF
Because the first run of the month was delayed, we need a couple days delay
now for the second run to start, so that the last of the wikis (dewiki) ca
finish up the first run.  I expect the second monthly run to finish on time
however, once started.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] [Analytics] Missing categorylinks and pages in Wikipedia dumps

2017-11-07 Thread Ariel Glenn WMF
I checked the files directly, both the pages.sql.gz and the
categorylinks.sql.gz files for 20170920.  The page is listed:

$ zcat enwiki-20170920-page.sql.gz | sed -e 's/),/),\n/g;' | grep
Computational_creativity | more
(16300571,0,'Computational_creativity','',0,0,0,0.718037721126,'20170903222622','20170903222623',798803037,59318,'wikitext',NULL),
(16390036,1,'Computational_creativity','',0,0,0,0.20741249006,'20170831064438','20170831084246',786288354,107057,'wikitext',NULL),

The first entry is the page, the second is the talk page.

$ zcat enwiki-20170920-categorylinks.sql.gz  | sed -e 's/),/),\n/g;' | grep
16300571 | cat -vte
(16300571,'All_NPOV_disputes','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-01-27
10:43:57','','uca-default-u-kn','page'),$
(16300571,'All_articles_needing_additional_references','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-05-19
16:52:06','','uca-default-u-kn','page'),$
(16300571,'All_articles_with_dead_external_links','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-11-29
07:32:22','','uca-default-u-kn','page'),$
(16300571,'All_articles_with_unsourced_statements','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2008-11-21
10:36:21','','uca-default-u-kn','page'),$
(16300571,'Areas_of_computer_science','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'Articles_needing_additional_references_from_May_2013','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-05-19
16:52:06','','uca-default-u-kn','page'),$
(16300571,'Articles_with_French-language_external_links','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-06-20
04:05:59','','uca-default-u-kn','page'),$
(16300571,'Articles_with_dead_external_links_from_November_2016','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-11-29
07:32:22','','uca-default-u-kn','page'),$
(16300571,'Articles_with_permanently_dead_external_links','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-11-29
07:32:22','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_April_2015','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_April_2016','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_December_2015','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2015-12-01
14:40:27','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_January_2010','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2010-01-09
05:50:15','','uca-default-u-kn','page'),$
(16300571,'Articles_with_unsourced_statements_from_October_2016','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-10-10
21:27:12','','uca-default-u-kn','page'),$
(16300571,'Artificial_intelligence','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2008-03-19
03:45:58','','uca-default-u-kn','page'),$
(16300571,'Arts','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'CS1_maint:_Extra_text:_authors_list','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2017-06-04
08:45:09','','uca-default-u-kn','page'),$
(16300571,'Cognitive_psychology','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'Computational_fields_of_study','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-11-10
15:53:12','','uca-default-u-kn','page'),$
(16300571,'Creativity_techniques','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2016-04-15
15:40:40','','uca-default-u-kn','page'),$
(16300571,'NPOV_disputes_from_January_2013','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2013-05-19
15:48:55','','uca-default-u-kn','page'),$
(16300571,'Philosophical_movements','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2017-01-07
20:24:38','','uca-default-u-kn','page'),$
(16300571,'Webarchive_template_wayback_links','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2017-01-27
20:04:18','','uca-default-u-kn','page'),$
(16300571,'Wikipedia_articles_needing_clarification_from_November_2008','+C?EOM\'M7CA\'=^D+I/\'M7Q7MW^A^\^AM-^O^[','2009-02-13
10:49:28','','uca-default-u-kn','page'),$

That list of categorylinks entries matches your results.
Is it possible that your download of the pages.sql file is corrupted?  Do
the md5 sums check out?  Or perhaps it is an issue with the tools.

Ariel

On Wed, Nov 1, 2017 at 7:40 PM, Tilman Bayer  wrote:

> CCing the data dumps mailing list, which is the recommended venue for
> questions like this (https://meta.wikimedia.org/wi
> ki/Data_dumps#Where_to_go_for_help ).
>
> On Wed, Nov 1, 2017 at 8:44 AM, Shubhanshu Mishra <
> shubhanshumis...@gmail.com> wrote:
>
>> Also, important categories like Computer Architechture, Human based
>> computation, Programming language theory, Software Engineering, and Theory
>> of Computation, are missing from the subcategories of Areas of Computer
>> Science.
>>
>>
>> *Regards,*
>> *Shubhanshu Mishra*
>> 

Re: [Xmldatadumps-l] Important news about the November dumps run!

2017-11-06 Thread Ariel Glenn WMF
Rsync of xml/sql dumps to the web server is now running on a rolling basis
via a script, so you should see updates regularly rather than "every
$random hours".  There's more to be done on that front, see
https://phabricator.wikimedia.org/T179857 for what's next.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] IMPORTANT: Changes to abstracts and siteinfo-namespaces jobs

2017-11-06 Thread Ariel Glenn WMF
These jobs are currently written uncompressed.  Starting with the next run,
I plan to write these as gzip compressed files. This means that we'll save
a lot of space for the larger abstracts dumps. Additionally,only status and
html files will be uncompressed, which is convenient for maintenance
reasons.

If anyone has a strong objection to this, please raise it now.  There's a
ticket open for it:   https://phabricator.wikimedia.org/T178046

Thanks!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Important news about the November dumps run!

2017-11-03 Thread Ariel Glenn WMF
The first set of dumps is running there and looks like it's working ok.
I've done a manual rsync of files produced up to this point, so those are
now available on the web server.

As before, you can follow work on this at
https://phabricator.wikimedia.org/T178893

Note that it is possible that some index.html files may contain links to
files which did not get picked up on the rsync.  They'll be there sometime
tomorrow after the next rsync.

Ariel

On Mon, Oct 30, 2017 at 5:39 PM, Ariel Glenn WMF <ar...@wikimedia.org>
wrote:

> As was previously announced on the xmldatadumps-l list, the sql/xml dumps
> generated twice a month will be written to an internal server, starting
> with the November run.  This is in part to reduce load on the web/rsync/nfs
> server which has been doing this work also until now.  We want separation
> of roles for some other reasons too.
>
> Because I want to get this right, and there are a lot of moving parts, and
> I don't want to rsync all the prefetch data over to these boxes again next
> month after cancelling the move:
>
> 
> If needed, the November full run will be delayed for a few days.
> If the November full run takes too long, the partial run, usually starting
> on the 20th of the month, will not take place.
> *
>
> Additionally, as described in an earlier email on the xmldatadumps-l list:
>
> *
> files will show up on the web server/rsync server with a substantial
> delay.  Initially this may be a day or more.  This includes index.html and
> other status files.
> *
>
> You can keep track of developments here: https://phabricator.wikimedia.
> org/T178893
>
> If you know folks not on the lists in the recipients field for this email,
> please forward it to them and suggest that they subscribe to this list.
>
> Thanks,
>
> Ariel
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] Important news about the November dumps run!

2017-10-30 Thread Ariel Glenn WMF
As was previously announced on the xmldatadumps-l list, the sql/xml dumps
generated twice a month will be written to an internal server, starting
with the November run.  This is in part to reduce load on the web/rsync/nfs
server which has been doing this work also until now.  We want separation
of roles for some other reasons too.

Because I want to get this right, and there are a lot of moving parts, and
I don't want to rsync all the prefetch data over to these boxes again next
month after cancelling the move:


If needed, the November full run will be delayed for a few days.
If the November full run takes too long, the partial run, usually starting
on the 20th of the month, will not take place.
*

Additionally, as described in an earlier email on the xmldatadumps-l list:

*
files will show up on the web server/rsync server with a substantial
delay.  Initially this may be a day or more.  This includes index.html and
other status files.
*

You can keep track of developments here:
https://phabricator.wikimedia.org/T178893

If you know folks not on the lists in the recipients field for this email,
please forward it to them and suggest that they subscribe to this list.

Thanks,

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] IMPORTANT: Impending move of xml/sql dump generation to another server

2017-10-24 Thread Ariel Glenn WMF
This issue will be tracked here. https://phabricator.wikimedia.org/T178893

As it says on the ticket, I hope to get this done in time for the Nov 1 run.
Here is what it means for folks who download the dumps:
* First off, the host where the dumps are generated will no longer be the
host that serves them to the web (or that serves them internally via NFS).
This means that you won't see automatic minute-to-minute updates of how the
dumps are doing.  I'll be rsyncing over files to the web server, probably
with a several hour delay.
* Second, it's possible that the index.html or other status files that you
check will point to things that aren't rsynced over yet.  If so, try again
in a few hours and the files should have arrived.
* Datasets that arrive via weekly or daily cron jobs, such as the wikidata
dumps or the adds/changes dumps, will not be affected at this stage.  The
plan is to move them later.
* Mirrors and web service will continue to remain where they are.

I may have forgotten some things; if so I'll update as they occur to me.
Questions and/or comments welcome.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Official .torrent site for dumps files!?

2017-09-18 Thread Ariel Glenn WMF
The Wikimedia Foundation does not have an official site for dumps
torrents.  It would be nice to add them to
https://meta.wikimedia.org/wiki/Data_dump_torrents however.

Ariel

On Mon, Sep 18, 2017 at 10:16 AM, Federico Leva (Nemo) 
wrote:

> Felipe Ewald, 18/09/2017 04:31:
>
>> Is this the official Wikimedia Foundation site for .torrent of the dumps
>> files?
>>
>
> It's not official, but it seems to work ok.
>
> Nemo
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] abstract dumps problem for languages with variants

2017-09-04 Thread Ariel Glenn WMF
Dumps watchers may have noticed that several zh wiki project dumps failed
the abstract dumps step today.  This is probably fixed, tracking here:
https://phabricator.wikimedia.org/T174906

I'll be sure it's fixed when a few more wikis have run without problems.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] Dumps issues this month

2017-07-05 Thread Ariel Glenn WMF
Dumps are running again, though the root cause of the nfs incident is still
undetermined.

Ariel

On Wed, Jul 5, 2017 at 5:08 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote:

> Our dumps server is having nfs issues; we're debugging it; debugging is
> slow and tedious.  You can follow along here should you wish all the gory
> details: https://phabricator.wikimedia.org/T169680
>
> As soon as service is back to normal I'll send an update here to the list.
>
> Ariel
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] Dumps issues this month

2017-07-05 Thread Ariel Glenn WMF
Our dumps server is having nfs issues; we're debugging it; debugging is
slow and tedious.  You can follow along here should you wish all the gory
details: https://phabricator.wikimedia.org/T169680

As soon as service is back to normal I'll send an update here to the list.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] xmlfileutils (mwxml2sql etc) moved to their own repo

2017-04-25 Thread Ariel Glenn WMF
A heads up to anyone who uses these, builds packages for them, etc: after a
bit of tlc they have been moved to their own repo in the 'master' branch:
clone from gerrit:

operations/dumps/import-tools.git

or browse at

https://phabricator.wikimedia.org/diffusion/ODIM/

Patches to gerrit, bug reports to phab as before, except that I'l be a bit
more attentive to them now that these scripts are being treated as
semi-real tools instead of toys.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] another month, another deploy -> another bug

2017-04-03 Thread Ariel Glenn WMF
I needed to clean up a bunch of tech debt before redoing the page content
dump 'divvy up into small pieces and rerun if necessary' mechanism.  I
cleaned up a bit too much and broke stub and article recombine dumps in the
process.

The fix has been deployed, I shot all the dump processes, marked the
relevant jobs (big and huge wikis only) as failed, and tossed the bad files.

The dumps should resume from the failed steps in about an hour.

Follow along at https://phabricator.wikimedia.org/T160507 for all the gory
details.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] this month's news in dump runs

2017-03-20 Thread Ariel Glenn WMF
Those of you following along will notice that dewiki and wikidatawiki have
more files than usual for the page content dumps (pages-meta-history).
We'll have more of this going forward; if I get the work done in time,
starting April we'll split up these jbos ahead of time into small files
that can be rerun right then when they fail, rather than waiting for
MediaWiki to split up the output based on run time and wait for a set of MW
jobs to complete before retrying failures.  This will mean more resiliency
when dbs are pulled out of the pool for various reasons (schema changes,
upgrades, etc).

Later on during the current run, I hope we will see dumps of magic words
and namespaces, provided as json files.  Let me put it this way: the code
is tested and deployed, now we shall see.

At this very moment, status of a given dump can be retrieved via a file in
the current run directory: /20170320/dumpstatus.json  These files
are updated frequently during the run.  You can also get the status of all
current runs at https://dumps.wikimedia.org/index.json  Thanks to Hydriz
for the idea on how a status api could be implemented cheaply.  This will
probably need some refinement, but feel free to play.  More information at
https://phabricator.wikimedia.org/T147177

The last of the UI updates went live, thanks to Ladsgroup for all of those
fixups.  It's nice to enter the new century at last :-)

And finally, we moved all the default config info out into a yaml file
(Thanks to Adam Wight for the first version of that changeset).  There were
a couple hiccups with that, which resulted in my starting the en wikipedia
run manually for this run, though via the standard script.

Happy trails,

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] New data dump torrents for enwiki and ptwiki

2017-03-16 Thread Ariel Glenn WMF
That's great news, thanks for taking the initiative!

Ariel

On Thu, Mar 16, 2017 at 5:57 AM, Felipe Ewald 
wrote:

> Hello everyone!
>
>
>
> For those who like torrent and download dumps files, good news!
>
>
>
> I add the torrent for “enwiki-20170301-pages-meta-current.xml.bz2” and
> “enwiki-20170301-pages-articles.xml.bz2”, also a bonus with all files for
> day 20170220: “enwiki-20170220-all-files”.
>
>
>
>
>
> Check on: https://meta.wikimedia.org/wiki/Data_dump_torrents
>
>
>
>
>
> Not forget to seed, please.
>
>
>
>
>
>
>
> Thank you for your attention,
>
>
>
> Felipe L. Ewald
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] Another dumps html update

2017-03-01 Thread Ariel Glenn WMF
Again thanks to Ladsgroup, this is a change to the per-dump index.html
page, and you can see sample screenshots here:
https://phabricator.wikimedia.org/T155697

Please weigh in on the ticket.  I'd like to get any issues resolved and
have this in play by the time the next dump run starts on March 20; this
run is already in process so we're too late for that.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] More dumps html changes

2017-02-06 Thread Ariel Glenn WMF
Hello everybody,

More changes to various html pages have been staged for review. Thanks
again to Amir (Ladsgroup) for those!  Have a look here:
https://gerrit.wikimedia.org/r/#/c/335684/ and comment here:
https://phabricator.wikimedia.org/T155697

Thanks!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] revised index.html for dumps?

2017-01-31 Thread Ariel Glenn WMF
Nemo, thanks for your comments on the ticket.

Last call. If no objections or new changes, this will be merged sometime
Thursday Feb 2nd.

On Mon, Jan 30, 2017 at 10:07 AM, Federico Leva (Nemo) 
wrote:

> Aww, but the monobook background is so *cute*. :(
> A server kitten just died.
>
> Nemo
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] revised index.html for dumps?

2017-01-29 Thread Ariel Glenn WMF
Hey folks,

A kind person submitted a patch to make the index.html page, and
potentially others as well, nicer.  Have a look:
https://phabricator.wikimedia.org/T155697  and please comment there if you
have suggestions.  Feel free to forward this to anyone else who might be
interested.  Thanks!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] new XML/sql dumps mirror

2016-12-19 Thread Ariel Glenn WMF
I'm happy to announce that the Academic Computer Club of Umeå University in
Sweden is now offering for download the last 5 XML/sql dumps, as well as a
mirror of 'other' datasets.  Check the current mirror list [1] for more
information, or go directly to download:

http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/
http://ftp.acc.umu.se/mirror/wikimedia.org/other/

Rsync is also available.

Happy downloading!

Ariel

[1]
https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_mirrors
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] changing order of dump steps in status and checksum files

2016-12-08 Thread Ariel Glenn WMF
Before I do this, I want to know if anyone here relies on the specific
order of the contents of the md5 or sha1 sum files for the dumps, or on the
order of the entries in the dumpruninfo file.

The reason I want to fiddle with the order is to have all the table dumps
together, rather than scattered around in these files.  And the reason for
that is convenience; I'm about to update the code that adds these jobs as
steps to be run, and it's more readable/maintainable to add them all in one
group.

 Anyone here who would be impacted?  Please let me know; I'd like to roll
this out for the second monthly run.

Thanks,

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] 9 am UTC maintenance for dataset1001 (dumps.wikimedia.org)

2016-11-14 Thread Ariel Glenn WMF
That should be Tuesday, Nov 15. It's been a long week.

A.

On Mon, Nov 14, 2016 at 2:27 PM, Ariel Glenn WMF <ar...@wikimedia.org>
wrote:

> On Tuesday Nov 13, at 9 am UTC, the web server for the dumps and other
> datasets will
> be unavailable due to maintenance.  This should take no longer than 10
> minutes.  Thanks for your understanding.
>
>
> Ariel
>
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] 8 am UTC Oct 29, maintenance for dataset1001 (dumps.wikimedia.org)

2016-10-28 Thread Ariel Glenn WMF
On Saturday Oct 29, at 8 am UTC, the web server for the dumps and other
datasets will be unavailable due to maintenance.  This should take no
longer than 10 minutes.  Thanks for your understanding.

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


[Xmldatadumps-l] Suggestions wanted: api for monitoring dump runs

2016-10-04 Thread Ariel Glenn WMF
The next Wikimedia Developers Summit will be in January 2017.  I plan to
hold an unconference session on development of an API for monitoring/stats
for dumps of all sorts.  Let's get the discussion going now; what do you
want to see?  Note that tis is for the rewrite so you need not be
restricted by how the dumps run now.

Add your thoughts, questions, comments to
https://phabricator.wikimedia.org/T147177 and feel free to forward this
where you think appropriate.

Thanks!

Ariel
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] New mirror of 'other' datasets

2016-09-27 Thread Ariel Glenn WMF
Thanks, that's great.

Ariel

On Tue, Sep 27, 2016 at 1:13 PM, Federico Leva (Nemo) <nemow...@gmail.com>
wrote:

> Ok.
>
> Ariel Glenn WMF, 27/09/2016 11:47:
>
>> http://dumps.wikimedia.your.org/other/mediacounts/daily/2016/   There
>> are mediacounts here, is the download speed acceptable?
>>
>
> Oh yes, that's around 50 MiB/s. I did not see this directory linked from
> their main page so I thought they had removed it; I'll add the link from
> https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps
>
> Nemo
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


Re: [Xmldatadumps-l] New mirror of 'other' datasets

2016-09-27 Thread Ariel Glenn WMF
I got nothing back from my email so I assume that means it's not happening.

http://dumps.wikimedia.your.org/other/mediacounts/daily/2016/   There are
mediacounts here, is the download speed acceptable?

Ariel

On Tue, Sep 27, 2016 at 12:34 PM, Federico Leva (Nemo) <nemow...@gmail.com>
wrote:

> Federico Leva (Nemo), 17/06/2016 14:59:
>
>> Ariel Glenn WMF, 17/06/2016 13:21:
>>
>>> For folks from specific institutions that suddenly no longer have
>>> access, I can forward instution names along and hope that helps.
>>>
>>
>> It would be nice to whitelist the wmflabs.org servers, which would
>> benefit from a faster server to download this stuff from.
>>
>
> Did this prove impossible? I need mediacounts data on a Labs server now,
> and it would take days do download from dumps.wikimedia.org.
>
>
> Nemo
>
> ___
> Xmldatadumps-l mailing list
> Xmldatadumps-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
___
Xmldatadumps-l mailing list
Xmldatadumps-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l


  1   2   >