[Xmldatadumps-l] Re: Is there a problem with the wikidata-dump?

2024-01-10 Thread Ariel Glenn WMF
I would hazard a guess that your bz2 unzip app does not handle multistream files in an appropriate way, Wurgl. The multistream files consist of several bzip2-compressed files concatenated together; see https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Multistream_dumps for details. Try

[Xmldatadumps-l] Wiki content and other dumps new ownership, feedback requested on new version!

2023-09-27 Thread Ariel Glenn WMF
Hello folks! For some years now, I've been the main or only point of contact for the Wiki project sql/xml dumps semimonthly, as well as for a number of miscellaneous weekly datasets. This work is now passing to Data Platform Engineering (DPE), and your new points of contact, starting right away,

[Xmldatadumps-l] Re: Inconsistency of Wikipedia dump exports with content licenses

2023-08-04 Thread Ariel Glenn WMF
Hi Dušan, The legal team handles all manner of legal issues. You'll need to be patient. I can't speed up their process for you, nor give you more information than I already have. Also, please don't send duplicate messages to the list. That would be considered spam. Thanks! Ariel Glenn dumps

[Xmldatadumps-l] Re: Inconsistency of Wikipedia dump exports with content licenses

2023-07-26 Thread Ariel Glenn WMF
I was away from work for the past two days and so unable to reply. My apologies! Indeed, Dušan, if you want to sort out exactly what to do with/about the licenses, the legal team is the way to go. Reach them at legal (at) wikimedia.org. Hope you get it sorted! Ariel On Wed, Jul 26, 2023 at

[Xmldatadumps-l] Re: The license for some files in the dump exports

2023-07-21 Thread Ariel Glenn WMF
I'm not sure which text you are relying on. But the legal information for the licensing of content in the dumps can be found here: https://dumps.wikimedia.org/legal.html I hope that helps. Ariel Glenn dumps co-maintainer ar...@wikimedia.org On Fri, Jul 21, 2023 at 12:10 PM Dušan Kreheľ wrote:

[Xmldatadumps-l] Re: "Experimental" Status of Enterprise HTML Dumps

2023-05-10 Thread Ariel Glenn WMF
Hello Evan, The Enterprise HTML dumps should be publicly available around the 22nd and the 3rd of each month, though there can be delays. We don't expect that to change any time soon. As to their content or the namespaces, I can't answer to that; someone from WIkimedia Enterprise will have to

[Xmldatadumps-l] Interruption in service of production of wikidata entity dumps and others

2023-04-18 Thread Ariel Glenn WMF
Due to switch maintenance, this week's dumps of wikidata entities, other weekly datasets, and today's adds-changes dumps may not be produced. All datasets should be back on a normal production schedule the following week. Apologies for the inconvenience! Ariel Glenn ar...@wikimedia.org

[Xmldatadumps-l] Too (Two) many FAQs

2023-03-31 Thread Ariel Glenn WMF
My apologies for the duplicate FAQ this month. We recently deployed a new server and the old one, now retired, still had the FAQ generation job running on it. We should be back to the usual number of FAQ emails (one) next month. Thanks! Ariel Glenn dumps co-maintainer

[Xmldatadumps-l] Re: Querying for recently created pages

2023-01-18 Thread Ariel Glenn WMF
Eric, We don't produce dumps of the revision table in sql format because some of those revisions may be hidden from public view, and even metadata about them should not be released. We do however publish so-called Adds/Changes dumps once a day for each wiki, providing stubs and content files in

[Xmldatadumps-l] Re: XML Data Dumps 20220701

2022-07-03 Thread Ariel Glenn WMF
There is an issue with the availability of these dumps for retrieval for publishing to the public. This is being tracked in https://phabricator.wikimedia.org/T311441 and updates will be posted there. Ariel Glenn ar...@wikimedia.org On Sun, Jul 3, 2022 at 9:37 PM wrote: > The folder >

[Xmldatadumps-l] Re: Access imageinfo data in a dump

2022-02-05 Thread Ariel Glenn WMF
gt; > [1] https://www.mediawiki.org/wiki/Manual:Text_table > > > Mitar > > On Fri, Feb 4, 2022 at 6:54 AM Ariel Glenn WMF > wrote: > > > > This looks great! If you like, you might add the link and a brief > description to this page: > https://meta.wikimed

[Xmldatadumps-l] Re: Access imageinfo data in a dump

2022-02-03 Thread Ariel Glenn WMF
extract data, like dumps in other formats. > > [1] https://gitlab.com/tozd/go/mediawiki > > > Mitar > > On Thu, Feb 3, 2022 at 9:13 AM Mitar wrote: > > > > Hi! > > > > I see. Thanks. > > > > > > Mitar > > > > On Thu, Feb 3

[Xmldatadumps-l] Re: Access imageinfo data in a dump

2022-02-02 Thread Ariel Glenn WMF
The media/file descriptions contained in the dump are the wikitext of the revisions of pages with the File: prefix, plus the metadata about those pages and revisions (user that made the edit, timestamp of edit, edit comment, and so on). Width and hieght of the image, the media type, the sha1 of

[Xmldatadumps-l] Re: Directory listing too small/filename too long

2021-11-28 Thread Ariel Glenn WMF
You can get the filename listing a couple of other ways: Check the directory listing for the specific date, i.e. https://dumps.wikimedia.org/wikidatawiki/20211120/ Get the status file from that or the "latest" directory, i.e. https://dumps.wikimedia.org/wikidatawiki/20211120/dumpstatus.json Get

[Xmldatadumps-l] Wikimedia Enterprise HTML dumps available for public download

2021-10-19 Thread Ariel Glenn WMF
I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for October 17-18th are available for public download; see https://dumps.wikimedia.org/other/enterprise_html/ for more information. We expect to make updated versions of these files available around the 1st/2nd of the month and

[Xmldatadumps-l] Re: still only a partial dump for 20210801 for a lot of wikis

2021-08-09 Thread Ariel Glenn WMF
Not the script itself but we have a permissions problem on some status files that I'm having trouble stamping out. See https://phabricator.wikimedia.org/T288192 for updates as they come in. Ariel On Mon, Aug 9, 2021 at 10:18 AM griffin tucker < lmxxlmwikwik3...@griffintucker.id.au> wrote: >

Re: [Xmldatadumps-l] enwiki dump ?

2021-02-03 Thread Ariel Glenn WMF
The enwiki run got a later start this month as we switched hosts around for migration to a more recent version of the OS. But it's currently moving along nicely. Thanks for the report though! Ariel On Wed, Feb 3, 2021 at 1:27 PM Nicolas Vervelle wrote: > Hi, > > Is there a problem with enwiki

Re: [Xmldatadumps-l] November 2nd dump run delayed half a day, wikidata full page content not ready yet

2020-11-22 Thread Ariel Glenn WMF
The files are now all available, as has been noted on the task. The bz2 files and 7z files are just fine and can be processed as usual. Ariel On Fri, Nov 20, 2020 at 2:37 PM Ariel Glenn WMF wrote: > Hello folks, > > I hope everyone is in good health and staying safe in these troub

[Xmldatadumps-l] November 2nd dump run delayed half a day, wikidata full page content not ready yet

2020-11-20 Thread Ariel Glenn WMF
Hello folks, I hope everyone is in good health and staying safe in these troubled times. Speaking of trouble, in the course of making an improvement to the xml/sql dumps, I introduced a bug, and so now I am doing the cleanup from that. The short version: There will be a 7z file missing from

Re: [Xmldatadumps-l] Mirror status

2020-08-03 Thread Ariel Glenn WMF
The page is in our puppet repo; see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/dumps/files/web/html/public_mirrors.html You can submit a patch to gerrit yourself if you like; see https://www.mediawiki.org/wiki/Gerrit/Tutorial for setting up and working

Re: [Xmldatadumps-l] Mirror status

2020-08-03 Thread Ariel Glenn WMF
Thanks for this report! Would you be willing to open a task in phabricator about the bytemark mirror, and tag it with dumps-generation so that it gets into the right queue? https://phabricator.wikimedia.org/maniphest/task/edit/form/1/ The C3SL mirror has technical issues with DNS that are

Re: [Xmldatadumps-l] List of dumped wikis, discrepancy with Wikidata

2020-08-02 Thread Ariel Glenn WMF
labswiki and labtestwiki are copies of Wikitech, which is maintained and dumped in a special fashion. You can find those dumps here: https://dumps.wikimedia.org/other/wikitech/dumps/ uk.wikiversity.org does not exist. ecwikimedia, as you rightly note, is private. The remaining wikis have all been

Re: [Xmldatadumps-l] Has anyone had success with data deduplication?

2020-07-29 Thread Ariel Glenn WMF
The basic problem is that the page content dumps are ordered by revision number within each page, which makes good sense for dumps users but means that the addition of a single revision to a page will shift all of the remaining data ,resulting in different compressed blocks. That's going to be

Re: [Xmldatadumps-l] Request for Wikipedia dump of February 2017

2020-07-19 Thread Ariel Glenn WMF
Dear Rajakumaran Archulan, Older dumps can often be found on the Internet Archive. The February 2017 full dumps for the English language Wikipedia are here: https://archive.org/details/enwiki-20170201 A reminder for all new and older members of this list: comprehensive documentation for dumps

[Xmldatadumps-l] sample html dumps available FOR QA ONLY

2020-07-10 Thread Ariel Glenn WMF
NOTE: I did not produce the HTML dumps, they are being managed by another team. If you are interested in weighing in on the output format, what's missing, etc, here is the phabricator task: https://phabricator.wikimedia.org/T257480 Your comments and suggestions would be welcome! Ariel

[Xmldatadumps-l] Commons structured data dumps

2020-07-09 Thread Ariel Glenn WMF
RDF dumps of structured data from commons are now available at https://dumps.wikimedia.org/other/wikibase/commonswiki/ They are run on a weekly basis. See https://lists.wikimedia.org/pipermail/wikidata/2020-July/014125.html for more information. Enjoy!

Re: [Xmldatadumps-l] Dumps stalled

2020-06-10 Thread Ariel Glenn WMF
They aren't, but the rsync copying files to the web server is behind. See https://phabricator.wikimedia.org/T254856 for that. They'll catch up in the next day or so. Ariel On Wed, Jun 10, 2020 at 7:36 PM Bruce Myers via Xmldatadumps-l < xmldatadumps-l@lists.wikimedia.org> wrote: > The

Re: [Xmldatadumps-l] Wikitaxi omits certain information from pages and inconvenient strings are found throughout the pages

2020-05-04 Thread Ariel Glenn WMF
The WikiTaxi software is maintained by a group unaffiliated with the Wikimedia Foundation, if it is maintained at all. I see that the wiki ( www.wikitaxi.org) has not been updated in years. There is a contact email listed there which you might try: m...@wikitaxi.org The parts you highlight are

Re: [Xmldatadumps-l] Wikipedia xml dumps 2009-2013

2020-04-30 Thread Ariel Glenn WMF
You might check our archives as well as archive.org: see https://meta.wikimedia.org/wiki/Data_dumps/Finding_older_xml_dumps if you have not already done so. Otherwise perhaps someone on the list will have a copy available. Ariel On Thu, Apr 30, 2020 at 1:15 PM Katja Schmahl wrote: > Hi all, >

[Xmldatadumps-l] BREAKING CHANGE: feature removal (private table dumps)

2020-04-06 Thread Ariel Glenn WMF
For the past few years we have not dumped private tables at all; they would not be accessible to the public in any case, and they do not suffice as a backup in case of catastrophic failure. We are therefore removing the feature to dump private tables along with public tables in a dump run. Anyone

Re: [Xmldatadumps-l] Duplicate entry in last Spanish dump

2020-04-06 Thread Ariel Glenn WMF
, 2020 at 9:16 AM Ariel Glenn WMF wrote: > Thanks for this report! > > This bug must have been introduced in my recent updates to file listing > methods. > > The multistream file is produced and available for download by changing > the file name in the download url. > > I'l

Re: [Xmldatadumps-l] Duplicate entry in last Spanish dump

2020-04-05 Thread Ariel Glenn WMF
Thanks for this report! This bug must have been introduced in my recent updates to file listing methods. The multistream file is produced and available for download by changing the file name in the download url. I'll have a look Monday to see about fixing up the index.html output generation.

[Xmldatadumps-l] No second dump run this month

2020-03-19 Thread Ariel Glenn WMF
As mentioned earlier on the xmldatadumps-l, the dumps are running very slow this month, ince the vslow db hosts they use are also serving live traffic during a tables migration. Even manual runs of partial jobs would not help the situation any, so there will be NO SECOND DUMP RUN THIS MONTH. The

[Xmldatadumps-l] Monthly dumps for March; possible no second run

2020-03-16 Thread Ariel Glenn WMF
Hello everybody, Those of you who follow the dumps closely may have notice that they are running slower than usual this month. That is because the db servers on which they run are also serving live traffic, so that a wikidata-related migration can complete before the end of the month. I will try

[Xmldatadumps-l] kowiki joins the ranks of the 'big wikis'

2020-02-28 Thread Ariel Glenn WMF
Happy almost March, everyone! Kowiki dumps jobs now take long enough to run for certain steps that the wiki has been moved to the 'big wikis' list. This means that 6 parallel jobs will produce output for stubs and page content dumps, similarly to frwiki, dewiki and so on. See [1] for more. This

Re: [Xmldatadumps-l] Was format change plan postponed?

2020-02-11 Thread Ariel Glenn WMF
Good morning! We are a bit delayed due to some code changes that need to go in. We hope to make the switch in March; I'll send an update with the target date when all patches have been deployed. My apologies for not updating the list. You can follow the progress of this changeover on

Re: [Xmldatadumps-l] Ordering of revisions

2020-01-17 Thread Ariel Glenn WMF
The queries to get page and revision metadata are ordered by page id, and within each page, by revision id. This is guaranteed. The behavior of rev_parent_id is not guaranteed however, in certain edge cases. See e.g. https://phabricator.wikimedia.org/T193211 Anyone who uses this field care to

Re: [Xmldatadumps-l] Your help requested (testing decompression)

2020-01-08 Thread Ariel Glenn WMF
minutes, cutting out many hours from the dump runs overall. Please check your tools using the files linked in the previous emails and make sure that they work. Thanks! Ariel On Thu, Dec 5, 2019 at 12:01 AM Ariel Glenn WMF wrote: > if you use one of the utilities listed here: >

[Xmldatadumps-l] Your help requested (testing decompression)

2019-12-04 Thread Ariel Glenn WMF
if you use one of the utilities listed here: https://phabricator.wikimedia.org/T239866 I'd like you to download one of the 'multistream' dumps and see if your utility decompresses it fully or not (you can compare the md5sum of the decompressed content to the regular file's decompressed content and

[Xmldatadumps-l] Comments requested: produce empty abstract files for Wikidata?

2019-10-21 Thread Ariel Glenn WMF
Currently, the abstracts dump for Wikidata consists of 62 million entries, all of which contain instead of any real abstract. Instead of this, I am considering producing abstract files that would contain only the mediawiki header and footer and the usual siteinfo contents. What do people think

Re: [Xmldatadumps-l] Incremental dumps

2019-09-11 Thread Ariel Glenn WMF
All dumps were interrupted for a period of several days due to a MediaWiki change. See https://phabricator.wikimedia.org/T232268 for details. Ariel On Wed, Sep 11, 2019 at 4:43 PM colin johnston wrote: > Any news on retention time for backups as well :) > > Col > > > > On 11 Sep 2019, at

[Xmldatadumps-l] New dumps mirror in the United States (Colorado)

2019-08-09 Thread Ariel Glenn WMF
Greetings dumps users, remixers and sharers! I'm happy to announce that we have another mirror of the last 5 XML dumps, located in the United States, for your downloading pleasure. All the information you need is here:

Re: [Xmldatadumps-l] Wikidate Entitites 24/06/2019 dump missing

2019-07-01 Thread Ariel Glenn WMF
This dump was incomplete due to a problem with MediaWiki code. It was removed so that scripts such as yours would not process a file with half the entities in it. This week's run should provide a new and complete file. For more information, you can follow along on the Phabricator task:

[Xmldatadumps-l] svwiki to move to the ranks of the 'big wikis'

2019-06-20 Thread Ariel Glenn WMF
Hello dumps users and re-users! As you know, some wikis are large enough that we produce dumps of some files in 6 pieces in parallel. We'll begin doing this for svwiki starting on July 1. You can follow along on https://phabricator.wikimedia.org/T226200 if interested. If you have not previously

Re: [Xmldatadumps-l] Wikipedia Dumps required

2019-05-29 Thread Ariel Glenn WMF
You can find some older dumps at https://dumps.wikimedia.org/archive/ (see https://meta.wikimedia.org/wiki/Data_dumps/Finding_older_xml_dumps for more about finding older dumps in general). I didn't see the March 2006 files but these https://dumps.wikimedia.org/archive/enwiki/20060816/ are later

Re: [Xmldatadumps-l] Approx. number of pages in enwiki-latest-pages-articles.xml

2019-05-27 Thread Ariel Glenn WMF
The number should be around 19414056, the same number of pages in the stubs-articles file. On Tue, May 28, 2019 at 8:35 AM Sigbert Klinke wrote: > Hi, > > I would be interested to know how many pages in > enwiki-latest-pages-articles.xml . My own count gives 19,4 Mio. pages. > Can this be, at

[Xmldatadumps-l] some dump failures today

2019-03-06 Thread Ariel Glenn WMF
Those of you watching the xml/sql dumps run this month may have noticed some dump failures today. These were caused by depooling of the database server for maintenance while the dump hosts were querying it. The jobs in question should be rerun automatically over the next few days, and I'll be

Re: [Xmldatadumps-l] wikimedia.bytemark.co.uk mirror is not updated from 2017-11

2019-03-04 Thread Ariel Glenn WMF
> dumps/mirrored updated to reflect compliance of removal. > > Colin > > > On 4 Mar 2019, at 09:24, Ariel Glenn WMF wrote: > > All of the information in these mirrored dump files is publicly available > to any user; no private information is provided. For GDPR-specific issu

Re: [Xmldatadumps-l] wikimedia.bytemark.co.uk mirror is not updated from 2017-11

2019-03-04 Thread Ariel Glenn WMF
red information ? > How is retention guidelines followed with this mirrored information ? > > Colin > > > > On 4 Mar 2019, at 08:52, Ariel Glenn WMF wrote: > > Excuse this very late reply. The index.html page is out of date but the > mirrored directories for various cur

Re: [Xmldatadumps-l] wikimedia.bytemark.co.uk mirror is not updated from 2017-11

2019-03-04 Thread Ariel Glenn WMF
Excuse this very late reply. The index.html page is out of date but the mirrored directories for various current runs are there. I'm checking with a colleague about making sure the index page gets copied over. Ariel On Wed, Feb 6, 2019 at 1:14 PM Mariusz "Nikow" Klinikowski <

[Xmldatadumps-l] question about wikidata entity dumps usage (please forward to interested parties)

2019-02-16 Thread Ariel Glenn WMF
Hey folks, We've had a request to reschedule the way the various wikidata entity dumps are run. Right now they go once a week on set days of the week; we've been asked about pegging them to specific days of the month, rather as the xml/sql dumps are run. See

[Xmldatadumps-l] New dumps mirror: The Free Mirror Project

2019-02-06 Thread Ariel Glenn WMF
I am happy to announce a new mirror site, located in Canada, which is hosting the last two good dumps of all projects. Please welcome and put to good use https://dumps.wikimedia.freemirror.org/ ! I want to thank Adam for volunteering bandwidth and space and for getting everything set up. More

[Xmldatadumps-l] incorrect links for pages articles multistream files for big wikis

2019-01-23 Thread Ariel Glenn WMF
Folks may have noticed already that the links presented for downlod of pages-articles-multistream dumps are incorrect on the web pages for big wikis. The files exist for download but the wrong links were created. I'll be looking into that and fixing it up over the next days, but in the meantime

[Xmldatadumps-l] Change in multistream dump file production

2019-01-19 Thread Ariel Glenn WMF
TL;DR: Don't panic, the single articles multistream bz2 file for big wikis will be produced shortly after the new smaller fles. Long version: For big wikis which already have split up article files, we now produce one multistream file per article file. These are now recombined into a single file

[Xmldatadumps-l] mwbzutils BREAKING CHANGE

2019-01-19 Thread Ariel Glenn WMF
If you use recompressxml in the mwbzutils package, as of version 0.0.9 (just deployed) it no longer writes bz2 compressed data by default to stdout; instead it relies on the extension of the output file and will write either gzipped, bz2 or plain text output, accordingly. This means that if it is

Re: [Xmldatadumps-l] Dump for enwiki blocked ?

2018-10-22 Thread Ariel Glenn WMF
The dumps are not blocked but a change in the way stubs dumps are processed has slowed down the queries considerably. This issue is being tracked here: https://phabricator.wikimedia.org/T207628 Ariel On Mon, Oct 22, 2018 at 1:07 PM Nicolas Vervelle wrote: > Hi, > > The dump for enwiki seems

[Xmldatadumps-l] some revisions missing from Sept 13 adds-changes dump

2018-10-12 Thread Ariel Glenn WMF
If you are a user of the adds-changes (so-called "incremental") dumps, read on. All dumps use database servers in our eqiad data center. For the past month, the wiki projects have used primary database masters out of our codfw data center; on one of these days, a number of revisions did not

Re: [Xmldatadumps-l] flow dumps failures being worked on

2018-10-05 Thread Ariel Glenn WMF
These issues have been cleared up and flow dumps are being produced properly. Ariel On Thu, Sep 6, 2018 at 1:51 PM Ariel Glenn WMF wrote: > This is being tracked here: https://phabricator.wikimedia.org/T203647 > You probably won't see much in the way of updates until all the job

Re: [Xmldatadumps-l] adds-changes (so-called 'incremental') dumps failed today

2018-10-02 Thread Ariel Glenn WMF
Somehow I committed but did not deploy one of the changes, so local testing worked great and the production run of course failed. The missing code is now live (I checked) so everything should be back to normal tomorrow. Ariel On Mon, Oct 1, 2018 at 5:26 PM Ariel Glenn WMF wrote: > The fail

[Xmldatadumps-l] adds-changes (so-called 'incremental') dumps failed today

2018-10-01 Thread Ariel Glenn WMF
The failure was a side effect of a configuration change that will, ironically enough, make it easier to test the 'other' dumps, including eventually these ones, in mediawiki-vagrant; see https://phabricator.wikimedia.org/T201478 for more information about that. They should run tomorrow and

[Xmldatadumps-l] Oct 3 2018: RFC on xml dumps schema update to be discussed at TechCom

2018-10-01 Thread Ariel Glenn WMF
Hey dumps users and contributors! This Wednesday, Oct 3 at 2pm PST(21:00 UTC, 23:00 CET) in #wikimedia-office TechCom will have a discussion about the RFC for the upcomign xml schema update needed for Multi-Content Revision content. Phabricator task: https://phabricator.wikimedia.org/T199121

Re: [Xmldatadumps-l] flow dumps failures being worked on

2018-09-06 Thread Ariel Glenn WMF
This is being tracked here: https://phabricator.wikimedia.org/T203647 You probably won't see much in the way of updates until all the jobs ahve completed; they are in progress now. Ariel On Thu, Sep 6, 2018 at 11:02 AM, Ariel Glenn WMF wrote: > Hello dumps users! > > You may hav

[Xmldatadumps-l] flow dumps failures being worked on

2018-09-06 Thread Ariel Glenn WMF
Hello dumps users! You may have noticed that a number of wikis have had dumps failures on the flow dumps step. The cause is known (a cleanup of mediawiki core that didn't carry over to the extension) and these jobs should be fixed up today or tomorrow. Ariel

[Xmldatadumps-l] huwiki, arwiki to be treated as 'big wikis' and run parallel jobs

2018-08-20 Thread Ariel Glenn WMF
Starting September 1, huwiki and arwiki, which both take several days to complete the revsion history content dumps, will be moved to the 'big wikis' list, meaning that they will run jobs in parallel as do frwiki, ptwiki and others now, for a speedup. Please update your scripts accordingly.

[Xmldatadumps-l] missing adds-changes dumps, page titles for today

2018-08-08 Thread Ariel Glenn WMF
These jobs did not run today due to a change in how maintenance scripts handle unknown arguments. The problem has been fixed and the jobs should run regularly tomorrow. ___ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org

[Xmldatadumps-l] MultiContent Revisions and changes to the XML dumps

2018-08-02 Thread Ariel Glenn WMF
As many of you may know, MultiContent Revisions are coming soon (October?) to a wiki near you. This means that we need changes to the XML dumps schema; these changes will likely NOT be backwards compatible. Initial discussion will take place here: https://phabricator.wikimedia.org/T199121 For

[Xmldatadumps-l] hewiki dump to be added to 'big wikis' and run with multiple processes

2018-07-19 Thread Ariel Glenn WMF
Good morning! The pages-meta-history dumps for hewiki take 70 hours these days, the longest of any wiki not already running with parallel jobs. I plan to add it to the list of 'big wikis' starting August 1st, meaning that 6 jobs will run in parallel producing the usual numbered file output; look

[Xmldatadumps-l] change to output file numbering of big wikis

2018-05-31 Thread Ariel Glenn WMF
TL;DR: Scripts that reply on xml files numbered 1 through 4 should be updated to check for 1 through 6. Explanation: A number of wikis have stubs and page content files generated 4 parts at a time, with the appropriate number added to the filename. I'm going to be increasing that thi month to 6.

[Xmldatadumps-l] pagecounts-ez missing April files (was Re: [Wikitech-l] changes coming to large dumps)

2018-04-10 Thread Ariel Glenn WMF
s-ez sets disappeared from > dumps.wikimedia.org starting this date. Is that a coincidence ? > Is it https://phabricator.wikimedia.org/T189283 perhaps ? > > DJ > > On Thu, Mar 29, 2018 at 2:42 PM, Ariel Glenn WMF <ar...@wikimedia.org> > wrote: > > Here it co

[Xmldatadumps-l] New web server for dumps/datasets, OLD ONE GOING AWAY

2018-04-04 Thread Ariel Glenn WMF
Folks, As you'll have seen from previous email, we are now using a new beefier webserver for your dataset downloading needs. And the old server is going away on TUESDAY April 10th. This means that if you are using 'dataset1001.wikimedia.org' or the IP address itself in your scripts, you MUST

[Xmldatadumps-l] Change for abstracts dumps, primarily for wikidata

2018-04-04 Thread Ariel Glenn WMF
Those of you that rely on the abstracts dumps will have noticed that the content for wikidata is pretty much useless. It doesn't look like a summary of the page because main namespace articles on wikidata aren't paragraphs of text. And there's really no useful summary to be generated, even if we

Re: [Xmldatadumps-l] changes coming to large dumps

2018-03-29 Thread Ariel Glenn WMF
dumps. Please forward wherever you deem appropriate. For further updates, don't forget to check the Phab ticket! https://phabricator.wikimedia.org/T179059 On Mon, Mar 19, 2018 at 2:00 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote: > A reprieve! Code's not ready and I need to do so

[Xmldatadumps-l] changes coming to large dumps

2018-03-05 Thread Ariel Glenn WMF
Please forward wherever you think appropriate. For some time we have provided multiple numbered pages-articles bz2 file for large wikis, as well as a single file with all of the contents combined into one. This is consuming enough time for Wikidata that it is no longer sustainable. For wikis

Re: [Xmldatadumps-l] Missing pages in enwiki pages-articles-multistream dumps

2018-02-27 Thread Ariel Glenn WMF
It turns out that this happens for exactly 27 pages, those at the end of each enwiki-20180220-stub-articlesXX.xml.gz file. Tracking here: https://phabricator.wikimedia.org/T188388 Ariel On Tue, Feb 27, 2018 at 10:45 AM, Ryan Hitchman wrote: > Multiple pages are missing

[Xmldatadumps-l] Delaying the second November run by 2 days

2017-11-20 Thread Ariel Glenn WMF
Because the first run of the month was delayed, we need a couple days delay now for the second run to start, so that the last of the wikis (dewiki) ca finish up the first run. I expect the second monthly run to finish on time however, once started. Ariel

Re: [Xmldatadumps-l] [Analytics] Missing categorylinks and pages in Wikipedia dumps

2017-11-07 Thread Ariel Glenn WMF
I checked the files directly, both the pages.sql.gz and the categorylinks.sql.gz files for 20170920. The page is listed: $ zcat enwiki-20170920-page.sql.gz | sed -e 's/),/),\n/g;' | grep Computational_creativity | more

Re: [Xmldatadumps-l] Important news about the November dumps run!

2017-11-06 Thread Ariel Glenn WMF
Rsync of xml/sql dumps to the web server is now running on a rolling basis via a script, so you should see updates regularly rather than "every $random hours". There's more to be done on that front, see https://phabricator.wikimedia.org/T179857 for what's next. Ariel

[Xmldatadumps-l] IMPORTANT: Changes to abstracts and siteinfo-namespaces jobs

2017-11-06 Thread Ariel Glenn WMF
These jobs are currently written uncompressed. Starting with the next run, I plan to write these as gzip compressed files. This means that we'll save a lot of space for the larger abstracts dumps. Additionally,only status and html files will be uncompressed, which is convenient for maintenance

Re: [Xmldatadumps-l] Important news about the November dumps run!

2017-11-03 Thread Ariel Glenn WMF
that some index.html files may contain links to files which did not get picked up on the rsync. They'll be there sometime tomorrow after the next rsync. Ariel On Mon, Oct 30, 2017 at 5:39 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote: > As was previously announced on the xmldatadumps-l list

[Xmldatadumps-l] Important news about the November dumps run!

2017-10-30 Thread Ariel Glenn WMF
As was previously announced on the xmldatadumps-l list, the sql/xml dumps generated twice a month will be written to an internal server, starting with the November run. This is in part to reduce load on the web/rsync/nfs server which has been doing this work also until now. We want separation of

[Xmldatadumps-l] IMPORTANT: Impending move of xml/sql dump generation to another server

2017-10-24 Thread Ariel Glenn WMF
This issue will be tracked here. https://phabricator.wikimedia.org/T178893 As it says on the ticket, I hope to get this done in time for the Nov 1 run. Here is what it means for folks who download the dumps: * First off, the host where the dumps are generated will no longer be the host that

Re: [Xmldatadumps-l] Official .torrent site for dumps files!?

2017-09-18 Thread Ariel Glenn WMF
The Wikimedia Foundation does not have an official site for dumps torrents. It would be nice to add them to https://meta.wikimedia.org/wiki/Data_dump_torrents however. Ariel On Mon, Sep 18, 2017 at 10:16 AM, Federico Leva (Nemo) wrote: > Felipe Ewald, 18/09/2017 04:31: >

[Xmldatadumps-l] abstract dumps problem for languages with variants

2017-09-04 Thread Ariel Glenn WMF
Dumps watchers may have noticed that several zh wiki project dumps failed the abstract dumps step today. This is probably fixed, tracking here: https://phabricator.wikimedia.org/T174906 I'll be sure it's fixed when a few more wikis have run without problems. Ariel

Re: [Xmldatadumps-l] Dumps issues this month

2017-07-05 Thread Ariel Glenn WMF
Dumps are running again, though the root cause of the nfs incident is still undetermined. Ariel On Wed, Jul 5, 2017 at 5:08 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote: > Our dumps server is having nfs issues; we're debugging it; debugging is > slow and tedious. You can follo

[Xmldatadumps-l] Dumps issues this month

2017-07-05 Thread Ariel Glenn WMF
Our dumps server is having nfs issues; we're debugging it; debugging is slow and tedious. You can follow along here should you wish all the gory details: https://phabricator.wikimedia.org/T169680 As soon as service is back to normal I'll send an update here to the list. Ariel

[Xmldatadumps-l] xmlfileutils (mwxml2sql etc) moved to their own repo

2017-04-25 Thread Ariel Glenn WMF
A heads up to anyone who uses these, builds packages for them, etc: after a bit of tlc they have been moved to their own repo in the 'master' branch: clone from gerrit: operations/dumps/import-tools.git or browse at https://phabricator.wikimedia.org/diffusion/ODIM/ Patches to gerrit, bug

[Xmldatadumps-l] another month, another deploy -> another bug

2017-04-03 Thread Ariel Glenn WMF
I needed to clean up a bunch of tech debt before redoing the page content dump 'divvy up into small pieces and rerun if necessary' mechanism. I cleaned up a bit too much and broke stub and article recombine dumps in the process. The fix has been deployed, I shot all the dump processes, marked

[Xmldatadumps-l] this month's news in dump runs

2017-03-20 Thread Ariel Glenn WMF
Those of you following along will notice that dewiki and wikidatawiki have more files than usual for the page content dumps (pages-meta-history). We'll have more of this going forward; if I get the work done in time, starting April we'll split up these jbos ahead of time into small files that can

Re: [Xmldatadumps-l] New data dump torrents for enwiki and ptwiki

2017-03-16 Thread Ariel Glenn WMF
That's great news, thanks for taking the initiative! Ariel On Thu, Mar 16, 2017 at 5:57 AM, Felipe Ewald wrote: > Hello everyone! > > > > For those who like torrent and download dumps files, good news! > > > > I add the torrent for

[Xmldatadumps-l] Another dumps html update

2017-03-01 Thread Ariel Glenn WMF
Again thanks to Ladsgroup, this is a change to the per-dump index.html page, and you can see sample screenshots here: https://phabricator.wikimedia.org/T155697 Please weigh in on the ticket. I'd like to get any issues resolved and have this in play by the time the next dump run starts on March

[Xmldatadumps-l] More dumps html changes

2017-02-06 Thread Ariel Glenn WMF
Hello everybody, More changes to various html pages have been staged for review. Thanks again to Amir (Ladsgroup) for those! Have a look here: https://gerrit.wikimedia.org/r/#/c/335684/ and comment here: https://phabricator.wikimedia.org/T155697 Thanks! Ariel

Re: [Xmldatadumps-l] revised index.html for dumps?

2017-01-31 Thread Ariel Glenn WMF
Nemo, thanks for your comments on the ticket. Last call. If no objections or new changes, this will be merged sometime Thursday Feb 2nd. On Mon, Jan 30, 2017 at 10:07 AM, Federico Leva (Nemo) wrote: > Aww, but the monobook background is so *cute*. :( > A server kitten just

[Xmldatadumps-l] revised index.html for dumps?

2017-01-29 Thread Ariel Glenn WMF
Hey folks, A kind person submitted a patch to make the index.html page, and potentially others as well, nicer. Have a look: https://phabricator.wikimedia.org/T155697 and please comment there if you have suggestions. Feel free to forward this to anyone else who might be interested. Thanks!

[Xmldatadumps-l] new XML/sql dumps mirror

2016-12-19 Thread Ariel Glenn WMF
I'm happy to announce that the Academic Computer Club of Umeå University in Sweden is now offering for download the last 5 XML/sql dumps, as well as a mirror of 'other' datasets. Check the current mirror list [1] for more information, or go directly to download:

[Xmldatadumps-l] changing order of dump steps in status and checksum files

2016-12-08 Thread Ariel Glenn WMF
Before I do this, I want to know if anyone here relies on the specific order of the contents of the md5 or sha1 sum files for the dumps, or on the order of the entries in the dumpruninfo file. The reason I want to fiddle with the order is to have all the table dumps together, rather than

Re: [Xmldatadumps-l] 9 am UTC maintenance for dataset1001 (dumps.wikimedia.org)

2016-11-14 Thread Ariel Glenn WMF
That should be Tuesday, Nov 15. It's been a long week. A. On Mon, Nov 14, 2016 at 2:27 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote: > On Tuesday Nov 13, at 9 am UTC, the web server for the dumps and other > datasets will > be unavailable due to maintenance. This should take

[Xmldatadumps-l] 8 am UTC Oct 29, maintenance for dataset1001 (dumps.wikimedia.org)

2016-10-28 Thread Ariel Glenn WMF
On Saturday Oct 29, at 8 am UTC, the web server for the dumps and other datasets will be unavailable due to maintenance. This should take no longer than 10 minutes. Thanks for your understanding. Ariel ___ Xmldatadumps-l mailing list

[Xmldatadumps-l] Suggestions wanted: api for monitoring dump runs

2016-10-04 Thread Ariel Glenn WMF
The next Wikimedia Developers Summit will be in January 2017. I plan to hold an unconference session on development of an API for monitoring/stats for dumps of all sorts. Let's get the discussion going now; what do you want to see? Note that tis is for the rewrite so you need not be restricted

Re: [Xmldatadumps-l] New mirror of 'other' datasets

2016-09-27 Thread Ariel Glenn WMF
Thanks, that's great. Ariel On Tue, Sep 27, 2016 at 1:13 PM, Federico Leva (Nemo) <nemow...@gmail.com> wrote: > Ok. > > Ariel Glenn WMF, 27/09/2016 11:47: > >> http://dumps.wikimedia.your.org/other/mediacounts/daily/2016/ There >> are mediacounts here, is

Re: [Xmldatadumps-l] New mirror of 'other' datasets

2016-09-27 Thread Ariel Glenn WMF
wrote: > Federico Leva (Nemo), 17/06/2016 14:59: > >> Ariel Glenn WMF, 17/06/2016 13:21: >> >>> For folks from specific institutions that suddenly no longer have >>> access, I can forward instution names along and hope that helps. >>> >> >> It

  1   2   >