Re: [Wikitech-l] Wikipedia dumps

2016-01-11 Thread Ariel Glenn WMF
That would be me; I need to push some changes through for this month but I was either travelling or dev summit/allstaff. I'm pretty jetlagged but I'll likely be doing that tonight, given I woke up at 5 pm :-D A. On Mon, Jan 11, 2016 at 4:20 PM, Bernardo Sulzbach < mafagafogiga...@gmail.com>

Re: [Wikitech-l] New mirror of 'other' datasets

2016-06-17 Thread Ariel Glenn WMF
mited, but is continually growing", as email from our contact at that mirror says. For folks from specific institutions that suddenly no longer have access, I can forward instution names along and hope that helps. Ariel On Wed, May 4, 2016 at 3:33 PM, Ariel Glenn WMF <ar...@wikimedia.org&g

Re: [Wikitech-l] Dumps.wm.o access will be https only

2016-04-08 Thread Ariel Glenn WMF
This is now live, if a few days later than expected. Ariel On Fri, Apr 1, 2016 at 6:11 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote: > This is part of a longstanding general plan to move to https for our > services. You can track (most of) those items h

Re: [Wikitech-l] dataset1001 (dumps.wikimedia.org) maintenance window March 2 1-4pm UTC

2016-03-02 Thread Ariel Glenn WMF
Glenn WMF <ar...@wikimedia.org> wrote: > Extending this downtime window because we ran into unexpected issues with > PXE boot. > > On Tue, Mar 1, 2016 at 3:53 PM, Ariel Glenn WMF <ar...@wikimedia.org> > wrote: > >> Dataset1001, the host which serves dumps

Re: [Wikitech-l] New maintenance window Mar 4 1 - 4 pm UTC (was Re: dataset1001 (dumps.wikimedia.org) maintenance window March 2 1-4pm UTC)

2016-03-04 Thread Ariel Glenn WMF
This upgrade has concluded successfully and all services are again operational. Ariel On Thu, Mar 3, 2016 at 8:15 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote: > Fallback is: cable up the old 1GB nic (Chris has done this and set up the > port), PXE install on that, move to 1

Re: [Wikitech-l] dataset1001 (dumps.wikimedia.org) maintenance window March 2 1-4pm UTC

2016-03-02 Thread Ariel Glenn WMF
Extending this downtime window because we ran into unexpected issues with PXE boot. On Tue, Mar 1, 2016 at 3:53 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote: > Dataset1001, the host which serves dumps and other datasets to the public, > as well as providing access to various datas

[Wikitech-l] New maintenance window Mar 4 1 - 4 pm UTC (was Re: dataset1001 (dumps.wikimedia.org) maintenance window March 2 1-4pm UTC)

2016-03-03 Thread Ariel Glenn WMF
2, 2016 at 8:47 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote: > PXE boot from non-embedded nic failed spectacularly despite our best > efforts. This means we'll have to schedule another window once we have > someting new to try. I apologize for the extra inconvenience. All serv

[Wikitech-l] dataset1001 (dumps.wikimedia.org) maintenance window March 2 1-4pm UTC

2016-03-01 Thread Ariel Glenn WMF
Dataset1001, the host which serves dumps and other datasets to the public, as well as providing access to various datasets directly on stats100x, will be unavailable tomorrow for an upgrade to jessie. While I don't expect to need nearly 3 hours for the upgrade, better safe than sorry. In the

Re: [Wikitech-l] Dumps.wm.o access will be https only

2016-04-01 Thread Ariel Glenn WMF
<benap...@gmail.com> wrote: > Can you give us some justification for this change? It's not like when > downloading dumps you would actually leak some sensitive data... > > On Fri, Apr 1, 2016 at 1:03 PM, Ariel Glenn WMF <ar...@wikimedia.org> > wrote: > > We pl

Re: [Wikitech-l] Feelings

2016-04-03 Thread Ariel Glenn WMF
Don't laugh, but I actually looked for the like button after reading this post (too much time on Twitter). I would like to see more of these initiatives, whatever form they might take. We have something that made a difference, let's build on that. Ariel On Sun, Apr 3, 2016 at 7:02 PM, Risker

[Wikitech-l] Dumps.wm.o access will be https only

2016-04-01 Thread Ariel Glenn WMF
We plan to make this change on April 4 (this coming Monday), redirecting plain http access to https. A reminder that our dumps can also be found on our mirror sites, for those who may have restricted https access. Ariel Glenn ___ Wikitech-l mailing

[Wikitech-l] New mirror of 'other' datasets

2016-05-04 Thread Ariel Glenn WMF
I'm happy to announce a new mirror for datasets other than the XML dumps. This mirror comes to us courtesy of the Center for Research Computing, University of Notre Dame, and covers everything "other" [1] which includes such goodies as Wikidata entity dumps, pageview counts, titles of all files on

Re: [Wikitech-l] Dump frequency

2016-08-03 Thread Ariel Glenn WMF
Hi Binaris, We actually have better hardware than 4 years ago [0]. However, we have more projects with more content than 4 years ago. Wikidata did not exist in 2011; today it has almost 1/2 the revisions of the English language Wikipedia. The English language Wikipedia itself has increased 51%

Re: [Wikitech-l] Gerrit screen size

2016-09-26 Thread Ariel Glenn WMF
(off topic) Paladox, for some reason google seriously disliked your last 2 emails, just so you know. (Big read warning banner, etc.) Ariel On Mon, Sep 26, 2016 at 6:01 PM, Bináris wrote: > 2016-09-26 16:54 GMT+02:00 Paladox : > > > What does

Re: [Wikitech-l] Setting up a new Tomcat servlet in production?

2016-10-18 Thread Ariel Glenn WMF
On Mon, Oct 17, 2016 at 11:02 PM, Chad wrote: > On Mon, Oct 17, 2016 at 5:14 AM Adam Wight wrote: > > > The challenges are first that it's based on a Tomcat backend > > < > > https://github.com/Wikimedia-TW/han3_ji7_tsoo1_kian3_WM/ >

Re: [Wikitech-l] 9 am UTC maintenance for dataset1001 (dumps.wikimedia.org)

2016-11-14 Thread Ariel Glenn WMF
That should be Tuesday, Nov 15. It's been a long week. A. On Mon, Nov 14, 2016 at 2:27 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote: > On Tuesday Nov 13, at 9 am UTC, the web server for the dumps and other > datasets will > be unavailable due to maintenance. This should take

[Wikitech-l] 8 am UTC Oct 29, maintenance for dataset1001 (dumps.wikimedia.org)

2016-10-28 Thread Ariel Glenn WMF
On Saturday Oct 29, at 8 am UTC, the web server for the dumps and other datasets will be unavailable due to maintenance. This should take no longer than 10 minutes. Thanks for your understanding. Ariel ___ Wikitech-l mailing list

[Wikitech-l] new XML/sql dumps mirror

2016-12-19 Thread Ariel Glenn WMF
I'm happy to announce that the Academic Computer Club of Umeå University in Sweden is now offering for download the last 5 XML/sql dumps, as well as a mirror of 'other' datasets. Check the current mirror list [1] for more information, or go directly to download:

Re: [Wikitech-l] [Potential Spoof] Question about wikidata dump bz2 file

2017-04-06 Thread Ariel Glenn WMF
Hi Trung, For larger wikis, there will be a collection of partial files such as these, where the pXXXpXXX indicate the first and last page ids in the file. But for pages-articles, there will also be a combined file generated, so you'll be able to download that directly. It's listed on the

Re: [Wikitech-l] Important news about the November dumps run!

2017-11-03 Thread Ariel Glenn WMF
that some index.html files may contain links to files which did not get picked up on the rsync. They'll be there sometime tomorrow after the next rsync. Ariel On Mon, Oct 30, 2017 at 5:39 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote: > As was previously announced on the xmldatadumps-l list

Re: [Wikitech-l] Important news about the November dumps run!

2017-11-06 Thread Ariel Glenn WMF
Rsync of xml/sql dumps to the web server is now running on a rolling basis via a script, so you should see updates regularly rather than "every $random hours". There's more to be done on that front, see https://phabricator.wikimedia.org/T179857 for what's next. Ariel

Re: [Wikitech-l] Important news about the November dumps run!

2017-11-07 Thread Ariel Glenn WMF
like page-articles are still missing > from the 20171103 dump directory, when usually it only takes a day... > > Nico > > On Mon, Nov 6, 2017 at 8:01 PM, Ariel Glenn WMF <ar...@wikimedia.org> > wrote: > > > Rsync of xml/sql dumps to the web server is now running on a

[Wikitech-l] Important news about the November dumps run!

2017-10-30 Thread Ariel Glenn WMF
As was previously announced on the xmldatadumps-l list, the sql/xml dumps generated twice a month will be written to an internal server, starting with the November run. This is in part to reduce load on the web/rsync/nfs server which has been doing this work also until now. We want separation of

[Wikitech-l] change to output file numbering of big wikis

2018-05-31 Thread Ariel Glenn WMF
TL;DR: Scripts that reply on xml files numbered 1 through 4 should be updated to check for 1 through 6. Explanation: A number of wikis have stubs and page content files generated 4 parts at a time, with the appropriate number added to the filename. I'm going to be increasing that thi month to 6.

[Wikitech-l] MultiContent Revisions and changes to the XML dumps

2018-08-02 Thread Ariel Glenn WMF
As many of you may know, MultiContent Revisions are coming soon (October?) to a wiki near you. This means that we need changes to the XML dumps schema; these changes will likely NOT be backwards compatible. Initial discussion will take place here: https://phabricator.wikimedia.org/T199121 For

Re: [Wikitech-l] huwiki, arwiki to be treated as 'big wikis' and run parallel jobs

2018-08-20 Thread Ariel Glenn WMF
? > Anyway, I am proud of being part of this. :-) > > 2018-08-20 12:26 GMT+02:00 Ariel Glenn WMF : > > > Starting September 1, huwiki and arwiki, which both take several days to > > complete the revsion history content dumps, will be moved to the 'big > > wikis' list, mea

[Wikitech-l] huwiki, arwiki to be treated as 'big wikis' and run parallel jobs

2018-08-20 Thread Ariel Glenn WMF
Starting September 1, huwiki and arwiki, which both take several days to complete the revsion history content dumps, will be moved to the 'big wikis' list, meaning that they will run jobs in parallel as do frwiki, ptwiki and others now, for a speedup. Please update your scripts accordingly.

[Wikitech-l] hewiki dump to be added to 'big wikis' and run with multiple processes

2018-07-19 Thread Ariel Glenn WMF
Good morning! The pages-meta-history dumps for hewiki take 70 hours these days, the longest of any wiki not already running with parallel jobs. I plan to add it to the list of 'big wikis' starting August 1st, meaning that 6 jobs will run in parallel producing the usual numbered file output; look

[Wikitech-l] terbium EOL, mw maintenance server MOVED, use mwmaint1001 for all

2018-07-04 Thread Ariel Glenn WMF
Hello folks, Terbium, our former faithful MediaWiki maintenance server, will be up for decommissioning on Monday, July 9th. It is no longer used for anything in production as of a few moments ago. The sole exception to that is cron jobs that were already running and have not yet completed. Please

[Wikitech-l] pagecounts-ez missing April files (was Re: changes coming to large dumps)

2018-04-10 Thread Ariel Glenn WMF
s-ez sets disappeared from > dumps.wikimedia.org starting this date. Is that a coincidence ? > Is it https://phabricator.wikimedia.org/T189283 perhaps ? > > DJ > > On Thu, Mar 29, 2018 at 2:42 PM, Ariel Glenn WMF <ar...@wikimedia.org> > wrote: > > Here it co

[Wikitech-l] New web server for dumps/datasets, OLD ONE GOING AWAY

2018-04-04 Thread Ariel Glenn WMF
Folks, As you'll have seen from previous email, we are now using a new beefier webserver for your dataset downloading needs. And the old server is going away on TUESDAY April 10th. This means that if you are using 'dataset1001.wikimedia.org' or the IP address itself in your scripts, you MUST

[Wikitech-l] Change for abstracts dumps, primarily for wikidata

2018-04-04 Thread Ariel Glenn WMF
Those of you that rely on the abstracts dumps will have noticed that the content for wikidata is pretty much useless. It doesn't look like a summary of the page because main namespace articles on wikidata aren't paragraphs of text. And there's really no useful summary to be generated, even if we

Re: [Wikitech-l] changes coming to large dumps

2018-03-29 Thread Ariel Glenn WMF
dumps. Please forward wherever you deem appropriate. For further updates, don't forget to check the Phab ticket! https://phabricator.wikimedia.org/T179059 On Mon, Mar 19, 2018 at 2:00 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote: > A reprieve! Code's not ready and I need to do so

Re: [Wikitech-l] changes coming to large dumps

2018-03-19 Thread Ariel Glenn WMF
A reprieve! Code's not ready and I need to do some timing tests, so the March 20th run will do the standard recombining. For updates, don't forget to check the Phab ticket! https://phabricator.wikimedia.org/T179059 On Mon, Mar 5, 2018 at 1:10 PM, Ariel Glenn WMF <ar...@wikimedia.org>

[Wikitech-l] changes coming to large dumps

2018-03-05 Thread Ariel Glenn WMF
Please forward wherever you think appropriate. For some time we have provided multiple numbered pages-articles bz2 file for large wikis, as well as a single file with all of the contents combined into one. This is consuming enough time for Wikidata that it is no longer sustainable. For wikis

Re: [Wikitech-l] changes coming to large dumps

2018-03-05 Thread Ariel Glenn WMF
We'll probably start at 20GB, which means that WIkidata will be the only wiki affected for now. Ariel On Mon, Mar 5, 2018 at 1:40 PM, Bináris <wikipo...@gmail.com> wrote: > Could you please translate "too large" to megabytes? > > 2018-03-05 12:10 GMT+01:00 Ariel Glenn

Re: [Wikitech-l] [Engineering] Gerrit now automatically adds reviewers

2019-01-18 Thread Ariel Glenn WMF
In the meantime, I would encourage those who have not looked at the Git Reviewer Bot page in a while, to do so and to add any updates. Ariel On Fri, Jan 18, 2019 at 4:12 PM Tyler Cipriani wrote: > Hi all, > > Gerrit no longer automatically adds reviewers[0]. Unfortunately, this > plugin

[Wikitech-l] Possible change in schedule of generation of wikidata entity dumps

2019-03-14 Thread Ariel Glenn WMF
If you use these dumps regularly, please read and weigh in here: https://phabricator.wikimedia.org/T216160 Thanks in advance, Ariel Glenn Wikimedia Foundation ar...@wikimedia.org ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org

[Wikitech-l] Wikidata now officially has more total edits than English language Wikipedia

2019-03-20 Thread Ariel Glenn WMF
Wikidata surpassed the English language Wikipedia in the number of revisions in the database, about 45 minutes ago today.I was tipped off by a tweet [1] a few day ago and have been watching via a script that displays the largest revision id and its timestamp. Here's the point where Wikidata

[Wikitech-l] New dumps mirror: The Free Mirror Project

2019-02-06 Thread Ariel Glenn WMF
I am happy to announce a new mirror site, located in Canada, which is hosting the last two good dumps of all projects. Please welcome and put to good use https://dumps.wikimedia.freemirror.org/ ! I want to thank Adam for volunteering bandwidth and space and for getting everything set up. More

[Wikitech-l] question about wikidata entity dumps usage (please forward to interested parties)

2019-02-16 Thread Ariel Glenn WMF
Hey folks, We've had a request to reschedule the way the various wikidata entity dumps are run. Right now they go once a week on set days of the week; we've been asked about pegging them to specific days of the month, rather as the xml/sql dumps are run. See

[Wikitech-l] BREAKING CHANGE: schema update, xml dumps

2019-11-27 Thread Ariel Glenn WMF
We plan to move to the new schema for xml dumps for the February 1, 2020 run. Update your scripts and apps accordingly! The new schema contains an entry for each 'slot' of content. This means that, for example, the commonswiki dump will contain MediaInfo information as well as the usual wikitext.

[Wikitech-l] New dumps available: MachineVision extension tables

2020-04-21 Thread Ariel Glenn WMF
Good morning! New weekly dumps are available [1], containing the content of the tables used by the MachineVision extension [2]. For information about these tables, please see [3]. If you decide to use these tables, as with any other dumps, I would be interested to know how you use them; feel

[Wikitech-l] No second dump run this month

2020-03-19 Thread Ariel Glenn WMF
As mentioned earlier on the xmldatadumps-l, the dumps are running very slow this month, ince the vslow db hosts they use are also serving live traffic during a tables migration. Even manual runs of partial jobs would not help the situation any, so there will be NO SECOND DUMP RUN THIS MONTH. The

Re: [Wikitech-l] Making breaking changes without deprecation?

2020-08-28 Thread Ariel Glenn WMF
I'd like to see third party users, even those not on the mailing list, get advance notice in one release (say in the release notes) so that when the next release shows up with the deprecated code removed, they have had time to patch up any internal extensions and code they may have. I don't want

[Wikitech-l] New XML/SQL dumps mirror

2021-07-12 Thread Ariel Glenn WMF
Thanks to BringYour, based in California, for volunteering to host the last 5 good xml/sql dumps! To check out the full list of mirrors, see either https://dumps.wikimedia.org/mirrors.html or https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Dumps Interested in hosting dumps

[Wikitech-l] Wikimedia Enterprise HTML dumps available for public download

2021-10-19 Thread Ariel Glenn WMF
I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for October 17-18th are available for public download; see https://dumps.wikimedia.org/other/enterprise_html/ for more information. We expect to make updated versions of these files available around the 1st/2nd of the month and

[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

2022-01-01 Thread Ariel Glenn WMF
for a single Wikipedia page? The JSON structure looks very > useful by itself (e.g., not in bulk). > > > Mitar > > > On Tue, Oct 19, 2021 at 4:57 PM Ariel Glenn WMF > wrote: > > > > I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for > &

[Wikitech-l] Wiki content and other dumps new ownership, feedback requested on new version!

2023-09-27 Thread Ariel Glenn WMF
Hello folks! For some years now, I've been the main or only point of contact for the Wiki project sql/xml dumps semimonthly, as well as for a number of miscellaneous weekly datasets. This work is now passing to Data Platform Engineering (DPE), and your new points of contact, starting right away,