Re: [Wikitech-l] Wikipedia dumps

2016-01-11 Thread Ariel Glenn WMF
That would be me; I need to push some changes through for this month but I
was either travelling or dev summit/allstaff.  I'm pretty jetlagged but
I'll likely be doing that tonight, given I woke up at 5 pm :-D

A.

On Mon, Jan 11, 2016 at 4:20 PM, Bernardo Sulzbach <
mafagafogiga...@gmail.com> wrote:

> On Mon, Jan 11, 2016 at 3:22 AM, Tilman Bayer 
> wrote:
> > CCing the Xmldatadumps mailing list
> > , where
> > someone has already posted
> > <
> https://lists.wikimedia.org/pipermail/xmldatadumps-l/2016-January/001214.html
> >
> > about
> > what might be the same issue.
>
> For some reason, I did not subscribe to that list. Thanks for pointing it
> out.
>
> --
> Bernardo Sulzbach
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] New mirror of 'other' datasets

2016-06-17 Thread Ariel Glenn WMF
Dear all,

The server hosting this service has been moved to a different network, and
as such, it is now "only accessible/routable from select (still many)
members of Internet2 (U.S. universities), ESnet (U.S. national labs), and
Geant in Europe. This restricted list of places is currently limited, but
is continually growing", as email from our contact at that mirror says.
For folks from specific institutions that suddenly no longer have access, I
can forward instution names along and hope that helps.

Ariel

On Wed, May 4, 2016 at 3:33 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote:

> I'm happy to announce a new mirror for datasets other than the XML dumps.
> This mirror comes to us courtesy of the Center for Research Computing,
> University of Notre Dame, and covers everything "other" [1] which includes
> such goodies as Wikidata entity dumps, pageview counts, titles of all files
> on each wiki (daily), titles of all articles of each wiki (daily), and the
> so-called "adds-changes" dumps, among other things. You can access it at
> http://wikimedia.crc.nd.edu/other/ so please do!
>
> Ariel
>
> [1] https://dumps.wikimedia.org/other/
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dumps.wm.o access will be https only

2016-04-08 Thread Ariel Glenn WMF
This is now live, if a few days later than expected.

Ariel

On Fri, Apr 1, 2016 at 6:11 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote:

> This is part of a longstanding general plan to move to https for our
> services. You can track  (most of) those items here:
> https://phabricator.wikimedia.org/project/board/162/ although the
> specific task https://phabricator.wikimedia.org/T128587 is not listed
> there.
>
> In particular you might look at a couple of the tasks under 'Big Picture',
> i.e.
> https://phabricator.wikimedia.org/T104681 HTTPS Plans (tracking/high
> level info) and
> https://phabricator.wikimedia.org/T75953 RFC: MediaWiki HTTPS policy
> (though that doesn't directly address dumps).
>
> Ariel
>
> On Fri, Apr 1, 2016 at 5:20 PM, Petr Bena <benap...@gmail.com> wrote:
>
>> Can you give us some justification for this change? It's not like when
>> downloading dumps you would actually leak some sensitive data...
>>
>> On Fri, Apr 1, 2016 at 1:03 PM, Ariel Glenn WMF <ar...@wikimedia.org>
>> wrote:
>> > We plan to make this change on April 4 (this coming Monday), redirecting
>> > plain http access to https.
>> >
>> > A reminder that our dumps can also be found on our mirror sites, for
>> those
>> > who may have restricted https access.
>> >
>> > Ariel Glenn
>> > ___
>> > Wikitech-l mailing list
>> > Wikitech-l@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] dataset1001 (dumps.wikimedia.org) maintenance window March 2 1-4pm UTC

2016-03-02 Thread Ariel Glenn WMF
PXE boot from non-embedded nic failed spectacularly despite our best
efforts.  This means we'll have to schedule another window once we have
someting new to try. I apologize for the extra inconvenience.  All services
are back exactly the way they were.

Ariel

On Wed, Mar 2, 2016 at 6:01 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote:

> Extending this downtime window because we ran into unexpected issues with
> PXE boot.
>
> On Tue, Mar 1, 2016 at 3:53 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> wrote:
>
>> Dataset1001, the host which serves dumps and other datasets to the
>> public, as well as providing access to various datasets directly on
>> stats100x, will be unavailable tomorrow for an upgrade to jessie.  While I
>> don't expect to need nearly 3 hours for the upgrade, better safe than
>> sorry. In the meantime all files will be accessible via
>> ms1001.wikimedia.org via the web, and all dumps and page view files from
>> our mirrors as well.
>>
>> Thanks for your understanding.
>>
>> Ariel Glenn
>>
>>
>>
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] New maintenance window Mar 4 1 - 4 pm UTC (was Re: dataset1001 (dumps.wikimedia.org) maintenance window March 2 1-4pm UTC)

2016-03-04 Thread Ariel Glenn WMF
This upgrade has concluded successfully and all services are again
operational.

Ariel

On Thu, Mar 3, 2016 at 8:15 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote:

> Fallback is: cable up the old 1GB nic (Chris has done this and set up the
> port), PXE install on that, move to 10gb NIC once we're back up.  Gross but
> it gets the job done.
>
> Set for tomorrow (Friday) 1 to 4 pm UTC, this time should be much
> smoother. Same caveats apply as before.
>
> Ariel
>
> On Wed, Mar 2, 2016 at 8:47 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> wrote:
>
>> PXE boot from non-embedded nic failed spectacularly despite our best
>> efforts.  This means we'll have to schedule another window once we have
>> someting new to try. I apologize for the extra inconvenience.  All services
>> are back exactly the way they were.
>>
>> Ariel
>>
>> On Wed, Mar 2, 2016 at 6:01 PM, Ariel Glenn WMF <ar...@wikimedia.org>
>> wrote:
>>
>>> Extending this downtime window because we ran into unexpected issues
>>> with PXE boot.
>>>
>>> On Tue, Mar 1, 2016 at 3:53 PM, Ariel Glenn WMF <ar...@wikimedia.org>
>>> wrote:
>>>
>>>> Dataset1001, the host which serves dumps and other datasets to the
>>>> public, as well as providing access to various datasets directly on
>>>> stats100x, will be unavailable tomorrow for an upgrade to jessie.  While I
>>>> don't expect to need nearly 3 hours for the upgrade, better safe than
>>>> sorry. In the meantime all files will be accessible via
>>>> ms1001.wikimedia.org via the web, and all dumps and page view files
>>>> from our mirrors as well.
>>>>
>>>> Thanks for your understanding.
>>>>
>>>> Ariel Glenn
>>>>
>>>>
>>>>
>>>
>>
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] dataset1001 (dumps.wikimedia.org) maintenance window March 2 1-4pm UTC

2016-03-02 Thread Ariel Glenn WMF
Extending this downtime window because we ran into unexpected issues with
PXE boot.

On Tue, Mar 1, 2016 at 3:53 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote:

> Dataset1001, the host which serves dumps and other datasets to the public,
> as well as providing access to various datasets directly on stats100x, will
> be unavailable tomorrow for an upgrade to jessie.  While I don't expect to
> need nearly 3 hours for the upgrade, better safe than sorry. In the
> meantime all files will be accessible via ms1001.wikimedia.org via the
> web, and all dumps and page view files from our mirrors as well.
>
> Thanks for your understanding.
>
> Ariel Glenn
>
>
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] New maintenance window Mar 4 1 - 4 pm UTC (was Re: dataset1001 (dumps.wikimedia.org) maintenance window March 2 1-4pm UTC)

2016-03-03 Thread Ariel Glenn WMF
Fallback is: cable up the old 1GB nic (Chris has done this and set up the
port), PXE install on that, move to 10gb NIC once we're back up.  Gross but
it gets the job done.

Set for tomorrow (Friday) 1 to 4 pm UTC, this time should be much smoother.
Same caveats apply as before.

Ariel

On Wed, Mar 2, 2016 at 8:47 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote:

> PXE boot from non-embedded nic failed spectacularly despite our best
> efforts.  This means we'll have to schedule another window once we have
> someting new to try. I apologize for the extra inconvenience.  All services
> are back exactly the way they were.
>
> Ariel
>
> On Wed, Mar 2, 2016 at 6:01 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> wrote:
>
>> Extending this downtime window because we ran into unexpected issues with
>> PXE boot.
>>
>> On Tue, Mar 1, 2016 at 3:53 PM, Ariel Glenn WMF <ar...@wikimedia.org>
>> wrote:
>>
>>> Dataset1001, the host which serves dumps and other datasets to the
>>> public, as well as providing access to various datasets directly on
>>> stats100x, will be unavailable tomorrow for an upgrade to jessie.  While I
>>> don't expect to need nearly 3 hours for the upgrade, better safe than
>>> sorry. In the meantime all files will be accessible via
>>> ms1001.wikimedia.org via the web, and all dumps and page view files
>>> from our mirrors as well.
>>>
>>> Thanks for your understanding.
>>>
>>> Ariel Glenn
>>>
>>>
>>>
>>
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] dataset1001 (dumps.wikimedia.org) maintenance window March 2 1-4pm UTC

2016-03-01 Thread Ariel Glenn WMF
Dataset1001, the host which serves dumps and other datasets to the public,
as well as providing access to various datasets directly on stats100x, will
be unavailable tomorrow for an upgrade to jessie.  While I don't expect to
need nearly 3 hours for the upgrade, better safe than sorry. In the
meantime all files will be accessible via ms1001.wikimedia.org via the web,
and all dumps and page view files from our mirrors as well.

Thanks for your understanding.

Ariel Glenn
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dumps.wm.o access will be https only

2016-04-01 Thread Ariel Glenn WMF
This is part of a longstanding general plan to move to https for our
services. You can track  (most of) those items here:
https://phabricator.wikimedia.org/project/board/162/ although the specific
task https://phabricator.wikimedia.org/T128587 is not listed there.

In particular you might look at a couple of the tasks under 'Big Picture',
i.e.
https://phabricator.wikimedia.org/T104681 HTTPS Plans (tracking/high level
info) and
https://phabricator.wikimedia.org/T75953 RFC: MediaWiki HTTPS policy
(though that doesn't directly address dumps).

Ariel

On Fri, Apr 1, 2016 at 5:20 PM, Petr Bena <benap...@gmail.com> wrote:

> Can you give us some justification for this change? It's not like when
> downloading dumps you would actually leak some sensitive data...
>
> On Fri, Apr 1, 2016 at 1:03 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> wrote:
> > We plan to make this change on April 4 (this coming Monday), redirecting
> > plain http access to https.
> >
> > A reminder that our dumps can also be found on our mirror sites, for
> those
> > who may have restricted https access.
> >
> > Ariel Glenn
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Feelings

2016-04-03 Thread Ariel Glenn WMF
Don't laugh, but I actually looked for the like button after reading this
post (too much time on Twitter).  I would like to see more of these
initiatives, whatever form they might take.  We have something that made a
difference, let's build on that.

Ariel

On Sun, Apr 3, 2016 at 7:02 PM, Risker  wrote:

> I sympathize with your concern, Ori.  I suspect, however, that it shows a
> fundamental misunderstanding of why the Teahouse works when other processes
> (several of which have included cute symbols) have been less effective.
>
> And the reason is: the Teahouse is explicitly designed for having
> conversations.
>
> Teahouse "convenors" were initially selected for their demonstrated
> communication skills and willingness to remain polite when dealing with
> often frustrated people, and their ability to explain often complex
> concepts in straightforward terms.  As their ranks have evolved, they have
> sought out and taught others those skills, and there's an element of
> self-selection that discourages the more curmudgeonly amongst us from
> participating.  (There's not a lot of overlap between those who regularly
> help out at the Teahouse and those who hang out on ANI, for example.)
> We're talking about a relatively small group of people who really excel at
> this type of communication, although it is certainly a skill that others
> can develop if they have the willingness and inclination - but it really
> comes down to being able to identify the right "level" at which to talk to
> people, and then actually talking.
>
> The Teahouse works because it doesn't [obviously] use a lot of fancy
> technology, because it doesn't use a lot of templates and automated
> messaging, because it's made a lot of effort to avoid massive hyperlinking
> to complex and inscrutable policies.  It's people talking to people.  It's
> scaled remarkably well - I suspect because there are more "nice"
> Wikipedians than people realize - where other processes have failed.
> Several of those processes failed because we couldn't link up the right
> people giving the right messages to new users (MoodBar was an example of
> that - on top of the really problematic technical issues it raised), and
> others failed because they were pretty much designed to deprecate direct
> person-to-person communcation (AFT-5 would be in that category).
>
> Nonetheless, I think you've raised an important point.  If we can develop
> processes that can better link up new users with people who have the
> interest and skill to communicate with those new users, we should keep
> trying those technologies. But those technologies need to incorporate the
> existing findings that the most effective way of attracting and retaining
> new editors is direct, one-to-one communication. Not templates. Not cute
> emojicons. Not canned text, and certainly not links to complicated
> policies. It's people talking to people in a helpful way that makes the
> difference.  And that's a lot harder than meets the eye.
>
> And now, having written this, I'm going to spend some time trying to figure
> out how to create a message to new users I encounter when I'm oversighting
> their personal information...without templating or linking to complex
> policies, but pointing them to the Teahouse. I'm pretty sure it's not going
> to be very easy, but I'm going to try.
>
> Thank you for saying this, Ori.
>
> Risker/Anne
>
>
>
> On 2 April 2016 at 21:37, Ori Livneh  wrote:
>
> > On Fri, Apr 1, 2016 at 10:24 PM, Legoktm 
> > wrote:
> >
> > > Hi,
> > >
> > > It's well known that Wikipedia is facing threats from other social
> > > networks and losing editors. While many of us spend time trying to make
> > > Wikipedia different, we need to be cognizant that what other social
> > > networks are doing is working. And if we can't beat them, we need to
> > > join them.
> > >
> > > I've written a patch[1] that introduces a new feature to the Thanks
> > > extension called "feelings". When hovering over a "thank" link, five
> > > different emoji icons will pop up[2], representing five different
> > > feelings: happy, love, surprise, anger, and fear. Editors can pick one
> > > of those options instead of just a plain thanks, to indicate how they
> > > really feel, which the recipient will see[3].
> > >
> >
> > Of the many initiatives to improve editor engagement and retention that
> the
> > Wikimedia Foundation has launched over the years, the only one that had a
> > demonstrable and substantial impact (AFAIK) was the Teahouse.
> >
> > The goal of the Teahouse initiative was "learning whether a social
> approach
> > to new editor support could retain more new editors there"; its stated
> > design goal was to create a space for new users which would feature "warm
> > colors, inviting pictorial and thematic elements, simple mechanisms for
> > communicating, and a warm welcome from real people."[0]
> >
> > Several studies were made of 

[Wikitech-l] Dumps.wm.o access will be https only

2016-04-01 Thread Ariel Glenn WMF
We plan to make this change on April 4 (this coming Monday), redirecting
plain http access to https.

A reminder that our dumps can also be found on our mirror sites, for those
who may have restricted https access.

Ariel Glenn
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] New mirror of 'other' datasets

2016-05-04 Thread Ariel Glenn WMF
I'm happy to announce a new mirror for datasets other than the XML dumps.
This mirror comes to us courtesy of the Center for Research Computing,
University of Notre Dame, and covers everything "other" [1] which includes
such goodies as Wikidata entity dumps, pageview counts, titles of all files
on each wiki (daily), titles of all articles of each wiki (daily), and the
so-called "adds-changes" dumps, among other things. You can access it at
http://wikimedia.crc.nd.edu/other/ so please do!

Ariel

[1] https://dumps.wikimedia.org/other/
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Dump frequency

2016-08-03 Thread Ariel Glenn WMF
Hi Binaris,

We actually have better hardware than 4 years ago [0].  However, we have
more projects with more content than 4 years ago.  Wikidata did not exist
in 2011; today it has almost 1/2 the revisions of the English language
Wikipedia.  The English language Wikipedia itself has increased 51% in size
since early 2012. And the Hungarian language wiki has grown by over 50% as
well.

We should be running two dumps a month going forwards [1], where the second
run each month does not contain full history revisions.  This isn't as good
as 2011 but it's not as bad as once a month either.

The main work however to improve the dumps situation will be in a complete
rearchitecturing of the dumps.  One big change will be to move to a format
and structure that is truly incremental.

Folks interested in these issues are welcome to subscribe to or watch the
Phabricator project for the current dumps [3] and/or the future dumps [4].
There is also a dedicated (low-traffic) list for users and contributors to
the xml dumps [4].

Lastly, your email reminds me that I should update the dumps information at
Meta; the documentation there has fallen a bit behind.  Thanks!
Ariel

[0] For current hardware, see
https://wikitech.wikimedia.org/wiki/Dumps/Snapshot_hosts
[1] https://phabricator.wikimedia.org/T126339
[2] https://phabricator.wikimedia.org/tag/dumps-generation/
[3] https://phabricator.wikimedia.org/tag/dumps-rewrite/
[4] https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

On Wed, Aug 3, 2016 at 8:31 AM, Bináris  wrote:

> Hi folks,
>
> we in Hungarian Wikipedia have been watching new huwiki dumps by bot since
> 2011, so this page history:
>
> https://hu.wikipedia.org/w/index.php?title=Sablon:A_dump_d%C3%A1tuma==250=history
> clearly shows the freqency. Back in 2012 it took 8-10 days to create the
> new dump. Now it takes one month. Are we less develeoped or do we have less
> hardware than 4 years ago?
>
>
>
> --
> Bináris
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Gerrit screen size

2016-09-26 Thread Ariel Glenn WMF
(off topic) Paladox, for some reason google seriously disliked your last 2
emails, just so you know. (Big read warning banner, etc.)

Ariel

On Mon, Sep 26, 2016 at 6:01 PM, Bináris  wrote:

> 2016-09-26 16:54 GMT+02:00 Paladox :
>
> > What does everyone think of using this skin
> https://github.com/shellscape/
> > OctoGerrit it's more modern, doint know if it is mobile friendly and
> > large screen friendly.
> >
>
> "Gerrit is a good tool built on a solid framework. But Gerrit, with regard
> > to user experience, is bad. Really bad. Really, really bad. Functional?
> > Yes. Pretty? Good lord no."
> >
> Fine. :-)
>
> "OctoGerrit was written and tested on Gerrit v2.12. If it works with other
> > older versions, that's wonderful! But not something we're going to test.
> > OctoGerrit is not guaranteed to work on newer versions, nor the new
> > 'PolyGerrit' being developed."
> >
> Caution!
>
>
> --
> Bináris
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Setting up a new Tomcat servlet in production?

2016-10-18 Thread Ariel Glenn WMF
On Mon, Oct 17, 2016 at 11:02 PM, Chad  wrote:

> On Mon, Oct 17, 2016 at 5:14 AM Adam Wight  wrote:
>
> > The challenges are first that it's based on a Tomcat backend
> > <
> > https://github.com/Wikimedia-TW/han3_ji7_tsoo1_kian3_WM/
> blob/master/src/idsrend/services/IDSrendServlet.java
> > >,
> > which I'm not sure is precedented in our current ecosystem, and second
> that
> > the code uses Chinese variable and function names, which should
> > unfortunately be Anglicized by convention, AIUI.  Finally, there might be
> > security issues around the rendered text itself, if it were misused to
> mask
> > content.
> >
> > I'm mostly asking this list for help with the question of using Tomcat in
> > production.
> >
> >
> So we don't use Tomcat anywhere right now, so yeah that's unprecedented.
>

Just a note that Tomcat is in use by analytics as part of the
Oozie-YARN-Hadoop infrastructure.  It's bundled as part of the set of
packages, which of course is quite a different thing than setting up and
maintaining an instance generally for services.

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] 9 am UTC maintenance for dataset1001 (dumps.wikimedia.org)

2016-11-14 Thread Ariel Glenn WMF
That should be Tuesday, Nov 15. It's been a long week.

A.

On Mon, Nov 14, 2016 at 2:27 PM, Ariel Glenn WMF <ar...@wikimedia.org>
wrote:

> On Tuesday Nov 13, at 9 am UTC, the web server for the dumps and other
> datasets will
> be unavailable due to maintenance.  This should take no longer than 10
> minutes.  Thanks for your understanding.
>
>
> Ariel
>
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] 8 am UTC Oct 29, maintenance for dataset1001 (dumps.wikimedia.org)

2016-10-28 Thread Ariel Glenn WMF
On Saturday Oct 29, at 8 am UTC, the web server for the dumps and other
datasets will be unavailable due to maintenance.  This should take no
longer than 10 minutes.  Thanks for your understanding.

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] new XML/sql dumps mirror

2016-12-19 Thread Ariel Glenn WMF
I'm happy to announce that the Academic Computer Club of Umeå University in
Sweden is now offering for download the last 5 XML/sql dumps, as well as a
mirror of 'other' datasets.  Check the current mirror list [1] for more
information, or go directly to download:

http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/
http://ftp.acc.umu.se/mirror/wikimedia.org/other/

Rsync is also available.

Happy downloading!

Ariel

[1]
https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Current_mirrors
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Potential Spoof] Question about wikidata dump bz2 file

2017-04-06 Thread Ariel Glenn WMF
Hi Trung,

For larger wikis, there will be a collection of partial files such as
these, where the pXXXpXXX indicate the first and last page ids in the
file.  But for pages-articles, there will also be a combined file
generated, so you'll be able to download that directly.  It's listed on the
download page https://dumps.wikimedia.org/wikidatawiki/20170401/ and the
direct link is as you expect:
https://dumps.wikimedia.org/wikidatawiki/20170401/wikidatawiki-20170401-pages-articles.xml.bz2

Please do consider joining the xmldatadumps-l list; changes and updates are
announced there, among other things.

Ariel

On Thu, Apr 6, 2017 at 10:12 AM, Jaime Crespo  wrote:

> Trung,
>
> If you do not get an answer on the developers' forum, there is a
> dumps-focused mailing list at
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
> Cheers,
>
> On Thu, Apr 6, 2017 at 6:59 AM, Trung Dinh  wrote:
>
> > Sorry, I hit enter early by accident.
> >
> > I realized the dump file for wikidata is no longer in the format
> > wikidatawiki-2017-pages-articles.xml.bz2 anymore.
> > Now, it is split in to different dumps:
> > https://dumps.wikimedia.org/wikidatawiki/latest/
> > wikidatawiki-latest-md5sums.txt
> >
> > I am wondering when did this happen and the rationale behind it. Will it
> > be permanent or we will switch back to the original format soon ?
> >
> > Thank you,
> >
> > Best regards,
> >
> > Trung
> >
> > On 4/5/17, 9:57 PM, "Wikitech-l on behalf of Trung Dinh" <
> > wikitech-l-boun...@lists.wikimedia.org on behalf of t...@fb.com> wrote:
> >
> > Hi everyone,
> >
> > I realized the dump file for wikidata is no longer in the format
> > wikidatawiki-2017-pages-articles.xml.bz2 anymore.
> >
> >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
>
>
>
> --
> Jaime Crespo
> 
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Important news about the November dumps run!

2017-11-03 Thread Ariel Glenn WMF
The first set of dumps is running there and looks like it's working ok.
I've done a manual rsync of files produced up to this point, so those are
now available on the web server.

As before, you can follow work on this at
https://phabricator.wikimedia.org/T178893

Note that it is possible that some index.html files may contain links to
files which did not get picked up on the rsync.  They'll be there sometime
tomorrow after the next rsync.

Ariel

On Mon, Oct 30, 2017 at 5:39 PM, Ariel Glenn WMF <ar...@wikimedia.org>
wrote:

> As was previously announced on the xmldatadumps-l list, the sql/xml dumps
> generated twice a month will be written to an internal server, starting
> with the November run.  This is in part to reduce load on the web/rsync/nfs
> server which has been doing this work also until now.  We want separation
> of roles for some other reasons too.
>
> Because I want to get this right, and there are a lot of moving parts, and
> I don't want to rsync all the prefetch data over to these boxes again next
> month after cancelling the move:
>
> 
> If needed, the November full run will be delayed for a few days.
> If the November full run takes too long, the partial run, usually starting
> on the 20th of the month, will not take place.
> *
>
> Additionally, as described in an earlier email on the xmldatadumps-l list:
>
> *
> files will show up on the web server/rsync server with a substantial
> delay.  Initially this may be a day or more.  This includes index.html and
> other status files.
> *
>
> You can keep track of developments here: https://phabricator.wikimedia.
> org/T178893
>
> If you know folks not on the lists in the recipients field for this email,
> please forward it to them and suggest that they subscribe to this list.
>
> Thanks,
>
> Ariel
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Important news about the November dumps run!

2017-11-06 Thread Ariel Glenn WMF
Rsync of xml/sql dumps to the web server is now running on a rolling basis
via a script, so you should see updates regularly rather than "every
$random hours".  There's more to be done on that front, see
https://phabricator.wikimedia.org/T179857 for what's next.

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Important news about the November dumps run!

2017-11-07 Thread Ariel Glenn WMF
There are no problems that I see.  We did get started a couple days late
for this run due to the move to an internal server, but I see all jobs
running fine.  The frwiki page-articles dumps have not yet run; enwiki and
wikidatawiki are in progress; eswiki, itwiki, jawiki, and zhwiki are busy
writing pages-articles right now, etc.  Just give it another couple of days
:-)

Ariel

On Tue, Nov 7, 2017 at 7:28 PM, Nicolas Vervelle <nverve...@gmail.com>
wrote:

> Hi,
>
> Are there problems with some dumps like frwiki with the new system ?
> On your.org mirror, important files like page-articles are still missing
> from the 20171103 dump directory, when usually it only takes a day...
>
> Nico
>
> On Mon, Nov 6, 2017 at 8:01 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> wrote:
>
> > Rsync of xml/sql dumps to the web server is now running on a rolling
> basis
> > via a script, so you should see updates regularly rather than "every
> > $random hours".  There's more to be done on that front, see
> > https://phabricator.wikimedia.org/T179857 for what's next.
> >
> > Ariel
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Important news about the November dumps run!

2017-10-30 Thread Ariel Glenn WMF
As was previously announced on the xmldatadumps-l list, the sql/xml dumps
generated twice a month will be written to an internal server, starting
with the November run.  This is in part to reduce load on the web/rsync/nfs
server which has been doing this work also until now.  We want separation
of roles for some other reasons too.

Because I want to get this right, and there are a lot of moving parts, and
I don't want to rsync all the prefetch data over to these boxes again next
month after cancelling the move:


If needed, the November full run will be delayed for a few days.
If the November full run takes too long, the partial run, usually starting
on the 20th of the month, will not take place.
*

Additionally, as described in an earlier email on the xmldatadumps-l list:

*
files will show up on the web server/rsync server with a substantial
delay.  Initially this may be a day or more.  This includes index.html and
other status files.
*

You can keep track of developments here:
https://phabricator.wikimedia.org/T178893

If you know folks not on the lists in the recipients field for this email,
please forward it to them and suggest that they subscribe to this list.

Thanks,

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] change to output file numbering of big wikis

2018-05-31 Thread Ariel Glenn WMF
TL;DR:
Scripts that reply on xml files numbered 1 through 4 should be updated to
check for 1 through 6.

Explanation:

A number of wikis have stubs and page content files generated 4 parts at a
time, with the appropriate number added to the filename. I'm going to be
increasing that thi month to 6.

The reason for the increase is that near the end of the run there are
usually just a few big wikis taking their time at completing. If they run
with 6 processes at once, they'll finish up a bit sooner.

If you have scripts that rely on the number 4, just increase it to 6 and
you're done.

This will go into effect for the June 1 run and all runs afterwards.

Thanks!
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] MultiContent Revisions and changes to the XML dumps

2018-08-02 Thread Ariel Glenn WMF
As many of you may know, MultiContent Revisions are coming soon (October?)
to a wiki near you. This means that we need changes to the XML dumps
schema; these changes will likely NOT be backwards compatible.

Initial discussion will take place here:
https://phabricator.wikimedia.org/T199121

For background on MultiContent Revisions and their use on e.g. Commons or
WikiData, see:

https://phabricator.wikimedia.org/T200903 (Commons media medata)
https://phabricator.wikimedia.org/T194729 (Wikidata entites)
https://www.mediawiki.org/wiki/Requests_for_comment/Multi-Content_Revisions
(MCR generally)

There may be other, better tickets/pages for background; feel free to
supplement this list if you have such links.

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] huwiki, arwiki to be treated as 'big wikis' and run parallel jobs

2018-08-20 Thread Ariel Glenn WMF
The dumps will run just as often as they ever did, twice a month, and the
change just concerns dumps, nothing else.

Ariel

On Mon, Aug 20, 2018 at 1:45 PM, Bináris  wrote:

> Does this affect the frequency of new dumps?
> Is this a "general" classification or just concerns dumps?
> Anyway, I am proud of being part of this. :-)
>
> 2018-08-20 12:26 GMT+02:00 Ariel Glenn WMF :
>
> > Starting September 1, huwiki and arwiki, which both take several days to
> > complete the revsion history content dumps, will be moved to the 'big
> > wikis' list, meaning that they will run jobs in parallel as do frwiki,
> > ptwiki and others now, for a speedup.
> >
> > Please update your scripts accordingly.  Thanks!
> >
> > Task for this: https://phabricator.wikimedia.org/T202268
> >
> > Ariel
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
>
> --
> Bináris
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] huwiki, arwiki to be treated as 'big wikis' and run parallel jobs

2018-08-20 Thread Ariel Glenn WMF
Starting September 1, huwiki and arwiki, which both take several days to
complete the revsion history content dumps, will be moved to the 'big
wikis' list, meaning that they will run jobs in parallel as do frwiki,
ptwiki and others now, for a speedup.

Please update your scripts accordingly.  Thanks!

Task for this: https://phabricator.wikimedia.org/T202268

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] hewiki dump to be added to 'big wikis' and run with multiple processes

2018-07-19 Thread Ariel Glenn WMF
Good morning!

The pages-meta-history dumps for hewiki take 70 hours these days, the
longest of any wiki not already running with parallel jobs. I plan to add
it to the list of 'big wikis' starting August 1st, meaning that 6 jobs will
run in parallel producing the usual numbered file output; look at e.g.
frwiki dumps for an example.

Please adjust any download/processing scripts accordingly.

Thanks!

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] terbium EOL, mw maintenance server MOVED, use mwmaint1001 for all

2018-07-04 Thread Ariel Glenn WMF
Hello folks,

Terbium, our former faithful MediaWiki maintenance server, will be up for
decommissioning on Monday, July 9th. It is no longer used for anything in
production as of a few moments ago. The sole exception to that is cron jobs
that were already running and have not yet completed. Please be sure that
you have no processes left running on terbium by Monday July 9th!

The new server for everything including cron jobs is
mwmaint1001.eqiad.wmnet.

Your files from terbium should already be available in a subdirectory
'home-terbium' in you home directory.

Please report any issues; you know where to find us (phab,
#wikimedia-operations irc).

Thanks!

WMF SRE Team

P.S. For full history of this process, see
https://phabricator.wikimedia.org/T192092
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] pagecounts-ez missing April files (was Re: changes coming to large dumps)

2018-04-10 Thread Ariel Glenn WMF
If it's gone, that's coincidence. Flagging this to look into, thanks for
the report. Please follow that ticket,
https://phabricator.wikimedia.org/T184258 for more info.

On Tue, Apr 10, 2018 at 5:35 PM, Derk-Jan Hartman <
d.j.hartman+wmf...@gmail.com> wrote:

> It seems that the pagecounts-ez sets disappeared from
> dumps.wikimedia.org starting this date. Is that a coincidence ?
> Is it https://phabricator.wikimedia.org/T189283 perhaps ?
>
> DJ
>
> On Thu, Mar 29, 2018 at 2:42 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> wrote:
> > Here it comes:
> >
> > For the April 1st run and all following runs, the Wikidata dumps of
> > pages-meta-current.bz2 will be produced only as separate downloadable
> > files, no recombined single file will be produced.
> >
> > No other dump jobs will be impacted.
> >
> > A reminder that each of the single downloadable pieces has the siteinfo
> > header and the mediawiki footer so they may all be processed separately
> by
> > whatever tools you use to grab data out of the combined file. If your
> > workflow supports it, they may even be processed in parallel.
> >
> > I am still looking into what the best approach is for the pags-articles
> > dumps.
> >
> > Please forward wherever you deem appropriate. For further updates, don't
> > forget to check the Phab ticket!  https://phabricator.wikimedia.
> org/T179059
> >
> > On Mon, Mar 19, 2018 at 2:00 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> > wrote:
> >
> >> A reprieve!  Code's not ready and I need to do some timing tests, so the
> >> March 20th run will do the standard recombining.
> >>
> >> For updates, don't forget to check the Phab ticket!
> >> https://phabricator.wikimedia.org/T179059
> >>
> >> On Mon, Mar 5, 2018 at 1:10 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> >> wrote:
> >>
> >>> Please forward wherever you think appropriate.
> >>>
> >>> For some time we have provided multiple numbered pages-articles bz2
> file
> >>> for large wikis, as well as a single file with all of the contents
> combined
> >>> into one.  This is consuming enough time for Wikidata that it is no
> longer
> >>> sustainable.  For wikis where the sizes of these files to recombine is
> "too
> >>> large", we will skip this recombine step. This means that downloader
> >>> scripts relying on this file will need to check its existence, and if
> it's
> >>> not there, fall back to downloading the multiple numbered files.
> >>>
> >>> I expect to get this done and deployed by the March 20th dumps run.
> You
> >>> can follow along here: https://phabricator.wikimedia.org/T179059
> >>>
> >>> Thanks!
> >>>
> >>> Ariel
> >>>
> >>
> >>
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] New web server for dumps/datasets, OLD ONE GOING AWAY

2018-04-04 Thread Ariel Glenn WMF
Folks,

As you'll have seen from previous email, we are now using a new beefier
webserver for your dataset downloading needs. And the old server is going
away on TUESDAY April 10th.

This means that if you are using 'dataset1001.wikimedia.org' or the IP
address itself in your scripts, you MUST change it before Tuesday, or it
will stop working.

There will be no further reminders.

Thanks!

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Change for abstracts dumps, primarily for wikidata

2018-04-04 Thread Ariel Glenn WMF
Those of you that rely on the abstracts dumps will have noticed that the
content for wikidata is pretty much useless.  It doesn't look like a
summary of the page because main namespace articles on wikidata aren't
paragraphs of text. And there's really no useful summary to be generated,
even if we were clever.

We have instead decided to produce abstracts output only for pages in the
main namespace that consist of text. For pages that are of type
wikidata-item, json and so on, the  tag will contain the
attribute 'not-applicable' set to the empty string. This impacts a very few
pages on other wikis; for the full list and for more information on this
change, see  https://phabricator.wikimedia.org/T178047

We hope this change will be merged in a week or so; it won't take effect
for wikidata until the next dumps run on April 20th, since the wikidata
abstracts are already in progress.

If you have any questions, don't hesitate to ask.

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] changes coming to large dumps

2018-03-29 Thread Ariel Glenn WMF
Here it comes:

For the April 1st run and all following runs, the Wikidata dumps of
pages-meta-current.bz2 will be produced only as separate downloadable
files, no recombined single file will be produced.

No other dump jobs will be impacted.

A reminder that each of the single downloadable pieces has the siteinfo
header and the mediawiki footer so they may all be processed separately by
whatever tools you use to grab data out of the combined file. If your
workflow supports it, they may even be processed in parallel.

I am still looking into what the best approach is for the pags-articles
dumps.

Please forward wherever you deem appropriate. For further updates, don't
forget to check the Phab ticket!  https://phabricator.wikimedia.org/T179059

On Mon, Mar 19, 2018 at 2:00 PM, Ariel Glenn WMF <ar...@wikimedia.org>
wrote:

> A reprieve!  Code's not ready and I need to do some timing tests, so the
> March 20th run will do the standard recombining.
>
> For updates, don't forget to check the Phab ticket!
> https://phabricator.wikimedia.org/T179059
>
> On Mon, Mar 5, 2018 at 1:10 PM, Ariel Glenn WMF <ar...@wikimedia.org>
> wrote:
>
>> Please forward wherever you think appropriate.
>>
>> For some time we have provided multiple numbered pages-articles bz2 file
>> for large wikis, as well as a single file with all of the contents combined
>> into one.  This is consuming enough time for Wikidata that it is no longer
>> sustainable.  For wikis where the sizes of these files to recombine is "too
>> large", we will skip this recombine step. This means that downloader
>> scripts relying on this file will need to check its existence, and if it's
>> not there, fall back to downloading the multiple numbered files.
>>
>> I expect to get this done and deployed by the March 20th dumps run.  You
>> can follow along here: https://phabricator.wikimedia.org/T179059
>>
>> Thanks!
>>
>> Ariel
>>
>
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] changes coming to large dumps

2018-03-19 Thread Ariel Glenn WMF
A reprieve!  Code's not ready and I need to do some timing tests, so the
March 20th run will do the standard recombining.

For updates, don't forget to check the Phab ticket!
https://phabricator.wikimedia.org/T179059

On Mon, Mar 5, 2018 at 1:10 PM, Ariel Glenn WMF <ar...@wikimedia.org> wrote:

> Please forward wherever you think appropriate.
>
> For some time we have provided multiple numbered pages-articles bz2 file
> for large wikis, as well as a single file with all of the contents combined
> into one.  This is consuming enough time for Wikidata that it is no longer
> sustainable.  For wikis where the sizes of these files to recombine is "too
> large", we will skip this recombine step. This means that downloader
> scripts relying on this file will need to check its existence, and if it's
> not there, fall back to downloading the multiple numbered files.
>
> I expect to get this done and deployed by the March 20th dumps run.  You
> can follow along here: https://phabricator.wikimedia.org/T179059
>
> Thanks!
>
> Ariel
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] changes coming to large dumps

2018-03-05 Thread Ariel Glenn WMF
Please forward wherever you think appropriate.

For some time we have provided multiple numbered pages-articles bz2 file
for large wikis, as well as a single file with all of the contents combined
into one.  This is consuming enough time for Wikidata that it is no longer
sustainable.  For wikis where the sizes of these files to recombine is "too
large", we will skip this recombine step. This means that downloader
scripts relying on this file will need to check its existence, and if it's
not there, fall back to downloading the multiple numbered files.

I expect to get this done and deployed by the March 20th dumps run.  You
can follow along here: https://phabricator.wikimedia.org/T179059

Thanks!

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] changes coming to large dumps

2018-03-05 Thread Ariel Glenn WMF
We'll probably start at 20GB, which means that WIkidata will be the only
wiki affected for now.

Ariel

On Mon, Mar 5, 2018 at 1:40 PM, Bináris <wikipo...@gmail.com> wrote:

> Could you please translate "too large" to megabytes?
>
> 2018-03-05 12:10 GMT+01:00 Ariel Glenn WMF <ar...@wikimedia.org>:
>
> > Please forward wherever you think appropriate.
> >
> > For some time we have provided multiple numbered pages-articles bz2 file
> > for large wikis, as well as a single file with all of the contents
> combined
> > into one.  This is consuming enough time for Wikidata that it is no
> longer
> > sustainable.  For wikis where the sizes of these files to recombine is
> "too
> > large", we will skip this recombine step. This means that downloader
> > scripts relying on this file will need to check its existence, and if
> it's
> > not there, fall back to downloading the multiple numbered files.
> >
> > I expect to get this done and deployed by the March 20th dumps run.  You
> > can follow along here: https://phabricator.wikimedia.org/T179059
> >
> > Thanks!
> >
> > Ariel
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
>
> --
> Bináris
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] Gerrit now automatically adds reviewers

2019-01-18 Thread Ariel Glenn WMF
In the meantime, I would encourage those who have not looked at the Git
Reviewer Bot page in a while, to do so and to add any updates.

Ariel

On Fri, Jan 18, 2019 at 4:12 PM Tyler Cipriani 
wrote:

> Hi all,
>
> Gerrit no longer automatically adds reviewers[0]. Unfortunately, this
> plugin appears (given the replies on this thread) to be missing key
> features needed to be useful for us at this time. Apologies to those
> folks whose inboxes were destroyed.
>
> I would like to re-enable this plugin at some point, provided the
> features identified in this thread are added (perhaps also an
> "X-Gerrit-reviewers-by-blame: 1" email header, or subject line to make
> filtering these messages easier).
>
> In the interim, project-owners are able to opt-in to using the
> reviewers-by-blame plugin on a per-project basis on their project admin
> page in Gerrit.
>
> Also, the Git Reviewer Bot[1] provides folks an opt-in method of
> volunteering to review a subset of files in a particular repo.
>
> Getting code review as a new contributor is hard. Thanks for bearing
> with us.
>
> -- Tyler
>
> [0]. 
> [1]. 
>
> On 19-01-17 13:51:58, Greg Grossmeier wrote:
> >Hello,
> >
> >Yesterday we (the Release Engineering team) enabled a Gerrit plugin that
> >will automatically add reviewers to your changes based on who previously
> >has committed changes to the file.
> >
> >For more, please read the blog post at:
> >
> https://phabricator.wikimedia.org/phame/post/view/139/gerrit_now_automatically_adds_reviewers/
> >
> >NOTE: There are a couple requests from us open upstream to improve the
> >plugin[0], we'll incorporate those improvements when they are released.
> >
> >On behalf of the rest of the Release Engineering Team[1],
> >
> >Greg
> >
> >[0] https://phabricator.wikimedia.org/T101131#4890023
> >[1] As well as Paladox, a Wikimedia volunteer with strong ties to
> >upstream Gerrit.
> >
> >--
> >| Greg GrossmeierGPG: B2FA 27B1 F7EB D327 6B8E |
> >| Release Team ManagerA18D 1138 8E47 FAC8 1C7D |
> >
> >___
> >Engineering mailing list
> >engineer...@lists.wikimedia.org
> >https://lists.wikimedia.org/mailman/listinfo/engineering
> ___
> Engineering mailing list
> engineer...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/engineering
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Possible change in schedule of generation of wikidata entity dumps

2019-03-14 Thread Ariel Glenn WMF
If you use these dumps regularly, please read and weigh in here:
https://phabricator.wikimedia.org/T216160

Thanks in advance,

Ariel Glenn
Wikimedia Foundation
ar...@wikimedia.org
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Wikidata now officially has more total edits than English language Wikipedia

2019-03-20 Thread Ariel Glenn WMF
Wikidata surpassed the English language Wikipedia in the number of
revisions in the database, about 45 minutes ago today.I was tipped off by a
tweet [1] a few day ago and have been watching via a script that displays
the largest revision id and its timestamp. Here's the point where Wikidata
overtakes English Wikipedia (times in UTC):

[ariel@bigtrouble wikidata-huge]$ python3 ./get_revid_info.py -d
www.wikidata.org -r 888603998,888603999,888604000
revid 888603998 at 2019-03-20T06:00:59Z
revid 888603999 at 2019-03-20T06:00:59Z
revid 888604000 at 2019-03-20T06:00:59Z
[ariel@bigtrouble wikidata-huge]$ python3 ./get_revid_info.py -d
en.wikipedia.org -r 888603998,888603999,888604000
revid 888603998 at 2019-03-20T06:00:59Z
revid 888603999 at 2019-03-20T06:00:59Z
revid 888604000 at 2019-03-20T06:01:00Z

Only 45 minutes later, the gap is already over 2000 revsions:

[ariel@bigtrouble wikidata-huge]$ python3 ./compare_sizes.py
Last enwiki revid is 888606979 and last wikidata revid is 888629401
2019-03-20 06:46:03: diff is 22422

Have a nice day!

Ariel

[1] https://twitter.com/MonsieurAZ/status/1106565116508729345
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] New dumps mirror: The Free Mirror Project

2019-02-06 Thread Ariel Glenn WMF
I am happy to announce a new mirror site, located in Canada, which is
hosting the last two good dumps of all projects. Please welcome and put to
good use https://dumps.wikimedia.freemirror.org/ !

I want to thank Adam for volunteering bandwidth and space and for getting
everything set up. More information about the project can be found at
http://freemirror.org/   Enjoy!

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] question about wikidata entity dumps usage (please forward to interested parties)

2019-02-16 Thread Ariel Glenn WMF
Hey folks,

We've had a request to reschedule the way the various wikidata entity dumps
are run. Right now they go once a week on set days of the week; we've been
asked about pegging them to specific days of the month, rather as the
xml/sql dumps are run. See https://phabricator.wikimedia.org/T216160 for
more info.

Is this going to cause problems for anyone? Do you ingest these dumps on a
schedule, and what works for you? Please weigh in here or on the
phabricator task; thanks!

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] BREAKING CHANGE: schema update, xml dumps

2019-11-27 Thread Ariel Glenn WMF
We plan to move to the new schema for xml dumps for the February 1, 2020
run. Update your scripts and apps accordingly!

The new schema contains an entry for each 'slot' of content. This means
that, for example, the commonswiki dump will contain MediaInfo information
as well as the usual wikitext. See
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/master/docs/export-0.11.xsd
for the schema and
https://www.mediawiki.org/wiki/Requests_for_comment/Schema_update_for_multiple_content_objects_per_revision_(MCR)_in_XML_dumps
for further explanation and example outputs.

Phabricator task for the update: https://phabricator.wikimedia.org/T238972

PLEASE FORWARD to other lists as you deem appropriate. Thanks!

Ariel Glenn
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] New dumps available: MachineVision extension tables

2020-04-21 Thread Ariel Glenn WMF
Good morning!

New weekly dumps are available [1], containing the content of the tables
used by the MachineVision extension [2].  For information about these
tables, please see [3].

If you decide to use these tables, as with any other dumps, I would be
interested to know how you use them; feel free to drop me an email.

Wishing everyone good health,

Ariel

[1] https://dumps.wikimedia.org/other/machinevision/
[2] https://www.mediawiki.org/wiki/Extension:MachineVision
[3] https://www.mediawiki.org/wiki/Extension:MachineVision/Schema
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] No second dump run this month

2020-03-19 Thread Ariel Glenn WMF
As mentioned earlier on the xmldatadumps-l, the dumps are running very slow
this month, ince the vslow db hosts they use are also serving live traffic
during a tables migration. Even manual runs of partial jobs would not help
the situation any, so there will be NO SECOND DUMP RUN THIS MONTH. The
March 1 Wikidata run is still in process but it should complete in the next
several days.

With any luck everything will be back to normal in April and we'll be able
to conduct two runs as usual from then on.

Ariel
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Making breaking changes without deprecation?

2020-08-28 Thread Ariel Glenn WMF
I'd like to see third party users, even those not on the mailing list, get
advance notice in one release (say in the release notes) so that when the
next release shows up with the deprecated code removed, they have had time
to patch up any internal extensions and code they may have.

I don't want to penalize third parties who may not publish their extensions
because they think the code is not good enough for public consumption or
because it is very specific to their company or workflow.

I also don't want to encourage delays in updating, or the common practice
of running very outdated versions of MediaWiki. Of course some folks will
remain on LTS; that's what it's there for. But once a new release is out,
we should want parties to be in a position to update to it immediately, at
least as far as our processes go.

A delay of two releases is nice but not necessary and honestly I'd just
skip that altogether.

Just my .02 €,

Ariel

On Fri, Aug 28, 2020 at 12:19 PM Daniel Kinzler 
wrote:

> Hi all!
>
> Since the new Stable Interface Policy[1] has come into effect, there has
> been
> some confusion about when and how the deprecation process can be
> accelerated or
> bypassed. I started a discussion about this issue on the talk page[2], and
> now
> I'm writing this email in the hope of gathering more perspectives.
>
> tl;dr: the key question is:
>
> Can we shorten or even entirely skip the deprecation process,
> if we have removed all usages of the obsolete code from public
> extensions?
>
> If you are affected by the answer to this question, or you otherwise have
> opinions about it, please read on (ok ok, this mail is massive - at least
> read
> the proposed new wording of the policy). I'm especially interested in the
> opinions of extension developers.
>
>
> So, let's dive in. On the one hand, the new (and old) policy states:
>
> Code MUST emit hard deprecation notices for at least one major
> MediaWiki version before being removed. It is RECOMMENDED to emit
> hard deprecation notices for at least two major MediaWiki
> versions. EXCEPTIONS to this are listed in the section "Removal
> without deprecation" below.
>
> This means that code that starts to emit a deprecation warning in version
> N can
> only be removed in version N+1, better even N+2. This effectively
> recommends
> that obsolete code be kept around for at least half a year, with a
> preference
> for a full year and more. However, we now have this exception in place:
>
> The deprecation process may be bypassed for code that is unused
> within the MediaWiki ecosystem. The ecosystem is defined to
> consist of all actively maintained code residing in repositories
> owned by the Wikimedia foundation, and can be searched using the
> code search tool.
>
> When TechCom added this section[3][4], we were thinking of the case where a
> method becomes obsolete, but is unused. In that case, why go through all
> the
> hassle of deprecation, if nobody uses it anyway?
>
> However, what does this mean for obsolete code that *is* used? Can we just
> go
> ahead and remove the usages, and then remove the code without deprecation?
> That
> seems to be the logical consequence.
>
> The result is a much tighter timeline from soft deprecation to removal,
> reducing
> the amount of deprecated code we have to drag along and keep functional.
> This is
> would be helpful particularly when code was refactored to remove
> undesirable
> dependencies, since the dependency will not actually go away until the
> deprecated code has been removed.
>
> So, if we put in the work to remove usages, can we skip the deprecation
> process?
> After all, if the code is truly unused, this would not do any harm, right?
> And
> being able to make breaking changes without the need to wait a year for
> them to
> become effective would greatly improve the speed at which we can modernize
> the
> code base.
>
> However, even skipping soft deprecation and going directly to hard
> deprecation
> of the construction of the Revision class raised concerns, see for instance
>  >.
>
> The key concern is that we can only know about usages in repositories in
> our
> "ecosystem", a concept introduced into the policy by the section quoted
> above. I
> will go into the implications of this further below. But first, let me
> propose a
> change to the policy, to clarify when deprecation is or is not needed.
>
> I propose that the policy should read:
>
> Obsolete code MAY be removed without deprecation if it is unused (or
> appropriately gated) by any code in the MediaWiki ecosystem. Such
> removal must be recorded in the release notes as a breaking change
> without deprecation, and must be announced on the appropriate
> mailing lists.
>
> Obsolete code that is still used within the ecosystem MAY be
> removed if it has been emitting deprecation warnings in AT 

[Wikitech-l] New XML/SQL dumps mirror

2021-07-12 Thread Ariel Glenn WMF
Thanks to BringYour, based in California, for volunteering to host the last
5 good xml/sql dumps!

To check out the full list of mirrors, see either
https://dumps.wikimedia.org/mirrors.html or
https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Dumps

Interested in hosting dumps and have some spare space and bandwidth? See
https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Potential_mirrors
for more about that.

Ariel "Dumps wrangler" Glenn
___
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Wikimedia Enterprise HTML dumps available for public download

2021-10-19 Thread Ariel Glenn WMF
I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for
October 17-18th are available for public download; see
https://dumps.wikimedia.org/other/enterprise_html/ for more information. We
expect to make updated versions of these files available around the 1st/2nd
of the month and the 20th/21st of the month, following the cadence of the
standard SQL/XML dumps.

This is still an experimental service, so there may be hiccups from time to
time. Please be patient and report issues as you find them. Thanks!

Ariel "Dumps Wrangler" Glenn

[1] See https://www.mediawiki.org/wiki/Wikimedia_Enterprise for much more
about Wikimedia Enterprise and its API.
___
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Re: [Wiki-research-l] Wikimedia Enterprise HTML dumps available for public download

2022-01-01 Thread Ariel Glenn WMF
Hello Mitar! I'm glad you are finding the Wikimedia Enterprise dumps useful.

For your tar.gz question, this is the format that the Wikimedia Enterprise
dataset consumers prefer, from what I understand. But I would suggest that
if you are interested in other formats, you might open a task on
phabricator with a feature request, and add  the Wikimedia Enterprise
project tag ( https://phabricator.wikimedia.org/project/view/4929/ ).

As to the API, I'm only familiar with the endpoints for bulk download, so
you'll want to ask the Wikimedia Enterprise folks, or have a look at their
API documentation here:
https://www.mediawiki.org/wiki/Wikimedia_Enterprise/Documentation

Ariel


On Sat, Jan 1, 2022 at 4:30 PM Mitar  wrote:

> Hi!
>
> Awesome!
>
> Is there any reason they are tar.gz files of one file and not simply
> bzip2 of the file contents? Wikidata dumps are bzip2 of one json and
> that allows parallel decompression. Having both tar (why tar of one
> file at all?) and gz in there really requires one to first decompress
> the whole thing before you can process it in parallel. Is there some
> other way I am missing?
>
> Wikipedia dumps are done with multistream bzip2 with an additional
> index file. That could be nice here too, if one could have an index
> file and then be able to immediately jump to a JSON line for
> corresponding articles.
>
> Also, is there an API endpoint or Special page which can return the
> same JSON for a single Wikipedia page? The JSON structure looks very
> useful by itself (e.g., not in bulk).
>
>
> Mitar
>
>
> On Tue, Oct 19, 2021 at 4:57 PM Ariel Glenn WMF 
> wrote:
> >
> > I am pleased to announce that Wikimedia Enterprise's HTML dumps [1] for
> > October 17-18th are available for public download; see
> > https://dumps.wikimedia.org/other/enterprise_html/ for more
> information. We
> > expect to make updated versions of these files available around the
> 1st/2nd
> > of the month and the 20th/21st of the month, following the cadence of the
> > standard SQL/XML dumps.
> >
> > This is still an experimental service, so there may be hiccups from time
> to
> > time. Please be patient and report issues as you find them. Thanks!
> >
> > Ariel "Dumps Wrangler" Glenn
> >
> > [1] See https://www.mediawiki.org/wiki/Wikimedia_Enterprise for much
> more
> > about Wikimedia Enterprise and its API.
> > ___
> > Wiki-research-l mailing list -- wiki-researc...@lists.wikimedia.org
> > To unsubscribe send an email to
> wiki-research-l-le...@lists.wikimedia.org
>
>
>
> --
> http://mitar.tnode.com/
> https://twitter.com/mitar_m
> ___
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
___
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Wiki content and other dumps new ownership, feedback requested on new version!

2023-09-27 Thread Ariel Glenn WMF
Hello folks!

For some years now, I've been the main or only point of contact for the
Wiki project sql/xml dumps semimonthly, as well as for a number of
miscellaneous weekly datasets.

This work is now passing to Data Platform Engineering (DPE), and your new
points of contact, starting right away, will be Will Doran (email:wdoran)
and Virginia Poundstone (email:vpoundstone). I'll still be lending a hand
in the background for a little while but by the end of the month I'll have
transitioned into a new role at the Wikimedia Foundation, working more
directly on MediaWiki itself.

The Data Products team, a subteam of DPE, will be managing the current
dumps day-to-day, as well as working on a new dumps system intended to
replace and greatly improve the current one. What formats will it produce,
and what content, and in what bundles?  These are all great questions, and
you have a chance to help decide on the answers. The team is gathering
feedback right now; follow this link [
https://docs.google.com/forms/d/e/1FAIpQLScp2KzkcTF7kE8gilCeSogzpeoVN-8yp_SY6Q47eEbuYfXzsw/viewform?usp=sf_link]
to give your input!

If you want to follow along on work being done on the new dumps system, you
can check the phabricator workboard at
https://phabricator.wikimedia.org/project/board/6630/ and look for items
with the "Dumps 2.0" tag.

Members of the Data Products team are already stepping up to manage the
xmldatadumps-l mailing list, so you should not notice any changes as far as
that goes.

And as always, for dumps-related questions people on this list cannot
answer, and which are not covered in the docs at
https://meta.wikimedia.org/wiki/Data_dumps or
https://wikitech.wikimedia.org/wiki/Dumps you can always email ops-dumps
(at) wikimedia.org.

See you on the wikis!

Ariel Glenn
ar...@wikimedia.org
___
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/