Re: [Wikitech-l] [Ops] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-13 Thread Andre Klapper
On Tue, 2016-07-12 at 20:15 -0400, aude wrote:
> This (unbreak now) bug has been open since November.  I wonder how
> this has been allowed to remain open and not addressed for this long?

FYI, Matt created a task about "Unbreak now" priority, to receive input from 
Team-Practices:
https://phabricator.wikimedia.org/T140207 

andre
-- 

Andre Klapper | Wikimedia Bugwrangler
http://blogs.gnome.org/aklapper/

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Ops] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-12 Thread Giuseppe Lavagetto
On Wed, Jul 13, 2016 at 2:15 AM, aude  wrote:
> On Tue, Jul 12, 2016 at 7:56 PM, Ori Livneh  wrote:
>> Our failure to react to this swiftly and comprehensively is appalling and
>> embarrassing. It represents failure of process at multiple levels and a lack
>> of accountability.
>
>
> This (unbreak now) bug has been open since November.  I wonder how this has
> been allowed to remain open and not addressed for this long?
>

I am sure we could've done way better even in our current structure,
but it's pretty clear to me that the absence of a team dedicated to
MediaWiki itself calls for such things to happen.

Which is pretty absurd, when you remember that 99% of our traffic is
still served by it.

Cheers

G.
-- 
Giuseppe Lavagetto, Ph.d.
Senior Technical Operations Engineer, Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Ops] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-12 Thread Matthew Flaschen

On 07/12/2016 08:15 PM, aude wrote:

This (unbreak now) bug has been open since November.  I wonder how this
has been allowed to remain open and not addressed for this long?


This has not all been caused by Echo, and it really isn't one bug, just 
one symptom.


There are clearly multiple causes.  The Echo one has been addressed, and 
there are multiple fixes and mitigation on the CentralAuth/core auth 
side, some merged (e.g. https://gerrit.wikimedia.org/r/#/c/298531/ , 
https://gerrit.wikimedia.org/r/#/c/298416/ ), some still being worked 
on/discussed ( https://gerrit.wikimedia.org/r/#/c/297946/ , 
https://gerrit.wikimedia.org/r/#/c/297936/ ), but work is not done.


Matt

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Ops] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-12 Thread aude
On Tue, Jul 12, 2016 at 7:56 PM, Ori Livneh  wrote:

> On Tue, Jul 12, 2016 at 4:07 PM, Greg Grossmeier 
> wrote:
>
>> 
>> > https://phabricator.wikimedia.org/T119736 - "Could not find local user
>> data for {Username}@{wiki}"
>> >
>> > There was an order of magnitude increase in the rate of those errors
>> > that started on July 7th.
>> >
>> > Investigation and remediation is on-going.
>>
>> Investigation and remediation is mostly complete[0] and the vast
>> majority of cases have been addressed. There are still users who will
>> experience this error for the next ~1 day.[1]
>>
>
> Is it actually fixed? It doesn't look like it, from the logs.
>
> Since midnight UTC on July 7, 3,195 distinct users have tried and failed
> to log in a combined total of 25,047 times, or an average of approximately
> eight times per user. The six days that have passed since then were
> business as usual for the Wikimedia Engineering.
>
> Our failure to react to this swiftly and comprehensively is appalling and
> embarrassing. It represents failure of process at multiple levels and a
> lack of accountability.
>

This (unbreak now) bug has been open since November.  I wonder how this has
been allowed to remain open and not addressed for this long?

A new user ran into this issue in June at an editathon that I attended. In
his case, I could fix the problem by manually deleting the offending row in
the database, but most of the time, the user likely gives up :(


>
> I think we need to have a serious discussion about what happened, and
> think very hard about the changes we would need to make to our processes
> and organizational structure to prevent a recurrence.
>
> I think we should also reach out to the users that were affected and
> apologize.
>

+1

Cheers,
Katie


>
> ___
> Ops mailing list
> o...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/ops
>
>


-- 
@wikidata
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l