Re: [Wikitech-l] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-20 Thread Adam Baso
>
> Hi all, I'm going to schedule some time next week to discuss the incident
> and its response. Good writeup
> ,
> by the way, Matt.
>

Notes posted:

https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCentralAuth/Retrospective
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-13 Thread Matthew Flaschen



On 07/12/2016 11:35 PM, Legoktm wrote:

We should not be blocking login anymore. The patch[1] I deployed last
night catches the exceptions so users are able to login, but still
continues to log them.


Does that still apply if they're logging in *to* the wiki where their 
user row is missing?


I know it fixes the issue "I can't log into English Wikipedia because my 
account on randomwiki is messed up".


Matt

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-13 Thread Adam Baso
>
> I think we need to have a serious discussion about what happened, and
> think very hard about the changes we would need to make to our
>

Hi all, I'm going to schedule some time next week to discuss the incident
and its response. Good writeup
,
by the way, Matt.


I think we should also reach out to the users that were affected and
> apologize.
>

I agree. Can someone please privately provide me a list of affected users
so we can work with a community liaison and engineer to communicate out a
"sorry" message?

-Adam
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-12 Thread Matthew Flaschen

On 07/12/2016 09:25 PM, Matthew Flaschen wrote:

I am already writing an incident report, and I welcome a discussion.


Incident report for the Echo part of this: 
https://wikitech.wikimedia.org/wiki/Incident_documentation/20160712-EchoCentralAuth 
.  Please edit and improve.


Thanks,

Matt

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-12 Thread Legoktm
Hi,

On 07/12/2016 04:56 PM, Ori Livneh wrote:
> Is it actually fixed? It doesn't look like it, from the logs.
> 
> Since midnight UTC on July 7, 3,195 distinct users have tried and failed to
> log in a combined total of 25,047 times, or an average of approximately
> eight times per user. The six days that have passed since then were
> business as usual for the Wikimedia Engineering.

We should not be blocking login anymore. The patch[1] I deployed last
night catches the exceptions so users are able to login, but still
continues to log them. I'm not sure if there's a way to tell the
difference between an exception that was shown to a user and one that
was just logged.

[1] https://gerrit.wikimedia.org/r/#/c/298416/

-- Legoktm

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-12 Thread Matthew Flaschen

On 07/12/2016 07:07 PM, Greg Grossmeier wrote:

Thanks to Matt Flaschen and Brad Jorsch (and others like Ori Livneh and
Bryan Davis) for their help.


Also Roan Kattouw, Kunal Mehta, and Stephane Bisson.

Matt


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-12 Thread Matthew Flaschen

On 07/12/2016 07:56 PM, Ori Livneh wrote:

Is it actually fixed? It doesn't look like it, from the logs.


It's beyond unhelpful that you would send this email without pointing to 
the logs you are referring to.  With a statement like that, a paste is 
called for. 	


If you mean the existing inconsistent state that already exists, there 
is a script running as Greg explicitly noted.



It represents failure of process at multiple levels
and a lack of accountability.


"Lack of accountability" is a serious charge, and one that I disagree 
with.  That would imply people did not take responsibility for their 
code's failures, or did not this seriously, and that is not what I see. 
 The Collaboration team and other people, such as Bryan Davis, worked 
on this promptly as soon as they were made aware, and I take full 
responsibility for causing this issue.


The severity level may not have been evident until last night (thanks to 
Legoktm for helping show this).  Could the severity have been realized 
sooner?  Yes, but I'm not sure this is the way to make that happen.



I think we need to have a serious discussion about what happened, and
think very hard about the changes we would need to make to our processes
and organizational structure to prevent a recurrence.


I am already writing an incident report, and I welcome a discussion.

However, I strongly disagree with the attitude that /there was a serious 
bug; therefore no one cared/ .


I don't dispute it's a very serious and unfortunate bug, and I agree we 
should work to prevent bugs, and ensure they're remediated more promptly.


But I take my work and the extensions my team is responsible for 
seriously, and I worked on this urgently as soon as I knew about it.


Matt Flaschen

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-12 Thread Greg Grossmeier

> On Tue, Jul 12, 2016 at 4:07 PM, Greg Grossmeier  wrote:
> 
> > 
> > > https://phabricator.wikimedia.org/T119736 - "Could not find local user
> > data for {Username}@{wiki}"
> > >
> > > There was an order of magnitude increase in the rate of those errors
> > > that started on July 7th.
> > >
> > > Investigation and remediation is on-going.
> >
> > Investigation and remediation is mostly complete[0] and the vast
> > majority of cases have been addressed. There are still users who will
> > experience this error for the next ~1 day.[1]
> >
> 
> Is it actually fixed? It doesn't look like it, from the logs.

That was the information I was given. If it is not improved after the
fixes and letting the maint script finish then we'll know more
certainly, and with that certainty can modify our plans (as we always
do).

> Our failure to react to this swiftly and comprehensively is appalling and
> embarrassing. It represents failure of process at multiple levels and a
> lack of accountability.

Matt is working on an incident report for this.

> I think we should also reach out to the users that were affected and
> apologize.

That certainly should/could be one of the action items.

Greg

-- 
| Greg GrossmeierGPG: B2FA 27B1 F7EB D327 6B8E |
| identi.ca: @gregA18D 1138 8E47 FAC8 1C7D |

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Engineering] The train will resume tomorrow (was Re: All wikis reverted to wmf.8 last night due to T119736)

2016-07-12 Thread Ori Livneh
On Tue, Jul 12, 2016 at 4:07 PM, Greg Grossmeier  wrote:

> 
> > https://phabricator.wikimedia.org/T119736 - "Could not find local user
> data for {Username}@{wiki}"
> >
> > There was an order of magnitude increase in the rate of those errors
> > that started on July 7th.
> >
> > Investigation and remediation is on-going.
>
> Investigation and remediation is mostly complete[0] and the vast
> majority of cases have been addressed. There are still users who will
> experience this error for the next ~1 day.[1]
>

Is it actually fixed? It doesn't look like it, from the logs.

Since midnight UTC on July 7, 3,195 distinct users have tried and failed to
log in a combined total of 25,047 times, or an average of approximately
eight times per user. The six days that have passed since then were
business as usual for the Wikimedia Engineering.

Our failure to react to this swiftly and comprehensively is appalling and
embarrassing. It represents failure of process at multiple levels and a
lack of accountability.

I think we need to have a serious discussion about what happened, and think
very hard about the changes we would need to make to our processes and
organizational structure to prevent a recurrence.

I think we should also reach out to the users that were affected and
apologize.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l