Re: [Wikitech-l] enwiki revision schema migration in progress

2012-03-20 Thread MZMcBride
Asher Feldman wrote:
 I've temporarily commented out db36 from db.php on the cluster.
 
 This is a flaw in the how the client-side use of maxlag interacts with our
 schema migration process - we run migrations on slaves one by one in an
 automated fashion, only moving to the next after replication lag catches
 up.  Mediawiki takes care of not sending queries to the lagged slave that
 is under migration.  Meanwhile, maxlag always reports the value of the most
 lagged slave.  Not a new issue, but this particular alter table on enwiki
 is likely the most time intensive ever run at wmf.  It's slightly
 ridiculous.
 
 For this one alter, I can stop the migration script and run each statement
 by hand, pulling and re-adding db's one by one along the way, but this
 isn't a sustainable process.  Perhaps we can add a migration flag to
 mediawiki, which if enabled, changes the behavior of maxlag and
 wfWaitForSlaves() to ignore one highly lagged slave so long as others are
 available without lag.

Thank you! My scripts are running fine again. :-)

Some people have suggested that the current API behavior is intentional.
That is, that having different servers return the same error code is better
than having different servers return an error code or not. I think this is
flawed logic due to the problems that it presents (scripts unable to get
past the error code), but it's definitely something that needs investigation
for the future.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] enwiki revision schema migration in progress

2012-03-20 Thread Roan Kattouw
On Tue, Mar 20, 2012 at 7:12 AM, MZMcBride z...@mzmcbride.com wrote:
 Some people have suggested that the current API behavior is intentional.
 That is, that having different servers return the same error code is better
 than having different servers return an error code or not. I think this is
 flawed logic due to the problems that it presents (scripts unable to get
 past the error code), but it's definitely something that needs investigation
 for the future.

You're confusing Apache servers and database servers. You're hitting
different Apache servers at random, those are the srv numbers. Each
Apache requests the current lag from all the DB servers serving that
wiki, computes the maximum (in this case db36 wins easily) and checks
that against your maxlag parameter. Obviously the highest lag is
greater than your maxlag setting, so it throws an error and bails.

Had it continued, or had you not set a maxlag parameter, you would
have hit a DB server at random (except that it wouldn't have let you
hit db36 because it was lagged to hell and back). So yeah /normally/
you hit DB servers at random and different servers might respond
differently (or be lagged to different degrees), but in this
particular case it was always the same DB server returning the same
lag value. Nothing strange going on here, this is how the maxlag
parameter works.

Roan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] enwiki revision schema migration in progress

2012-03-20 Thread Asher Feldman
On Tuesday, March 20, 2012, Roan Kattouw roan.katt...@gmail.com wrote:
 So yeah /normally/
 you hit DB servers at random and different servers might respond
 differently (or be lagged to different degrees), but in this
 particular case it was always the same DB server returning the same
 lag value. Nothing strange going on here, this is how the maxlag
 parameter works.

How do you feel about a switch to change that behavior (maxlag - 1)? It
would be nice to be continue guiding developers towards throttling API
requests around maxlag without complicating schema migrations by requiring
config deployments before and after every db for this reason only.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] enwiki revision schema migration in progress

2012-03-20 Thread Roan Kattouw
On Tue, Mar 20, 2012 at 11:35 AM, Asher Feldman afeld...@wikimedia.org wrote:
 How do you feel about a switch to change that behavior (maxlag - 1)? It
 would be nice to be continue guiding developers towards throttling API
 requests around maxlag without complicating schema migrations by requiring
 config deployments before and after every db for this reason only.
That sounds reasonable to me, what alternative behavior do you
propose? A flag that, when enabled, causes maxlag to use the 2nd
highest lag instead of the highest lag?

Roan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] enwiki revision schema migration in progress

2012-03-20 Thread Daniel Friesen
On Tue, 20 Mar 2012 11:35:02 -0700, Asher Feldman afeld...@wikimedia.org  
wrote:



On Tuesday, March 20, 2012, Roan Kattouw roan.katt...@gmail.com wrote:

So yeah /normally/
you hit DB servers at random and different servers might respond
differently (or be lagged to different degrees), but in this
particular case it was always the same DB server returning the same
lag value. Nothing strange going on here, this is how the maxlag
parameter works.


How do you feel about a switch to change that behavior (maxlag - 1)? It
would be nice to be continue guiding developers towards throttling API
requests around maxlag without complicating schema migrations by  
requiring

config deployments before and after every db for this reason only.


Why don't we have a system to allow databases to be automatically taken  
out of the LB temporarily?


--
~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://daniel.friesen.name]

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] enwiki revision schema migration in progress

2012-03-20 Thread Jeremy Baron
On Tue, Mar 20, 2012 at 14:35, Asher Feldman afeld...@wikimedia.org wrote:
 On Tuesday, March 20, 2012, Roan Kattouw roan.katt...@gmail.com wrote:
 So yeah /normally/
 you hit DB servers at random and different servers might respond
 differently (or be lagged to different degrees), but in this
 particular case it was always the same DB server returning the same
 lag value. Nothing strange going on here, this is how the maxlag
 parameter works.

 How do you feel about a switch to change that behavior (maxlag - 1)? It
 would be nice to be continue guiding developers towards throttling API
 requests around maxlag without complicating schema migrations by requiring
 config deployments before and after every db for this reason only.

How about just having a new file in the config that lists the servers
currently migrating and the migration script can update and deploy as
it goes. Then apaches can check if a high maxlag is on the lag's
expected list and ignore if it is. Should be fairly safe as long as
it's a file that no human will edit and the deploy only pushes the one
file. But, I don't actually know if it's possible (with current
scripts) to push just one file for the config.

-Jeremy

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] enwiki revision schema migration in progress

2012-03-20 Thread Asher Feldman
On Tuesday, March 20, 2012, Roan Kattouw roan.katt...@gmail.com wrote:
 On Tue, Mar 20, 2012 at 11:35 AM, Asher Feldman afeld...@wikimedia.org
wrote:
 How do you feel about a switch to change that behavior (maxlag - 1)? It
 would be nice to be continue guiding developers towards throttling API
 requests around maxlag without complicating schema migrations by
requiring
 config deployments before and after every db for this reason only.
 That sounds reasonable to me, what alternative behavior do you
 propose? A flag that, when enabled, causes maxlag to use the 2nd
 highest lag instead of the highest lag?

That was my original thought. Jeremy's idea is good too, though I wonder if
we could do something similar without depending on deployments.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] enwiki revision schema migration in progress

2012-03-20 Thread MZMcBride
Roan Kattouw wrote:
 On Tue, Mar 20, 2012 at 7:12 AM, MZMcBride z...@mzmcbride.com wrote:
 Some people have suggested that the current API behavior is intentional.
 That is, that having different servers return the same error code is better
 than having different servers return an error code or not. I think this is
 flawed logic due to the problems that it presents (scripts unable to get
 past the error code), but it's definitely something that needs investigation
 for the future.

 You're confusing Apache servers and database servers. You're hitting
 different Apache servers at random, those are the srv numbers. Each
 Apache requests the current lag from all the DB servers serving that
 wiki, computes the maximum (in this case db36 wins easily) and checks
 that against your maxlag parameter. Obviously the highest lag is
 greater than your maxlag setting, so it throws an error and bails.
 
 Had it continued, or had you not set a maxlag parameter, you would
 have hit a DB server at random (except that it wouldn't have let you
 hit db36 because it was lagged to hell and back). So yeah /normally/
 you hit DB servers at random and different servers might respond
 differently (or be lagged to different degrees), but in this
 particular case it was always the same DB server returning the same
 lag value. Nothing strange going on here, this is how the maxlag
 parameter works.

Thank you for this explanation. It was very helpful.

I had some follow-up questions about what to do to mitigate this issue going
forward (particularly server-side), but you and others are already
brainstorming in this thread, so I'll slink back to my quiet world of
client-side scripting. :-)

(And thanks again to Asher for depooling the lagged server, even as a
temporary fix. I was able to do about 2500 talk page deliveries this morning
without any issue.)

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] enwiki revision schema migration in progress

2012-03-19 Thread Asher Feldman
Just a heads up that the last of the 1.19 migrations, to add the sha1
column to enwiki.revision is going to be running throughout this week.
Don't be alarmed by replication lag messages for s1 dbs in irc.  I'm going
to juggle which db watchlist queries go to during the migration, so nothing
should be noticeable on the site.

-Asher
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] enwiki revision schema migration in progress

2012-03-19 Thread MZMcBride
Asher Feldman wrote:
 Just a heads up that the last of the 1.19 migrations, to add the sha1
 column to enwiki.revision is going to be running throughout this week.
 Don't be alarmed by replication lag messages for s1 dbs in irc.  I'm going
 to juggle which db watchlist queries go to during the migration, so nothing
 should be noticeable on the site.

Hi.

This is causing significant problems for the scripts that I run that rely on
the MediaWiki API. The scripts were designed to be respectful of database
lag, but with db36 reporting 20879 seconds of lag (and rising), the scripts
are all getting hung as they patiently wait for lag to go back down.

I'm not sure of the exact configuration, but it seems like nearly every API
request is being handled by the lagged server (db36)? Or perhaps my scripts
just have terrible luck.

In any case, is it possible to depool the server or do something to mitigate
this? I don't think we want to encourage script writers to write scripts
that simply ignore the lag, but there aren't many other options when it's
outrageously high. Alternately, could the other servers be assigned to API
duty?

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] enwiki revision schema migration in progress

2012-03-19 Thread MZMcBride
MZMcBride wrote:
 I'm not sure of the exact configuration, but it seems like nearly every API
 request is being handled by the lagged server (db36)? Or perhaps my scripts
 just have terrible luck.

I added some prints to the code. Different servers are responding, but
they're all unable to get past the lag, apparently:

{u'servedby': u'srv234', u'error': {u'info': u'Waiting for 10.0.6.46: 21948
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv242', u'error': {u'info': u'Waiting for 10.0.6.46: 21982
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'mw20', u'error': {u'info': u'Waiting for 10.0.6.46: 21984
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'mw45', u'error': {u'info': u'Waiting for 10.0.6.46: 21986
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'mw14', u'error': {u'info': u'Waiting for 10.0.6.46: 21988
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'mw42', u'error': {u'info': u'Waiting for 10.0.6.46: 21989
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'mw3', u'error': {u'info': u'Waiting for 10.0.6.46: 21991
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv230', u'error': {u'info': u'Waiting for 10.0.6.46: 22005
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv259', u'error': {u'info': u'Waiting for 10.0.6.46: 22006
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv274', u'error': {u'info': u'Waiting for 10.0.6.46: 22008
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv280', u'error': {u'info': u'Waiting for 10.0.6.46: 22009
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv236', u'error': {u'info': u'Waiting for 10.0.6.46: 22010
seconds lagged', u'code': u'maxlag'}}
{u'servedby': u'srv230', u'error': {u'info': u'Waiting for 10.0.6.46: 22011
seconds lagged', u'code': u'maxlag'}}

And it goes on and on.

The relevant branch of code is:

---
def __parseJSON(self, data):
  maxlag = True
  while maxlag:
try:
  maxlag = False
  parsed = json.loads(data.read())
  content = None
  if isinstance(parsed, dict):
content = APIResult(parsed)
content.response = self.response.items()
  elif isinstance(parsed, list):
content = APIListResult(parsed)
content.response = self.response.items()
  else:
content = parsed
  if 'error' in content:
error = content['error']['code']
if error == maxlag:
  lagtime = int(re.search((\d+) seconds,
content['error']['info']).group(1))
  if lagtime  self.wiki.maxwaittime:
lagtime = self.wiki.maxwaittime
  print(Server lag, sleeping for +str(lagtime)+ seconds)
  maxlag = True
  time.sleep(int(lagtime)+0.5)
  return False
---

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] enwiki revision schema migration in progress

2012-03-19 Thread Asher Feldman
I've temporarily commented out db36 from db.php on the cluster.

This is a flaw in the how the client-side use of maxlag interacts with our
schema migration process - we run migrations on slaves one by one in an
automated fashion, only moving to the next after replication lag catches
up.  Mediawiki takes care of not sending queries to the lagged slave that
is under migration.  Meanwhile, maxlag always reports the value of the most
lagged slave.  Not a new issue, but this particular alter table on enwiki
is likely the most time intensive ever run at wmf.  It's slightly
ridiculous.

For this one alter, I can stop the migration script and run each statement
by hand, pulling and re-adding db's one by one along the way, but this
isn't a sustainable process.  Perhaps we can add a migration flag to
mediawiki, which if enabled, changes the behavior of maxlag and
wfWaitForSlaves() to ignore one highly lagged slave so long as others are
available without lag.

-A

On Mon, Mar 19, 2012 at 9:28 PM, MZMcBride z...@mzmcbride.com wrote:

 MZMcBride wrote:
  I'm not sure of the exact configuration, but it seems like nearly every
 API
  request is being handled by the lagged server (db36)? Or perhaps my
 scripts
  just have terrible luck.

 I added some prints to the code. Different servers are responding, but
 they're all unable to get past the lag, apparently:

 {u'servedby': u'srv234', u'error': {u'info': u'Waiting for 10.0.6.46:
 21948
 seconds lagged', u'code': u'maxlag'}}
 {u'servedby': u'srv242', u'error': {u'info': u'Waiting for 10.0.6.46:
 21982
 seconds lagged', u'code': u'maxlag'}}
 {u'servedby': u'mw20', u'error': {u'info': u'Waiting for 10.0.6.46: 21984
 seconds lagged', u'code': u'maxlag'}}
 {u'servedby': u'mw45', u'error': {u'info': u'Waiting for 10.0.6.46: 21986
 seconds lagged', u'code': u'maxlag'}}
 {u'servedby': u'mw14', u'error': {u'info': u'Waiting for 10.0.6.46: 21988
 seconds lagged', u'code': u'maxlag'}}
 {u'servedby': u'mw42', u'error': {u'info': u'Waiting for 10.0.6.46: 21989
 seconds lagged', u'code': u'maxlag'}}
 {u'servedby': u'mw3', u'error': {u'info': u'Waiting for 10.0.6.46: 21991
 seconds lagged', u'code': u'maxlag'}}
 {u'servedby': u'srv230', u'error': {u'info': u'Waiting for 10.0.6.46:
 22005
 seconds lagged', u'code': u'maxlag'}}
 {u'servedby': u'srv259', u'error': {u'info': u'Waiting for 10.0.6.46:
 22006
 seconds lagged', u'code': u'maxlag'}}
 {u'servedby': u'srv274', u'error': {u'info': u'Waiting for 10.0.6.46:
 22008
 seconds lagged', u'code': u'maxlag'}}
 {u'servedby': u'srv280', u'error': {u'info': u'Waiting for 10.0.6.46:
 22009
 seconds lagged', u'code': u'maxlag'}}
 {u'servedby': u'srv236', u'error': {u'info': u'Waiting for 10.0.6.46:
 22010
 seconds lagged', u'code': u'maxlag'}}
 {u'servedby': u'srv230', u'error': {u'info': u'Waiting for 10.0.6.46:
 22011
 seconds lagged', u'code': u'maxlag'}}

 And it goes on and on.

 The relevant branch of code is:

 ---
 def __parseJSON(self, data):
  maxlag = True
  while maxlag:
try:
  maxlag = False
  parsed = json.loads(data.read())
  content = None
  if isinstance(parsed, dict):
content = APIResult(parsed)
content.response = self.response.items()
  elif isinstance(parsed, list):
content = APIListResult(parsed)
content.response = self.response.items()
  else:
content = parsed
  if 'error' in content:
error = content['error']['code']
if error == maxlag:
  lagtime = int(re.search((\d+) seconds,
 content['error']['info']).group(1))
  if lagtime  self.wiki.maxwaittime:
lagtime = self.wiki.maxwaittime
  print(Server lag, sleeping for +str(lagtime)+ seconds)
  maxlag = True
  time.sleep(int(lagtime)+0.5)
  return False
 ---

 MZMcBride



 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l