jcrespo created this task.
jcrespo added projects: Wikimedia-log-errors, Wikidata, DBA, MediaWiki-JobRunner.
Herald added a subscriber: Aklapper.

TASK DESCRIPTION

There is high number of connection errors to 10.64.16.144 (db1049, or s5-master) caused probably by high number of connections such as:

{
  "_index": "logstash-2016.09.19",
  "_type": "mediawiki",
  "_id": "AVdD7nZU2B4w3SKhWGu0",
  "_score": null,
  "_source": {
    "message": "Error connecting to 10.64.16.144: Can't connect to MySQL server on '10.64.16.144' (4)",
    "@version": 1,
    "@timestamp": "2016-09-19T19:31:23.000Z",
    "type": "mediawiki",
    "host": "mw1299",
    "level": "ERROR",
    "tags": [
      "syslog",
      "es",
      "es"
    ],
    "channel": "wfLogDBError",
    "normalized_message": "Error connecting to {db_server}: {error}",
    "url": "/rpc/RunJobs.php?wiki=dewiki&type=wikibase-addUsagesForPage&maxtime=60&maxmem=300M",
    "ip": "127.0.0.1",
    "http_method": "POST",
    "server": "127.0.0.1",
    "referrer": null,
    "wiki": "dewiki",
    "mwversion": "1.28.0-wmf.18",
    "reqId": "a7a0b44622b72ad31d160f12",
    "db_server": "10.64.16.144",
    "db_name": "dewiki",
    "db_user": "wikiuser",
    "method": "DatabaseMysqlBase::open",
    "error": "Can't connect to MySQL server on '10.64.16.144' (4)"
  },
  "fields": {
    "@timestamp": [
      1474313483000
    ]
  },
  "highlight": {
    "channel.raw": [
      "@kibana-highlighted-field@wfLogDBError@/kibana-highlighted-field@"
    ]
  },
  "sort": [
    1474313483000
  ]
}

This could be the cause or just a consequence because this job is very common.

Here is a sample of errors: https://logstash.wikimedia.org/goto/11cd5759017b61d371035de09a41531c

The number was high before, but at 17:00 UTC today there was a spike of 2000 errors in 5 minutes, following by a continuous tail of ~100 errors/5 minutes. This could be just a spike on activity that will disappear, or could be something substantial (code pattern change).

75% of current database errors are connection errors to this server (not normal). By looking at https://grafana-admin.wikimedia.org/dashboard/db/mysql?from=1473709162759&to=1474313962759&var-dc=eqiad%20prometheus%2Fops&var-server=db1049 I can see there is a pattern change at 6-8 UTC today, and increasing highly at 17 UTC, but see nothing strange at infrastructure side.


TASK DETAIL
https://phabricator.wikimedia.org/T146079

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: jcrespo
Cc: hoo, Aklapper, jcrespo, Marostegui, Minhnv-2809, D3r1ck01, Izno, Luke081515, Wikidata-bugs, aude, Mbch331, Jay8g, Krenair
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to