https://bugzilla.wikimedia.org/show_bug.cgi?id=63693
Bug ID: 63693
Summary: Attempts of Hadoop tasks randomly fail "Bad connect
ack with firstBadLink as $SOME_CLUSTER_IP"
Product: Analytics
Version: unspecified
Hardware: All
OS: All
Status: NEW
Severity: normal
Priority: Unprioritized
Component: General/Unknown
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected], [email protected],
[email protected]
Web browser: ---
Mobile Platform: ---
Created attachment 15055
--> https://bugzilla.wikimedia.org/attachment.cgi?id=15055&action=edit
diagram showing failed connection attepmts of some jobs around 2014-04-08
Sporadically some attempt of an Hadoop task fails with error messages
like
Error: java.io.IOException: Bad connect ack with firstBadLink as
10.64.36.116:50010
. See for example
http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2971/m/FAILED
. The failed attempts are correctly restarted by Hadoop and eventually
succeed. But as the cluster is now pretty clean and not under heavy
beating by different jobs, I do not expect to see above failures at
all.
I cannot recall having seen the error message for Hive queries, and up
to now, I only saw tasks of camus webrequest importer jobs having such
failed attempts. However, it does not matter whether it's a full run
of importing the whole seven day's worth of wobile request traffic
(e.g.: above's job_1387838787660_2971), or just importing the last
hour (e.g.:
http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2965/m/FAILED
).
I briefly scanned the attempts of recently run applications, and there
seems to be a pattern that connecting to analytics10{11,16,17} is more
likely an issue than connecting to other machines [1]. Not sure if
this is a misinterpretation, as it may be time/scheduling dependent,
but it looks strange. (See attachment failures.png for dot output of
the failed connection attempts.)
[1]
+---------------------------------------+---------------+---------------+
| Attempt | Source | Destination |
+---------------------------------------+---------------+---------------+
| attempt_1387838787660_2856_m_000006_0 | analytics1013 | analytics1016 |
| attempt_1387838787660_2859_m_000002_0 | analytics1013 | analytics1017 |
| attempt_1387838787660_2955_m_000003_1 | analytics1019 | analytics1011 |
| attempt_1387838787660_2955_m_000009_1 | analytics1011 | analytics1017 |
| attempt_1387838787660_2956_m_000003_1 | analytics1011 | analytics1016 |
| attempt_1387838787660_2956_m_000005_1 | analytics1013 | analytics1017 |
| attempt_1387838787660_2956_m_000006_1 | analytics1015 | analytics1011 |
| attempt_1387838787660_2956_m_000007_1 | analytics1017 | analytics1011 |
| attempt_1387838787660_2956_m_000008_0 | analytics1011 | analytics1018 |
| attempt_1387838787660_2971_m_000001_0 | analytics1012 | analytics1016 |
| attempt_1387838787660_2971_m_000003_1 | analytics1020 | analytics1011 |
| attempt_1387838787660_2971_m_000005_0 | analytics1018 | analytics1011 |
| attempt_1387838787660_2971_m_000007_1 | analytics1013 | analytics1016 |
| attempt_1387838787660_2971_m_000008_1 | analytics1015 | analytics1017 |
| attempt_1387838787660_2972_m_000003_0 | analytics1015 | analytics1011 |
+---------------------------------------+---------------+---------------+
--
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l