https://bugzilla.wikimedia.org/show_bug.cgi?id=63693

            Bug ID: 63693
           Summary: Attempts of Hadoop tasks randomly fail "Bad connect
                    ack with firstBadLink as $SOME_CLUSTER_IP"
           Product: Analytics
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: General/Unknown
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected], [email protected],
                    [email protected]
       Web browser: ---
   Mobile Platform: ---

Created attachment 15055
  --> https://bugzilla.wikimedia.org/attachment.cgi?id=15055&action=edit
diagram showing failed connection attepmts of some jobs around 2014-04-08

Sporadically some attempt of an Hadoop task fails with error messages
like

  Error: java.io.IOException: Bad connect ack with firstBadLink as
10.64.36.116:50010

. See for example

 
http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2971/m/FAILED

. The failed attempts are correctly restarted by Hadoop and eventually
succeed. But as the cluster is now pretty clean and not under heavy
beating by different jobs, I do not expect to see above failures at
all.

I cannot recall having seen the error message for Hive queries, and up
to now, I only saw tasks of camus webrequest importer jobs having such
failed attempts. However, it does not matter whether it's a full run
of importing the whole seven day's worth of wobile request traffic
(e.g.: above's job_1387838787660_2971), or just importing the last
hour (e.g.:

 
http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2965/m/FAILED
).

I briefly scanned the attempts of recently run applications, and there
seems to be a pattern that connecting to analytics10{11,16,17} is more
likely an issue than connecting to other machines [1]. Not sure if
this is a misinterpretation, as it may be time/scheduling dependent,
but it looks strange. (See attachment failures.png for dot output of
the failed connection attempts.)





[1]
+---------------------------------------+---------------+---------------+
| Attempt | Source | Destination |
+---------------------------------------+---------------+---------------+
| attempt_1387838787660_2856_m_000006_0 | analytics1013 | analytics1016 |
| attempt_1387838787660_2859_m_000002_0 | analytics1013 | analytics1017 |
| attempt_1387838787660_2955_m_000003_1 | analytics1019 | analytics1011 |
| attempt_1387838787660_2955_m_000009_1 | analytics1011 | analytics1017 |
| attempt_1387838787660_2956_m_000003_1 | analytics1011 | analytics1016 |
| attempt_1387838787660_2956_m_000005_1 | analytics1013 | analytics1017 |
| attempt_1387838787660_2956_m_000006_1 | analytics1015 | analytics1011 |
| attempt_1387838787660_2956_m_000007_1 | analytics1017 | analytics1011 |
| attempt_1387838787660_2956_m_000008_0 | analytics1011 | analytics1018 |
| attempt_1387838787660_2971_m_000001_0 | analytics1012 | analytics1016 |
| attempt_1387838787660_2971_m_000003_1 | analytics1020 | analytics1011 |
| attempt_1387838787660_2971_m_000005_0 | analytics1018 | analytics1011 |
| attempt_1387838787660_2971_m_000007_1 | analytics1013 | analytics1016 |
| attempt_1387838787660_2971_m_000008_1 | analytics1015 | analytics1017 |
| attempt_1387838787660_2972_m_000003_0 | analytics1015 | analytics1011 |
+---------------------------------------+---------------+---------------+

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to