https://bugzilla.wikimedia.org/show_bug.cgi?id=63470
Bug ID: 63470
Summary: analytics1012 fails Hadoop applications and jobs
Product: Analytics
Version: unspecified
Hardware: All
OS: All
Status: NEW
Severity: normal
Priority: Unprioritized
Component: General/Unknown
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected], [email protected],
[email protected]
Web browser: ---
Mobile Platform: ---
When walking through the Hadoop applications from early April 2014
(until 2014-04-03 09:00) on [1], it seems applications failed if and
only if they were started on analytics1012:8042 [2].
And I checked about a dozen of succeeded (hence started on nodes
different to analytics1012:8042) applications, and their subordinated
mapreduce jobs again failed if and only if they were run on
analytics1012:8042 [3].
Is there something wrong with analytics1012:8042 ?
[1] http://analytics1010.eqiad.wmnet:8088/cluster
[2] The URLs for the corresponding failed applications are
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2843
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2837
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2836
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2820
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2798
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2790
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2788
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2787
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2786
[3] So for example application 1387838787660_2796 [4] was started on
analytics1015:8042 and hence succeeded. But it had one failed map
attempt, which was again on analytics1012:8042 [5].
Such failed subordinated mapreduce jobs on analytics1012:8042 fail
with notes about timeouts. As for example here:
AttemptID:attempt_1387838787660_2796_m_000001_0 Timed out after 600 secs
[4]
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2796
[5]
http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2796/m/FAILED
--
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l