https://bugzilla.wikimedia.org/show_bug.cgi?id=63470
Bug ID: 63470 Summary: analytics1012 fails Hadoop applications and jobs Product: Analytics Version: unspecified Hardware: All OS: All Status: NEW Severity: normal Priority: Unprioritized Component: General/Unknown Assignee: wikibugs-l@lists.wikimedia.org Reporter: christ...@quelltextlich.at CC: christ...@quelltextlich.at, oke...@wikimedia.org, tneg...@wikimedia.org Web browser: --- Mobile Platform: --- When walking through the Hadoop applications from early April 2014 (until 2014-04-03 09:00) on [1], it seems applications failed if and only if they were started on analytics1012:8042 [2]. And I checked about a dozen of succeeded (hence started on nodes different to analytics1012:8042) applications, and their subordinated mapreduce jobs again failed if and only if they were run on analytics1012:8042 [3]. Is there something wrong with analytics1012:8042 ? [1] http://analytics1010.eqiad.wmnet:8088/cluster [2] The URLs for the corresponding failed applications are http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2843 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2837 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2836 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2820 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2798 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2790 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2788 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2787 http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2786 [3] So for example application 1387838787660_2796 [4] was started on analytics1015:8042 and hence succeeded. But it had one failed map attempt, which was again on analytics1012:8042 [5]. Such failed subordinated mapreduce jobs on analytics1012:8042 fail with notes about timeouts. As for example here: AttemptID:attempt_1387838787660_2796_m_000001_0 Timed out after 600 secs [4] http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2796 [5] http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2796/m/FAILED -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l