https://bugzilla.wikimedia.org/show_bug.cgi?id=63470

            Bug ID: 63470
           Summary: analytics1012 fails Hadoop applications and jobs
           Product: Analytics
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: Unprioritized
         Component: General/Unknown
          Assignee: wikibugs-l@lists.wikimedia.org
          Reporter: christ...@quelltextlich.at
                CC: christ...@quelltextlich.at, oke...@wikimedia.org,
                    tneg...@wikimedia.org
       Web browser: ---
   Mobile Platform: ---

When walking through the Hadoop applications from early April 2014
(until 2014-04-03 09:00) on [1], it seems applications failed if and
only if they were started on analytics1012:8042 [2].

And I checked about a dozen of succeeded (hence started on nodes
different to analytics1012:8042) applications, and their subordinated
mapreduce jobs again failed if and only if they were run on
analytics1012:8042 [3].

Is there something wrong with analytics1012:8042 ?



[1] http://analytics1010.eqiad.wmnet:8088/cluster


[2] The URLs for the corresponding failed applications are
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2843
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2837
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2836
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2820
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2798
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2790
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2788
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2787
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2786


[3] So for example application 1387838787660_2796 [4] was started on
analytics1015:8042 and hence succeeded. But it had one failed map
attempt, which was again on analytics1012:8042 [5].

Such failed subordinated mapreduce jobs on analytics1012:8042 fail
with notes about timeouts. As for example here:
  AttemptID:attempt_1387838787660_2796_m_000001_0 Timed out after 600 secs


[4]
http://analytics1010.eqiad.wmnet:8088/cluster/app/application_1387838787660_2796


[5]
http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2796/m/FAILED

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to