Thanks, it’s 0.6.0 and it looks like it is in that state forever. Job got killed after 7 hours.
Johannes > On 11 Aug 2015, at 11:21, Rajesh Balamohan <[email protected]> wrote: > > Is it 0.6.0 or 0.6.1? Does that fetcher get locked up in Status.WAIT forever? > > This could be related to https://issues.apache.org/jira/browse/TEZ-1923 > <https://issues.apache.org/jira/browse/TEZ-1923> (fix available in 0.6.1). > > ~Rajesh.B > > On Tue, Aug 11, 2015 at 1:48 PM, Johannes Zillmann <[email protected] > <mailto:[email protected]>> wrote: > Hey Tez Community, > > a customer of ours has following situation. He sets up a computation pipeline > in a way that 2 independent graphs are running in an Tez DAG. > Result: the job is still running after a half day with only alive task > running, seems to running forever. > > If he splits the pipeline that each of the independent graph are executed in > their own DAG, then each of the DAG’s are completed in one or two hours. > > Here is some initial logging which might be of interest of that one > never-ending task: > 2015-08-03 13:20:12,352 INFO [Initializer 1] orderedgrouped.Shuffle: Shuffle > assigned with 55 inputs, codec: NoneifileReadAhead: true > 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.ShuffleScheduler: > ShuffleScheduler running for sourceVertex: MapChainVertex:Workbook job (184): > ClickPairs#ClikDataWithPlayerSession(Disconnected record stream) > (4ae490d9-9d88-4ccd-a236-1d4dddb56060) with configuration: > maxFetchFailuresBeforeReporting=5, reportReadErrorImmediately=true, > maxFailedUniqueFetches=5, abortFailureLimit=30, maxMapRuntime=0 > 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager: > InitialRequest: ShuffleMem=1932735232 <tel:1932735232>, postMergeMem=0, > RuntimeTotalAvailable=1932735232 <tel:1932735232>. Updated to: > ShuffleMem=1932735232 <tel:1932735232>, postMergeMem: 0 > 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager: > MergerManager: memoryLimit=1932735232 <tel:1932735232>, > maxSingleShuffleLimit=483183808, mergeThreshold=1739461632 <tel:1739461632>, > ioSortFactor=100, memToMemMergeOutputsThreshold=100 > > > And then there are hours full of those things: > 2015-08-03 13:33:59,911 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned > Status.WAIT ... > 2015-08-03 13:33:59,911 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by > fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 3ms > 2015-08-03 13:33:59,913 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for > url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006 > > <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006> > sent hash and receievd reply 0 ms > 2015-08-03 13:33:59,913 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned > Status.WAIT ... > 2015-08-03 13:33:59,913 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by > fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms > 2015-08-03 13:33:59,914 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for > url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015 > > <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015> > sent hash and receievd reply 0 ms > 2015-08-03 13:33:59,914 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned > Status.WAIT ... > 2015-08-03 13:33:59,914 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by > fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms > 2015-08-03 13:33:59,916 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for > url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006 > > <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006> > sent hash and receievd reply 0 ms > 2015-08-03 13:33:59,916 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned > Status.WAIT ... > 2015-08-03 13:33:59,916 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by > fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms > 2015-08-03 13:33:59,917 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for > url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012 > > <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012> > sent hash and receievd reply 0 ms > 2015-08-03 13:33:59,917 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned > Status.WAIT ... > 2015-08-03 13:33:59,917 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by > fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms > 2015-08-03 13:33:59,919 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for > url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006 > > <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006> > sent hash and receievd reply 0 ms > 2015-08-03 13:33:59,919 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned > Status.WAIT ... > 2015-08-03 13:33:59,919 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by > fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms > 2015-08-03 13:33:59,920 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for > url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015 > > <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015> > sent hash and receievd reply 0 ms > 2015-08-03 13:33:59,920 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned > Status.WAIT ... > 2015-08-03 13:33:59,920 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #36] > orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by > fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms > 2015-08-03 13:33:59,922 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #25] shuffle.HttpConnection: for > url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006 > > <http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006> > sent hash and receievd reply 0 ms > 2015-08-03 13:33:59,922 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #25] > orderedgrouped.FetcherOrderedGrouped: fetcher#25 - MergerManager returned > Status.WAIT ... > 2015-08-03 13:33:59,922 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #25] > orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by > fetcher [MapChainVertex_Workbook_job__184___Click] #25 in 2ms > 2015-08-03 13:34:06,125 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #33] shuffle.HttpConnection: for > url=http://instance-8865.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000029_0_10039,attempt_1438596625066_0001_2_00_000052_0_10051,attempt_1438596625066_0001_2_00_000045_0_10015,attempt_1438596625066_0001_2_00_000049_0_10045,attempt_1438596625066_0001_2_00_000017_0_10009 > > <http://instance-8865.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000029_0_10039,attempt_1438596625066_0001_2_00_000052_0_10051,attempt_1438596625066_0001_2_00_000045_0_10015,attempt_1438596625066_0001_2_00_000049_0_10045,attempt_1438596625066_0001_2_00_000017_0_10009> > sent hash and receievd reply 0 ms > > Focusing on an individual fetcher it looks like this over time: > 2015-08-03 13:46:45,977 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #31] > orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned > Status.WAIT … > … > 2015-08-03 14:39:44,974 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #31] > orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned > Status.WAIT … > … > 2015-08-03 15:27:32,739 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #31] > orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned > Status.WAIT … > … > 2015-08-03 17:42:28,354 INFO [fetcher > [MapChainVertex_Workbook_job__184___Click] #31] > orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned > Status.WAIT … > > > That behaviour is consistent for the customer. > Using Tez-0.6… > Any ideas what could cause this ? > > Johannes >
