Hey Tez Community,

a customer of ours has following situation. He sets up a computation pipeline 
in a way that 2 independent graphs are running in an Tez DAG.
Result: the job is still running after a half day with only alive task running, 
seems to running forever. 

If he splits the pipeline that each of the independent graph are executed in 
their own DAG, then each of the DAG’s are completed in one or two hours.

Here is some initial logging which might be of interest of that one 
never-ending task:
2015-08-03 13:20:12,352 INFO [Initializer 1] orderedgrouped.Shuffle: Shuffle 
assigned with 55 inputs, codec: NoneifileReadAhead: true
2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.ShuffleScheduler: 
ShuffleScheduler running for sourceVertex: MapChainVertex:Workbook job (184): 
ClickPairs#ClikDataWithPlayerSession(Disconnected record stream) 
(4ae490d9-9d88-4ccd-a236-1d4dddb56060) with configuration: 
maxFetchFailuresBeforeReporting=5, reportReadErrorImmediately=true, 
maxFailedUniqueFetches=5, abortFailureLimit=30, maxMapRuntime=0
2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager: 
InitialRequest: ShuffleMem=1932735232, postMergeMem=0, 
RuntimeTotalAvailable=1932735232. Updated to: ShuffleMem=1932735232, 
postMergeMem: 0
2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager: 
MergerManager: memoryLimit=1932735232, maxSingleShuffleLimit=483183808, 
mergeThreshold=1739461632, ioSortFactor=100, memToMemMergeOutputsThreshold=100


And then there are hours full of those things:
2015-08-03 13:33:59,911 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
Status.WAIT ...
2015-08-03 13:33:59,911 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 3ms
2015-08-03 13:33:59,913 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for 
url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006
 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,913 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
Status.WAIT ...
2015-08-03 13:33:59,913 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
2015-08-03 13:33:59,914 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for 
url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015
 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,914 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
Status.WAIT ...
2015-08-03 13:33:59,914 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
2015-08-03 13:33:59,916 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for 
url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006
 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,916 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
Status.WAIT ...
2015-08-03 13:33:59,916 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
2015-08-03 13:33:59,917 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for 
url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012
 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,917 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
Status.WAIT ...
2015-08-03 13:33:59,917 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
2015-08-03 13:33:59,919 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for 
url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006
 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,919 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
Status.WAIT ...
2015-08-03 13:33:59,919 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
2015-08-03 13:33:59,920 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for 
url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015
 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,920 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned 
Status.WAIT ...
2015-08-03 13:33:59,920 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #36] 
orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
2015-08-03 13:33:59,922 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #25] shuffle.HttpConnection: for 
url=http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006
 sent hash and receievd reply 0 ms
2015-08-03 13:33:59,922 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #25] 
orderedgrouped.FetcherOrderedGrouped: fetcher#25 - MergerManager returned 
Status.WAIT ...
2015-08-03 13:33:59,922 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #25] 
orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by 
fetcher [MapChainVertex_Workbook_job__184___Click] #25 in 2ms
2015-08-03 13:34:06,125 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #33] shuffle.HttpConnection: for 
url=http://instance-8865.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000029_0_10039,attempt_1438596625066_0001_2_00_000052_0_10051,attempt_1438596625066_0001_2_00_000045_0_10015,attempt_1438596625066_0001_2_00_000049_0_10045,attempt_1438596625066_0001_2_00_000017_0_10009
 sent hash and receievd reply 0 ms

Focusing on an individual fetcher it looks like this over time:
2015-08-03 13:46:45,977 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #31] 
orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned 
Status.WAIT …
…
2015-08-03 14:39:44,974 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #31] 
orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned 
Status.WAIT …
…
2015-08-03 15:27:32,739 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #31] 
orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned 
Status.WAIT …
…
2015-08-03 17:42:28,354 INFO [fetcher 
[MapChainVertex_Workbook_job__184___Click] #31] 
orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned 
Status.WAIT …


That behaviour is consistent for the customer.
Using Tez-0.6…
Any ideas what could cause this ?

Johannes

Reply via email to