Is it 0.6.0 or 0.6.1? Does that fetcher get locked up in Status.WAIT
forever?

This could be related to https://issues.apache.org/jira/browse/TEZ-1923
(fix available in 0.6.1).

~Rajesh.B

On Tue, Aug 11, 2015 at 1:48 PM, Johannes Zillmann <[email protected]
> wrote:

> Hey Tez Community,
>
> a customer of ours has following situation. He sets up a computation
> pipeline in a way that 2 independent graphs are running in an Tez DAG.
> Result: the job is still running after a half day with only alive task
> running, seems to running forever.
>
> If he splits the pipeline that each of the independent graph are executed
> in their own DAG, then each of the DAG’s are completed in one or two hours.
>
> Here is some initial logging which might be of interest of that one
> never-ending task:
> 2015-08-03 13:20:12,352 INFO [Initializer 1] orderedgrouped.Shuffle:
> Shuffle assigned with 55 inputs, codec: NoneifileReadAhead: true
> 2015-08-03 13:20:12,353 INFO [Initializer 1]
> orderedgrouped.ShuffleScheduler: ShuffleScheduler running for sourceVertex:
> MapChainVertex:Workbook job (184):
> ClickPairs#ClikDataWithPlayerSession(Disconnected record stream)
> (4ae490d9-9d88-4ccd-a236-1d4dddb56060) with configuration:
> maxFetchFailuresBeforeReporting=5, reportReadErrorImmediately=true,
> maxFailedUniqueFetches=5, abortFailureLimit=30, maxMapRuntime=0
> 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager:
> InitialRequest: ShuffleMem=1932735232, postMergeMem=0,
> RuntimeTotalAvailable=1932735232. Updated to: ShuffleMem=1932735232,
> postMergeMem: 0
> 2015-08-03 13:20:12,353 INFO [Initializer 1] orderedgrouped.MergeManager:
> MergerManager: memoryLimit=1932735232, maxSingleShuffleLimit=483183808,
> mergeThreshold=1739461632, ioSortFactor=100,
> memToMemMergeOutputsThreshold=100
>
>
> And then there are hours full of those things:
> 2015-08-03 13:33:59,911 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,911 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 3ms
> 2015-08-03 13:33:59,913 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,913 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,913 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
> 2015-08-03 13:33:59,914 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,914 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,914 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
> 2015-08-03 13:33:59,916 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,916 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,916 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
> 2015-08-03 13:33:59,917 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,917 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,917 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
> 2015-08-03 13:33:59,919 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,919 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,919 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 2ms
> 2015-08-03 13:33:59,920 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000039_0_10006,attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,920 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.FetcherOrderedGrouped: fetcher#36 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,920 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #36]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #36 in 1ms
> 2015-08-03 13:33:59,922 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #25] shuffle.HttpConnection: for
> url=
> http://instance-8863.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000028_0_10012,attempt_1438596625066_0001_2_00_000036_0_10006,attempt_1438596625066_0001_2_00_000038_0_10015,attempt_1438596625066_0001_2_00_000039_0_10006
> sent hash and receievd reply 0 ms
> 2015-08-03 13:33:59,922 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #25]
> orderedgrouped.FetcherOrderedGrouped: fetcher#25 - MergerManager returned
> Status.WAIT ...
> 2015-08-03 13:33:59,922 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #25]
> orderedgrouped.ShuffleScheduler: instance-8863.bigfoot.ko:13562 freed by
> fetcher [MapChainVertex_Workbook_job__184___Click] #25 in 2ms
> 2015-08-03 13:34:06,125 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #33] shuffle.HttpConnection: for
> url=
> http://instance-8865.bigfoot.ko:13562/mapOutput?job=job_1438596625066_0001&reduce=5&map=attempt_1438596625066_0001_2_00_000029_0_10039,attempt_1438596625066_0001_2_00_000052_0_10051,attempt_1438596625066_0001_2_00_000045_0_10015,attempt_1438596625066_0001_2_00_000049_0_10045,attempt_1438596625066_0001_2_00_000017_0_10009
> sent hash and receievd reply 0 ms
>
> Focusing on an individual fetcher it looks like this over time:
> 2015-08-03 13:46:45,977 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #31]
> orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned
> Status.WAIT …
> …
> 2015-08-03 14:39:44,974 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #31]
> orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned
> Status.WAIT …
> …
> 2015-08-03 15:27:32,739 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #31]
> orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned
> Status.WAIT …
> …
> 2015-08-03 17:42:28,354 INFO [fetcher
> [MapChainVertex_Workbook_job__184___Click] #31]
> orderedgrouped.FetcherOrderedGrouped: fetcher#31 - MergerManager returned
> Status.WAIT …
>
>
> That behaviour is consistent for the customer.
> Using Tez-0.6…
> Any ideas what could cause this ?
>
> Johannes
>

Reply via email to