Rajesh Balamohan created TEZ-4194: ------------------------------------- Summary: NPE in FetcherOrderedGrouped Key: TEZ-4194 URL: https://issues.apache.org/jira/browse/TEZ-4194 Project: Apache Tez Issue Type: Bug Affects Versions: 0.9.1 Reporter: Rajesh Balamohan
When running store_sales data generation (@ 1 TB scale) on a cloud environment, this was observed. This causes task to fail and re-execute. One in 2 or 4 runs gets into this. Weirdly, not sure why its trying to call fetcher when all inputs are downloaded as per log. {noformat} 2020-06-23 08:49:34,546 [INFO] [Fetcher_O {Map_1} #1] |orderedgrouped.ShuffleScheduler|: All inputs fetched for input vertex : Map 1 2020-06-23 08:49:34,546 [INFO] [Fetcher_O {Map_1} #1] |orderedgrouped.ShuffleScheduler|: copy(2385 (spillsFetched=2385) of 2385. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s) 2020-06-23 08:49:34,546 [INFO] [ShuffleAndMergeRunner {Map_1}] |orderedgrouped.ShuffleScheduler|: Shutting down FetchScheduler for input: Map_1, wasInterrupted=false 2020-06-23 08:49:34,547 [INFO] [ShuffleAndMergeRunner {Map_1}] |orderedgrouped.ShuffleScheduler|: copy(2385 (spillsFetched=2385) of 2385. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s) 2020-06-23 08:49:34,548 [INFO] [ShuffleAndMergeRunner {Map_1}] |orderedgrouped.ShuffleScheduler|: Shutting down fetchers for input: Map_1, shutdown timetaken: 0 ms, hasFetcherExecutorStopped: true 2020-06-23 08:49:34,549 [INFO] [ShuffleAndMergeRunner {Map_1}] |orderedgrouped.MergeManager|: TotalInMemFetchStats: count=2115, totalSize=28250798, min=2011, max=17268, avg=1.0 2020-06-23 08:49:34,549 [INFO] [ShuffleAndMergeRunner {Map_1}] |orderedgrouped.MergeManager|: finalMerge with #inMemoryOutputs=2115, size=28250798 and #onDiskOutputs=144, size=3487553 2020-06-23 08:49:34,686 [INFO] [Fetcher_O {Map_1} #3] |orderedgrouped.Shuffle|: Map_1: Setting throwable in reportException with message [null] from thread [Fetcher_O {Map_1} #3 2020-06-23 08:49:34,687 [INFO] [Fetcher_O {Map_1} #3] |orderedgrouped.ShuffleScheduler|: Map_1: Already shutdown. Ignoring fetch complete 2020-06-23 08:49:35,044 [ERROR] [ShuffleAndMergeRunner {Map_1}] |orderedgrouped.Shuffle|: Map_1: ShuffleRunner failed with error org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher_O {Map_1} #3 at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:332) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:288) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:354) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:263) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:182) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:194) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:57) ... 7 more 2020-06-23 08:49:35,046 [INFO] [ShuffleAndMergeRunner {Map_1}] |task.TezTaskRunner2|: Received notification of a failure which will cause the task to die org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher_O {Map_1} #3 at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:332) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:288) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:354) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:263) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:182) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:194) at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:57) ... 7 more {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)