[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-10-23 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661392#comment-16661392
 ] 

Gopal V commented on TEZ-3976:
--

LGTM - +1

> ShuffleManager reporting too many errors
> 
>
> Key: TEZ-3976
> URL: https://issues.apache.org/jira/browse/TEZ-3976
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Attachments: TEZ-3976.1.patch, TEZ-3976.2.patch, TEZ-3976.3.patch, 
> TEZ-3976.4.patch, TEZ-3976.5.patch, TEZ-3976.6.patch, TEZ-3976.7.patch, 
> TEZ-3976.8.patch, TEZ-3976.9.patch
>
>
> The symptoms are a lot of these logs are being shown:
> {code:java}
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #0 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0], connectFailed: true
> 2018-06-15T18:09:35,811 WARN  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for 
> tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]]
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0], connectFailed: true
> {code}
> Each of those translate into an event in the AM which finally crashes due to 
> OOM after around 30 minutes and around 10 million shuffle input errors (and 
> 10 million lines like the previous ones). When the ShufflerManager is closed 
> and the counters reported there are many shuffle input errors, some of those 
> logs are:
> {code:java}
> 2018-06-15T17:46:30,988  INFO [TezTR-441963_21_34_4_0_4 
> (152901963_0021_34_04_00_4)] runtime.LogicalIOProcessorRuntimeTask: 
> Final Counters for attempt_152901963_0021_34_04_00_4: Counters: 43 
> [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, 
> NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0, 
> OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, 
> OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, 
> SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, 
> SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, 
> SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516, 
> FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE 
> RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1, 
> RECORDS_OUT_OPERATOR_GBY_159=1, 
> RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11
>  FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, 
> LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, 
> SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, 
> SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1
>  ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146, 
> OUTPUT_BYTES_WITH_OVERHEAD=779483, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1, 
> SPILLED_RECORDS=0]]
> 2018-06-15T17:46:32,271 INFO  [TezTR-441963_21_34_3_15_1 ()] 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Final Counters for 
> attempt_152901963_0021_34_03_15_1: Counters: 87 [[File System 
> Counters FILE_BYTES_READ=0, FILE_BYTES_WRITTEN=0, FILE_READ_OPS=0, 
> FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=2344929, 
> HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=5, HDFS_LARGE_READ_OPS=0, 
> HDFS_WRITE_OPS=0][org.apache.tez.common.counters.TaskCounter 
> SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=1, NUM_FAILED_SHUFFLE_INPUTS=105195, 
> INPUT_RECORDS_PROCESSED=397, INPUT_SPLIT_LENGTH_BYTES=21563271, 
> OUTPUT_RECORDS=15737, OUTPUT_LARGE_RECORDS=0, OUTPUT_BYTES=1235818, 
> OUTPUT_BYTES_WITH_OVERHEAD=1267307, OUTPUT_BYTES_PHYSICAL=357520, 

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-10-19 Thread TezQA (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16657659#comment-16657659
 ] 

TezQA commented on TEZ-3976:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  2m 
21s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
29s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
15s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
36s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 20s{color} 
| {color:red} tez-runtime-library generated 2 new + 6 unchanged - 0 fixed = 8 
total (was 6) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 21s{color} | {color:orange} tez-api: The patch generated 11 new + 8 
unchanged - 7 fixed = 19 total (was 15) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 34s{color} | {color:orange} tez-runtime-library: The patch generated 5 new + 
587 unchanged - 3 fixed = 592 total (was 590) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
43s{color} | {color:green} tez-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m  
2s{color} | {color:green} tez-runtime-library in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 24m 36s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | TEZ-3976 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944834/TEZ-3976.9.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
checkstyle  compile  |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-143-generic #192-Ubuntu SMP Tue 
Feb 27 10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build/yetus/precommit/personality/tez.sh
 |
| git revision | master / 9277815 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_172 |
| findbugs | v3.0.1 |
| javac | 
https://builds.apache.org/job/PreCommit-TEZ-Build/42/artifact/out/diff-compile-javac-tez-runtime-library.txt
 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-TEZ-Build/42/artifact/out/diff-checkstyle-tez-api.txt
 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-TEZ-Build/42/artifact/out/diff-checkstyle-tez-runtime-library.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-TEZ-Build/42/testReport/ |
| modules | C: tez-api tez-runtime-library U: . |
| Console output | https://builds.apache.org/job/PreCommit-TEZ-Build/42/console 
|
| Powered by | Apache Yetus 0.8.0  

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-10-19 Thread TezQA (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16657629#comment-16657629
 ] 

TezQA commented on TEZ-3976:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  3s{color} 
| {color:red} TEZ-3976 does not apply to master. Rebase required? Wrong Branch? 
See https://cwiki.apache.org/confluence/display/TEZ/How+to+Contribute+to+Tez 
for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | TEZ-3976 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944832/TEZ-3976.8.patch |
| Console output | https://builds.apache.org/job/PreCommit-TEZ-Build/41/console 
|
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> ShuffleManager reporting too many errors
> 
>
> Key: TEZ-3976
> URL: https://issues.apache.org/jira/browse/TEZ-3976
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Attachments: TEZ-3976.1.patch, TEZ-3976.2.patch, TEZ-3976.3.patch, 
> TEZ-3976.4.patch, TEZ-3976.5.patch, TEZ-3976.6.patch, TEZ-3976.7.patch, 
> TEZ-3976.8.patch
>
>
> The symptoms are a lot of these logs are being shown:
> {code:java}
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #0 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0], connectFailed: true
> 2018-06-15T18:09:35,811 WARN  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for 
> tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]]
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0], connectFailed: true
> {code}
> Each of those translate into an event in the AM which finally crashes due to 
> OOM after around 30 minutes and around 10 million shuffle input errors (and 
> 10 million lines like the previous ones). When the ShufflerManager is closed 
> and the counters reported there are many shuffle input errors, some of those 
> logs are:
> {code:java}
> 2018-06-15T17:46:30,988  INFO [TezTR-441963_21_34_4_0_4 
> (152901963_0021_34_04_00_4)] runtime.LogicalIOProcessorRuntimeTask: 
> Final Counters for attempt_152901963_0021_34_04_00_4: Counters: 43 
> [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, 
> NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0, 
> OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, 
> OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, 
> SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, 
> SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, 
> SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516, 
> FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE 
> RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1, 
> RECORDS_OUT_OPERATOR_GBY_159=1, 
> RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11
>  FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, 
> LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, 
> SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, 
> SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1
>  ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146, 
> OUTPUT_BYTES_WITH_OVERHEAD=779483, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1, 
> SPILLED_RECORDS=0]]
> 2018-06-15T17:46:32,271 INFO  [TezTR-441963_21_34_3_15_1 ()] 
> 

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-10-19 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16657617#comment-16657617
 ] 

Gopal V commented on TEZ-3976:
--

[~jmarhuen]: I'm fine with this patch (+1), except for one detail - the 
InputReadErrorEvent equals() & hashCode should not use the numFailures to 
reference it.

{code}
+  private final HashMap failedEvents
+  = new HashMap<>()
{code}

I know you cannot modify numFailures there, but it is just adding noise to the 
hashtable because it is always the same value (& go into the lower bits of the 
hash function).

The lookup should match whether you have 1 failure or two - currently it 
assumes you're always merging +1, which is true, but I've debugged too many 
hash function skews this week to let this one go.

> ShuffleManager reporting too many errors
> 
>
> Key: TEZ-3976
> URL: https://issues.apache.org/jira/browse/TEZ-3976
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Attachments: TEZ-3976.1.patch, TEZ-3976.2.patch, TEZ-3976.3.patch, 
> TEZ-3976.4.patch, TEZ-3976.5.patch, TEZ-3976.6.patch, TEZ-3976.7.patch
>
>
> The symptoms are a lot of these logs are being shown:
> {code:java}
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #0 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0], connectFailed: true
> 2018-06-15T18:09:35,811 WARN  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for 
> tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]]
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0], connectFailed: true
> {code}
> Each of those translate into an event in the AM which finally crashes due to 
> OOM after around 30 minutes and around 10 million shuffle input errors (and 
> 10 million lines like the previous ones). When the ShufflerManager is closed 
> and the counters reported there are many shuffle input errors, some of those 
> logs are:
> {code:java}
> 2018-06-15T17:46:30,988  INFO [TezTR-441963_21_34_4_0_4 
> (152901963_0021_34_04_00_4)] runtime.LogicalIOProcessorRuntimeTask: 
> Final Counters for attempt_152901963_0021_34_04_00_4: Counters: 43 
> [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, 
> NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0, 
> OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, 
> OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, 
> SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, 
> SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, 
> SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516, 
> FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE 
> RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1, 
> RECORDS_OUT_OPERATOR_GBY_159=1, 
> RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11
>  FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, 
> LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, 
> SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, 
> SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1
>  ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146, 
> OUTPUT_BYTES_WITH_OVERHEAD=779483, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1, 
> SPILLED_RECORDS=0]]
> 2018-06-15T17:46:32,271 INFO  [TezTR-441963_21_34_3_15_1 ()] 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Final Counters for 
> attempt_152901963_0021_34_03_15_1: Counters: 87 

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-10-18 Thread Kuhu Shukla (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655924#comment-16655924
 ] 

Kuhu Shukla commented on TEZ-3976:
--

Thank you [~jmarhuen] for the patch.

I took a cursory look at it and the new config waits for a configurable amount 
of time before reporting all the failures thus far. This approach looks good 
and it is different from the ordered case though and I wonder if we should make 
them consistent?

I will look at the patch closely in the mean time.

> ShuffleManager reporting too many errors
> 
>
> Key: TEZ-3976
> URL: https://issues.apache.org/jira/browse/TEZ-3976
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Attachments: TEZ-3976.1.patch, TEZ-3976.2.patch, TEZ-3976.3.patch, 
> TEZ-3976.4.patch, TEZ-3976.5.patch, TEZ-3976.6.patch, TEZ-3976.7.patch
>
>
> The symptoms are a lot of these logs are being shown:
> {code:java}
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #0 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0], connectFailed: true
> 2018-06-15T18:09:35,811 WARN  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for 
> tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]]
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0], connectFailed: true
> {code}
> Each of those translate into an event in the AM which finally crashes due to 
> OOM after around 30 minutes and around 10 million shuffle input errors (and 
> 10 million lines like the previous ones). When the ShufflerManager is closed 
> and the counters reported there are many shuffle input errors, some of those 
> logs are:
> {code:java}
> 2018-06-15T17:46:30,988  INFO [TezTR-441963_21_34_4_0_4 
> (152901963_0021_34_04_00_4)] runtime.LogicalIOProcessorRuntimeTask: 
> Final Counters for attempt_152901963_0021_34_04_00_4: Counters: 43 
> [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, 
> NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0, 
> OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, 
> OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, 
> SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, 
> SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, 
> SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516, 
> FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE 
> RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1, 
> RECORDS_OUT_OPERATOR_GBY_159=1, 
> RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11
>  FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, 
> LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, 
> SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, 
> SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1
>  ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146, 
> OUTPUT_BYTES_WITH_OVERHEAD=779483, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1, 
> SPILLED_RECORDS=0]]
> 2018-06-15T17:46:32,271 INFO  [TezTR-441963_21_34_3_15_1 ()] 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Final Counters for 
> attempt_152901963_0021_34_03_15_1: Counters: 87 [[File System 
> Counters FILE_BYTES_READ=0, FILE_BYTES_WRITTEN=0, FILE_READ_OPS=0, 
> FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=2344929, 
> HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=5, HDFS_LARGE_READ_OPS=0, 
> 

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-10-18 Thread TezQA (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655823#comment-16655823
 ] 

TezQA commented on TEZ-3976:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
8s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
58s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
45s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
53s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
33s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
15s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 23s{color} 
| {color:red} tez-runtime-library generated 2 new + 6 unchanged - 0 fixed = 8 
total (was 6) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 21s{color} | {color:orange} tez-api: The patch generated 11 new + 7 
unchanged - 7 fixed = 18 total (was 14) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 31s{color} | {color:orange} tez-runtime-library: The patch generated 6 new + 
587 unchanged - 3 fixed = 593 total (was 590) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m  
0s{color} | {color:green} tez-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m 
15s{color} | {color:green} tez-runtime-library in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
24s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 21m 59s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | TEZ-3976 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944611/TEZ-3976.7.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
checkstyle  compile  |
| uname | Linux asf904.gq1.ygridcore.net 3.13.0-153-generic #203-Ubuntu SMP Thu 
Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build/yetus/precommit/personality/tez.sh
 |
| git revision | master / 9277815 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_172 |
| findbugs | v3.0.1 |
| javac | 
https://builds.apache.org/job/PreCommit-TEZ-Build/35/artifact/out/diff-compile-javac-tez-runtime-library.txt
 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-TEZ-Build/35/artifact/out/diff-checkstyle-tez-api.txt
 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-TEZ-Build/35/artifact/out/diff-checkstyle-tez-runtime-library.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-TEZ-Build/35/testReport/ |
| modules | C: tez-api tez-runtime-library U: . |
| Console output | https://builds.apache.org/job/PreCommit-TEZ-Build/35/console 
|
| Powered by | Apache Yetus 0.8.0  

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-10-18 Thread TezQA (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655741#comment-16655741
 ] 

TezQA commented on TEZ-3976:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  2m 
23s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
18s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
16s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
37s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 20s{color} 
| {color:red} tez-runtime-library generated 2 new + 6 unchanged - 0 fixed = 8 
total (was 6) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 22s{color} | {color:orange} tez-api: The patch generated 11 new + 8 
unchanged - 7 fixed = 19 total (was 15) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 33s{color} | {color:orange} tez-runtime-library: The patch generated 20 new 
+ 587 unchanged - 3 fixed = 607 total (was 590) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
55s{color} | {color:red} tez-runtime-library generated 1 new + 0 unchanged - 0 
fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
47s{color} | {color:green} tez-api in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m 
22s{color} | {color:green} tez-runtime-library in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
31s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 24m 51s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:tez-runtime-library |
|  |  Exceptional return value of 
java.util.concurrent.locks.Condition.await(long, TimeUnit) ignored in 
org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager$ReporterCallable.callInternal()
  At ShuffleManager.java:TimeUnit) ignored in 
org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager$ReporterCallable.callInternal()
  At ShuffleManager.java:[line 349] |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | TEZ-3976 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944593/TEZ-3976.6.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
checkstyle  compile  |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-153-generic #203-Ubuntu SMP Thu 
Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build/yetus/precommit/personality/tez.sh
 |
| git revision | master / 9277815 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_172 |
| findbugs | v3.0.1 |
| javac | 

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-10-17 Thread TezQA (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654616#comment-16654616
 ] 

TezQA commented on TEZ-3976:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
21s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  1m 
33s{color} | {color:red} root in master failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m  
9s{color} | {color:red} tez-api in master failed with JDK v1.8.0_181. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m  
9s{color} | {color:red} tez-runtime-library in master failed with JDK 
v1.8.0_181. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
10s{color} | {color:red} tez-api in master failed with JDK v1.8.0_172. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m  
9s{color} | {color:red} tez-runtime-library in master failed with JDK 
v1.8.0_172. {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
48s{color} | {color:green} master passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m  
9s{color} | {color:red} tez-api in master failed. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m  
9s{color} | {color:red} tez-runtime-library in master failed. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m  
9s{color} | {color:red} tez-api in master failed with JDK v1.8.0_181. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m  
9s{color} | {color:red} tez-runtime-library in master failed with JDK 
v1.8.0_181. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
10s{color} | {color:red} tez-api in master failed with JDK v1.8.0_172. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
10s{color} | {color:red} tez-runtime-library in master failed with JDK 
v1.8.0_172. {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
11s{color} | {color:red} tez-api in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m  
9s{color} | {color:red} tez-runtime-library in the patch failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m  
9s{color} | {color:red} tez-api in the patch failed with JDK v1.8.0_181. 
{color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m  
8s{color} | {color:red} tez-runtime-library in the patch failed with JDK 
v1.8.0_181. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m  9s{color} 
| {color:red} tez-api in the patch failed with JDK v1.8.0_181. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m  8s{color} 
| {color:red} tez-runtime-library in the patch failed with JDK v1.8.0_181. 
{color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m  
9s{color} | {color:red} tez-api in the patch failed with JDK v1.8.0_172. 
{color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m  
9s{color} | {color:red} tez-runtime-library in the patch failed with JDK 
v1.8.0_172. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m  9s{color} 
| {color:red} tez-api in the patch failed with JDK v1.8.0_172. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m  9s{color} 
| {color:red} tez-runtime-library in the patch failed with JDK v1.8.0_172. 
{color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 17s{color} | {color:orange} tez-api: The patch generated 14 new + 12 
unchanged - 2 fixed = 26 total (was 14) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 29s{color} | {color:orange} tez-runtime-library: The patch generated 18 new 
+ 588 unchanged - 3 fixed = 606 total (was 591) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | 

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-10-17 Thread TezQA (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654577#comment-16654577
 ] 

TezQA commented on TEZ-3976:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  2m 
28s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  3m 
52s{color} | {color:red} root in master failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
17s{color} | {color:red} tez-api in master failed with JDK v1.8.0_181. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
12s{color} | {color:red} tez-runtime-library in master failed with JDK 
v1.8.0_181. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
10s{color} | {color:red} tez-api in master failed with JDK v1.8.0_172. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
11s{color} | {color:red} tez-runtime-library in master failed with JDK 
v1.8.0_172. {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
15s{color} | {color:green} master passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
11s{color} | {color:red} tez-api in master failed. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
10s{color} | {color:red} tez-runtime-library in master failed. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
10s{color} | {color:red} tez-api in master failed with JDK v1.8.0_181. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
11s{color} | {color:red} tez-runtime-library in master failed with JDK 
v1.8.0_181. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
11s{color} | {color:red} tez-api in master failed with JDK v1.8.0_172. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
11s{color} | {color:red} tez-runtime-library in master failed with JDK 
v1.8.0_172. {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
14s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
10s{color} | {color:red} tez-api in the patch failed. {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
10s{color} | {color:red} tez-runtime-library in the patch failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
10s{color} | {color:red} tez-api in the patch failed with JDK v1.8.0_181. 
{color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m  
9s{color} | {color:red} tez-runtime-library in the patch failed with JDK 
v1.8.0_181. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 10s{color} 
| {color:red} tez-api in the patch failed with JDK v1.8.0_181. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m  9s{color} 
| {color:red} tez-runtime-library in the patch failed with JDK v1.8.0_181. 
{color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
10s{color} | {color:red} tez-api in the patch failed with JDK v1.8.0_172. 
{color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
10s{color} | {color:red} tez-runtime-library in the patch failed with JDK 
v1.8.0_172. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 10s{color} 
| {color:red} tez-api in the patch failed with JDK v1.8.0_172. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 10s{color} 
| {color:red} tez-runtime-library in the patch failed with JDK v1.8.0_172. 
{color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 19s{color} | {color:orange} tez-api: The patch generated 14 new + 12 
unchanged - 2 fixed = 26 total (was 14) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 30s{color} | {color:orange} tez-runtime-library: The patch generated 18 new 
+ 588 unchanged - 3 fixed = 606 total (was 591) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | 

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-10-17 Thread TezQA (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16654533#comment-16654533
 ] 

TezQA commented on TEZ-3976:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
21s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
41s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
43s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 3s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
39s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
53s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
16s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:red}-1{color} | {color:red} mvninstall {color} | {color:red}  0m 
18s{color} | {color:red} tez-runtime-library in the patch failed. {color} |
| {color:red}-1{color} | {color:red} compile {color} | {color:red}  0m 
18s{color} | {color:red} tez-runtime-library in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  0m 18s{color} 
| {color:red} tez-runtime-library in the patch failed. {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 23s{color} | {color:orange} tez-api: The patch generated 14 new + 12 
unchanged - 2 fixed = 26 total (was 14) {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 38s{color} | {color:orange} tez-runtime-library: The patch generated 20 new 
+ 587 unchanged - 3 fixed = 607 total (was 590) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  0m 
20s{color} | {color:red} tez-runtime-library in the patch failed. {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
22s{color} | {color:red} tez-runtime-library generated 1 new + 0 unchanged - 0 
fixed = 1 total (was 0) {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
53s{color} | {color:green} tez-api in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  0m 18s{color} 
| {color:red} tez-runtime-library in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
22s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 17m 41s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | TEZ-3976 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12944455/TEZ-3976.3.patch |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
checkstyle  compile  |
| uname | Linux asf903.gq1.ygridcore.net 3.13.0-143-generic #192-Ubuntu SMP Tue 
Feb 27 10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build/yetus/precommit/personality/tez.sh
 |
| git revision | master / 9277815 |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_172 |
| findbugs | v3.0.1 |
| mvninstall | 
https://builds.apache.org/job/PreCommit-TEZ-Build/30/artifact/out/patch-mvninstall-tez-runtime-library.txt
 |
| compile | 
https://builds.apache.org/job/PreCommit-TEZ-Build/30/artifact/out/patch-compile-tez-runtime-library.txt
 |
| javac | 
https://builds.apache.org/job/PreCommit-TEZ-Build/30/artifact/out/patch-compile-tez-runtime-library.txt
 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-TEZ-Build/30/artifact/out/diff-checkstyle-tez-api.txt
 |
| checkstyle | 

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-08-20 Thread TezQA (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16586399#comment-16586399
 ] 

TezQA commented on TEZ-3976:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12936304/TEZ-3976.2.patch
  against master revision 4b9a7be.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2895//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2895//console

This message is automatically generated.


> ShuffleManager reporting too many errors
> 
>
> Key: TEZ-3976
> URL: https://issues.apache.org/jira/browse/TEZ-3976
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jaume M
>Assignee: Jaume M
>Priority: Major
> Attachments: TEZ-3976.1.patch, TEZ-3976.2.patch
>
>
> The symptoms are a lot of these logs are being shown:
> {code:java}
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #0 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0], connectFailed: true
> 2018-06-15T18:09:35,811 WARN  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for 
> tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]]
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0], connectFailed: true
> {code}
> Each of those translate into an event in the AM which finally crashes due to 
> OOM after around 30 minutes and around 10 million shuffle input errors (and 
> 10 million lines like the previous ones). When the ShufflerManager is closed 
> and the counters reported there are many shuffle input errors, some of those 
> logs are:
> {code:java}
> 2018-06-15T17:46:30,988  INFO [TezTR-441963_21_34_4_0_4 
> (152901963_0021_34_04_00_4)] runtime.LogicalIOProcessorRuntimeTask: 
> Final Counters for attempt_152901963_0021_34_04_00_4: Counters: 43 
> [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, 
> NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0, 
> OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, 
> OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, 
> SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, 
> SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, 
> SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516, 
> FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE 
> RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1, 
> RECORDS_OUT_OPERATOR_GBY_159=1, 
> RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11
>  FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, 
> LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, 
> SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, 
> SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1
>  ADDITIONAL_SPILLS_BYTES_READ=0, 

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-08-07 Thread Jaume M (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572043#comment-16572043
 ] 

Jaume M commented on TEZ-3976:
--

[~jlowe] for clarification the OOM is in the AM.

Some tasks are killed by the AM and others seem to finish fine (despite a huge 
number of \{{NUM_FAILED_SHUFFLE_INPUTS}}). I see the AM killing some tasks but 
I think it just can't keep up with all the incoming evens. 

For example task {{152901963_0021_34_04_00_4}} runs for around one 
minute and reports {{NUM_FAILED_SHUFFLE_INPUTS=858965}}. I think we should to 
make you second point:
bq. The downstream task decides that it is getting far too many errors and 
insufficient overall shuffle progress, so it declares itself unhealthy and 
fails.
happen as well for the {{ShuffleManager}}.

> ShuffleManager reporting too many errors
> 
>
> Key: TEZ-3976
> URL: https://issues.apache.org/jira/browse/TEZ-3976
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jaume M
>Priority: Major
>
> The symptoms are a lot of these logs are being shown:
> {code:java}
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #0 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0], connectFailed: true
> 2018-06-15T18:09:35,811 WARN  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for 
> tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]]
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0], connectFailed: true
> {code}
> Each of those translate into an event in the AM which finally crashes due to 
> OOM after around 30 minutes and around 10 million shuffle input errors (and 
> 10 million lines like the previous ones). When the ShufflerManager is closed 
> and the counters reported there are many shuffle input errors, some of those 
> logs are:
> {code:java}
> 2018-06-15T17:46:30,988  INFO [TezTR-441963_21_34_4_0_4 
> (152901963_0021_34_04_00_4)] runtime.LogicalIOProcessorRuntimeTask: 
> Final Counters for attempt_152901963_0021_34_04_00_4: Counters: 43 
> [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, 
> NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0, 
> OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, 
> OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, 
> SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, 
> SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, 
> SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516, 
> FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE 
> RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1, 
> RECORDS_OUT_OPERATOR_GBY_159=1, 
> RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11
>  FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, 
> LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, 
> SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, 
> SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1
>  ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146, 
> OUTPUT_BYTES_WITH_OVERHEAD=779483, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1, 
> SPILLED_RECORDS=0]]
> 2018-06-15T17:46:32,271 INFO  [TezTR-441963_21_34_3_15_1 ()] 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Final Counters for 
> attempt_152901963_0021_34_03_15_1: Counters: 87 [[File System 
> Counters FILE_BYTES_READ=0, FILE_BYTES_WRITTEN=0, FILE_READ_OPS=0, 
> FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, 

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-08-07 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16572034#comment-16572034
 ] 

Gopal V commented on TEZ-3976:
--

bq. Normally this scenario does not occur because one of the following triggers 
first:

That is what happens with ShuffleScheduler (the one that matches the MapReduce 
shuffle).

ShuffleManager is the other codepath which is used for unsorted shuffle joins.

> ShuffleManager reporting too many errors
> 
>
> Key: TEZ-3976
> URL: https://issues.apache.org/jira/browse/TEZ-3976
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jaume M
>Priority: Major
>
> The symptoms are a lot of these logs are being shown:
> {code:java}
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #0 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0], connectFailed: true
> 2018-06-15T18:09:35,811 WARN  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for 
> tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]]
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0], connectFailed: true
> {code}
> Each of those translate into an event in the AM which finally crashes due to 
> OOM after around 30 minutes and around 10 million shuffle input errors (and 
> 10 million lines like the previous ones). When the ShufflerManager is closed 
> and the counters reported there are many shuffle input errors, some of those 
> logs are:
> {code:java}
> 2018-06-15T17:46:30,988  INFO [TezTR-441963_21_34_4_0_4 
> (152901963_0021_34_04_00_4)] runtime.LogicalIOProcessorRuntimeTask: 
> Final Counters for attempt_152901963_0021_34_04_00_4: Counters: 43 
> [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, 
> NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0, 
> OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, 
> OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, 
> SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, 
> SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, 
> SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516, 
> FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE 
> RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1, 
> RECORDS_OUT_OPERATOR_GBY_159=1, 
> RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11
>  FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, 
> LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, 
> SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, 
> SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1
>  ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146, 
> OUTPUT_BYTES_WITH_OVERHEAD=779483, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1, 
> SPILLED_RECORDS=0]]
> 2018-06-15T17:46:32,271 INFO  [TezTR-441963_21_34_3_15_1 ()] 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Final Counters for 
> attempt_152901963_0021_34_03_15_1: Counters: 87 [[File System 
> Counters FILE_BYTES_READ=0, FILE_BYTES_WRITTEN=0, FILE_READ_OPS=0, 
> FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=2344929, 
> HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=5, HDFS_LARGE_READ_OPS=0, 
> HDFS_WRITE_OPS=0][org.apache.tez.common.counters.TaskCounter 
> SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=1, NUM_FAILED_SHUFFLE_INPUTS=105195, 
> INPUT_RECORDS_PROCESSED=397, INPUT_SPLIT_LENGTH_BYTES=21563271, 
> OUTPUT_RECORDS=15737, OUTPUT_LARGE_RECORDS=0, OUTPUT_BYTES=1235818, 
> OUTPUT_BYTES_WITH_OVERHEAD=1267307, 

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

2018-08-07 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571755#comment-16571755
 ] 

Jason Lowe commented on TEZ-3976:
-

bq. Each of those translate into an event in the AM which finally crashes due 
to OOM after around 30 minutes and around 10 million shuffle input errors (and 
10 million lines like the previous ones).

To clarify, this is the task crashing with an OOM and not the AM?  If so it 
looks like yet another flow control issue with the RPC layer where the task is 
generating events faster than the network can push them out.

Normally this scenario does not occur because one of the following triggers 
first:
# The AM sees enough complaints about the inability to fetch an upstream task's 
inputs that it decides to retroactively fail the upstream task and re-run it.  
When that is decided the AM informs the downstream tasks about the now obsolete 
shuffle inputs, and the downstream tasks will stop trying to fetch it.
# The downstream task decides that it is getting far too many errors and 
insufficient overall shuffle progress, so it declares itself unhealthy and 
fails.

It would be useful to understand why neither of those is happening in this case.


> ShuffleManager reporting too many errors
> 
>
> Key: TEZ-3976
> URL: https://issues.apache.org/jira/browse/TEZ-3976
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jaume M
>Priority: Major
>
> The symptoms are a lot of these logs are being shown:
> {code:java}
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #0 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0], connectFailed: true
> 2018-06-15T18:09:35,811 WARN  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for 
> tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]]
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_152901963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0], connectFailed: true
> {code}
> Each of those translate into an event in the AM which finally crashes due to 
> OOM after around 30 minutes and around 10 million shuffle input errors (and 
> 10 million lines like the previous ones). When the ShufflerManager is closed 
> and the counters reported there are many shuffle input errors, some of those 
> logs are:
> {code:java}
> 2018-06-15T17:46:30,988  INFO [TezTR-441963_21_34_4_0_4 
> (152901963_0021_34_04_00_4)] runtime.LogicalIOProcessorRuntimeTask: 
> Final Counters for attempt_152901963_0021_34_04_00_4: Counters: 43 
> [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, 
> NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0, 
> OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, 
> OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, 
> SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, 
> SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, 
> SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516, 
> FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE 
> RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1, 
> RECORDS_OUT_OPERATOR_GBY_159=1, 
> RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11
>  FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, 
> LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, 
> SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, 
> SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1
>  ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146, 
> OUTPUT_BYTES_WITH_OVERHEAD=779483,