[jira] [Updated] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded

2015-05-26 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated TEZ-2484:
-
Attachment: Tez_RM_misreporting_succeeded.png

Attaching screenshot of Yarn Resource Manager line showing this Tez job being 
incorrectly reported as succeeded despite failure output in user session.

 Tez vertex for Hive fails but Resource Manager reports job succeeded
 

 Key: TEZ-2484
 URL: https://issues.apache.org/jira/browse/TEZ-2484
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2.4.2
Reporter: Hari Sekhon
 Attachments: Tez_RM_misreporting_succeeded.png


 When running a Hive on Tez job via Hive CLI the job fails and I get the error 
 shown below but in the Resource Manager the job is shown as Succeeded, even 
 though it's clearly failed:
 {code}
 Status: Running (Executing on YARN cluster with App id 
 application_1432310690008_0103)
 
 VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
 KILLED
 
 Map 1 FAILED   1478  00 1478   1
 1477
 
 VERTICES: 00/01  [--] 0%ELAPSED TIME: 1589.41 s
 
 Status: Failed
 Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, 
 diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, 
 diagnostics=[TaskAttempt 0 failed, info=[ 
 Containercontainer_e122_1432310690008_0103_01_94 received a 
 STOP_REQUEST]], Vertex failed as one or more tasks failed. failedTasks:1, 
 Vertex vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null]
 DAG failed due to vertex failure. failedVertices:1 killedVertices:0
 FAILED: Execution Error, return code 2 from 
 org.apache.hadoop.hive.ql.exec.tez.TezTask
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2484) Tez vertex for Hive fails but Resource Manager reports job succeeded

2015-05-26 Thread Hari Sekhon (JIRA)
Hari Sekhon created TEZ-2484:


 Summary: Tez vertex for Hive fails but Resource Manager reports 
job succeeded
 Key: TEZ-2484
 URL: https://issues.apache.org/jira/browse/TEZ-2484
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2.4.2
Reporter: Hari Sekhon


When running a Hive on Tez job via Hive CLI the job fails and I get the error 
shown below but in the Resource Manager the job is shown as Succeeded, even 
though it's clearly failed:
{code}
Status: Running (Executing on YARN cluster with App id 
application_1432310690008_0103)


VERTICES  STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1 FAILED   1478  00 1478   11477

VERTICES: 00/01  [--] 0%ELAPSED TIME: 1589.41 s

Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1432310690008_0103_1_00, 
diagnostics=[Task failed, taskId=task_1432310690008_0103_1_00_00, 
diagnostics=[TaskAttempt 0 failed, info=[ 
Containercontainer_e122_1432310690008_0103_01_94 received a STOP_REQUEST]], 
Vertex failed as one or more tasks failed. failedTasks:1, Vertex 
vertex_1432310690008_0103_1_00 [Map 1] killed/failed due to:null]
DAG failed due to vertex failure. failedVertices:1 killedVertices:0
FAILED: Execution Error, return code 2 from 
org.apache.hadoop.hive.ql.exec.tez.TezTask
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TEZ-2370) Add stages information to RM UI for debugging / visibility on job progress

2015-05-25 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon resolved TEZ-2370.
--
   Resolution: Fixed
Fix Version/s: 0.6.0

Ok great thanks, I'll look forward to upgrading to that... I also saw 
Hortonworks recent announcement for a Tez job view for Ambari which I'm looking 
forward to trying once it's GA.

 Add stages information to RM UI for debugging / visibility on job progress
 --

 Key: TEZ-2370
 URL: https://issues.apache.org/jira/browse/TEZ-2370
 Project: Apache Tez
  Issue Type: Improvement
  Components: UI
Affects Versions: 0.5.2
 Environment: HDP 2.2.0
Reporter: Hari Sekhon
Priority: Minor
 Fix For: 0.6.0


 Something that has been bugging me since last year is the difficulty of 
 debugging Tez jobs compared to MapReduce jobs.
 This is because Resource Manager / Application Master does not display the 
 job stats and stages that we are used to seeing in MapReduce eg. Map and 
 Reduce task counts and progress. I appreciate that Tez is a more flexible 
 framework with a DAG but it would be nice if it could surface the information 
 on the different stages, number of tasks running, completed, failed, killed, 
 successful etc, similar to how Spark does, and the stage breakdown would be 
 useful in understanding what the job is doing at different times, what stage 
 is getting stuck/failing etc.
 At the moment the only thing available is to trawl the logs or hope to have a 
 console output where some of that information is available, both of which are 
 non-ideal when debugging other's people's jobs after the fact.
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2457) Improve Documentation to explicitly list all valid Tez configuration variables

2015-05-18 Thread Hari Sekhon (JIRA)
Hari Sekhon created TEZ-2457:


 Summary: Improve Documentation to explicitly list all valid Tez 
configuration variables
 Key: TEZ-2457
 URL: https://issues.apache.org/jira/browse/TEZ-2457
 Project: Apache Tez
  Issue Type: Improvement
Affects Versions: 0.5.2
 Environment: HDP 2.2
Reporter: Hari Sekhon


Request to improve Tez documentation by adding a page showing all valid Tez 
configuration variables with their defaults and description as well as which 
MapReduce variables Tez respects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-2370) Add stages information to RM UI for debugging / visibility on job progress

2015-04-27 Thread Hari Sekhon (JIRA)
Hari Sekhon created TEZ-2370:


 Summary: Add stages information to RM UI for debugging / 
visibility on job progress
 Key: TEZ-2370
 URL: https://issues.apache.org/jira/browse/TEZ-2370
 Project: Apache Tez
  Issue Type: Improvement
  Components: UI
Affects Versions: 0.5.2
 Environment: HDP 2.2.0
Reporter: Hari Sekhon
Priority: Minor


Something that has been bugging me since last year is the difficulty of 
debugging Tez jobs compared to MapReduce jobs.

This is because Resource Manager / Application Master does not display the job 
stats and stages that we are used to seeing in MapReduce eg. Map and Reduce 
task counts and progress. I appreciate that Tez is a more flexible framework 
with a DAG but it would be nice if it could surface the information on the 
different stages, number of tasks running, completed, failed, killed, 
successful etc, similar to how Spark does, and the stage breakdown would be 
useful in understanding what the job is doing at different times, what stage is 
getting stuck/failing etc.

At the moment the only thing available is to trawl the logs or hope to have a 
console output where some of that information is available, both of which are 
non-ideal when debugging other's people's jobs after the fact.

Hari Sekhon
http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181

2015-04-21 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504694#comment-14504694
 ] 

Hari Sekhon commented on TEZ-2322:
--

Hitesh Shah, the yarn logs command failed originally otherwise I would have 
supplied that output.

Jeff Zhang I did note the job did succeed in the end - this is just a jira to 
mark that the counts were wrong, hence I've labelled this as minor priority to 
fix.

 Succeeded count wrong for Pig on Tez job, decreased 380 = 181
 --

 Key: TEZ-2322
 URL: https://issues.apache.org/jira/browse/TEZ-2322
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor
 Attachments: attempt1_syslog_dag_1427546104095_0146_1, 
 attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, 
 attempt2_syslog_dag_1427546104095_0146_1_post


 During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
 as shown below:
 {code}
 2015-04-15 15:09:56,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
 2015-04-15 15:10:16,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
 2015-04-15 15:10:36,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
 2015-04-15 15:10:56,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:11:16,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:11:36,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:11:56,993 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:12:16,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
 0 
 {code}
 Now this may be because the tasks failed, some certainly did due to space 
 exceptions having checked the logs, but surely once a task has finished 
 successfully and is marked as succeeded it cannot then later be removed from 
 the succeeded count? Perhaps the succeeded counter is incremented too early 
 before the task results are really saved?
 KilledTaskAttempts jumped from 16 = 89 at the same time, but even this 
 doesn't account for the large drop in number of succeeded tasks.
 There was also a noticeable jump in Running tasks from 58 = 724 at the same 
 time which is suspicious, I'm pretty sure there was no contending job to 
 finish and release so much more resource to this Tez job, so it's also 
 unclear how the running count count have jumped up to significantly given the 
 cluster hardware resources have been the same throughout.
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181

2015-04-15 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated TEZ-2322:
-
Description: 
During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 
{code}
Now this may be because the tasks failed, some certainly did due to space 
exceptions having checked the logs, but surely once a task has finished 
successfully and is marked as succeeded it cannot then later be removed from 
the succeeded count? Perhaps the succeeded counter is incremented too early 
before the task results are really saved?

KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't 
account for the large drop in number of succeeded tasks.

There was also a noticeable jump in Running tasks from 58 = 724 at the same 
time which is suspicious, I'm pretty sure there was no contending job to finish 
and release so much more resource to this Tez job.

  was:
During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 

[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181

2015-04-15 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated TEZ-2322:
-
Description: 
During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 
{code}
Now this may be because the tasks failed, some certainly did due to space 
exceptions having checked the logs, but surely once a task has finished 
successfully and is marked as succeeded it cannot then later be removed from 
the succeeded count? Perhaps the succeeded counter is incremented too early 
before the task results are really saved?

KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't 
account for the large drop in number of succeeded tasks.

There was also a noticeable jump in Running tasks from 58 = 724 at the same 
time which is suspicious, I'm pretty sure there was no contending job to finish 
and release so much more resource to this Tez job, so it's also unclear how the 
running count count have jumped up to significantly.

  was:
During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 

[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181

2015-04-15 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated TEZ-2322:
-
Description: 
During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 
{code}
Now this may be because the tasks failed, some certainly did due to space 
exceptions having checked the logs, but surely once a task has finished 
successfully and is marked as succeeded it cannot then later be removed from 
the succeeded count? Perhaps the succeeded counter is incremented too early 
before the task results are really saved?

KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't 
account for the large drop in number of succeeded tasks.

There was also a noticeable jump in Running tasks from 58 = 724 at the same 
time which is suspicious, I'm pretty sure there was no contending job to finish 
and release so much more resource to this Tez job, so it's also unclear how the 
running count count have jumped up to significantly given the cluster hardware 
resources have been the same throughout.

  was:
During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 

[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181

2015-04-15 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated TEZ-2322:
-
Description: 
During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 
{code}
Now this may be because the tasks failed, some certainly did due to space 
exceptions having checked the logs, but surely once a task has finished 
successfully and is marked as succeeded it cannot then later be removed from 
the succeeded count? Perhaps the succeeded counter is incremented too early 
before the task results are really saved?

KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't 
account for the large drop in number of succeeded tasks.

There was also a noticeable jump in Running tasks from 58 = 724 at the same 
time which is suspicious, I'm pretty sure there was no contending job to finish 
and release so much more resource to this Tez job, so it's also unclear how the 
running count count have jumped up to significantly given the cluster hardware 
resources have been the same throughout.

Hari Sekhon
http://www.linkedin.com/in/harisekhon

  was:
During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  

[jira] [Created] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181

2015-04-15 Thread Hari Sekhon (JIRA)
Hari Sekhon created TEZ-2322:


 Summary: Succeeded count wrong for Pig on Tez job, decreased 380 
= 181
 Key: TEZ-2322
 URL: https://issues.apache.org/jira/browse/TEZ-2322
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor


During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
as shown below:
{code}
2015-04-15 15:09:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
2015-04-15 15:10:56,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:36,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:11:56,993 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 0 
Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
2015-04-15 15:12:16,992 [Timer-0] INFO  
org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 0 
{code}
Now this may be because the tasks failed, some certainly did due to space 
exceptions, but surely once a task has finished successfully and is marked as 
succeeded it cannot then be removed from the succeeded count? Perhaps the 
succeeded counter is incremented too early before the task results are really 
saved?

KilledTaskAttempts jumped from 16 = 89 at the same time, but even this doesn't 
account for the large drop in number of succeeded tasks.

There was also a noticeable jump in Running tasks from 58 = 724 at the same 
time which is suspicious, I'm pretty sure there was no contending job to finish 
and release so much more resource to this Tez job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181

2015-04-15 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496553#comment-14496553
 ] 

Hari Sekhon edited comment on TEZ-2322 at 4/15/15 5:25 PM:
---

Iirc Ambari still doesn't support Job History server so that command fails, but 
I've copied the logs out via RM and attached to this ticket for you.


was (Author: harisekhon):
Iirc Ambari still doesn't support Job History server so that command fails, but 
I've copied the logs out via RM.

 Succeeded count wrong for Pig on Tez job, decreased 380 = 181
 --

 Key: TEZ-2322
 URL: https://issues.apache.org/jira/browse/TEZ-2322
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor
 Attachments: attempt1_syslog_dag_1427546104095_0146_1, 
 attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, 
 attempt2_syslog_dag_1427546104095_0146_1_post


 During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
 as shown below:
 {code}
 2015-04-15 15:09:56,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
 2015-04-15 15:10:16,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
 2015-04-15 15:10:36,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
 2015-04-15 15:10:56,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:11:16,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:11:36,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:11:56,993 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:12:16,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
 0 
 {code}
 Now this may be because the tasks failed, some certainly did due to space 
 exceptions having checked the logs, but surely once a task has finished 
 successfully and is marked as succeeded it cannot then later be removed from 
 the succeeded count? Perhaps the succeeded counter is incremented too early 
 before the task results are really saved?
 KilledTaskAttempts jumped from 16 = 89 at the same time, but even this 
 doesn't account for the large drop in number of succeeded tasks.
 There was also a noticeable jump in Running tasks from 58 = 724 at the same 
 time which is suspicious, I'm pretty sure there was no contending job to 
 finish and release so much more resource to this Tez job, so it's also 
 unclear how the running count count have jumped up to significantly given the 
 cluster hardware resources have been the same throughout.
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181

2015-04-15 Thread Hari Sekhon (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated TEZ-2322:
-
Attachment: attempt2_syslog_dag_1427546104095_0146_1_post
attempt2_syslog_dag_1427546104095_0146_1
attempt2_syslog
attempt1_syslog_dag_1427546104095_0146_1

Iirc Ambari still doesn't support Job History server so that command fails, but 
I've copied the logs out via RM.

 Succeeded count wrong for Pig on Tez job, decreased 380 = 181
 --

 Key: TEZ-2322
 URL: https://issues.apache.org/jira/browse/TEZ-2322
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor
 Attachments: attempt1_syslog_dag_1427546104095_0146_1, 
 attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, 
 attempt2_syslog_dag_1427546104095_0146_1_post


 During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
 as shown below:
 {code}
 2015-04-15 15:09:56,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
 2015-04-15 15:10:16,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
 2015-04-15 15:10:36,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
 2015-04-15 15:10:56,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:11:16,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:11:36,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:11:56,993 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:12:16,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
 0 
 {code}
 Now this may be because the tasks failed, some certainly did due to space 
 exceptions having checked the logs, but surely once a task has finished 
 successfully and is marked as succeeded it cannot then later be removed from 
 the succeeded count? Perhaps the succeeded counter is incremented too early 
 before the task results are really saved?
 KilledTaskAttempts jumped from 16 = 89 at the same time, but even this 
 doesn't account for the large drop in number of succeeded tasks.
 There was also a noticeable jump in Running tasks from 58 = 724 at the same 
 time which is suspicious, I'm pretty sure there was no contending job to 
 finish and release so much more resource to this Tez job, so it's also 
 unclear how the running count count have jumped up to significantly given the 
 cluster hardware resources have been the same throughout.
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2322) Succeeded count wrong for Pig on Tez job, decreased 380 = 181

2015-04-15 Thread Hari Sekhon (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496555#comment-14496555
 ] 

Hari Sekhon commented on TEZ-2322:
--

There was a point at which space ran out and kerberos also broke as a result, 
but I fixed it and the job continued and eventually succeeded.

 Succeeded count wrong for Pig on Tez job, decreased 380 = 181
 --

 Key: TEZ-2322
 URL: https://issues.apache.org/jira/browse/TEZ-2322
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.5.2
 Environment: HDP 2.2
Reporter: Hari Sekhon
Priority: Minor
 Attachments: attempt1_syslog_dag_1427546104095_0146_1, 
 attempt2_syslog, attempt2_syslog_dag_1427546104095_0146_1, 
 attempt2_syslog_dag_1427546104095_0146_1_post


 During a Pig on Tez job the number of succeeded tasks dropped from 380 = 181 
 as shown below:
 {code}
 2015-04-15 15:09:56,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
 2015-04-15 15:10:16,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
 2015-04-15 15:10:36,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 380 Running: 58 Failed: 0 
 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 16, diagnostics=
 2015-04-15 15:10:56,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:11:16,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 181 Running: 724 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:11:36,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 182 Running: 723 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:11:56,993 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 184 Running: 721 Failed: 
 0 Killed: 0 FailedTaskAttempts: 10 KilledTaskAttempts: 89, diagnostics=
 2015-04-15 15:12:16,992 [Timer-0] INFO  
 org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG Status: 
 status=RUNNING, progress=TotalTasks: 905 Succeeded: 186 Running: 719 Failed: 
 0 
 {code}
 Now this may be because the tasks failed, some certainly did due to space 
 exceptions having checked the logs, but surely once a task has finished 
 successfully and is marked as succeeded it cannot then later be removed from 
 the succeeded count? Perhaps the succeeded counter is incremented too early 
 before the task results are really saved?
 KilledTaskAttempts jumped from 16 = 89 at the same time, but even this 
 doesn't account for the large drop in number of succeeded tasks.
 There was also a noticeable jump in Running tasks from 58 = 724 at the same 
 time which is suspicious, I'm pretty sure there was no contending job to 
 finish and release so much more resource to this Tez job, so it's also 
 unclear how the running count count have jumped up to significantly given the 
 cluster hardware resources have been the same throughout.
 Hari Sekhon
 http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)