[jira] [Commented] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587501#comment-15587501
 ] 

Rajesh Balamohan commented on TEZ-3479:
---

That is correct. Haven't observed this in other cases.

> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587378#comment-15587378
 ] 

Hitesh Shah commented on TEZ-3479:
--

bq. I haven't disabled recovery in my runs.

To clarify, my question was whether this reproduces only in the cases where the 
AM crashes and restarts? 

> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3465) Support broadcast edge into cartesian product vertex and forbid other edges

2016-10-18 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587305#comment-15587305
 ] 

TezQA commented on TEZ-3465:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12834068/TEZ-3465.3.patch
  against master revision 67243a0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2045//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2045//console

This message is automatically generated.

> Support broadcast edge into cartesian product vertex and forbid other edges
> ---
>
> Key: TEZ-3465
> URL: https://issues.apache.org/jira/browse/TEZ-3465
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Zhiyuan Yang
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3465.1.patch, TEZ-3465.2.patch, TEZ-3465.3.patch
>
>
> Cartesian product vertex manager should support other incoming edge type. 
> Currently only broadcast edge is necessary, although potentially more edge 
> types could also be. Custom edge need its own vertex manager which can't work 
> with Cartesian product VM, so it has to be forbade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Success: TEZ-3465 PreCommit Build #2045

2016-10-18 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-3465
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2045/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 4819 lines...]
[INFO] Tez  SUCCESS [  0.026 s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 57:00 min
[INFO] Finished at: 2016-10-19T01:26:37+00:00
[INFO] Final Memory: 80M/1495M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12834068/TEZ-3465.3.patch
  against master revision 67243a0.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2045//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2045//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
2a03768c8bb4cc7e131e0a076013b2152eef909c logged out


==
==
Finished build.
==
==


Archiving artifacts
[description-setter] Description set: TEZ-3465
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (TEZ-3405) Support ability for AM to kill itself if there is no client heartbeating to it

2016-10-18 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587218#comment-15587218
 ] 

Siddharth Seth commented on TEZ-3405:
-

+1. 

> Support ability for AM to kill itself if there is no client heartbeating to it
> --
>
> Key: TEZ-3405
> URL: https://issues.apache.org/jira/browse/TEZ-3405
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
>Assignee: Hitesh Shah
>Priority: Critical
> Attachments: TEZ-3405.1.patch, TEZ-3405.2.patch, TEZ-3405.3.patch, 
> TEZ-3405.4.patch, TEZ-3405.5.patch
>
>
> HiveServer2 optionally maintains a pool of AMs in either Tez or LLAP mode. 
> This is done to amortize the cost of launching a Tez session.
> We also try in a shutdown hook to kill all these AMs when HS2 goes down. 
> However, there are cases where HS2 doesn't get the chance to kill these AMs 
> before it goes away. As a result these zombie AMs hang around until the 
> timeout kicks in.
> The trouble with the timeout is that we have to set it fairly high. Otherwise 
> the benefit of having pre-launched AMs obviously goes away (in a lightly 
> loaded cluster).
> So, if people kill/restart HS2 they often times run into situations where the 
> cluster/queue doesn't have any more capacity for AMs. They either have to 
> manually kill the zombies or wait.
> The request is therefore for Tez to maintain a heartbeat to the client. If 
> the client goes away the AM should exit. That way we can keep the AMs alive 
> for a long time regardless of activity and at the same time don't have to 
> worry about them if HS2 goes down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3477) MRInputHelpers generateInputSplitsToMem public API modified

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587215#comment-15587215
 ] 

Hitesh Shah commented on TEZ-3477:
--

[~jeagles] The change seems straightforward. Do we want to change APIs to 
limited private (hive/pig) so future commits to this are looked at a bit more 
carefully for compatibility? 

> MRInputHelpers generateInputSplitsToMem public API modified
> ---
>
> Key: TEZ-3477
> URL: https://issues.apache.org/jira/browse/TEZ-3477
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: TEZ-3477.1.patch
>
>
> Pig and Hive directly rely on specific APIs in MRInputHelpers. I would like 
> to ensure these signature are prevented from being modified.
> - MRInputHelpers.generateInputSplitsToMem
> - MRInputHelpers.parseMRInputPayload
> - MRInputHelpers.createSplitProto
> - MRInputHelpers.createOldFormatSplitFromUserPayload
> - MRInputHelpers.configureMRInputWithLegacySplitGeneration
> A recent fixed jira TEZ-3430 modified generateInputSplitsToMem
> {code}
> java.lang.NoSuchMethodError: 
> org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(Lorg/apache/hadoop/conf/Configuration;ZI)Lorg/apache/tez/mapreduce/hadoop/InputSplitInfoMem;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3458) Auto grouping for cartesian product edge(unpartitioned case)

2016-10-18 Thread Zhiyuan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiyuan Yang updated TEZ-3458:
--
Attachment: TEZ-3458.3.patch

> Auto grouping for cartesian product edge(unpartitioned case)
> 
>
> Key: TEZ-3458
> URL: https://issues.apache.org/jira/browse/TEZ-3458
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Zhiyuan Yang
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3458.1.patch, TEZ-3458.2.patch, TEZ-3458.3.patch
>
>
> Original CartesianProductVertexManagerUnpartitioned set parallelism as 
> product of all source vertices parallelism which may explode to insane 
> number. We should do auto reduce as in ShuffleVertexManager to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3465) Support broadcast edge into cartesian product vertex and forbid other edges

2016-10-18 Thread Zhiyuan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiyuan Yang updated TEZ-3465:
--
Attachment: TEZ-3465.3.patch

> Support broadcast edge into cartesian product vertex and forbid other edges
> ---
>
> Key: TEZ-3465
> URL: https://issues.apache.org/jira/browse/TEZ-3465
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Zhiyuan Yang
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3465.1.patch, TEZ-3465.2.patch, TEZ-3465.3.patch
>
>
> Cartesian product vertex manager should support other incoming edge type. 
> Currently only broadcast edge is necessary, although potentially more edge 
> types could also be. Custom edge need its own vertex manager which can't work 
> with Cartesian product VM, so it has to be forbade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587023#comment-15587023
 ] 

Hitesh Shah edited comment on TEZ-3479 at 10/18/16 11:31 PM:
-

Atleast for this scenario, I think we did not recover 
task_1476667862449_0031_1_07_04 properly to a failed state which ends up 
leading to a hang as the vertex cannot complete.

{code}
2016-10-18 07:06:24,837 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Task Completion: vertex_1476667862449_0031_1_07 [Map 3], tasks=29, failed=1, 
killed=24, success=3, completed=28, commits=0, err=OWN_TASK_FAILURE 
{code}

The task failure tracked is for task_1476667862449_0031_1_07_00 and not for 
0004.


was (Author: hitesh):
Atleast for this scenario, I think we did not recover 
task_1476667862449_0031_1_07_04 properly to a failed state which ends up 
leading to a hang as the vertex cannot complete.

{code}
2016-10-18 07:06:24,837 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Task Completion: vertex_1476667862449_0031_1_07 [Map 3], tasks=29, failed=1, 
killed=24, success=3, completed=28, commits=0, err=OWN_TASK_FAILURE 
{code}


> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587023#comment-15587023
 ] 

Hitesh Shah commented on TEZ-3479:
--

Atleast for this scenario, I think we did not recover 
task_1476667862449_0031_1_07_04 properly to a failed state which ends up 
leading to a hang as the vertex cannot complete.

{code}
2016-10-18 07:06:24,837 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Task Completion: vertex_1476667862449_0031_1_07 [Map 3], tasks=29, failed=1, 
killed=24, success=3, completed=28, commits=0, err=OWN_TASK_FAILURE 
{code}


> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586998#comment-15586998
 ] 

Rajesh Balamohan commented on TEZ-3479:
---

[~hitesh] - I haven't disabled recovery in my runs. Will check that.

> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3478) Cleanup fetcher data for failing task attempts (Unordered fetcher)

2016-10-18 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586993#comment-15586993
 ] 

Rajesh Balamohan commented on TEZ-3478:
---

Haven't checked for ordered case yet, but should be present there as well. 
Created this ticket to handle cleanup of unordered data here.  Will create 
subsequent jira for ordered case.

> Cleanup fetcher data for failing task attempts (Unordered fetcher)
> --
>
> Key: TEZ-3478
> URL: https://issues.apache.org/jira/browse/TEZ-3478
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
>
> Env: 3 node AWS cluster with entire dataset in S3. Since data is in S3, it 
> does have not additional storage for HDFS (uses existing space available in 
> VMs). tez version is 0.7.
> With some workloads (e.g q29 in tpcds), unordered fetchers download data in 
> parallel for different vertices and runs out of disk space. However, 
> downloaded
> data related to these failed task attempts are not cleared. So subsequent 
> task attempts also encounter similar situation and fails with "No space" 
> exception. e.g stack trace
> {noformat}
> , errorMessage=Fetch failed:org.apache.hadoop.fs.FSError: 
> java.io.IOException: No space left on device
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:261)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:426)
> at 
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:206)
> at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:124)
> at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:110)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> at 
> org.apache.tez.runtime.library.common.shuffle.ShuffleUtils.shuffleToDisk(ShuffleUtils.java:146)
> at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:771)
> at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:497)
> at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:396)
> at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:195)
> at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:70)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:345)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSy
> {noformat}
> This would also affect any other job running in the cluster at the same time. 
> It would be helpful to clean up the data downloaded for the failed task 
> attempts.
> Creating this ticket mainly for unordered fetcher case, though it could be 
> similar case for ordered shuffle case as well.
> e.g files
> {noformat}
> 17M   
> /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_0_10023_src_62_spill_-1.out
> 18M   
> /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_0_10023_src_63_spill_-1.out
> 16M   
> /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_0_10023_src_64_spill_-1.out
> ..
> ..
> 18M   
> /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_2_10003_src_0_spill_-1.out
> 17M   
> /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_2_10003_src_13_spill_-1.out

[jira] [Commented] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586992#comment-15586992
 ] 

Hitesh Shah commented on TEZ-3479:
--

[~rajesh.balamohan] Is this happening only in the cases where the AM crashes 
and tries to recover? 

> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3478) Cleanup fetcher data for failing task attempts (Unordered fetcher)

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586962#comment-15586962
 ] 

Hitesh Shah commented on TEZ-3478:
--

Is this only an issue with unordered data? 

> Cleanup fetcher data for failing task attempts (Unordered fetcher)
> --
>
> Key: TEZ-3478
> URL: https://issues.apache.org/jira/browse/TEZ-3478
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
>
> Env: 3 node AWS cluster with entire dataset in S3. Since data is in S3, it 
> does have not additional storage for HDFS (uses existing space available in 
> VMs). tez version is 0.7.
> With some workloads (e.g q29 in tpcds), unordered fetchers download data in 
> parallel for different vertices and runs out of disk space. However, 
> downloaded
> data related to these failed task attempts are not cleared. So subsequent 
> task attempts also encounter similar situation and fails with "No space" 
> exception. e.g stack trace
> {noformat}
> , errorMessage=Fetch failed:org.apache.hadoop.fs.FSError: 
> java.io.IOException: No space left on device
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:261)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:426)
> at 
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:206)
> at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:124)
> at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:110)
> at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> at 
> org.apache.tez.runtime.library.common.shuffle.ShuffleUtils.shuffleToDisk(ShuffleUtils.java:146)
> at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:771)
> at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:497)
> at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:396)
> at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:195)
> at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:70)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:345)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSy
> {noformat}
> This would also affect any other job running in the cluster at the same time. 
> It would be helpful to clean up the data downloaded for the failed task 
> attempts.
> Creating this ticket mainly for unordered fetcher case, though it could be 
> similar case for ordered shuffle case as well.
> e.g files
> {noformat}
> 17M   
> /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_0_10023_src_62_spill_-1.out
> 18M   
> /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_0_10023_src_63_spill_-1.out
> 16M   
> /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_0_10023_src_64_spill_-1.out
> ..
> ..
> 18M   
> /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_2_10003_src_0_spill_-1.out
> 17M   
> /hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_2_10003_src_13_spill_-1.out
> 16M   
> 

[jira] [Updated] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-3479:
--
Attachment: application_1476667862449_0031_not_complete.1.log.tar.gz

> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
> Attachments: application_1476667862449_0031_not_complete.1.log.tar.gz
>
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-3479:
--
Affects Version/s: 0.7.1

> DAG AM does not schedule any more containers in corner cases
> 
>
> Key: TEZ-3479
> URL: https://issues.apache.org/jira/browse/TEZ-3479
> Project: Apache Tez
>  Issue Type: Improvement
>Affects Versions: 0.7.1
>Reporter: Rajesh Balamohan
>
> Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.
> Some workloads end up generating lots of data that the tasks start throwing 
> "No space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
> enough number of retries which happens most of the time. Once in a while (~ 
> once in 20-30 runs), DAG AM gets into hung state and does not schedule any 
> more containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3479) DAG AM does not schedule any more containers in corner cases

2016-10-18 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created TEZ-3479:
-

 Summary: DAG AM does not schedule any more containers in corner 
cases
 Key: TEZ-3479
 URL: https://issues.apache.org/jira/browse/TEZ-3479
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Rajesh Balamohan



Env: 3 node AWS cluster with data residing in S3. Tez version is 0.7.

Some workloads end up generating lots of data that the tasks start throwing "No 
space available" in local disks (e.g Q29 in TPCDS). DAG should fail after 
enough number of retries which happens most of the time. Once in a while (~ 
once in 20-30 runs), DAG AM gets into hung state and does not schedule any more 
containers for the failed task attempts. Will attach the logs shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3478) Cleanup fetcher data for failing task attempts (Unordered fetcher)

2016-10-18 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created TEZ-3478:
-

 Summary: Cleanup fetcher data for failing task attempts (Unordered 
fetcher)
 Key: TEZ-3478
 URL: https://issues.apache.org/jira/browse/TEZ-3478
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
Priority: Minor



Env: 3 node AWS cluster with entire dataset in S3. Since data is in S3, it does 
have not additional storage for HDFS (uses existing space available in VMs). 
tez version is 0.7.

With some workloads (e.g q29 in tpcds), unordered fetchers download data in 
parallel for different vertices and runs out of disk space. However, downloaded
data related to these failed task attempts are not cleared. So subsequent task 
attempts also encounter similar situation and fails with "No space" exception. 
e.g stack trace
{noformat}

, errorMessage=Fetch failed:org.apache.hadoop.fs.FSError: java.io.IOException: 
No space left on device
at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:261)
at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:426)
at 
org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunks(FSOutputSummer.java:206)
at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:124)
at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:110)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at 
org.apache.tez.runtime.library.common.shuffle.ShuffleUtils.shuffleToDisk(ShuffleUtils.java:146)
at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:771)
at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:497)
at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:396)
at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:195)
at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:70)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:345)
at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSy
{noformat}

This would also affect any other job running in the cluster at the same time. 
It would be helpful to clean up the data downloaded for the failed task 
attempts.
Creating this ticket mainly for unordered fetcher case, though it could be 
similar case for ordered shuffle case as well.

e.g files
{noformat}
17M 
/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_0_10023_src_62_spill_-1.out
18M 
/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_0_10023_src_63_spill_-1.out
16M 
/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_0_10023_src_64_spill_-1.out
..
..

18M 
/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_2_10003_src_0_spill_-1.out
17M 
/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_2_10003_src_13_spill_-1.out
16M 
/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_2_10003_src_15_spill_-1.out
16M 
/hadoopfs/fs1/yarn/nodemanager/usercache/cloudbreak/appcache/application_1476667862449_0043/attempt_1476667862449_0043_1_07_28_2_10003_src_17_spill_-1.ou
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3439) Tez joinvalidate fails when first input argument size is bigger than the second

2016-10-18 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-3439:
-
Summary: Tez joinvalidate fails when first input argument size is bigger 
than the second  (was: Tez joinvalidate example failed when first input 
argument size is bigger than the second)

> Tez joinvalidate fails when first input argument size is bigger than the 
> second
> ---
>
> Key: TEZ-3439
> URL: https://issues.apache.org/jira/browse/TEZ-3439
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hui Cao
>Assignee: Hui Cao
> Attachments: TEZ-3439.1.patch, TEZ-3439.2.patch
>
>
> when using joinvalidate in Tez example jar. as command
> {{"hadoop jar tez-examples-.jar joinvalidate  "}}
> if the size of  is bigger than , an IOException is thrown.
> {noformat}
> 16/09/21 00:07:53 INFO examples.JoinValidate: DAG diagnostics: [Vertex 
> failed, vertexName=joinvalidate, vertexId=vertex_1473073428528_0031_1_02, 
> diagnostics=[Task failed, taskId=task_1473073428528_0031_1_02_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : attempt_1473073428528_0031_1_02_00_0:java.io.IOException: 
> Please check if you are invoking moveToNext() even after it returned false.
>   at 
> org.apache.tez.runtime.library.common.ValuesIterator.hasCompletedProcessing(ValuesIterator.java:221)
>   at 
> org.apache.tez.runtime.library.common.ValuesIterator.moveToNext(ValuesIterator.java:103)
>   at 
> org.apache.tez.runtime.library.input.OrderedGroupedKVInput$OrderedGroupedKeyValuesReader.next(OrderedGroupedKVInput.java:321)
>   at 
> org.apache.tez.examples.JoinValidate$JoinValidateProcessor.run(JoinValidate.java:254)
>   at 
> org.apache.tez.runtime.library.processor.SimpleProcessor.run(SimpleProcessor.java:53)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3439) Tez joinvalidate example failed when first input argument size is bigger than the second

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586823#comment-15586823
 ] 

Hitesh Shah commented on TEZ-3439:
--

+1. Committing shortly. 

> Tez joinvalidate example failed when first input argument size is bigger than 
> the second
> 
>
> Key: TEZ-3439
> URL: https://issues.apache.org/jira/browse/TEZ-3439
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Hui Cao
>Assignee: Hui Cao
> Attachments: TEZ-3439.1.patch, TEZ-3439.2.patch
>
>
> when using joinvalidate in Tez example jar. as command
> {{"hadoop jar tez-examples-.jar joinvalidate  "}}
> if the size of  is bigger than , an IOException is thrown.
> {noformat}
> 16/09/21 00:07:53 INFO examples.JoinValidate: DAG diagnostics: [Vertex 
> failed, vertexName=joinvalidate, vertexId=vertex_1473073428528_0031_1_02, 
> diagnostics=[Task failed, taskId=task_1473073428528_0031_1_02_00, 
> diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( 
> failure ) : attempt_1473073428528_0031_1_02_00_0:java.io.IOException: 
> Please check if you are invoking moveToNext() even after it returned false.
>   at 
> org.apache.tez.runtime.library.common.ValuesIterator.hasCompletedProcessing(ValuesIterator.java:221)
>   at 
> org.apache.tez.runtime.library.common.ValuesIterator.moveToNext(ValuesIterator.java:103)
>   at 
> org.apache.tez.runtime.library.input.OrderedGroupedKVInput$OrderedGroupedKeyValuesReader.next(OrderedGroupedKVInput.java:321)
>   at 
> org.apache.tez.examples.JoinValidate$JoinValidateProcessor.run(JoinValidate.java:254)
>   at 
> org.apache.tez.runtime.library.processor.SimpleProcessor.run(SimpleProcessor.java:53)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3462) Task attempt failure during container shutdown loses useful container diagnostics

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586783#comment-15586783
 ] 

Hitesh Shah commented on TEZ-3462:
--

bq. complicated to handle ... since the ATS publish would already have happened

This could be doable via a separate history event if needed and diagnostics 
could be updated into ATS. 


> Task attempt failure during container shutdown loses useful container 
> diagnostics
> -
>
> Key: TEZ-3462
> URL: https://issues.apache.org/jira/browse/TEZ-3462
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Eric Badger
> Attachments: TEZ-3462.001.patch
>
>
> When a nodemanager kills a task attempt due to excessive memory usage it will 
> send a SIGTERM followed by a SIGKILL.  It also sends a useful diagnostic 
> message with the container completion event to the RM which will eventually 
> make it to the AM on a subsequent heartbeat.
> However if the JVM shutdown processing causes an error in the task (e.g.: 
> filesystem being closed by shutdown hook) then the task attempt can report a 
> failure before the useful NM diagnostic makes it to the AM.  The AM then 
> records some other error as the task failure reason, and by the time the 
> container completion status makes it to the AM it does not associate that 
> error with the task attempt and the useful information is lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3405) Support ability for AM to kill itself if there is no client heartbeating to it

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586778#comment-15586778
 ] 

Hitesh Shah commented on TEZ-3405:
--

ping [~sseth]  - please help with hopefully a final review whenever you get a 
chance. 

> Support ability for AM to kill itself if there is no client heartbeating to it
> --
>
> Key: TEZ-3405
> URL: https://issues.apache.org/jira/browse/TEZ-3405
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Gunther Hagleitner
>Assignee: Hitesh Shah
>Priority: Critical
> Attachments: TEZ-3405.1.patch, TEZ-3405.2.patch, TEZ-3405.3.patch, 
> TEZ-3405.4.patch, TEZ-3405.5.patch
>
>
> HiveServer2 optionally maintains a pool of AMs in either Tez or LLAP mode. 
> This is done to amortize the cost of launching a Tez session.
> We also try in a shutdown hook to kill all these AMs when HS2 goes down. 
> However, there are cases where HS2 doesn't get the chance to kill these AMs 
> before it goes away. As a result these zombie AMs hang around until the 
> timeout kicks in.
> The trouble with the timeout is that we have to set it fairly high. Otherwise 
> the benefit of having pre-launched AMs obviously goes away (in a lightly 
> loaded cluster).
> So, if people kill/restart HS2 they often times run into situations where the 
> cluster/queue doesn't have any more capacity for AMs. They either have to 
> manually kill the zombies or wait.
> The request is therefore for Tez to maintain a heartbeat to the client. If 
> the client goes away the AM should exit. That way we can keep the AMs alive 
> for a long time regardless of activity and at the same time don't have to 
> worry about them if HS2 goes down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3419) Tez UI: Applications page shows error, for users with only DAG level ACL permission.

2016-10-18 Thread Hitesh Shah (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-3419:
-
Target Version/s: 0.8.5  (was: 0.9.0)

> Tez UI: Applications page shows error, for users with only DAG level ACL 
> permission.
> 
>
> Key: TEZ-3419
> URL: https://issues.apache.org/jira/browse/TEZ-3419
> Project: Apache Tez
>  Issue Type: Sub-task
>Affects Versions: 0.7.0
>Reporter: Sreenath Somarajapuram
>Assignee: Sreenath Somarajapuram
> Attachments: Screen Shot 2016-10-13 at 4.25.31 PM.png, Screen Shot 
> 2016-10-13 at 4.37.09 PM.png, Screen Shot 2016-10-17 at 4.11.29 PM.png, 
> Screen Shot 2016-10-17 at 4.11.59 PM.png, Screen Shot 2016-10-17 at 4.12.23 
> PM.png, TEZ-3419.1.patch, TEZ-3419.2.patch, TEZ-3419.3.patch, 
> TEZ-3419.4.patch, TEZ-3419.5.patch, TEZ-3419.6.patch, TEZ-3419.wip.1.patch, 
> Tez data missing.png, YARN & Tez data missing.png, YARN data missing.png
>
>
> Follow this logic and display better message:
> On loading app details page, send a request to 
> /ws/v1/timeline/TEZ_APPLICATION/tez_
> - If it succeed, display the details page as we do now.
> - If it fails, send a request to 
> /ws/v1/timeline/TEZ_DAG_ID?primaryFilter=applicationId%3A
> -- If it succeed, then we know that DAGs under the app are available and 
> assume that the user doesn't have permission to access app level data.
> --- If AHS is accessible, display application data from there in the details 
> page.
> --- else if AHS is not accessible, display a message in app details tab, 
> something like "Data is not available. Check if you are authorized to access 
> application data!".
> --- Also display the DAGs tab, for the user to see DAGs under that app.
> -- If it fails, display error message as we do now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3419) Tez UI: Applications page shows error, for users with only DAG level ACL permission.

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586674#comment-15586674
 ] 

Hitesh Shah commented on TEZ-3419:
--

+1

> Tez UI: Applications page shows error, for users with only DAG level ACL 
> permission.
> 
>
> Key: TEZ-3419
> URL: https://issues.apache.org/jira/browse/TEZ-3419
> Project: Apache Tez
>  Issue Type: Sub-task
>Affects Versions: 0.7.0
>Reporter: Sreenath Somarajapuram
>Assignee: Sreenath Somarajapuram
> Attachments: Screen Shot 2016-10-13 at 4.25.31 PM.png, Screen Shot 
> 2016-10-13 at 4.37.09 PM.png, Screen Shot 2016-10-17 at 4.11.29 PM.png, 
> Screen Shot 2016-10-17 at 4.11.59 PM.png, Screen Shot 2016-10-17 at 4.12.23 
> PM.png, TEZ-3419.1.patch, TEZ-3419.2.patch, TEZ-3419.3.patch, 
> TEZ-3419.4.patch, TEZ-3419.5.patch, TEZ-3419.6.patch, TEZ-3419.wip.1.patch, 
> Tez data missing.png, YARN & Tez data missing.png, YARN data missing.png
>
>
> Follow this logic and display better message:
> On loading app details page, send a request to 
> /ws/v1/timeline/TEZ_APPLICATION/tez_
> - If it succeed, display the details page as we do now.
> - If it fails, send a request to 
> /ws/v1/timeline/TEZ_DAG_ID?primaryFilter=applicationId%3A
> -- If it succeed, then we know that DAGs under the app are available and 
> assume that the user doesn't have permission to access app level data.
> --- If AHS is accessible, display application data from there in the details 
> page.
> --- else if AHS is not accessible, display a message in app details tab, 
> something like "Data is not available. Check if you are authorized to access 
> application data!".
> --- Also display the DAGs tab, for the user to see DAGs under that app.
> -- If it fails, display error message as we do now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3477) MRInputHelpers generateInputSplitsToMem public API modified

2016-10-18 Thread Jonathan Eagles (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Eagles updated TEZ-3477:
-
Attachment: TEZ-3477.1.patch

> MRInputHelpers generateInputSplitsToMem public API modified
> ---
>
> Key: TEZ-3477
> URL: https://issues.apache.org/jira/browse/TEZ-3477
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: TEZ-3477.1.patch
>
>
> Pig and Hive directly rely on specific APIs in MRInputHelpers. I would like 
> to ensure these signature are prevented from being modified.
> - MRInputHelpers.generateInputSplitsToMem
> - MRInputHelpers.parseMRInputPayload
> - MRInputHelpers.createSplitProto
> - MRInputHelpers.createOldFormatSplitFromUserPayload
> - MRInputHelpers.configureMRInputWithLegacySplitGeneration
> A recent fixed jira TEZ-3430 modified generateInputSplitsToMem
> {code}
> java.lang.NoSuchMethodError: 
> org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(Lorg/apache/hadoop/conf/Configuration;ZI)Lorg/apache/tez/mapreduce/hadoop/InputSplitInfoMem;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3477) MRInputHelpers generateInputSplitsToMem public API modified

2016-10-18 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586551#comment-15586551
 ] 

Hitesh Shah commented on TEZ-3477:
--

Might be good to add backward compatible functions to account for the changes 
brought in by TEZ-3430 as part of this jira too. 

> MRInputHelpers generateInputSplitsToMem public API modified
> ---
>
> Key: TEZ-3477
> URL: https://issues.apache.org/jira/browse/TEZ-3477
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
>
> Pig and Hive directly rely on specific APIs in MRInputHelpers. I would like 
> to ensure these signature are prevented from being modified.
> - MRInputHelpers.generateInputSplitsToMem
> - MRInputHelpers.parseMRInputPayload
> - MRInputHelpers.createSplitProto
> - MRInputHelpers.createOldFormatSplitFromUserPayload
> - MRInputHelpers.configureMRInputWithLegacySplitGeneration
> A recent fixed jira TEZ-3430 modified generateInputSplitsToMem
> {code}
> java.lang.NoSuchMethodError: 
> org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(Lorg/apache/hadoop/conf/Configuration;ZI)Lorg/apache/tez/mapreduce/hadoop/InputSplitInfoMem;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3477) MRInputHelpers generateInputSplitsToMem public API modified

2016-10-18 Thread Jonathan Eagles (JIRA)
Jonathan Eagles created TEZ-3477:


 Summary: MRInputHelpers generateInputSplitsToMem public API 
modified
 Key: TEZ-3477
 URL: https://issues.apache.org/jira/browse/TEZ-3477
 Project: Apache Tez
  Issue Type: Bug
Reporter: Jonathan Eagles
Assignee: Jonathan Eagles


Pig and Hive directly rely on specific APIs in MRInputHelpers. I would like to 
ensure these signature are prevented from being modified.

- MRInputHelpers.generateInputSplitsToMem
- MRInputHelpers.parseMRInputPayload
- MRInputHelpers.createSplitProto
- MRInputHelpers.createOldFormatSplitFromUserPayload
- MRInputHelpers.configureMRInputWithLegacySplitGeneration

A recent fixed jira TEZ-3430 modified generateInputSplitsToMem

{code}
java.lang.NoSuchMethodError: 
org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(Lorg/apache/hadoop/conf/Configuration;ZI)Lorg/apache/tez/mapreduce/hadoop/InputSplitInfoMem;
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3476) Need a way to account for container localization.

2016-10-18 Thread Eric Payne (JIRA)
Eric Payne created TEZ-3476:
---

 Summary: Need a way to account for container localization.
 Key: TEZ-3476
 URL: https://issues.apache.org/jira/browse/TEZ-3476
 Project: Apache Tez
  Issue Type: Bug
Reporter: Eric Payne


Tez task attempt start times don't reflect time spent in localization.

In the MapReduce framework, the time spent in localization was included in the 
total runtime of each task attempt. But since Tez reuses containers, the time 
spent localizing for a container is not captured. The start time of the first 
attempt in that container will only be set after the localization has completed.

The result is that attempts can appear as if they are not being run even though 
there are resources available in the queue. An attempt can be assigned to a 
container, but if the container is on a slow node and it takes a long time to 
localize, the attempt state will remain pending until localization completes.

The impact risk is that tasks will not speculate during localization since they 
haven't started



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3458) Auto grouping for cartesian product edge(unpartitioned case)

2016-10-18 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586045#comment-15586045
 ] 

TezQA commented on TEZ-3458:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12833984/TEZ-3458.2.patch
  against master revision 04d609e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 10 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2044//console

This message is automatically generated.

> Auto grouping for cartesian product edge(unpartitioned case)
> 
>
> Key: TEZ-3458
> URL: https://issues.apache.org/jira/browse/TEZ-3458
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Zhiyuan Yang
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3458.1.patch, TEZ-3458.2.patch
>
>
> Original CartesianProductVertexManagerUnpartitioned set parallelism as 
> product of all source vertices parallelism which may explode to insane 
> number. We should do auto reduce as in ShuffleVertexManager to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-3458 PreCommit Build #2044

2016-10-18 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-3458
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2044/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 102 lines...]
patching file 
tez-runtime-library/src/test/java/org/apache/tez/runtime/library/cartesianproduct/TestCartesianProductVertexManagerConfig.java
patching file 
tez-runtime-library/src/test/java/org/apache/tez/runtime/library/cartesianproduct/TestCartesianProductVertexManagerPartitioned.java
patching file 
tez-runtime-library/src/test/java/org/apache/tez/runtime/library/cartesianproduct/TestCartesianProductVertexManagerUnpartitioned.java
patching file 
tez-runtime-library/src/test/java/org/apache/tez/runtime/library/cartesianproduct/TestGrouper.java


==
==
Determining number of patched javac warnings.
==
==


/home/jenkins/tools/maven/latest/bin/mvn clean test -DskipTests -Ptest-patch > 
/home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build/../patchprocess/patchJavacWarnings.txt
 2>&1




{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12833984/TEZ-3458.2.patch
  against master revision 04d609e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 10 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2044//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
293210c5cd443cd3bb1bbff53167dec689e34bbb logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Compressed 786.92 KB of artifacts by 44.7% relative to #2043
[description-setter] Could not determine description.
Recording test results
ERROR: Step ‘Publish JUnit test result report’ failed: No test report files 
were found. Configuration error?
Email was triggered for: Failure - Any
Sending email for trigger: Failure - Any



###
## FAILED TESTS (if any) 
##
No tests ran.

[jira] [Updated] (TEZ-3458) Auto grouping for cartesian product edge(unpartitioned case)

2016-10-18 Thread Zhiyuan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiyuan Yang updated TEZ-3458:
--
Attachment: TEZ-3458.2.patch

> Auto grouping for cartesian product edge(unpartitioned case)
> 
>
> Key: TEZ-3458
> URL: https://issues.apache.org/jira/browse/TEZ-3458
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Zhiyuan Yang
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3458.1.patch, TEZ-3458.2.patch
>
>
> Original CartesianProductVertexManagerUnpartitioned set parallelism as 
> product of all source vertices parallelism which may explode to insane 
> number. We should do auto reduce as in ShuffleVertexManager to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3452) Auto-reduce parallelism calculation can overflow with large inputs

2016-10-18 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585906#comment-15585906
 ] 

Ming Ma commented on TEZ-3452:
--

+1. Thanks [~jeagles].

> Auto-reduce parallelism calculation can overflow with large inputs
> --
>
> Key: TEZ-3452
> URL: https://issues.apache.org/jira/browse/TEZ-3452
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jonathan Eagles
>Assignee: Jonathan Eagles
> Attachments: TEZ-3452.1.patch, TEZ-3452.2.patch, TEZ-3452.3.patch
>
>
> Overflow can occur when the numTasks is high (say 45000) and outputSize is 
> high (say 311TB) and slow start is set to 1.0. 
> {code:title=ShuffleVertexManager}
> for (Map.Entry vInfo : getBipartiteInfo()) {
>   SourceVertexInfo srcInfo = vInfo.getValue();
>   if (srcInfo.numTasks > 0 && srcInfo.numVMEventsReceived > 0) {
> // this assumes that 1 vmEvent is received per completed task - 
> TEZ-2961
> expectedTotalSourceTasksOutputSize += 
> (srcInfo.numTasks * srcInfo.outputSize) / 
> srcInfo.numVMEventsReceived;
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3475) Merge duplicated method into base class

2016-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585388#comment-15585388
 ] 

ASF GitHub Bot commented on TEZ-3475:
-

GitHub user darionyaphet opened a pull request:

https://github.com/apache/tez/pull/17

TEZ-3475 Merge duplicated method into base class

Merge duplicated method (handleEvents and close) into MRTask.class

[Merge duplicated method into base 
class](https://issues.apache.org/jira/browse/TEZ-3475)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/darionyaphet/tez TEZ-3475

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/tez/pull/17.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17


commit a729ac669bfb7815ec43805025e5b9f0d7217608
Author: darionyaphet 
Date:   2016-10-18T12:52:31Z

TEZ-3475 Merge duplicated method into base class




> Merge duplicated method into base class
> ---
>
> Key: TEZ-3475
> URL: https://issues.apache.org/jira/browse/TEZ-3475
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.8.4
>Reporter: darion yaphet
>Assignee: darion yaphet
> Fix For: 0.9.0, 0.8.5
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3475) Merge duplicated method into base class

2016-10-18 Thread darion yaphet (JIRA)
darion yaphet created TEZ-3475:
--

 Summary: Merge duplicated method into base class
 Key: TEZ-3475
 URL: https://issues.apache.org/jira/browse/TEZ-3475
 Project: Apache Tez
  Issue Type: Bug
Affects Versions: 0.8.4
Reporter: darion yaphet
Assignee: darion yaphet
 Fix For: 0.9.0, 0.8.5






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3419) Tez UI: Applications page shows error, for users with only DAG level ACL permission.

2016-10-18 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585304#comment-15585304
 ] 

TezQA commented on TEZ-3419:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12833926/TEZ-3419.6.patch
  against master revision 48208dc.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2043//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2043//console

This message is automatically generated.

> Tez UI: Applications page shows error, for users with only DAG level ACL 
> permission.
> 
>
> Key: TEZ-3419
> URL: https://issues.apache.org/jira/browse/TEZ-3419
> Project: Apache Tez
>  Issue Type: Sub-task
>Affects Versions: 0.7.0
>Reporter: Sreenath Somarajapuram
>Assignee: Sreenath Somarajapuram
> Attachments: Screen Shot 2016-10-13 at 4.25.31 PM.png, Screen Shot 
> 2016-10-13 at 4.37.09 PM.png, Screen Shot 2016-10-17 at 4.11.29 PM.png, 
> Screen Shot 2016-10-17 at 4.11.59 PM.png, Screen Shot 2016-10-17 at 4.12.23 
> PM.png, TEZ-3419.1.patch, TEZ-3419.2.patch, TEZ-3419.3.patch, 
> TEZ-3419.4.patch, TEZ-3419.5.patch, TEZ-3419.6.patch, TEZ-3419.wip.1.patch, 
> Tez data missing.png, YARN & Tez data missing.png, YARN data missing.png
>
>
> Follow this logic and display better message:
> On loading app details page, send a request to 
> /ws/v1/timeline/TEZ_APPLICATION/tez_
> - If it succeed, display the details page as we do now.
> - If it fails, send a request to 
> /ws/v1/timeline/TEZ_DAG_ID?primaryFilter=applicationId%3A
> -- If it succeed, then we know that DAGs under the app are available and 
> assume that the user doesn't have permission to access app level data.
> --- If AHS is accessible, display application data from there in the details 
> page.
> --- else if AHS is not accessible, display a message in app details tab, 
> something like "Data is not available. Check if you are authorized to access 
> application data!".
> --- Also display the DAGs tab, for the user to see DAGs under that app.
> -- If it fails, display error message as we do now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Success: TEZ-3419 PreCommit Build #2043

2016-10-18 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-3419
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2043/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 4824 lines...]
[INFO] Tez  SUCCESS [  0.037 s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 58:36 min
[INFO] Finished at: 2016-10-18T12:11:05+00:00
[INFO] Final Memory: 82M/1431M
[INFO] 




{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12833926/TEZ-3419.6.patch
  against master revision 48208dc.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 5 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 3.0.1) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2043//testReport/
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2043//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
e2aadc29188148ed096682451b909c7b6a5188dd logged out


==
==
Finished build.
==
==


Archiving artifacts
[description-setter] Description set: TEZ-3419
Recording test results
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Updated] (TEZ-3419) Tez UI: Applications page shows error, for users with only DAG level ACL permission.

2016-10-18 Thread Sreenath Somarajapuram (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sreenath Somarajapuram updated TEZ-3419:

Attachment: TEZ-3419.6.patch

Thanks [~hitesh]
bq. 2 screenshots show spurious data being shown in the UI.
Attaching a fresh patch with the correction. 

bq. 3rd screenshot is for the configs. Configs are not accessible due to 
permission issues but UI says no records found. I think this is a reasonable 
approach for now ( as compared to an error message indicating no data or 
permission issue ) but just wanted to make sure that this was the intention of 
the patch and not an accidental change.
Thats true. The behavior is as expected. As of now, we are just bypassing a 
failure condition.

> Tez UI: Applications page shows error, for users with only DAG level ACL 
> permission.
> 
>
> Key: TEZ-3419
> URL: https://issues.apache.org/jira/browse/TEZ-3419
> Project: Apache Tez
>  Issue Type: Sub-task
>Affects Versions: 0.7.0
>Reporter: Sreenath Somarajapuram
>Assignee: Sreenath Somarajapuram
> Attachments: Screen Shot 2016-10-13 at 4.25.31 PM.png, Screen Shot 
> 2016-10-13 at 4.37.09 PM.png, Screen Shot 2016-10-17 at 4.11.29 PM.png, 
> Screen Shot 2016-10-17 at 4.11.59 PM.png, Screen Shot 2016-10-17 at 4.12.23 
> PM.png, TEZ-3419.1.patch, TEZ-3419.2.patch, TEZ-3419.3.patch, 
> TEZ-3419.4.patch, TEZ-3419.5.patch, TEZ-3419.6.patch, TEZ-3419.wip.1.patch, 
> Tez data missing.png, YARN & Tez data missing.png, YARN data missing.png
>
>
> Follow this logic and display better message:
> On loading app details page, send a request to 
> /ws/v1/timeline/TEZ_APPLICATION/tez_
> - If it succeed, display the details page as we do now.
> - If it fails, send a request to 
> /ws/v1/timeline/TEZ_DAG_ID?primaryFilter=applicationId%3A
> -- If it succeed, then we know that DAGs under the app are available and 
> assume that the user doesn't have permission to access app level data.
> --- If AHS is accessible, display application data from there in the details 
> page.
> --- else if AHS is not accessible, display a message in app details tab, 
> something like "Data is not available. Check if you are authorized to access 
> application data!".
> --- Also display the DAGs tab, for the user to see DAGs under that app.
> -- If it fails, display error message as we do now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3458) Auto grouping for cartesian product edge(unpartitioned case)

2016-10-18 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584716#comment-15584716
 ] 

TezQA commented on TEZ-3458:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12833895/TEZ-3458.1.patch
  against master revision 48208dc.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 10 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2042//console

This message is automatically generated.

> Auto grouping for cartesian product edge(unpartitioned case)
> 
>
> Key: TEZ-3458
> URL: https://issues.apache.org/jira/browse/TEZ-3458
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Zhiyuan Yang
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3458.1.patch
>
>
> Original CartesianProductVertexManagerUnpartitioned set parallelism as 
> product of all source vertices parallelism which may explode to insane 
> number. We should do auto reduce as in ShuffleVertexManager to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-3458 PreCommit Build #2042

2016-10-18 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-3458
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/2042/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 99 lines...]
patching file 
tez-runtime-library/src/test/java/org/apache/tez/runtime/library/cartesianproduct/TestCartesianProductVertexManagerConfig.java
patching file 
tez-runtime-library/src/test/java/org/apache/tez/runtime/library/cartesianproduct/TestCartesianProductVertexManagerPartitioned.java
patching file 
tez-runtime-library/src/test/java/org/apache/tez/runtime/library/cartesianproduct/TestCartesianProductVertexManagerUnpartitioned.java
patching file 
tez-runtime-library/src/test/java/org/apache/tez/runtime/library/cartesianproduct/TestGrouper.java


==
==
Determining number of patched javac warnings.
==
==


/home/jenkins/tools/maven/latest/bin/mvn clean test -DskipTests -Ptest-patch > 
/home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build/../patchprocess/patchJavacWarnings.txt
 2>&1




{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  http://issues.apache.org/jira/secure/attachment/12833895/TEZ-3458.1.patch
  against master revision 48208dc.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 10 new 
or modified test files.

{color:red}-1 javac{color:red}.  The patch appears to cause the build to 
fail.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2042//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
4fafd91b10829a1b3e96f213eff30793966f2287 logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
Compressed 788.43 KB of artifacts by 40.6% relative to #2041
[description-setter] Could not determine description.
Recording test results
ERROR: Step ?Publish JUnit test result report? failed: No test report files 
were found. Configuration error?
Email was triggered for: Failure - Any
Sending email for trigger: Failure - Any



###
## FAILED TESTS (if any) 
##
No tests ran.

[jira] [Updated] (TEZ-3458) Auto grouping for cartesian product edge(unpartitioned case)

2016-10-18 Thread Zhiyuan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiyuan Yang updated TEZ-3458:
--
Attachment: TEZ-3458.1.patch

> Auto grouping for cartesian product edge(unpartitioned case)
> 
>
> Key: TEZ-3458
> URL: https://issues.apache.org/jira/browse/TEZ-3458
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Zhiyuan Yang
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3458.1.patch
>
>
> Original CartesianProductVertexManagerUnpartitioned set parallelism as 
> product of all source vertices parallelism which may explode to insane 
> number. We should do auto reduce as in ShuffleVertexManager to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3458) Auto grouping for cartesian product edge(unpartitioned case)

2016-10-18 Thread Zhiyuan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhiyuan Yang updated TEZ-3458:
--
Summary: Auto grouping for cartesian product edge(unpartitioned case)  
(was: Auto reduce for cartesian product edge(unpartitioned case))

> Auto grouping for cartesian product edge(unpartitioned case)
> 
>
> Key: TEZ-3458
> URL: https://issues.apache.org/jira/browse/TEZ-3458
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Zhiyuan Yang
>Assignee: Zhiyuan Yang
> Attachments: TEZ-3458.1.patch
>
>
> Original CartesianProductVertexManagerUnpartitioned set parallelism as 
> product of all source vertices parallelism which may explode to insane 
> number. We should do auto reduce as in ShuffleVertexManager to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)