[jira] [Commented] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes
[ https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700278#comment-14700278 ] Bikas Saha commented on TEZ-2300: - Then perhaps the patch could killDAG() and send a (new) message to the scheduler to release resources. Then proceed with normal stop (like we do today). From what I see, AM shutdown today does not kill the DAG. TezClient.stop() takes a lot of time or does not work sometimes --- Key: TEZ-2300 URL: https://issues.apache.org/jira/browse/TEZ-2300 Project: Apache Tez Issue Type: Bug Reporter: Rohini Palaniswamy Assignee: Jonathan Eagles Attachments: TEZ-2300.1.patch, TEZ-2300.2.patch, TEZ-2300.3.patch, TEZ-2300.4.patch, syslog_dag_1428329756093_325099_1_post Noticed this with a couple of pig scripts which were not behaving well (AM close to OOM, etc) and even with some that were running fine. Pig calls Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits immediately or is hung. In both cases it either takes a long time for the yarn application to go to KILLED state. Many times I just end up calling yarn application -kill separately after waiting for 5 mins or more for it to get killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2670) Remove TaskAttempt holder used within TezTaskCommunicator
[ https://issues.apache.org/jira/browse/TEZ-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2670: Attachment: TEZ-2670.1.txt Simple patch to replace TaskAttempt with TaskAttemptId, and reduce unnecessary object creation. Remove TaskAttempt holder used within TezTaskCommunicator - Key: TEZ-2670 URL: https://issues.apache.org/jira/browse/TEZ-2670 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2670.1.txt This will rely on using IDs or the equivalent construct exposed by Tez. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TEZ-2670) Remove TaskAttempt holder used within TezTaskCommunicator
[ https://issues.apache.org/jira/browse/TEZ-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved TEZ-2670. - Resolution: Fixed Fix Version/s: TEZ-2003 Remove TaskAttempt holder used within TezTaskCommunicator - Key: TEZ-2670 URL: https://issues.apache.org/jira/browse/TEZ-2670 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: TEZ-2003 Attachments: TEZ-2670.1.txt This will rely on using IDs or the equivalent construct exposed by Tez. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes
[ https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700300#comment-14700300 ] Hitesh Shah commented on TEZ-2300: -- bq. Currently there are no APIs to cancel a DAG DAGClient::tryKillDAG() bq. From what I see, AM shutdown today does not kill the DAG. DAGAppMaster::shutdownTezAM() tries to kill the dag first. TezClient.stop() takes a lot of time or does not work sometimes --- Key: TEZ-2300 URL: https://issues.apache.org/jira/browse/TEZ-2300 Project: Apache Tez Issue Type: Bug Reporter: Rohini Palaniswamy Assignee: Jonathan Eagles Attachments: TEZ-2300.1.patch, TEZ-2300.2.patch, TEZ-2300.3.patch, TEZ-2300.4.patch, syslog_dag_1428329756093_325099_1_post Noticed this with a couple of pig scripts which were not behaving well (AM close to OOM, etc) and even with some that were running fine. Pig calls Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits immediately or is hung. In both cases it either takes a long time for the yarn application to go to KILLED state. Many times I just end up calling yarn application -kill separately after waiting for 5 mins or more for it to get killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2726) Handle invalid number of partitions for SCATTER-GATHER edge
[ https://issues.apache.org/jira/browse/TEZ-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700410#comment-14700410 ] Rohini Palaniswamy commented on TEZ-2726: - There was some bug in Pig planning (yet to debug and create jira) which was setting incorrect edge types. Handle invalid number of partitions for SCATTER-GATHER edge --- Key: TEZ-2726 URL: https://issues.apache.org/jira/browse/TEZ-2726 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Encountered an issue where the source vertex has M task and sink vertex has N tasks (N M), [e.g. M = 1, N = 3]and the edge is of type SCATTER -GATHER. This resulted in sink vertex receiving DMEs with non existent targetIds. The fetchers for the sink vertex tasks then try to retrieve the map outputs and retrieve invalid headers due to exception in the ShuffleHandler. Possible fixes: 1. raise proper Tez Exception to indicate this invalid scenario. 2. or write appropriate empty partition bits, for the missing partitions before sending out the DMEs to sink vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes
[ https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700319#comment-14700319 ] Bikas Saha commented on TEZ-2300: - I mean directly in the shutdown handler which would happen when the AM was killed by the RM. Not sure if Pig is using shutdownTezAM() or just calling killApplication on YARN. TezClient.stop() takes a lot of time or does not work sometimes --- Key: TEZ-2300 URL: https://issues.apache.org/jira/browse/TEZ-2300 Project: Apache Tez Issue Type: Bug Reporter: Rohini Palaniswamy Assignee: Jonathan Eagles Attachments: TEZ-2300.1.patch, TEZ-2300.2.patch, TEZ-2300.3.patch, TEZ-2300.4.patch, syslog_dag_1428329756093_325099_1_post Noticed this with a couple of pig scripts which were not behaving well (AM close to OOM, etc) and even with some that were running fine. Pig calls Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits immediately or is hung. In both cases it either takes a long time for the yarn application to go to KILLED state. Many times I just end up calling yarn application -kill separately after waiting for 5 mins or more for it to get killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2728) Wrap IPC connection Exception as SessionNotRunning - RM crash
[ https://issues.apache.org/jira/browse/TEZ-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-2728: - Attachment: hive.log.gz Wrap IPC connection Exception as SessionNotRunning - RM crash - Key: TEZ-2728 URL: https://issues.apache.org/jira/browse/TEZ-2728 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0, 0.5.4, 0.6.2, 0.8.0 Reporter: Gopal V Assignee: Hitesh Shah Attachments: hive.log.gz Crashing the RM when a query session is open and restarting it does not result in a recoverable state for a Hive session. {code} 2015-08-17T22:34:21,981 INFO [main]: ipc.Client (Client.java:handleConnectionFailure(885)) - Retrying connect to server: cn042-10.sandbox.hortonworks.com/172.19.128.42:10200. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-08-17T22:34:22,982 INFO [main]: ipc.Client (Client.java:handleConnectionFailure(885)) - Retrying connect to server: cn042-10.sandbox.hortonworks.com/172.19.128.42:10200. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-08-17T22:34:22,987 ERROR [main]: exec.Task (TezTask.java:execute(195)) - Failed to execute tez graph. java.net.ConnectException: Call From cn041.sandbox.hortonworks.com/172.19.128.41 to cn042.sandbox.hortonworks.com:10200 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_51] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_51] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_51] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) ~[?:1.8.0_51] at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.ipc.Client.call(Client.java:1444) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.ipc.Client.call(Client.java:1371) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at com.sun.proxy.$Proxy41.getApplicationReport(Unknown Source) ~[?:?] at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108) ~[hadoop-yarn-common-2.8.0-20150721.221214-843.jar:?] at org.apache.hadoop.yarn.client.api.impl.AHSClientImpl.getApplicationReport(AHSClientImpl.java:101) ~[hadoop-yarn-client-2.8.0-20150721.221233-841.jar:?] at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:442) ~[hadoop-yarn-client-2.8.0-20150721.221233-841. jar:?] at org.apache.tez.client.TezYarnClient.getApplicationReport(TezYarnClient.java:89) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:835) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.getAMProxy(TezClient.java:713) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.waitForProxy(TezClient.java:723) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:453) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.submitDAG(TezClient.java:391) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:409) ~[hive-exec-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes
[ https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700329#comment-14700329 ] Hitesh Shah commented on TEZ-2300: -- [~bikassaha] As the jira title states, I believe they are invoking TezClient::stop() TezClient.stop() takes a lot of time or does not work sometimes --- Key: TEZ-2300 URL: https://issues.apache.org/jira/browse/TEZ-2300 Project: Apache Tez Issue Type: Bug Reporter: Rohini Palaniswamy Assignee: Jonathan Eagles Attachments: TEZ-2300.1.patch, TEZ-2300.2.patch, TEZ-2300.3.patch, TEZ-2300.4.patch, syslog_dag_1428329756093_325099_1_post Noticed this with a couple of pig scripts which were not behaving well (AM close to OOM, etc) and even with some that were running fine. Pig calls Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits immediately or is hung. In both cases it either takes a long time for the yarn application to go to KILLED state. Many times I just end up calling yarn application -kill separately after waiting for 5 mins or more for it to get killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2726) Handle invalid number of partitions for SCATTER-GATHER edge
[ https://issues.apache.org/jira/browse/TEZ-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700555#comment-14700555 ] Bikas Saha commented on TEZ-2726: - Still not sure, what the exact sequence of events was for the error. A planning bug cause empty partitions and somehow Tez handled the empty partitions erroneously? It will really help if we had logs or some sequence of events that produced the error. Tez does have some handling for empty partitions but thats an optimization to not fetch them (since they are empty). Handle invalid number of partitions for SCATTER-GATHER edge --- Key: TEZ-2726 URL: https://issues.apache.org/jira/browse/TEZ-2726 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Encountered an issue where the source vertex has M task and sink vertex has N tasks (N M), [e.g. M = 1, N = 3]and the edge is of type SCATTER -GATHER. This resulted in sink vertex receiving DMEs with non existent targetIds. The fetchers for the sink vertex tasks then try to retrieve the map outputs and retrieve invalid headers due to exception in the ShuffleHandler. Possible fixes: 1. raise proper Tez Exception to indicate this invalid scenario. 2. or write appropriate empty partition bits, for the missing partitions before sending out the DMEs to sink vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Success: TEZ-2727 PreCommit Build #1000
Jira: https://issues.apache.org/jira/browse/TEZ-2727 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/1000/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 3620 lines...] [INFO] Final Memory: 93M/1163M [INFO] {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12750879/2003_20150817.1.txt against master revision 6cb8206. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 51 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1000//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1000//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 9789e853f14a49bfc68c5ee4acc461d5551d0d68 logged out == == Finished build. == == Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #988 Archived 53 artifacts Archive block size is 32768 Received 0 blocks and 3205963 bytes Compression is 0.0% Took 1.3 sec Description set: TEZ-2727 Recording test results Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2728) Wrap IPC connection Exception as SessionNotRunning - RM crash
[ https://issues.apache.org/jira/browse/TEZ-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700383#comment-14700383 ] Hitesh Shah commented on TEZ-2728: -- Could you attach the full stack trace/log? This looks like the ipc call eventually timed out. I am not sure whether we can safely assume that the session is not running if the RM is down but could later come back up and recover the yarn application. Instead should Hive consider treating any exception from submitDAG as an excuse to try killing the session and re-trying with a new one? Wrap IPC connection Exception as SessionNotRunning - RM crash - Key: TEZ-2728 URL: https://issues.apache.org/jira/browse/TEZ-2728 Project: Apache Tez Issue Type: Bug Affects Versions: 0.7.0, 0.5.4, 0.6.2, 0.8.0 Reporter: Gopal V Assignee: Hitesh Shah Crashing the RM when a query session is open and restarting it does not result in a recoverable state for a Hive session. {code} 2015-08-17T22:34:21,981 INFO [main]: ipc.Client (Client.java:handleConnectionFailure(885)) - Retrying connect to server: cn042-10.sandbox.hortonworks.com/172.19.128.42:10200. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-08-17T22:34:22,982 INFO [main]: ipc.Client (Client.java:handleConnectionFailure(885)) - Retrying connect to server: cn042-10.sandbox.hortonworks.com/172.19.128.42:10200. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-08-17T22:34:22,987 ERROR [main]: exec.Task (TezTask.java:execute(195)) - Failed to execute tez graph. java.net.ConnectException: Call From cn041.sandbox.hortonworks.com/172.19.128.41 to cn042.sandbox.hortonworks.com:10200 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_51] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_51] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_51] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) ~[?:1.8.0_51] at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.ipc.Client.call(Client.java:1444) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.ipc.Client.call(Client.java:1371) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at com.sun.proxy.$Proxy41.getApplicationReport(Unknown Source) ~[?:?] at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108) ~[hadoop-yarn-common-2.8.0-20150721.221214-843.jar:?] at org.apache.hadoop.yarn.client.api.impl.AHSClientImpl.getApplicationReport(AHSClientImpl.java:101) ~[hadoop-yarn-client-2.8.0-20150721.221233-841.jar:?] at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:442) ~[hadoop-yarn-client-2.8.0-20150721.221233-841. jar:?] at org.apache.tez.client.TezYarnClient.getApplicationReport(TezYarnClient.java:89) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:835) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.getAMProxy(TezClient.java:713) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.waitForProxy(TezClient.java:723) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:453) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.submitDAG(TezClient.java:391) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:409) ~[hive-exec-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes
[ https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700334#comment-14700334 ] Rohini Palaniswamy commented on TEZ-2300: - bq. DAGClient::tryKillDAG() Sorry missed the DAGClient API as I was only looking at TezClient API. bq. Not sure if Pig is using shutdownTezAM() or just calling killApplication on YARN. We do not killApplication on YARN. We call TezClient.stop() which calls proxy.shutdownSession. TezClient.stop() tries to kill via YARN but only if it was not able to connect and send shutdown request to Tez AM. Don't think I have seen cases which have gone into that condition. Problem is in bad cases like big event queue backlog the shutdown happens after 10-15 mins. It should kill via YARN if shutdown does not happen within a reasonable amount of time in addition to when not able to connect. {code} if (!sessionShutdownSuccessful) { LOG.info(Could not connect to AM, killing session via YARN + , sessionName= + clientName + , applicationId= + sessionAppId); try { frameworkClient.killApplication(sessionAppId); } catch (ApplicationNotFoundException e) { LOG.info(Failed to kill nonexistent application + sessionAppId, e); } catch (YarnException e) { throw new TezException(e); } } {code} TezClient.stop() takes a lot of time or does not work sometimes --- Key: TEZ-2300 URL: https://issues.apache.org/jira/browse/TEZ-2300 Project: Apache Tez Issue Type: Bug Reporter: Rohini Palaniswamy Assignee: Jonathan Eagles Attachments: TEZ-2300.1.patch, TEZ-2300.2.patch, TEZ-2300.3.patch, TEZ-2300.4.patch, syslog_dag_1428329756093_325099_1_post Noticed this with a couple of pig scripts which were not behaving well (AM close to OOM, etc) and even with some that were running fine. Pig calls Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits immediately or is hung. In both cases it either takes a long time for the yarn application to go to KILLED state. Many times I just end up calling yarn application -kill separately after waiting for 5 mins or more for it to get killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2727) Fix findbugs warnings
[ https://issues.apache.org/jira/browse/TEZ-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700355#comment-14700355 ] TezQA commented on TEZ-2727: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12750879/2003_20150817.1.txt against master revision 6cb8206. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 51 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1000//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1000//console This message is automatically generated. Fix findbugs warnings - Key: TEZ-2727 URL: https://issues.apache.org/jira/browse/TEZ-2727 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth Fix For: TEZ-2003 Attachments: 2003_20150817.1.txt, TEZ-2727.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2726) Handle invalid number of partitions for SCATTER-GATHER edge
[ https://issues.apache.org/jira/browse/TEZ-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700369#comment-14700369 ] Rajesh Balamohan commented on TEZ-2726: --- [~saikatr] - Is there any repro for this? When you say invalid headers, is it something like the following? Can you plz provide more info? {noformat} org.apache.tez.runtime.library.common.shuffle.impl.Fetcher: Invalid map id java.lang.IllegalArgumentException: Invalid header received: W^s??.attempt_1399351577718_4169_1_ partition: 95 {noformat} If so, are you using tez.runtime.intermediate-output.compress.codec = org.apache.hadoop.io.compress.DefaultCodec ? Handle invalid number of partitions for SCATTER-GATHER edge --- Key: TEZ-2726 URL: https://issues.apache.org/jira/browse/TEZ-2726 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Encountered an issue where the source vertex has M task and sink vertex has N tasks (N M), [e.g. M = 1, N = 3]and the edge is of type SCATTER -GATHER. This resulted in sink vertex receiving DMEs with non existent targetIds. The fetchers for the sink vertex tasks then try to retrieve the map outputs and retrieve invalid headers due to exception in the ShuffleHandler. Possible fixes: 1. raise proper Tez Exception to indicate this invalid scenario. 2. or write appropriate empty partition bits, for the missing partitions before sending out the DMEs to sink vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2294) Add tez-site-template.xml with description of config properties
[ https://issues.apache.org/jira/browse/TEZ-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700645#comment-14700645 ] TezQA commented on TEZ-2294: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12750911/TEZ-2294.7.patch against master revision 6cb8206. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1001//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1001//console This message is automatically generated. Add tez-site-template.xml with description of config properties --- Key: TEZ-2294 URL: https://issues.apache.org/jira/browse/TEZ-2294 Project: Apache Tez Issue Type: Improvement Reporter: Rajesh Balamohan Assignee: Hitesh Shah Attachments: TEZ-2294.4.patch, TEZ-2294.5.patch, TEZ-2294.6.patch, TEZ-2294.7.patch, TEZ-2294.wip.2.patch, TEZ-2294.wip.3.patch, TEZ-2294.wip.patch, TezConfiguration.html, TezRuntimeConfiguration.html, tez-default-template.xml, tez-runtime-default-template.xml Document all tez configs with descriptions and default values. Also, document MR configs that can be easily translated to Tez configs via Tez helpers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2728) Wrap IPC connection Exception as SessionNotRunning - RM crash
Gopal V created TEZ-2728: Summary: Wrap IPC connection Exception as SessionNotRunning - RM crash Key: TEZ-2728 URL: https://issues.apache.org/jira/browse/TEZ-2728 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.2, 0.5.4, 0.7.0, 0.8.0 Reporter: Gopal V Assignee: Hitesh Shah Crashing the RM when a query session is open and restarting it does not result in a recoverable state for a Hive session. {code} 2015-08-17T22:34:21,981 INFO [main]: ipc.Client (Client.java:handleConnectionFailure(885)) - Retrying connect to server: cn042-10.l42scl.hortonworks.com/172.19.128.42:10200. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-08-17T22:34:22,982 INFO [main]: ipc.Client (Client.java:handleConnectionFailure(885)) - Retrying connect to server: cn042-10.l42scl.hortonworks.com/172.19.128.42:10200. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-08-17T22:34:22,987 ERROR [main]: exec.Task (TezTask.java:execute(195)) - Failed to execute tez graph. java.net.ConnectException: Call From cn041-10.l42scl.hortonworks.com/172.19.128.41 to cn042-10.l42scl.hortonworks.com:10200 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_51] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_51] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_51] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) ~[?:1.8.0_51] at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.ipc.Client.call(Client.java:1444) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.ipc.Client.call(Client.java:1371) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at com.sun.proxy.$Proxy41.getApplicationReport(Unknown Source) ~[?:?] at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108) ~[hadoop-yarn-common-2.8.0-20150721.221214-843.jar:?] at org.apache.hadoop.yarn.client.api.impl.AHSClientImpl.getApplicationReport(AHSClientImpl.java:101) ~[hadoop-yarn-client-2.8.0-20150721.221233-841.jar:?] at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:442) ~[hadoop-yarn-client-2.8.0-20150721.221233-841. jar:?] at org.apache.tez.client.TezYarnClient.getApplicationReport(TezYarnClient.java:89) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:835) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.getAMProxy(TezClient.java:713) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.waitForProxy(TezClient.java:723) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:453) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.submitDAG(TezClient.java:391) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:409) ~[hive-exec-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TEZ-2164) Shade the guava version used by Tez
[ https://issues.apache.org/jira/browse/TEZ-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah reassigned TEZ-2164: Assignee: Hitesh Shah Shade the guava version used by Tez --- Key: TEZ-2164 URL: https://issues.apache.org/jira/browse/TEZ-2164 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Hitesh Shah Priority: Critical Attachments: TEZ-2164.3.patch, TEZ-2164.wip.2.patch, allow-guava-16.0.1.patch Should allow us to upgrade to a newer version without shipping a guava dependency. Would be good to do this in 0.7 so that we stop shipping guava as early as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2164) Shade the guava version used by Tez
[ https://issues.apache.org/jira/browse/TEZ-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700588#comment-14700588 ] TezQA commented on TEZ-2164: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12750920/TEZ-2164.3.patch against master revision 6cb8206. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 71 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1002//console This message is automatically generated. Shade the guava version used by Tez --- Key: TEZ-2164 URL: https://issues.apache.org/jira/browse/TEZ-2164 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Hitesh Shah Priority: Critical Attachments: TEZ-2164.3.patch, TEZ-2164.wip.2.patch, allow-guava-16.0.1.patch Should allow us to upgrade to a newer version without shipping a guava dependency. Would be good to do this in 0.7 so that we stop shipping guava as early as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2728) Wrap IPC connection Exception as SessionNotRunning - RM crash
[ https://issues.apache.org/jira/browse/TEZ-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V updated TEZ-2728: - Description: Crashing the RM when a query session is open and restarting it does not result in a recoverable state for a Hive session. {code} 2015-08-17T22:34:21,981 INFO [main]: ipc.Client (Client.java:handleConnectionFailure(885)) - Retrying connect to server: cn042-10.sandbox.hortonworks.com/172.19.128.42:10200. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-08-17T22:34:22,982 INFO [main]: ipc.Client (Client.java:handleConnectionFailure(885)) - Retrying connect to server: cn042-10.sandbox.hortonworks.com/172.19.128.42:10200. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-08-17T22:34:22,987 ERROR [main]: exec.Task (TezTask.java:execute(195)) - Failed to execute tez graph. java.net.ConnectException: Call From cn041.sandbox.hortonworks.com/172.19.128.41 to cn042.sandbox.hortonworks.com:10200 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_51] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_51] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_51] at java.lang.reflect.Constructor.newInstance(Constructor.java:422) ~[?:1.8.0_51] at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.ipc.Client.call(Client.java:1444) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.ipc.Client.call(Client.java:1371) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) ~[hadoop-common-2.8.0-20150722.003145-873.jar:?] at com.sun.proxy.$Proxy41.getApplicationReport(Unknown Source) ~[?:?] at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationHistoryProtocolPBClientImpl.getApplicationReport(ApplicationHistoryProtocolPBClientImpl.java:108) ~[hadoop-yarn-common-2.8.0-20150721.221214-843.jar:?] at org.apache.hadoop.yarn.client.api.impl.AHSClientImpl.getApplicationReport(AHSClientImpl.java:101) ~[hadoop-yarn-client-2.8.0-20150721.221233-841.jar:?] at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:442) ~[hadoop-yarn-client-2.8.0-20150721.221233-841. jar:?] at org.apache.tez.client.TezYarnClient.getApplicationReport(TezYarnClient.java:89) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:835) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.getAMProxy(TezClient.java:713) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.waitForProxy(TezClient.java:723) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.submitDAGSession(TezClient.java:453) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.tez.client.TezClient.submitDAG(TezClient.java:391) ~[tez-api-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT] at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:409) ~[hive-exec-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT] {code} was: Crashing the RM when a query session is open and restarting it does not result in a recoverable state for a Hive session. {code} 2015-08-17T22:34:21,981 INFO [main]: ipc.Client (Client.java:handleConnectionFailure(885)) - Retrying connect to server: cn042-10.l42scl.hortonworks.com/172.19.128.42:10200. Already tried 48 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-08-17T22:34:22,982 INFO [main]: ipc.Client (Client.java:handleConnectionFailure(885)) - Retrying connect to server: cn042-10.l42scl.hortonworks.com/172.19.128.42:10200. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 2015-08-17T22:34:22,987 ERROR [main]: exec.Task (TezTask.java:execute(195)) - Failed to execute tez graph. java.net.ConnectException: Call From cn041-10.l42scl.hortonworks.com/172.19.128.41 to cn042-10.l42scl.hortonworks.com:10200 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:
Failed: TEZ-2164 PreCommit Build #1002
Jira: https://issues.apache.org/jira/browse/TEZ-2164 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/1002/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 419 lines...] == == Determining number of patched javac warnings. == == /home/jenkins/tools/maven/latest/bin/mvn clean test -DskipTests -Ptest-patch /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build/../patchprocess/patchJavacWarnings.txt 21 {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12750920/TEZ-2164.3.patch against master revision 6cb8206. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 71 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1002//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. a321b6a92ad051370e1d5133a95d47e33309a98f logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #1000 Archived 3 artifacts Archive block size is 32768 Received 0 blocks and 810886 bytes Compression is 0.0% Took 6.7 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## No tests ran.
Failed: TEZ-2294 PreCommit Build #1001
Jira: https://issues.apache.org/jira/browse/TEZ-2294 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/1001/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 3423 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12750911/TEZ-2294.7.patch against master revision 6cb8206. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1001//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1001//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 04570d164c2b71de8569f0b8d96cc64cfd12c6d1 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #1000 Archived 53 artifacts Archive block size is 32768 Received 16 blocks and 2612241 bytes Compression is 16.7% Took 0.87 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Comment Edited] (TEZ-2164) Shade the guava version used by Tez
[ https://issues.apache.org/jira/browse/TEZ-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700309#comment-14700309 ] Hitesh Shah edited comment on TEZ-2164 at 8/17/15 10:06 PM: [~rajesh.balamohan] [~sseth] [~cchepelov] Mind trying this patch out? Check BUILDING.txt for more details. was (Author: hitesh): [~rajesh.balamohan] [~sseth] [~cchepelov] Mind trying this patch out? Shade the guava version used by Tez --- Key: TEZ-2164 URL: https://issues.apache.org/jira/browse/TEZ-2164 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Priority: Critical Attachments: TEZ-2164.3.patch, TEZ-2164.wip.2.patch, allow-guava-16.0.1.patch Should allow us to upgrade to a newer version without shipping a guava dependency. Would be good to do this in 0.7 so that we stop shipping guava as early as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2164) Shade the guava version used by Tez
[ https://issues.apache.org/jira/browse/TEZ-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2164: - Attachment: TEZ-2164.3.patch [~rajesh.balamohan] [~sseth] [~cchepelov] Mind trying this patch out? Shade the guava version used by Tez --- Key: TEZ-2164 URL: https://issues.apache.org/jira/browse/TEZ-2164 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Priority: Critical Attachments: TEZ-2164.3.patch, TEZ-2164.wip.2.patch, allow-guava-16.0.1.patch Should allow us to upgrade to a newer version without shipping a guava dependency. Would be good to do this in 0.7 so that we stop shipping guava as early as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699873#comment-14699873 ] Prakash Ramachandran commented on TEZ-2724: --- * if realClient.getApplicationReportInternal returns null (say temp n/w issue) and we switch to ats client , should we switch back to getting status via am once the appreport is available and app has not completed? * minor - switchToTimelineClient debug log can be changed. Tez Client keeps on showing old status when application is finished but RM is shutdown -- Key: TEZ-2724 URL: https://issues.apache.org/jira/browse/TEZ-2724 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.4 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt From the logs, it seems the ipc retry interval is set as 20 seconds and ipc max retries is 45. This means that the client will retry the RPC connection for total 900 (20*45) seconds. And in this period, the application may already complete and RM Restarting may be triggered as said in the jira description. And I think the RM recovery is not enabled, so even the new RM is restarted, the original application info is lost, that means the client can never get the correct application report which makes it showing the old status forever. {code} 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 Deleted /user/hadoopqa/Input1 RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls /user/hadoopqa/Input2 RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r -skipTrash /user/hadoopqa/Input2 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 {code} Configuration to reproduce this issue * disable generic application history (yarn.timeline-service.generic-application-history.enabled) * disable rm recovery (yarn.resourcemanager.recovery.enabled) * increase the ipc retry interval and max retry (ipc.client.connect.retry.interval ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2726) Handle invalid number of partitions for SCATTER-GATHER edge
[ https://issues.apache.org/jira/browse/TEZ-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699945#comment-14699945 ] Rohini Palaniswamy commented on TEZ-2726: - We should raise proper exception in Tez and not write empty partition bits and mask the issue which is most due to some DAG misconfiguration. Handle invalid number of partitions for SCATTER-GATHER edge --- Key: TEZ-2726 URL: https://issues.apache.org/jira/browse/TEZ-2726 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Encountered an issue where the source vertex has M task and sink vertex has N tasks (N M), [e.g. M = 1, N = 3]and the edge is of type SCATTER -GATHER. This resulted in sink vertex receiving DMEs with non existent targetIds. The fetchers for the sink vertex tasks then try to retrieve the map outputs and retrieve invalid headers due to exception in the ShuffleHandler. Possible fixes: 1. raise proper Tez Exception to indicate this invalid scenario. 2. or write appropriate empty partition bits, for the missing partitions before sending out the DMEs to sink vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2725) Tez UI: Unit tests
[ https://issues.apache.org/jira/browse/TEZ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699876#comment-14699876 ] Hitesh Shah commented on TEZ-2725: -- Is this single jira meant to create unit tests for the full existing UI code base? Tez UI: Unit tests -- Key: TEZ-2725 URL: https://issues.apache.org/jira/browse/TEZ-2725 Project: Apache Tez Issue Type: Bug Reporter: Sreenath Somarajapuram Assignee: Sreenath Somarajapuram -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2726) Handle invalid number of partitions for SCATTER-GATHER edge
[ https://issues.apache.org/jira/browse/TEZ-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699942#comment-14699942 ] Saikat commented on TEZ-2726: - Adding [~jlowe] [~rohini] [~jeagles] for watch and comments. Handle invalid number of partitions for SCATTER-GATHER edge --- Key: TEZ-2726 URL: https://issues.apache.org/jira/browse/TEZ-2726 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Encountered an issue where the source vertex has M task and sink vertex has N tasks (N M), [e.g. M = 1, N = 3]and the edge is of type SCATTER -GATHER. This resulted in sink vertex receiving DMEs with non existent targetIds. The fetchers for the sink vertex tasks then try to retrieve the map outputs and retrieve invalid headers due to exception in the ShuffleHandler. Possible fixes: 1. raise proper Tez Exception to indicate this invalid scenario. 2. or write appropriate empty partition bits, for the missing partitions before sending out the DMEs to sink vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2726) Handle invalid number of partitions for SCATTER-GATHER edge
Saikat created TEZ-2726: --- Summary: Handle invalid number of partitions for SCATTER-GATHER edge Key: TEZ-2726 URL: https://issues.apache.org/jira/browse/TEZ-2726 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Encountered an issue where the source vertex has M task and sink vertex has N tasks (N M), [e.g. M = 1, N = 3]and the edge is of type SCATTER -GATHER. This resulted in sink vertex receiving DMEs with non existent targetIds. The fetchers for the sink vertex tasks then try to retrieve the map outputs and retrieve invalid headers due to exception in the ShuffleHandler. Possible fixes: 1. raise proper Tez Exception to indicate this invalid scenario. 2. or write appropriate empty partition bits, for the missing partitions before sending out the DMEs to sink vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2726) Handle invalid number of partitions for SCATTER-GATHER edge
[ https://issues.apache.org/jira/browse/TEZ-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699942#comment-14699942 ] Saikat edited comment on TEZ-2726 at 8/17/15 6:00 PM: -- Adding [~jlowe] [~rohini] [~jeagles] [~rajesh.balamohan] for watch and comments. was (Author: saikatr): Adding [~jlowe] [~rohini] [~jeagles] for watch and comments. Handle invalid number of partitions for SCATTER-GATHER edge --- Key: TEZ-2726 URL: https://issues.apache.org/jira/browse/TEZ-2726 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Encountered an issue where the source vertex has M task and sink vertex has N tasks (N M), [e.g. M = 1, N = 3]and the edge is of type SCATTER -GATHER. This resulted in sink vertex receiving DMEs with non existent targetIds. The fetchers for the sink vertex tasks then try to retrieve the map outputs and retrieve invalid headers due to exception in the ShuffleHandler. Possible fixes: 1. raise proper Tez Exception to indicate this invalid scenario. 2. or write appropriate empty partition bits, for the missing partitions before sending out the DMEs to sink vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2670) Remove TaskAttempt holder used within TezTaskCommunicator
[ https://issues.apache.org/jira/browse/TEZ-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699966#comment-14699966 ] Siddharth Seth commented on TEZ-2670: - To be replaced with changes post TEZ-2697. For now, moving this back to TaskAttemptId to remove unnecessary object creation. Remove TaskAttempt holder used within TezTaskCommunicator - Key: TEZ-2670 URL: https://issues.apache.org/jira/browse/TEZ-2670 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth This will rely on using IDs or the equivalent construct exposed by Tez. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2727) Fix findbugs warnings
[ https://issues.apache.org/jira/browse/TEZ-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2727: Attachment: TEZ-2727.1.txt Actual patch to fix findbugs. Fix findbugs warnings - Key: TEZ-2727 URL: https://issues.apache.org/jira/browse/TEZ-2727 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2727.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2727) Fix findbugs warnings
Siddharth Seth created TEZ-2727: --- Summary: Fix findbugs warnings Key: TEZ-2727 URL: https://issues.apache.org/jira/browse/TEZ-2727 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2726) Handle invalid number of partitions for SCATTER-GATHER edge
[ https://issues.apache.org/jira/browse/TEZ-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1475#comment-1475 ] Hitesh Shah commented on TEZ-2726: -- \cc [~bikassaha] Handle invalid number of partitions for SCATTER-GATHER edge --- Key: TEZ-2726 URL: https://issues.apache.org/jira/browse/TEZ-2726 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Encountered an issue where the source vertex has M task and sink vertex has N tasks (N M), [e.g. M = 1, N = 3]and the edge is of type SCATTER -GATHER. This resulted in sink vertex receiving DMEs with non existent targetIds. The fetchers for the sink vertex tasks then try to retrieve the map outputs and retrieve invalid headers due to exception in the ShuffleHandler. Possible fixes: 1. raise proper Tez Exception to indicate this invalid scenario. 2. or write appropriate empty partition bits, for the missing partitions before sending out the DMEs to sink vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2727) Fix findbugs warnings
[ https://issues.apache.org/jira/browse/TEZ-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2727: Attachment: 2003_20150817.1.txt Patch for jenkins. Fix findbugs warnings - Key: TEZ-2727 URL: https://issues.apache.org/jira/browse/TEZ-2727 Project: Apache Tez Issue Type: Sub-task Affects Versions: TEZ-2003 Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: 2003_20150817.1.txt, TEZ-2727.1.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2726) Handle invalid number of partitions for SCATTER-GATHER edge
[ https://issues.apache.org/jira/browse/TEZ-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699963#comment-14699963 ] Jason Lowe commented on TEZ-2726: - +1 for throwing an exception. I think it could be dangerous to assume that putting in empty bits for missing partitions is the correct action to take. If that approach is mistaken we could end up with missing or corrupted outputs for a successful job. Handle invalid number of partitions for SCATTER-GATHER edge --- Key: TEZ-2726 URL: https://issues.apache.org/jira/browse/TEZ-2726 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Encountered an issue where the source vertex has M task and sink vertex has N tasks (N M), [e.g. M = 1, N = 3]and the edge is of type SCATTER -GATHER. This resulted in sink vertex receiving DMEs with non existent targetIds. The fetchers for the sink vertex tasks then try to retrieve the map outputs and retrieve invalid headers due to exception in the ShuffleHandler. Possible fixes: 1. raise proper Tez Exception to indicate this invalid scenario. 2. or write appropriate empty partition bits, for the missing partitions before sending out the DMEs to sink vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2294) Add tez-site-template.xml with description of config properties
[ https://issues.apache.org/jira/browse/TEZ-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2294: - Attachment: TEZ-2294.6.patch Add findbugs-exclude file. Add tez-site-template.xml with description of config properties --- Key: TEZ-2294 URL: https://issues.apache.org/jira/browse/TEZ-2294 Project: Apache Tez Issue Type: Improvement Reporter: Rajesh Balamohan Assignee: Hitesh Shah Attachments: TEZ-2294.4.patch, TEZ-2294.5.patch, TEZ-2294.6.patch, TEZ-2294.wip.2.patch, TEZ-2294.wip.3.patch, TEZ-2294.wip.patch, TezConfiguration.html, TezRuntimeConfiguration.html, tez-default-template.xml, tez-runtime-default-template.xml Document all tez configs with descriptions and default values. Also, document MR configs that can be easily translated to Tez configs via Tez helpers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699116#comment-14699116 ] Jeff Zhang commented on TEZ-2724: - Upload patch to fix it. Verified it manually. [~pramachandran] [~hitesh] Please help review. Tez Client keeps on showing old status when application is finished but RM is shutdown -- Key: TEZ-2724 URL: https://issues.apache.org/jira/browse/TEZ-2724 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.4 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt From the logs, it seems the ipc retry interval is set as 20 seconds and ipc max retries is 45. This means that the client will retry the RPC connection for total 900 (20*45) seconds. And in this period, the application may already complete and RM Restarting may be triggered as said in the jira description. And I think the RM recovery is not enabled, so even the new RM is restarted, the original application info is lost, that means the client can never get the correct application report which makes it showing the old status forever. {code} 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 Deleted /user/hadoopqa/Input1 RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls /user/hadoopqa/Input2 RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r -skipTrash /user/hadoopqa/Input2 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 {code} Configuration to reproduce this issue * disable generic application history (yarn.timeline-service.generic-application-history.enabled) * disable rm recovery (yarn.resourcemanager.recovery.enabled) * increase the ipc retry interval and max retry (ipc.client.connect.retry.interval ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699116#comment-14699116 ] Jeff Zhang edited comment on TEZ-2724 at 8/17/15 7:27 AM: -- Upload patch to fix it. This patch can not solve the problem completely. (When ATS is not eabled, the patch only fix it when ATS is enabled ) Verified it manually. [~pramachandran] [~hitesh] Please help review. was (Author: zjffdu): Upload patch to fix it. Verified it manually. [~pramachandran] [~hitesh] Please help review. Tez Client keeps on showing old status when application is finished but RM is shutdown -- Key: TEZ-2724 URL: https://issues.apache.org/jira/browse/TEZ-2724 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.4 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt From the logs, it seems the ipc retry interval is set as 20 seconds and ipc max retries is 45. This means that the client will retry the RPC connection for total 900 (20*45) seconds. And in this period, the application may already complete and RM Restarting may be triggered as said in the jira description. And I think the RM recovery is not enabled, so even the new RM is restarted, the original application info is lost, that means the client can never get the correct application report which makes it showing the old status forever. {code} 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 Deleted /user/hadoopqa/Input1 RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls /user/hadoopqa/Input2 RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r -skipTrash /user/hadoopqa/Input2 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 {code} Configuration to reproduce this issue * disable generic application history (yarn.timeline-service.generic-application-history.enabled) * disable rm recovery (yarn.resourcemanager.recovery.enabled) * increase the ipc retry interval and max retry (ipc.client.connect.retry.interval ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700228#comment-14700228 ] Hitesh Shah commented on TEZ-2724: -- I think this is an edge case where RM HA is not enabled or if RM recovery is not enabled. I think the switch to using TimelineClient should only happen in the following condition: RM either says app finished or throws an AppNotFound exception. If the RM is down, we should just wait or throw an error if it is being done today. Switching to the TimelineClient while the RM is down is probably going to be problematic as it will not switch back to the AM after the RM comes back up ( if recovery is enabled ). Tez Client keeps on showing old status when application is finished but RM is shutdown -- Key: TEZ-2724 URL: https://issues.apache.org/jira/browse/TEZ-2724 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.4 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt From the logs, it seems the ipc retry interval is set as 20 seconds and ipc max retries is 45. This means that the client will retry the RPC connection for total 900 (20*45) seconds. And in this period, the application may already complete and RM Restarting may be triggered as said in the jira description. And I think the RM recovery is not enabled, so even the new RM is restarted, the original application info is lost, that means the client can never get the correct application report which makes it showing the old status forever. {code} 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 Deleted /user/hadoopqa/Input1 RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls /user/hadoopqa/Input2 RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r -skipTrash /user/hadoopqa/Input2 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 {code} Configuration to reproduce this issue * disable generic application history (yarn.timeline-service.generic-application-history.enabled) * disable rm recovery (yarn.resourcemanager.recovery.enabled) * increase the ipc retry interval and max retry (ipc.client.connect.retry.interval ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700228#comment-14700228 ] Hitesh Shah edited comment on TEZ-2724 at 8/17/15 9:05 PM: --- I think this is an edge case where RM HA is not enabled or if RM recovery is not enabled. I think the switch to using TimelineClient should only happen in the following condition: RM either says app finished or throws an AppNotFound exception ( AppNotFound would imply recovery disabled ). If the RM is down, we should just wait or throw an error if it is being done today. Switching to the TimelineClient while the RM is down is probably going to be problematic as it will not switch back to the AM after the RM comes back up ( if recovery is enabled ). was (Author: hitesh): I think this is an edge case where RM HA is not enabled or if RM recovery is not enabled. I think the switch to using TimelineClient should only happen in the following condition: RM either says app finished or throws an AppNotFound exception. If the RM is down, we should just wait or throw an error if it is being done today. Switching to the TimelineClient while the RM is down is probably going to be problematic as it will not switch back to the AM after the RM comes back up ( if recovery is enabled ). Tez Client keeps on showing old status when application is finished but RM is shutdown -- Key: TEZ-2724 URL: https://issues.apache.org/jira/browse/TEZ-2724 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.4 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt From the logs, it seems the ipc retry interval is set as 20 seconds and ipc max retries is 45. This means that the client will retry the RPC connection for total 900 (20*45) seconds. And in this period, the application may already complete and RM Restarting may be triggered as said in the jira description. And I think the RM recovery is not enabled, so even the new RM is restarted, the original application info is lost, that means the client can never get the correct application report which makes it showing the old status forever. {code} 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 Deleted /user/hadoopqa/Input1 RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls /user/hadoopqa/Input2 RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r -skipTrash /user/hadoopqa/Input2 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 {code} Configuration to reproduce this issue * disable generic application history (yarn.timeline-service.generic-application-history.enabled) * disable rm recovery (yarn.resourcemanager.recovery.enabled) * increase the ipc retry interval and max retry (ipc.client.connect.retry.interval ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2294) Add tez-site-template.xml with description of config properties
[ https://issues.apache.org/jira/browse/TEZ-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-2294: - Attachment: TEZ-2294.7.patch Re-upload to trigger pre-commit Add tez-site-template.xml with description of config properties --- Key: TEZ-2294 URL: https://issues.apache.org/jira/browse/TEZ-2294 Project: Apache Tez Issue Type: Improvement Reporter: Rajesh Balamohan Assignee: Hitesh Shah Attachments: TEZ-2294.4.patch, TEZ-2294.5.patch, TEZ-2294.6.patch, TEZ-2294.7.patch, TEZ-2294.wip.2.patch, TEZ-2294.wip.3.patch, TEZ-2294.wip.patch, TezConfiguration.html, TezRuntimeConfiguration.html, tez-default-template.xml, tez-runtime-default-template.xml Document all tez configs with descriptions and default values. Also, document MR configs that can be easily translated to Tez configs via Tez helpers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2726) Handle invalid number of partitions for SCATTER-GATHER edge
[ https://issues.apache.org/jira/browse/TEZ-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700237#comment-14700237 ] Saikat edited comment on TEZ-2726 at 8/17/15 9:10 PM: -- One possible place to raise a proper exception can be in sendTezEventToDestinationTasks() in Edge.java before sending out the DME(for a scattergather edgemanger). We can raise AMUserCodeException with source as edgemanager, and appropriate message. was (Author: saikatr): One possible place to raise a proper exception can be in sendTezEventToDestinationTasks() in Edge.java before sending out the DME. We can raise AMUserCodeException with source as edgemanager, and appropriate message. Handle invalid number of partitions for SCATTER-GATHER edge --- Key: TEZ-2726 URL: https://issues.apache.org/jira/browse/TEZ-2726 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Encountered an issue where the source vertex has M task and sink vertex has N tasks (N M), [e.g. M = 1, N = 3]and the edge is of type SCATTER -GATHER. This resulted in sink vertex receiving DMEs with non existent targetIds. The fetchers for the sink vertex tasks then try to retrieve the map outputs and retrieve invalid headers due to exception in the ShuffleHandler. Possible fixes: 1. raise proper Tez Exception to indicate this invalid scenario. 2. or write appropriate empty partition bits, for the missing partitions before sending out the DMEs to sink vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2726) Handle invalid number of partitions for SCATTER-GATHER edge
[ https://issues.apache.org/jira/browse/TEZ-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700237#comment-14700237 ] Saikat commented on TEZ-2726: - One possible place to raise a proper exception can be in sendTezEventToDestinationTasks() in Edge.java before sending out the DME. We can raise AMUserCodeException with source as edgemanager, and appropriate message. Handle invalid number of partitions for SCATTER-GATHER edge --- Key: TEZ-2726 URL: https://issues.apache.org/jira/browse/TEZ-2726 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Encountered an issue where the source vertex has M task and sink vertex has N tasks (N M), [e.g. M = 1, N = 3]and the edge is of type SCATTER -GATHER. This resulted in sink vertex receiving DMEs with non existent targetIds. The fetchers for the sink vertex tasks then try to retrieve the map outputs and retrieve invalid headers due to exception in the ShuffleHandler. Possible fixes: 1. raise proper Tez Exception to indicate this invalid scenario. 2. or write appropriate empty partition bits, for the missing partitions before sending out the DMEs to sink vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2726) Handle invalid number of partitions for SCATTER-GATHER edge
[ https://issues.apache.org/jira/browse/TEZ-2726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700262#comment-14700262 ] Bikas Saha commented on TEZ-2726: - Are there any details as to what exactly happened. I am not clear about that. Seems to be some issue where user misconfiguration caused empty partitions that were not handled correctly? //cc [~rajesh.balamohan] Handle invalid number of partitions for SCATTER-GATHER edge --- Key: TEZ-2726 URL: https://issues.apache.org/jira/browse/TEZ-2726 Project: Apache Tez Issue Type: Improvement Reporter: Saikat Assignee: Saikat Encountered an issue where the source vertex has M task and sink vertex has N tasks (N M), [e.g. M = 1, N = 3]and the edge is of type SCATTER -GATHER. This resulted in sink vertex receiving DMEs with non existent targetIds. The fetchers for the sink vertex tasks then try to retrieve the map outputs and retrieve invalid headers due to exception in the ShuffleHandler. Possible fixes: 1. raise proper Tez Exception to indicate this invalid scenario. 2. or write appropriate empty partition bits, for the missing partitions before sending out the DMEs to sink vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2723) Tez UI: Breadcrumb changes
[ https://issues.apache.org/jira/browse/TEZ-2723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699064#comment-14699064 ] Sreenath Somarajapuram commented on TEZ-2723: - Sorry my bad. Adding the framework as part of TEZ-2725. Tez UI: Breadcrumb changes -- Key: TEZ-2723 URL: https://issues.apache.org/jira/browse/TEZ-2723 Project: Apache Tez Issue Type: Bug Reporter: Sreenath Somarajapuram Assignee: Sreenath Somarajapuram Priority: Minor Attachments: TEZ-2723.1.patch - Update breadcrumb on tab change - Tune breadcrumb font-size -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2607) SIMD-based bitonic merge sorting
[ https://issues.apache.org/jira/browse/TEZ-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699070#comment-14699070 ] Tsuyoshi Ozawa commented on TEZ-2607: - Implemented bitonic_algorithm with [~maropu]. https://github.com/oza/bitonic_sort Flash report of micro benchmark is as follows: ||algorithm||speed(million sort per sec)|| |qsort(C)|5.9883126432| |bitonic_sort(C)|29.1652639347| I've started to work integrate this code with Tez. SIMD-based bitonic merge sorting Key: TEZ-2607 URL: https://issues.apache.org/jira/browse/TEZ-2607 Project: Apache Tez Issue Type: Sub-task Reporter: Tsuyoshi Ozawa Assignee: Tsuyoshi Ozawa Attachments: map_phase.png -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2724 PreCommit Build #998
Jira: https://issues.apache.org/jira/browse/TEZ-2724 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/998/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 3288 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12750751/TEZ-2724-1.patch against master revision 6cb8206. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/998//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/998//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. e8a16e5069ee061dadad7d571ce396b1548bc200 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #988 Archived 50 artifacts Archive block size is 32768 Received 0 blocks and 3094207 bytes Compression is 0.0% Took 0.92 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699107#comment-14699107 ] TezQA commented on TEZ-2724: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12750751/TEZ-2724-1.patch against master revision 6cb8206. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/998//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/998//console This message is automatically generated. Tez Client keeps on showing old status when application is finished but RM is shutdown -- Key: TEZ-2724 URL: https://issues.apache.org/jira/browse/TEZ-2724 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.4 Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt From the logs, it seems the ipc retry interval is set as 20 seconds and ipc max retries is 45. This means that the client will retry the RPC connection for total 900 (20*45) seconds. And in this period, the application may already complete and RM Restarting may be triggered as said in the jira description. And I think the RM recovery is not enabled, so even the new RM is restarted, the original application info is lost, that means the client can never get the correct application report which makes it showing the old status forever. {code} 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 Deleted /user/hadoopqa/Input1 RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls /user/hadoopqa/Input2 RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r -skipTrash /user/hadoopqa/Input2 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 {code} Configuration to reproduce this issue * disable generic application history (yarn.timeline-service.generic-application-history.enabled) * disable rm recovery (yarn.resourcemanager.recovery.enabled) * increase the ipc retry interval and max retry (ipc.client.connect.retry.interval ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes
[ https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700194#comment-14700194 ] Rohini Palaniswamy commented on TEZ-2300: - When a user aborts a Pig script, Pig kills the jobs it launched in the shutdown hook. What I am looking for is the same behaviour as killing a mapreduce job. The job should stop whatever it is doing and AM should exit in less than half a minute. bq. Are we waiting for the DAG to be finished? No. We are trying to kill it. It should be interrupted and processing stopped. bq. Are we waiting until the AM is closed as well? Currently the call is not blocking. It should block and exit after the kill succeeds. bq. Or is the most important aspect to reduce the amount of time of it takes to shutdown an AM with a DAG running? That as well. AM should be terminated after a timeout period if graceful kill/shutdown does not work similar to mapreduce. bq. With the pig interactive command line, will pig want to cancel a DAG and run another in the same AM? Currently there are no APIs to cancel a DAG and I don't see the need at this point to cancel a DAG and reuse that AM. TezClient.stop() takes a lot of time or does not work sometimes --- Key: TEZ-2300 URL: https://issues.apache.org/jira/browse/TEZ-2300 Project: Apache Tez Issue Type: Bug Reporter: Rohini Palaniswamy Assignee: Jonathan Eagles Attachments: TEZ-2300.1.patch, TEZ-2300.2.patch, TEZ-2300.3.patch, TEZ-2300.4.patch, syslog_dag_1428329756093_325099_1_post Noticed this with a couple of pig scripts which were not behaving well (AM close to OOM, etc) and even with some that were running fine. Pig calls Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits immediately or is hung. In both cases it either takes a long time for the yarn application to go to KILLED state. Many times I just end up calling yarn application -kill separately after waiting for 5 mins or more for it to get killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2294 PreCommit Build #999
Jira: https://issues.apache.org/jira/browse/TEZ-2294 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/999/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 38 lines...] TEZ-2294 patch is being downloaded at Mon Aug 17 20:59:59 UTC 2015 from http://issues.apache.org/jira/secure/attachment/12750871/TEZ-2294.6.patch == == Pre-build master to verify master stability and javac warnings == == Compiling /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build /home/jenkins/tools/maven/latest/bin/mvn clean test -DskipTests -Ptest-patch /home/jenkins/jenkins-slave/workspace/PreCommit-TEZ-Build/../patchprocess/masterJavacWarnings.txt 21 master compilation is broken? {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12750871/TEZ-2294.6.patch against master revision 6cb8206. {color:red}-1 patch{color}. master compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/999//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 876db1de0274684c51f7e0136549c906f0661902 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #988 Archived 1 artifacts Archive block size is 32768 Received 0 blocks and 180377 bytes Compression is 0.0% Took 0.43 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## No tests ran.
[jira] [Commented] (TEZ-2294) Add tez-site-template.xml with description of config properties
[ https://issues.apache.org/jira/browse/TEZ-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700226#comment-14700226 ] TezQA commented on TEZ-2294: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12750871/TEZ-2294.6.patch against master revision 6cb8206. {color:red}-1 patch{color}. master compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/999//console This message is automatically generated. Add tez-site-template.xml with description of config properties --- Key: TEZ-2294 URL: https://issues.apache.org/jira/browse/TEZ-2294 Project: Apache Tez Issue Type: Improvement Reporter: Rajesh Balamohan Assignee: Hitesh Shah Attachments: TEZ-2294.4.patch, TEZ-2294.5.patch, TEZ-2294.6.patch, TEZ-2294.wip.2.patch, TEZ-2294.wip.3.patch, TEZ-2294.wip.patch, TezConfiguration.html, TezRuntimeConfiguration.html, tez-default-template.xml, tez-runtime-default-template.xml Document all tez configs with descriptions and default values. Also, document MR configs that can be easily translated to Tez configs via Tez helpers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TEZ-2629) LimitExceededException in Tez client when DAG has exceeds the default max
[ https://issues.apache.org/jira/browse/TEZ-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth reassigned TEZ-2629: --- Assignee: Siddharth Seth LimitExceededException in Tez client when DAG has exceeds the default max - Key: TEZ-2629 URL: https://issues.apache.org/jira/browse/TEZ-2629 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Jason Dere Assignee: Siddharth Seth Attachments: TEZ-2629.1.txt Original issue was HIVE-11303, seeing LimitExceededException when the client tries to get the counters for a completed job: {noformat} 2015-07-17 18:18:11,830 INFO [main]: counters.Limits (Limits.java:ensureInitialized(59)) - Counter limits initialized with parameters: GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64, MAX_COUNTERS=1200 2015-07-17 18:18:11,841 ERROR [main]: exec.Task (TezTask.java:execute(189)) - Failed to execute tez graph. org.apache.tez.common.counters.LimitExceededException: Too many counters: 1201 max=1200 at org.apache.tez.common.counters.Limits.checkCounters(Limits.java:87) at org.apache.tez.common.counters.Limits.incrCounters(Limits.java:94) at org.apache.tez.common.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:76) at org.apache.tez.common.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:93) at org.apache.tez.common.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:104) at org.apache.tez.dag.api.DagTypeConverters.convertTezCountersFromProto(DagTypeConverters.java:567) at org.apache.tez.dag.api.client.DAGStatus.getDAGCounters(DAGStatus.java:148) at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:175) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1673) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1432) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1213) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1064) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1054) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:213) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:165) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:409) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:425) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:714) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:681) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:621) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) {noformat} It looks like Limits.ensureInitialized() is defaulting to an empty configuration, resulting in COUNTERS_MAX being set to the default of 1200 (even though Hive's configuration specified tez.counters.max=16000). Per [~sseth]: {quote} I think the Tez client does need to make this call to setup the Configuration correctly. We do this for the AM and the executing task - which is why it works. Could you please open a Tez jira for this ? Also, Limits is making use of Configuration instead of TezConfiguration for default initialization, which implies changes to tez-site on the local node won't be picked up. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2300) TezClient.stop() takes a lot of time or does not work sometimes
[ https://issues.apache.org/jira/browse/TEZ-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14700203#comment-14700203 ] Rohini Palaniswamy commented on TEZ-2300: - bq. Are we waiting until the AM is closed as well? Actually have no concerns about AM lingering to write history, if you can completely ensure that processing of DAG has been terminated when the stop() call returns. Problem here is stop() call returns and there is no clue as to whether DAG processing is still happening or it has terminated. TezClient.stop() takes a lot of time or does not work sometimes --- Key: TEZ-2300 URL: https://issues.apache.org/jira/browse/TEZ-2300 Project: Apache Tez Issue Type: Bug Reporter: Rohini Palaniswamy Assignee: Jonathan Eagles Attachments: TEZ-2300.1.patch, TEZ-2300.2.patch, TEZ-2300.3.patch, TEZ-2300.4.patch, syslog_dag_1428329756093_325099_1_post Noticed this with a couple of pig scripts which were not behaving well (AM close to OOM, etc) and even with some that were running fine. Pig calls Tezclient.stop() in shutdown hook. Ctrl+C to the pig script either exits immediately or is hung. In both cases it either takes a long time for the yarn application to go to KILLED state. Many times I just end up calling yarn application -kill separately after waiting for 5 mins or more for it to get killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)