[jira] [Commented] (TEZ-2368) Make the dag number available in Context classes
[ https://issues.apache.org/jira/browse/TEZ-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14515150#comment-14515150 ] Hitesh Shah commented on TEZ-2368: -- Comments: typo in Get a numeric identifier for the dto which the task belongs +1 once the typo is fixed. Make the dag number available in Context classes Key: TEZ-2368 URL: https://issues.apache.org/jira/browse/TEZ-2368 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2368.1.txt, TEZ-2368.2.txt Provide the dag number, which is a unique number, for each dag running within an application in the TezInputContext, TezOutputContext, TezProcessorContext. When containers are re-used, or for external services, this can be used to generate intermediate data to a dag specific directory instead of an application specific directory, where it becomes difficult to differentiate between different dags. The DAG name does provide this - but is not suitable for use in a directory name. Hashing the name is an option, but can lead to collisions. Generating data into a dag specific directory will eventually only be usable when we move away from the default MR handler, or enhance it to support an additional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2368) Make the dag number available in Context classes
[ https://issues.apache.org/jira/browse/TEZ-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512779#comment-14512779 ] Hitesh Shah commented on TEZ-2368: -- Probably overthinking this one too much. dagIdentifier api sounds good. Using the dagId/index should be sufficient. Make the dag number available in Context classes Key: TEZ-2368 URL: https://issues.apache.org/jira/browse/TEZ-2368 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2368.1.txt Provide the dag number, which is a unique number, for each dag running within an application in the TezInputContext, TezOutputContext, TezProcessorContext. When containers are re-used, or for external services, this can be used to generate intermediate data to a dag specific directory instead of an application specific directory, where it becomes difficult to differentiate between different dags. The DAG name does provide this - but is not suitable for use in a directory name. Hashing the name is an option, but can lead to collisions. Generating data into a dag specific directory will eventually only be usable when we move away from the default MR handler, or enhance it to support an additional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2368) Make the dag number available in Context classes
[ https://issues.apache.org/jira/browse/TEZ-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511993#comment-14511993 ] Hitesh Shah commented on TEZ-2368: -- dag names are meant to be unique within a session. Make the dag number available in Context classes Key: TEZ-2368 URL: https://issues.apache.org/jira/browse/TEZ-2368 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Provide the dag number, which is a unique number, for each dag running within an application in the TezInputContext, TezOutputContext, TezProcessorContext. When containers are re-used, or for external services, this can be used to generate intermediate data to a dag specific directory instead of an application specific directory, where it becomes difficult to differentiate between different dags. The DAG name does provide this - but is not suitable for use in a directory name. Hashing the name is an option, but can lead to collisions. Generating data into a dag specific directory will eventually only be usable when we move away from the default MR handler, or enhance it to support an additional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2368) Make the dag number available in Context classes
[ https://issues.apache.org/jira/browse/TEZ-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512091#comment-14512091 ] TezQA commented on TEZ-2368: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12728081/TEZ-2368.1.txt against master revision 2935ef4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/535//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/535//console This message is automatically generated. Make the dag number available in Context classes Key: TEZ-2368 URL: https://issues.apache.org/jira/browse/TEZ-2368 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2368.1.txt Provide the dag number, which is a unique number, for each dag running within an application in the TezInputContext, TezOutputContext, TezProcessorContext. When containers are re-used, or for external services, this can be used to generate intermediate data to a dag specific directory instead of an application specific directory, where it becomes difficult to differentiate between different dags. The DAG name does provide this - but is not suitable for use in a directory name. Hashing the name is an option, but can lead to collisions. Generating data into a dag specific directory will eventually only be usable when we move away from the default MR handler, or enhance it to support an additional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2368) Make the dag number available in Context classes
[ https://issues.apache.org/jira/browse/TEZ-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512018#comment-14512018 ] Siddharth Seth commented on TEZ-2368: - Yes, but they're not suitable for use in directory names. Similarly for vertex names. Make the dag number available in Context classes Key: TEZ-2368 URL: https://issues.apache.org/jira/browse/TEZ-2368 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Provide the dag number, which is a unique number, for each dag running within an application in the TezInputContext, TezOutputContext, TezProcessorContext. When containers are re-used, or for external services, this can be used to generate intermediate data to a dag specific directory instead of an application specific directory, where it becomes difficult to differentiate between different dags. The DAG name does provide this - but is not suitable for use in a directory name. Hashing the name is an option, but can lead to collisions. Generating data into a dag specific directory will eventually only be usable when we move away from the default MR handler, or enhance it to support an additional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2368) Make the dag number available in Context classes
[ https://issues.apache.org/jira/browse/TEZ-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512127#comment-14512127 ] Hitesh Shah commented on TEZ-2368: -- What I meant was that a hashcode would work given that the name is unique. In any case, why does a dag number need to be exposed to user code? Isn't the unique id sufficient? If the end-goal is per-dag data for the framework to be able to clean up code then the framework should be creating dag specific dirs before passing them to user land code and cleaning up these dirs when the dag has completed. External services are not meant to be using these context classes in any case. Or am I missing something? We can add this api but I am not sure I see a need for it currently. Make the dag number available in Context classes Key: TEZ-2368 URL: https://issues.apache.org/jira/browse/TEZ-2368 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2368.1.txt Provide the dag number, which is a unique number, for each dag running within an application in the TezInputContext, TezOutputContext, TezProcessorContext. When containers are re-used, or for external services, this can be used to generate intermediate data to a dag specific directory instead of an application specific directory, where it becomes difficult to differentiate between different dags. The DAG name does provide this - but is not suitable for use in a directory name. Hashing the name is an option, but can lead to collisions. Generating data into a dag specific directory will eventually only be usable when we move away from the default MR handler, or enhance it to support an additional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2368) Make the dag number available in Context classes
[ https://issues.apache.org/jira/browse/TEZ-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512249#comment-14512249 ] Siddharth Seth commented on TEZ-2368: - This could be renamed to something like getDagIdentifier (short unique integer identifying a dag within an app) instead of getDagNumber - which makes it cleaner. Make the dag number available in Context classes Key: TEZ-2368 URL: https://issues.apache.org/jira/browse/TEZ-2368 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2368.1.txt Provide the dag number, which is a unique number, for each dag running within an application in the TezInputContext, TezOutputContext, TezProcessorContext. When containers are re-used, or for external services, this can be used to generate intermediate data to a dag specific directory instead of an application specific directory, where it becomes difficult to differentiate between different dags. The DAG name does provide this - but is not suitable for use in a directory name. Hashing the name is an option, but can lead to collisions. Generating data into a dag specific directory will eventually only be usable when we move away from the default MR handler, or enhance it to support an additional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2368) Make the dag number available in Context classes
[ https://issues.apache.org/jira/browse/TEZ-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512245#comment-14512245 ] Siddharth Seth commented on TEZ-2368: - The hashcode will not be sufficient since it can, and likely will be, different across two different JVMs. We currently end up generating data in ${appId}/constant/${uniqeId} - which makes cleanup very difficult. The main limitation in changing this path is the way we are tied to the MR ShuffleHandler which only knows how to process this path. Creating dag specific dirs is an option, but only after the ShuffleHandler changes. When an external shuffleHandler is used - this API provides the relevant information to create dag specific dirs instead of the app dir directly. The API isn't exposing the dagId directly. What it does expose is a small unique identifer for each dag running in an application - which can be useful. Caching would be an alternate use for something like this. It's similar to a vertexIndex API which exists on the context impls - which is present for exactly the same reason - to generate names. bq. External services are not meant to be using these context classes in any case. Or am I missing something? External services can use the components in the RuntimeLibrary, all of which depend on the Context classes. What that does mean is construction/usage of these classes will eventually need to be exposed as a limited public API - likely tied to specific Tez versions as it evolves. Make the dag number available in Context classes Key: TEZ-2368 URL: https://issues.apache.org/jira/browse/TEZ-2368 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Siddharth Seth Attachments: TEZ-2368.1.txt Provide the dag number, which is a unique number, for each dag running within an application in the TezInputContext, TezOutputContext, TezProcessorContext. When containers are re-used, or for external services, this can be used to generate intermediate data to a dag specific directory instead of an application specific directory, where it becomes difficult to differentiate between different dags. The DAG name does provide this - but is not suitable for use in a directory name. Hashing the name is an option, but can lead to collisions. Generating data into a dag specific directory will eventually only be usable when we move away from the default MR handler, or enhance it to support an additional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)