[jira] [Commented] (TEZ-3957) Report TASK_DURATION_MILLIS as a Counter for completed tasks

2018-11-06 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677629#comment-16677629
 ] 

Sergey Shelukhin commented on TEZ-3957:
---

Hmm.. all the tests failed to fork wm and don't appear to repro locally. I'm 
running all the tests now.

> Report TASK_DURATION_MILLIS as a Counter for completed tasks
> 
>
> Key: TEZ-3957
> URL: https://issues.apache.org/jira/browse/TEZ-3957
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3957.01.patch, TEZ-3957.02.patch, TEZ-3957.patch
>
>
> timeTaken is already being reported by {{TaskAttemptFinishedEvent}}, but not 
> as a Counter.
> Combined with TEZ-3911, this provides min(timeTaken), max(timeTaken), 
> avg(timeTaken).
> The value will be: {{finishTime - launchTime}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3957) Report TASK_DURATION_MILLIS as a Counter for completed tasks

2018-11-06 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677567#comment-16677567
 ] 

Sergey Shelukhin commented on TEZ-3957:
---

Updated

> Report TASK_DURATION_MILLIS as a Counter for completed tasks
> 
>
> Key: TEZ-3957
> URL: https://issues.apache.org/jira/browse/TEZ-3957
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3957.01.patch, TEZ-3957.02.patch, TEZ-3957.patch
>
>
> timeTaken is already being reported by {{TaskAttemptFinishedEvent}}, but not 
> as a Counter.
> Combined with TEZ-3911, this provides min(timeTaken), max(timeTaken), 
> avg(timeTaken).
> The value will be: {{finishTime - launchTime}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3957) Report TASK_DURATION_MILLIS as a Counter for completed tasks

2018-11-06 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3957:
--
Attachment: TEZ-3957.02.patch

> Report TASK_DURATION_MILLIS as a Counter for completed tasks
> 
>
> Key: TEZ-3957
> URL: https://issues.apache.org/jira/browse/TEZ-3957
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3957.01.patch, TEZ-3957.02.patch, TEZ-3957.patch
>
>
> timeTaken is already being reported by {{TaskAttemptFinishedEvent}}, but not 
> as a Counter.
> Combined with TEZ-3911, this provides min(timeTaken), max(timeTaken), 
> avg(timeTaken).
> The value will be: {{finishTime - launchTime}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3957) Report TASK_DURATION_MILLIS as a Counter for completed tasks

2018-10-29 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667578#comment-16667578
 ] 

Sergey Shelukhin commented on TEZ-3957:
---

CPU milliseconds is CPU time, this reports wall clock time.
No idea about MR.

> Report TASK_DURATION_MILLIS as a Counter for completed tasks
> 
>
> Key: TEZ-3957
> URL: https://issues.apache.org/jira/browse/TEZ-3957
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3957.01.patch, TEZ-3957.patch
>
>
> timeTaken is already being reported by {{TaskAttemptFinishedEvent}}, but not 
> as a Counter.
> Combined with TEZ-3911, this provides min(timeTaken), max(timeTaken), 
> avg(timeTaken).
> The value will be: {{finishTime - launchTime}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3957) Report TASK_DURATION_MILLIS as a Counter for completed tasks

2018-10-26 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3957:
--
Attachment: TEZ-3957.01.patch

> Report TASK_DURATION_MILLIS as a Counter for completed tasks
> 
>
> Key: TEZ-3957
> URL: https://issues.apache.org/jira/browse/TEZ-3957
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3957.01.patch, TEZ-3957.patch
>
>
> timeTaken is already being reported by {{TaskAttemptFinishedEvent}}, but not 
> as a Counter.
> Combined with TEZ-3911, this provides min(timeTaken), max(timeTaken), 
> avg(timeTaken).
> The value will be: {{finishTime - launchTime}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3957) Report TASK_DURATION_MILLIS as a Counter for completed tasks

2018-10-26 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16665699#comment-16665699
 ] 

Sergey Shelukhin commented on TEZ-3957:
---

I'm going to fix some but not all checkstyle warnings... some of them don't 
make any sense (e.g. DesignForExtension - 
http://checkstyle.sourceforge.net/config_design.html#DesignForExtension 
mentions that it only makes sense for library projects, as is it prevents 
normal method overrides - perhaps it should be disabled in a separate patch.), 
some diffing seems to be buggy (complaining about lack of javadoc for a field 
that was already there and without javadoc). 

> Report TASK_DURATION_MILLIS as a Counter for completed tasks
> 
>
> Key: TEZ-3957
> URL: https://issues.apache.org/jira/browse/TEZ-3957
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3957.patch
>
>
> timeTaken is already being reported by {{TaskAttemptFinishedEvent}}, but not 
> as a Counter.
> Combined with TEZ-3911, this provides min(timeTaken), max(timeTaken), 
> avg(timeTaken).
> The value will be: {{finishTime - launchTime}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3957) Report TASK_DURATION_MILLIS as a Counter for completed tasks

2018-10-25 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3957:
--
Attachment: TEZ-3957.patch

> Report TASK_DURATION_MILLIS as a Counter for completed tasks
> 
>
> Key: TEZ-3957
> URL: https://issues.apache.org/jira/browse/TEZ-3957
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3957.patch
>
>
> timeTaken is already being reported by {{TaskAttemptFinishedEvent}}, but not 
> as a Counter.
> Combined with TEZ-3911, this provides min(timeTaken), max(timeTaken), 
> avg(timeTaken).
> The value will be: {{finishTime - launchTime}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TEZ-3957) Report TASK_DURATION_MILLIS as a Counter for completed tasks

2018-10-25 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin reassigned TEZ-3957:
-

Assignee: Sergey Shelukhin  (was: Eric Wohlstadter)

> Report TASK_DURATION_MILLIS as a Counter for completed tasks
> 
>
> Key: TEZ-3957
> URL: https://issues.apache.org/jira/browse/TEZ-3957
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Eric Wohlstadter
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3957.patch
>
>
> timeTaken is already being reported by {{TaskAttemptFinishedEvent}}, but not 
> as a Counter.
> Combined with TEZ-3911, this provides min(timeTaken), max(timeTaken), 
> avg(timeTaken).
> The value will be: {{finishTime - launchTime}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3980) ShuffleRunner: the wake loop needs to check for shutdown

2018-08-22 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589304#comment-16589304
 ] 

Sergey Shelukhin commented on TEZ-3980:
---

+1 non-binding

> ShuffleRunner: the wake loop needs to check for shutdown
> 
>
> Key: TEZ-3980
> URL: https://issues.apache.org/jira/browse/TEZ-3980
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Gopal V
>Assignee: Gopal V
>Priority: Major
> Attachments: TEZ-3980.1.patch
>
>
> In the ShuffleRunner threads, there's a loop which does not terminate if the 
> task threads get killed.
> {code}
>   while ((runningFetchers.size() >= numFetchers || 
> pendingHosts.isEmpty())
>   && numCompletedInputs.get() < numInputs) {
> inputContext.notifyProgress();
> boolean ret = wakeLoop.await(1000, TimeUnit.MILLISECONDS);
>   }
> {code}
> The wakeLoop signal does not exit this out of the loop and is missing a break 
> for shut-down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3953) Restore ABI-compat for DAGClient for TEZ-3951

2018-07-05 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16534116#comment-16534116
 ] 

Sergey Shelukhin commented on TEZ-3953:
---

[~jlowe] yes, can you push it there? Thanks

> Restore ABI-compat for DAGClient for TEZ-3951
> -
>
> Key: TEZ-3953
> URL: https://issues.apache.org/jira/browse/TEZ-3953
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.10.0
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Fix For: 0.10.0
>
> Attachments: TEZ-3953.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3953) make interface change from TEZ-3951 non-breaking

2018-06-11 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3953:
--
Attachment: TEZ-3953.patch

> make interface change from TEZ-3951 non-breaking
> 
>
> Key: TEZ-3953
> URL: https://issues.apache.org/jira/browse/TEZ-3953
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3953.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Moved] (TEZ-3953) make interface change from TEZ-3951 non-breaking

2018-06-11 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin moved HIVE-19858 to TEZ-3953:
--

Key: TEZ-3953  (was: HIVE-19858)
Project: Apache Tez  (was: Hive)

> make interface change from TEZ-3951 non-breaking
> 
>
> Key: TEZ-3953
> URL: https://issues.apache.org/jira/browse/TEZ-3953
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3904) an API to update tokens for Tez AM and the DAG

2018-06-08 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506317#comment-16506317
 ] 

Sergey Shelukhin commented on TEZ-3904:
---

Yeah that's the idea, although in the case of Tez the actual renewer may not 
live fully in Tez (since some tokens, like the ones for HBase/etc., are 
originally obtained by Hive, and some are obtained by Tez based on paths). 
Might make sense to allow the users of Tez (ie Hive) to supply a function to 
get the former instead of the tokens themselves, so Tez could get the new 
tokens at any time.
The containers would also need to get tokens.

> an API to update tokens for Tez AM and the DAG
> --
>
> Key: TEZ-3904
> URL: https://issues.apache.org/jira/browse/TEZ-3904
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> Nothing is permanent in this world, lest of all delegation tokens.
> The current way around token expiration (the one where you cannot keep 
> renewing anymore) in Hive when Tez AM is used in session mode is to cycle Tez 
> AM. It may happen though that a query is running at that time, and so the AM 
> cannot be restarted with new tokens. We let the query run its course and it 
> usually dies because it tries to do something with an expired token.
> To get around that, we cycle AMs a few hours before tokens are going to 
> expire.
> However, that is still not ideal because it puts an upper bound on safe Hive 
> query runtime (a query longer than 3 hours with current config may fail due 
> to an expired token if its timing is unlucky), and also precludes setting 
> tokens to expire much faster than the standard 7-day time frame.
> There should be a mechanism to replace tokens in the AM, including for a 
> running DAG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3951) TezClient wait too long for the DAGClient for prewarm; tries to shut down the wrong DAG

2018-06-07 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3951:
--
Attachment: TEZ-3951.01.patch

> TezClient wait too long for the DAGClient for prewarm; tries to shut down the 
> wrong DAG
> ---
>
> Key: TEZ-3951
> URL: https://issues.apache.org/jira/browse/TEZ-3951
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3951.01.patch, TEZ-3951.patch
>
>
> Follow-up from TEZ-3943



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3951) TezClient wait too long for the DAGClient for prewarm; tries to shut down the wrong DAG

2018-06-07 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505223#comment-16505223
 ] 

Sergey Shelukhin commented on TEZ-3951:
---

Added a small test case to test the timeout.

> TezClient wait too long for the DAGClient for prewarm; tries to shut down the 
> wrong DAG
> ---
>
> Key: TEZ-3951
> URL: https://issues.apache.org/jira/browse/TEZ-3951
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3951.01.patch, TEZ-3951.patch
>
>
> Follow-up from TEZ-3943



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3951) TezClient wait too long for the DAGClient for prewarm; tries to shut down the wrong DAG

2018-06-07 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505072#comment-16505072
 ] 

Sergey Shelukhin commented on TEZ-3951:
---

[~ewohlstadter] [~ashutoshc] ping?

> TezClient wait too long for the DAGClient for prewarm; tries to shut down the 
> wrong DAG
> ---
>
> Key: TEZ-3951
> URL: https://issues.apache.org/jira/browse/TEZ-3951
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3951.patch
>
>
> Follow-up from TEZ-3943



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-2218) Turn on speculation by default

2018-06-06 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503803#comment-16503803
 ] 

Sergey Shelukhin commented on TEZ-2218:
---

Thanks for the info... as far as I know Hive doesn't turn on speculative 
execution on Tez. But it would potentially be smth useful :)

> Turn on speculation by default
> --
>
> Key: TEZ-2218
> URL: https://issues.apache.org/jira/browse/TEZ-2218
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3951) TezClient wait too long for the DAGClient for prewarm; tries to shut down the wrong DAG

2018-06-05 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502606#comment-16502606
 ] 

Sergey Shelukhin commented on TEZ-3951:
---

Pewarm itself is a pretty obscure feature, and the time to wait to shut down 
prewarm DAG seems too esoteric to be a config setting. Any reason people would 
want to change it?

> TezClient wait too long for the DAGClient for prewarm; tries to shut down the 
> wrong DAG
> ---
>
> Key: TEZ-3951
> URL: https://issues.apache.org/jira/browse/TEZ-3951
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3951.patch
>
>
> Follow-up from TEZ-3943



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3951) TezClient wait too long for the DAGClient for prewarm; tries to shut down the wrong DAG

2018-06-05 Thread Sergey Shelukhin (JIRA)


[ 
https://issues.apache.org/jira/browse/TEZ-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502587#comment-16502587
 ] 

Sergey Shelukhin commented on TEZ-3951:
---

[~ewohlstadter] can you take a look?



> TezClient wait too long for the DAGClient for prewarm; tries to shut down the 
> wrong DAG
> ---
>
> Key: TEZ-3951
> URL: https://issues.apache.org/jira/browse/TEZ-3951
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3951.patch
>
>
> Follow-up from TEZ-3943



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3951) TezClient wait too long for the DAGClient for prewarm; tries to shut down the wrong DAG

2018-06-05 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3951:
--
Attachment: TEZ-3951.patch

> TezClient wait too long for the DAGClient for prewarm; tries to shut down the 
> wrong DAG
> ---
>
> Key: TEZ-3951
> URL: https://issues.apache.org/jira/browse/TEZ-3951
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3951.patch
>
>
> Follow-up from TEZ-3943



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3951) TezClient wait too long for the DAGClient for prewarm; tries to shut down the wrong DAG

2018-06-05 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3951:
--
Attachment: (was: TEZ-3951.patch)

> TezClient wait too long for the DAGClient for prewarm; tries to shut down the 
> wrong DAG
> ---
>
> Key: TEZ-3951
> URL: https://issues.apache.org/jira/browse/TEZ-3951
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
>
> Follow-up from TEZ-3943



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3951) TezClient wait too long for the DAGClient for prewarm; tries to shut down the wrong DAG

2018-06-05 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3951:
--
Attachment: TEZ-3951.patch

> TezClient wait too long for the DAGClient for prewarm; tries to shut down the 
> wrong DAG
> ---
>
> Key: TEZ-3951
> URL: https://issues.apache.org/jira/browse/TEZ-3951
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3951.patch
>
>
> Follow-up from TEZ-3943



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3951) TezClient wait too long for the DAGClient for prewarm; tries to shut down the wrong DAG

2018-06-05 Thread Sergey Shelukhin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TEZ-3951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3951:
--
Summary: TezClient wait too long for the DAGClient for prewarm; tries to 
shut down the wrong DAG  (was: TezClient wait too long for the DAGClient for 
prewarm)

> TezClient wait too long for the DAGClient for prewarm; tries to shut down the 
> wrong DAG
> ---
>
> Key: TEZ-3951
> URL: https://issues.apache.org/jira/browse/TEZ-3951
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3951.patch
>
>
> Follow-up from TEZ-3943



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3951) TezClient wait too long for the DAGClient for prewarm

2018-06-05 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3951:
-

 Summary: TezClient wait too long for the DAGClient for prewarm
 Key: TEZ-3951
 URL: https://issues.apache.org/jira/browse/TEZ-3951
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin


Follow-up from TEZ-3943



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-2218) Turn on speculation by default

2018-05-25 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491428#comment-16491428
 ] 

Sergey Shelukhin commented on TEZ-2218:
---

Hmm...

> Turn on speculation by default
> --
>
> Key: TEZ-2218
> URL: https://issues.apache.org/jira/browse/TEZ-2218
> Project: Apache Tez
>  Issue Type: Sub-task
>Reporter: Bikas Saha
>Assignee: Bikas Saha
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-2132) Support fault tolerance & speculation in pipelined data transfer for ordered output

2018-05-25 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491425#comment-16491425
 ] 

Sergey Shelukhin commented on TEZ-2132:
---

hmm...

> Support fault tolerance & speculation in pipelined data transfer for ordered 
> output
> ---
>
> Key: TEZ-2132
> URL: https://issues.apache.org/jira/browse/TEZ-2132
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Major
>
> Follow up of TEZ-2001.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3943) TezClient leaks DAGClient for prewarm

2018-05-25 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491395#comment-16491395
 ] 

Sergey Shelukhin commented on TEZ-3943:
---

Looks like testshiffle is pretty unstable. 
https://builds.apache.org/job/PreCommit-TEZ-Build/2815/testReport/org.apache.tez.test/TestSecureShuffle/testSecureShuffle_test_sslInCluster_true__resultWithTezSSL_0__resultWithoutTezSSL_1__asyncHttp_false__/history/

The other test also fails often: 
https://builds.apache.org/job/PreCommit-TEZ-Build/2815/testReport/org.apache.tez.test/TestAMRecovery/testVertexCompletelyFinished_Broadcast/history/

[~ashutoshc] [~hagleitn] can you commit? [~ewohlstadter] reviewed above, but 
neither of us is a committer

> TezClient leaks DAGClient for prewarm
> -
>
> Key: TEZ-3943
> URL: https://issues.apache.org/jira/browse/TEZ-3943
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3943.01.patch, TEZ-3943.02.patch, TEZ-3943.patch
>
>
> This may in turn leak some security related threads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3943) TezClient leaks DAGClient for prewarm

2018-05-25 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3943:
--
Attachment: TEZ-3943.02.patch

> TezClient leaks DAGClient for prewarm
> -
>
> Key: TEZ-3943
> URL: https://issues.apache.org/jira/browse/TEZ-3943
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3943.01.patch, TEZ-3943.02.patch, TEZ-3943.patch
>
>
> This may in turn leak some security related threads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3943) TezClient leaks DAGClient for prewarm

2018-05-25 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491243#comment-16491243
 ] 

Sergey Shelukhin commented on TEZ-3943:
---

Fixed some client test.  The other ones pass for me.

> TezClient leaks DAGClient for prewarm
> -
>
> Key: TEZ-3943
> URL: https://issues.apache.org/jira/browse/TEZ-3943
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3943.01.patch, TEZ-3943.patch
>
>
> This may in turn leak some security related threads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3943) TezClient leaks DAGClient for prewarm

2018-05-25 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3943:
--
Attachment: TEZ-3943.01.patch

> TezClient leaks DAGClient for prewarm
> -
>
> Key: TEZ-3943
> URL: https://issues.apache.org/jira/browse/TEZ-3943
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3943.01.patch, TEZ-3943.patch
>
>
> This may in turn leak some security related threads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3943) TezClient leaks DAGClient for prewarm

2018-05-24 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489808#comment-16489808
 ] 

Sergey Shelukhin commented on TEZ-3943:
---

[~ewohlstadter] can you take a look? 

> TezClient leaks DAGClient for prewarm
> -
>
> Key: TEZ-3943
> URL: https://issues.apache.org/jira/browse/TEZ-3943
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3943.patch
>
>
> This may in turn leak some security related threads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3943) TezClient leaks DAGClient for prewarm

2018-05-24 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3943:
--
Attachment: TEZ-3943.patch

> TezClient leaks DAGClient for prewarm
> -
>
> Key: TEZ-3943
> URL: https://issues.apache.org/jira/browse/TEZ-3943
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3943.patch
>
>
> This may in turn leak some security related threads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3943) TezClient leaks DAGClient for prewarm

2018-05-24 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3943:
--
Description: This may in turn leak some security related threads.

> TezClient leaks DAGClient for prewarm
> -
>
> Key: TEZ-3943
> URL: https://issues.apache.org/jira/browse/TEZ-3943
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
>
> This may in turn leak some security related threads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3943) TezClient leaks DAGClient for prewarm

2018-05-24 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3943:
-

 Summary: TezClient leaks DAGClient for prewarm
 Key: TEZ-3943
 URL: https://issues.apache.org/jira/browse/TEZ-3943
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3904) an API to update tokens for Tez AM and the DAG

2018-05-22 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484426#comment-16484426
 ] 

Sergey Shelukhin commented on TEZ-3904:
---

Not that I know of. However, MR is deprecated in Hive, so we basically only 
care about Tez and Spark. Not sure what Spark does for that, if it uses 
delegation tokens at all.

> an API to update tokens for Tez AM and the DAG
> --
>
> Key: TEZ-3904
> URL: https://issues.apache.org/jira/browse/TEZ-3904
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Priority: Major
>
> Nothing is permanent in this world, lest of all delegation tokens.
> The current way around token expiration (the one where you cannot keep 
> renewing anymore) in Hive when Tez AM is used in session mode is to cycle Tez 
> AM. It may happen though that a query is running at that time, and so the AM 
> cannot be restarted with new tokens. We let the query run its course and it 
> usually dies because it tries to do something with an expired token.
> To get around that, we cycle AMs a few hours before tokens are going to 
> expire.
> However, that is still not ideal because it puts an upper bound on safe Hive 
> query runtime (a query longer than 3 hours with current config may fail due 
> to an expired token if its timing is unlucky), and also precludes setting 
> tokens to expire much faster than the standard 7-day time frame.
> There should be a mechanism to replace tokens in the AM, including for a 
> running DAG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3906) general purpose plugin interface

2018-03-19 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16405697#comment-16405697
 ] 

Sergey Shelukhin commented on TEZ-3906:
---

cc [~hagleitn] [~ewohlstadter] this will be needed to expand AM recovery in 
Hive beyond LLAP-using AMs. Currently there's no way to host AM registry in the 
AM otherwise (that I know of)

> general purpose plugin interface
> 
>
> Key: TEZ-3906
> URL: https://issues.apache.org/jira/browse/TEZ-3906
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Eric Wohlstadter
>Priority: Major
>
> Tez has plugin interfaces for communicator, scheduler, and stuff.
> It would be nice to be able to host general purpose code in Tez AM for 
> particular purposes.
> E.g. currently Hive AM registry (that may contain some Hive-specific stuff 
> that is neither appropriate, nor convenient w.r.t. compat and flexibility 
> when it's changed, for Tez codebase) is hosted via LLAP plugin, which limits 
> its applicability.
> It would be nice to be able to add one or several general purpose plugins 
> with start/stop interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3906) general purpose plugin interface

2018-03-19 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3906:
-

 Summary: general purpose plugin interface
 Key: TEZ-3906
 URL: https://issues.apache.org/jira/browse/TEZ-3906
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin
Assignee: Eric Wohlstadter


Tez has plugin interfaces for communicator, scheduler, and stuff.
It would be nice to be able to host general purpose code in Tez AM for 
particular purposes.
E.g. currently Hive AM registry (that may contain some Hive-specific stuff that 
is neither appropriate, nor convenient w.r.t. compat and flexibility when it's 
changed, for Tez codebase) is hosted via LLAP plugin, which limits its 
applicability.

It would be nice to be able to add one or several general purpose plugins with 
start/stop interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3900) upgrade to a recent guava version

2018-03-09 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393675#comment-16393675
 ] 

Sergey Shelukhin commented on TEZ-3900:
---

Nm, we actually just broke our own shading so tez jar came first on the 
classpath after that. I guess this can wait for TEZ-2164

> upgrade to a recent guava version
> -
>
> Key: TEZ-3900
> URL: https://issues.apache.org/jira/browse/TEZ-3900
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3900.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3900) upgrade to a recent guava version

2018-03-08 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391955#comment-16391955
 ] 

Sergey Shelukhin commented on TEZ-3900:
---

Where can one view the best practices? :)
We were deploying Guava 11 because of the dependency version in Tez, and are 
hitting issue in Hive due to lack of API compat (Hive is on 19). 
We are going to try deploying 19 for everything and see if it works.

> upgrade to a recent guava version
> -
>
> Key: TEZ-3900
> URL: https://issues.apache.org/jira/browse/TEZ-3900
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3900.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3904) an API to update tokens for Tez AM and the DAG

2018-03-08 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3904:
-

 Summary: an API to update tokens for Tez AM and the DAG
 Key: TEZ-3904
 URL: https://issues.apache.org/jira/browse/TEZ-3904
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin


Nothing is permanent in this world, lest of all delegation tokens.
The current way around token expiration (the one where you cannot keep renewing 
anymore) in Hive when Tez AM is used in session mode is to cycle Tez AM. It may 
happen though that a query is running at that time, and so the AM cannot be 
restarted with new tokens. We let the query run its course and it usually dies 
because it tries to do something with an expired token.
To get around that, we cycle AMs a few hours before tokens are going to expire.
However, that is still not ideal because it puts an upper bound on safe Hive 
query runtime (a query longer than 3 hours with current config may fail due to 
an expired token if its timing is unlucky), and also precludes setting tokens 
to expire much faster than the standard 7-day time frame.

There should be a mechanism to replace tokens in the AM, including for a 
running DAG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3900) upgrade to a recent guava version

2018-03-07 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390121#comment-16390121
 ] 

Sergey Shelukhin commented on TEZ-3900:
---

Hmm... the problem we actually have is that Hive has recently upgraded guava 
and we are hitting issues due to the old version coming from Tez.
The referenced JIRA hasn't had any progress in almost 2 years, I wonder if it's 
time to just upgrade it and force Hive/Pig to upgrade too (Hive has already 
done that, in fact).

> upgrade to a recent guava version
> -
>
> Key: TEZ-3900
> URL: https://issues.apache.org/jira/browse/TEZ-3900
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3900.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3900) upgrade to a recent guava version

2018-03-06 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388901#comment-16388901
 ] 

Sergey Shelukhin commented on TEZ-3900:
---

[~ewohlstadter] [~hagleitn] can you take a look?

> upgrade to a recent guava version
> -
>
> Key: TEZ-3900
> URL: https://issues.apache.org/jira/browse/TEZ-3900
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3900.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TEZ-3900) upgrade to a recent guava version

2018-03-06 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3900:
--
Attachment: TEZ-3900.patch

> upgrade to a recent guava version
> -
>
> Key: TEZ-3900
> URL: https://issues.apache.org/jira/browse/TEZ-3900
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>Priority: Major
> Attachments: TEZ-3900.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (TEZ-3900) upgrade to a recent guava version

2018-03-06 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3900:
-

 Summary: upgrade to a recent guava version
 Key: TEZ-3900
 URL: https://issues.apache.org/jira/browse/TEZ-3900
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3892) getClient API for TezClient

2018-03-06 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388447#comment-16388447
 ] 

Sergey Shelukhin commented on TEZ-3892:
---

+1, but I'm also not a Tez committer. cc [~hagleitn] can you +1?

> getClient API for TezClient
> ---
>
> Key: TEZ-3892
> URL: https://issues.apache.org/jira/browse/TEZ-3892
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Eric Wohlstadter
>Assignee: Eric Wohlstadter
>Priority: Major
> Attachments: TEZ-3892.1.patch, TEZ-3892.2.patch
>
>
> This is a proposed opt-in feature.
> Tez AM already supports long-lived sessions, if desired a AM session can live 
> indefinitely.
> However, new clients cannot connect to a long-lived AM session through the 
> standard TezClient API. 
> TezClient API only provides a "start" method to initiate a connection, which 
> always allocates a new AM from YARN.
>  # For interactive BI use-cases, this startup time can be significant.
>  # Hive is implementing a HiveServer2 High Availability feature.
>  ** When the singleton HS2 master server fails, the HS2 client is quickly 
> redirected to a pre-warmed HS2 backup. 
>  # For the failover to complete quickly end-to-end, a Tez AM must also be 
> pre-warmed and ready to accept connections.
> For more information, see design for: 
> https://issues.apache.org/jira/browse/HIVE-18281.
> 
> Anticipated changes:
>  # A getClient{{(ApplicationId)}} method is added to TezClient. The 
> functionality is similar to {{start}}
>  ** Code related to launching a new AM from the RM is factored out.
>  ** Since {{start}} and getClient will share some code, this code is 
> refactored into reusable helper methods.
>  ** A usage example is added to {{org/apache/tez/examples}}
>  # It is not a goal of this JIRA to ensure that running Tez DAGs can be 
> recovered by a client using the getClient API. The goal is only for 
> maintaining a pool of warm Tez AMs to skip RM/container/JVM startup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3892) getClient API for TezClient

2018-03-05 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387144#comment-16387144
 ] 

Sergey Shelukhin commented on TEZ-3892:
---

Can you post on RB? thnx

> getClient API for TezClient
> ---
>
> Key: TEZ-3892
> URL: https://issues.apache.org/jira/browse/TEZ-3892
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Eric Wohlstadter
>Assignee: Eric Wohlstadter
>Priority: Major
> Attachments: TEZ-3892.1.patch, TEZ-3892.2.patch
>
>
> This is a proposed opt-in feature.
> Tez AM already supports long-lived sessions, if desired a AM session can live 
> indefinitely.
> However, new clients cannot connect to a long-lived AM session through the 
> standard TezClient API. 
> TezClient API only provides a "start" method to initiate a connection, which 
> always allocates a new AM from YARN.
>  # For interactive BI use-cases, this startup time can be significant.
>  # Hive is implementing a HiveServer2 High Availability feature.
>  ** When the singleton HS2 master server fails, the HS2 client is quickly 
> redirected to a pre-warmed HS2 backup. 
>  # For the failover to complete quickly end-to-end, a Tez AM must also be 
> pre-warmed and ready to accept connections.
> For more information, see design for: 
> https://issues.apache.org/jira/browse/HIVE-18281.
> 
> Anticipated changes:
>  # A getClient{{(ApplicationId)}} method is added to TezClient. The 
> functionality is similar to {{start}}
>  ** Code related to launching a new AM from the RM is factored out.
>  ** Since {{start}} and getClient will share some code, this code is 
> refactored into reusable helper methods.
>  ** A usage example is added to {{org/apache/tez/examples}}
>  # It is not a goal of this JIRA to ensure that running Tez DAGs can be 
> recovered by a client using the getClient API. The goal is only for 
> maintaining a pool of warm Tez AMs to skip RM/container/JVM startup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (TEZ-3892) getClient API for TezClient

2018-02-23 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin reassigned TEZ-3892:
-

Assignee: Eric Wohlstadter

> getClient API for TezClient
> ---
>
> Key: TEZ-3892
> URL: https://issues.apache.org/jira/browse/TEZ-3892
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Eric Wohlstadter
>Assignee: Eric Wohlstadter
>Priority: Major
> Attachments: TEZ-3892.1.patch
>
>
> This is a proposed opt-in feature.
> Tez AM already supports long-lived sessions, if desired a AM session can live 
> indefinitely.
> However, new clients cannot connect to a long-lived AM session through the 
> standard TezClient API. 
> TezClient API only provides a "start" method to initiate a connection, which 
> always allocates a new AM from YARN.
>  # For interactive BI use-cases, this startup time can be significant.
>  # Hive is implementing a HiveServer2 High Availability feature.
>  ** When the singleton HS2 master server fails, the HS2 client is quickly 
> redirected to a pre-warmed HS2 backup. 
>  # For the failover to complete quickly end-to-end, a Tez AM must also be 
> pre-warmed and ready to accept connections.
> For more information, see design for: 
> https://issues.apache.org/jira/browse/HIVE-18281.
> 
> Anticipated changes:
>  # A {{reconnect(ApplicationId)}} method is added to TezClient. The 
> functionality is similar to {{start}}
>  ** Code related to launching a new AM from the RM is factored out.
>  ** Since {{start}} and {{reconnect}} will share some code, this code is 
> refactored into reusable helper methods.
>  ** A usage example is added to {{org/apache/tez/examples}}
>  # It is not a goal of this JIRA to ensure that running Tez DAGs can be 
> recovered by a client using the {{reconnect}} API. The goal is only for 
> maintaining a pool of warm Tez AMs to skip RM/container/JVM startup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-10 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321209#comment-16321209
 ] 

Sergey Shelukhin commented on TEZ-3880:
---

[~hagleitn] can you please commit it? I'm not a Tez committer :)

> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3880.01.patch, TEZ-3880.02.patch, TEZ-3880.patch
>
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-09 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319389#comment-16319389
 ] 

Sergey Shelukhin commented on TEZ-3880:
---

[~hagleitn] can you take a look at the updated patch? thanks

> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3880.01.patch, TEZ-3880.02.patch, TEZ-3880.patch
>
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-08 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316923#comment-16316923
 ] 

Sergey Shelukhin commented on TEZ-3880:
---

[~sseth] perfect timing ;) Fixed the test.
The follow up jira is supposed to address that. Instead of classifying killed 
and failed (or in addition) I'd like to have tasks grouped by error types. 
Phase 4 ;)

> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3880.01.patch, TEZ-3880.02.patch, TEZ-3880.patch
>
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-08 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3880:
--
Attachment: TEZ-3880.02.patch

Fixed the test... this requires adding a field to progress but it's optional so 
it's backward compatible.

> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3880.01.patch, TEZ-3880.02.patch, TEZ-3880.patch
>
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-05 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3880:
--
Attachment: TEZ-3880.01.patch

> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3880.01.patch, TEZ-3880.patch
>
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-05 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3880:
--
Attachment: (was: TEZ-3880.01.patch)

> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3880.01.patch, TEZ-3880.patch
>
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-05 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3880:
--
Attachment: TEZ-3880.01.patch

Removed the TODOs, and added a test

> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3880.01.patch, TEZ-3880.patch
>
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-05 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16314066#comment-16314066
 ] 

Sergey Shelukhin commented on TEZ-3880:
---

I don't see it used anywhere in the codebase, so I'm assuming it's unused. I 
can remove the TODO-s.


> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3880.patch
>
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-04 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312260#comment-16312260
 ] 

Sergey Shelukhin commented on TEZ-3880:
---

When AM tries to schedule on LLAP and there's no capacity, it treats task 
attempt as killed with SERVICE_BUSY error.
This is not really a killed task but just an artifact of fitting the model that 
is based on how RM gives out containers for LLAP that works differently 
(similarly, queueing in LLAP is not accounted for in current Tez model because 
YARN handles it differently thru RM).
On a full cluster, this affects killed task attempt counter in the UI.

> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3880.patch
>
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-04 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3880:
--
Attachment: TEZ-3880.patch

[~ewohlstadter] [~sseth] can you take a look? thnx

> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3880.patch
>
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TEZ-3881) expose kill reason, as opposed to just a single kill count, in vertex progress

2018-01-04 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3881:
-

 Summary: expose kill reason, as opposed to just a single kill 
count, in vertex progress
 Key: TEZ-3881
 URL: https://issues.apache.org/jira/browse/TEZ-3881
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin


A followup from TEZ-3880 that would provide more information to the callers to 
decide what to display and how. An API change, so we can do it in phase 2.

cc [~EricWohlstadter] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3880) do not count rejected tasks as killed in vertex progress

2018-01-04 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3880:
--
Summary: do not count rejected tasks as killed in vertex progress  (was: do 
not show rejected tasks as killed in query UI)

> do not count rejected tasks as killed in vertex progress
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Moved] (TEZ-3880) do not show rejected tasks as killed in query UI

2018-01-04 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin moved HIVE-18074 to TEZ-3880:
--

Key: TEZ-3880  (was: HIVE-18074)
Project: Apache Tez  (was: Hive)

> do not show rejected tasks as killed in query UI
> 
>
> Key: TEZ-3880
> URL: https://issues.apache.org/jira/browse/TEZ-3880
> Project: Apache Tez
>  Issue Type: Task
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Tasks rejected from LLAP because the cluster is full are shown as killed 
> tasks in the commandline query UI (CLI and beeline). This shouldn't really 
> happen; killed tasks in the container case means something else, and this 
> scenario doesn't exist because AM doesn't continuously try to queue tasks. We 
> could change LLAP queue to use sort of a pull model (would also allow for 
> better duplicate scheduling), but for now we should fix the UI



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3879) potential abort propagation issue (race?)

2017-12-19 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297650#comment-16297650
 ] 

Sergey Shelukhin commented on TEZ-3879:
---

The whole log is very large... do you have smth in particular of interest? 
There are not really any more log statement for these tasks at the time.
The task has finished successfully and the then found at the end that it has 
been killed.

> potential abort propagation issue (race?)
> -
>
> Key: TEZ-3879
> URL: https://issues.apache.org/jira/browse/TEZ-3879
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> I'm looking at a Hive LLAP query where AM aborts some tasks for whatever 
> reason (AM preemption). 
> On the nodes, the abort is handled by TezTaskRunner2 and it looks like 
> there's some race there for some cases.
> Most tasks receive abort normally, like so (the first thing Hive TezProcessor 
> does on any abort is log "Received abort").
> {noformat}
> 2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
> shutdownRequested
> 2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
> RecordProcessor
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
> mapOp: {} MAP
> {noformat}
> However on some tasks that are terminated shortly after init, TezProcessor is 
> never called. Moreover, when AM tries to kill the task again (when it's 
> already running, having ignored the abort) Tez says the task is already 
> aborted and doesn't propagate this either.
> {noformat}
> 2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
> (1513367667720_3619_3_02_12_0)] 
> reducesink.VectorReduceSinkCommonOperator: Using tag = -1
> (this is the end of Hive init)
> ...
> 2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
> shutdownRequested
> (no TezProcessor log statements)
> {noformat}
> The task keeps running and the next kill is ignored
> {noformat}
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
> terminateFragment request for attempt_1513367667720_3619_3_02_12_0
> ...
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since 
> the task with id attempt_1513367667720_3619_3_02_12_0 has ended for 
> reason: CONTAINER_STOP_REQUESTED. IgnoredError:  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3879) potential abort propagation issue (race?)

2017-12-19 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3879:
--
Description: 
I'm looking at a Hive LLAP query where AM aborts some tasks for whatever reason 
(AM preemption). 
On the nodes, the abort is handled by TezTaskRunner2 and it looks like there's 
some race there for some cases.
Most tasks receive abort normally, like so (the first thing Hive TezProcessor 
does on any abort is log "Received abort").
{noformat}
2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
shutdownRequested
2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
RecordProcessor
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
mapOp: {} MAP
{noformat}

However on some tasks that are terminated shortly after init, TezProcessor is 
never called. Moreover, when AM tries to kill the task again (when it's already 
running, having ignored the abort) Tez says the task is already aborted and 
doesn't propagate this either.

{noformat}
2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
(1513367667720_3619_3_02_12_0)] reducesink.VectorReduceSinkCommonOperator: 
Using tag = -1
(this is the end of Hive init)
...
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
shutdownRequested
(no TezProcessor log statements)
{noformat}
The task keeps running and the next kill is ignored
{noformat}
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
terminateFragment request for attempt_1513367667720_3619_3_02_12_0
...
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the 
task with id attempt_1513367667720_3619_3_02_12_0 has ended for reason: 
CONTAINER_STOP_REQUESTED. IgnoredError:  
{noformat}

  was:
I'm looking at a Hive LLAP query where AM aborts some tasks for whatever reason 
(AM preemption).
On the nodes, most tasks receive abort normally, like so (note that 
TezProcessor is part of Hive and the main class that Tez code calls; the first 
thing it does on any abort is log "Received abort").
{noformat}
2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
shutdownRequested
2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
RecordProcessor
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
mapOp: {} MAP
{noformat}

However on some tasks that are terminated shortly after init, TezProcessor is 
never called. Moreover, when AM tries to kill the task again (when it's already 
running, having ignored the abort) Tez says the task is already aborted and 
doesn't propagate this either.

{noformat}
2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
(1513367667720_3619_3_02_12_0)] reducesink.VectorReduceSinkCommonOperator: 
Using tag = -1
(this is the end of Hive init)
...
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter: Asked to die via task 
heartbeat: attempt_1513367667720_3619_3_02_12_0
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
shutdownRequested
(no TezProcessor log statements)
{noformat}
The task keeps running and the next kill is ignored
{noformat}
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
terminateFragment request for attempt_1513367667720_3619_3_02_12_0
...
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the 
task with id attempt_1513367667720_3619_3_02_12_0 has ended for reason: 
CONTAINER_STOP_REQUESTED. IgnoredError:  
{noformat}


> potential abort propagation issue (race?)
> -
>
> Key: TEZ-3879
>   

[jira] [Commented] (TEZ-3879) potential abort propagation issue (race?)

2017-12-19 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297577#comment-16297577
 ] 

Sergey Shelukhin commented on TEZ-3879:
---

[~sseth] [~ewohlstadter] can you take a look?

> potential abort propagation issue (race?)
> -
>
> Key: TEZ-3879
> URL: https://issues.apache.org/jira/browse/TEZ-3879
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> I'm looking at a Hive LLAP query where AM aborts some tasks for whatever 
> reason (AM preemption). 
> On the nodes, the abort is handled by TezTaskRunner2 and it looks like 
> there's some race there for some cases.
> Most tasks receive abort normally, like so (the first thing Hive TezProcessor 
> does on any abort is log "Received abort").
> {noformat}
> 2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
> shutdownRequested
> 2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
> RecordProcessor
> 2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
> mapOp: {} MAP
> {noformat}
> However on some tasks that are terminated shortly after init, TezProcessor is 
> never called. Moreover, when AM tries to kill the task again (when it's 
> already running, having ignored the abort) Tez says the task is already 
> aborted and doesn't propagate this either.
> {noformat}
> 2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
> (1513367667720_3619_3_02_12_0)] 
> reducesink.VectorReduceSinkCommonOperator: Using tag = -1
> (this is the end of Hive init)
> ...
> 2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
> attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
> shutdownRequested
> (no TezProcessor log statements)
> {noformat}
> The task keeps running and the next kill is ignored
> {noformat}
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
> terminateFragment request for attempt_1513367667720_3619_3_02_12_0
> ...
> 2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
> org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since 
> the task with id attempt_1513367667720_3619_3_02_12_0 has ended for 
> reason: CONTAINER_STOP_REQUESTED. IgnoredError:  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3879) potential abort propagation issue (race?)

2017-12-19 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3879:
--
Description: 
I'm looking at a Hive LLAP query where AM aborts some tasks for whatever reason 
(AM preemption).
On the nodes, most tasks receive abort normally, like so (note that 
TezProcessor is part of Hive and the main class that Tez code calls; the first 
thing it does on any abort is log "Received abort").
{noformat}
2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
shutdownRequested
2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
RecordProcessor
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
mapOp: {} MAP
{noformat}

However on some tasks that are terminated shortly after init, TezProcessor is 
never called. Moreover, when AM tries to kill the task again (when it's already 
running, having ignored the abort) Tez says the task is already aborted and 
doesn't propagate this either.

{noformat}
2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
(1513367667720_3619_3_02_12_0)] reducesink.VectorReduceSinkCommonOperator: 
Using tag = -1
(this is the end of Hive init)
...
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter: Asked to die via task 
heartbeat: attempt_1513367667720_3619_3_02_12_0
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
shutdownRequested
(no TezProcessor log statements)
{noformat}
The task keeps running and the next kill is ignored
{noformat}
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
terminateFragment request for attempt_1513367667720_3619_3_02_12_0
...
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the 
task with id attempt_1513367667720_3619_3_02_12_0 has ended for reason: 
CONTAINER_STOP_REQUESTED. IgnoredError:  
{noformat}

  was:
I'm looking at a Hive query where AM aborts some tasks for whatever reason (AM 
preemption).
On the nodes, most tasks receive abort normally, like so (note that 
TezProcessor is part of Hive and the main class that Tez code calls; the first 
thing it does on any abort is log "Received abort").
{noformat}
2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
shutdownRequested
2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
RecordProcessor
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
mapOp: {} MAP
{noformat}

However on some tasks that are terminated shortly after init, TezProcessor is 
never called. Moreover, when AM tries to kill the task again (when it's already 
running, having ignored the abort) Tez says the task is already aborted and 
doesn't propagate this either.

{noformat}
2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
(1513367667720_3619_3_02_12_0)] reducesink.VectorReduceSinkCommonOperator: 
Using tag = -1
(this is the end of Hive init)
...
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter: Asked to die via task 
heartbeat: attempt_1513367667720_3619_3_02_12_0
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
shutdownRequested
(no TezProcessor log statements)
{noformat}
The task keeps running and the next kill is ignored
{noformat}
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
terminateFragment request for attempt_1513367667720_3619_3_02_12_0
...
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the 
task with id attempt_1513367667720_3619_3_02_12_0 has ended for reason: 
CONTAINER_STOP_REQUESTED. 

[jira] [Created] (TEZ-3879) potential abort propagation issue (race?)

2017-12-19 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3879:
-

 Summary: potential abort propagation issue (race?)
 Key: TEZ-3879
 URL: https://issues.apache.org/jira/browse/TEZ-3879
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin


I'm looking at a Hive query where AM aborts some tasks for whatever reason (AM 
preemption).
On the nodes, most tasks receive abort normally, like so (note that 
TezProcessor is part of Hive and the main class that Tez code calls; the first 
thing it does on any abort is log "Received abort").
{noformat}
2017-12-18T14:44:26,616 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_1_02_12_0 due to an invocation of 
shutdownRequested
2017-12-18T14:44:26,621 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Received abort
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor: Forwarding abort to 
RecordProcessor
2017-12-18T14:44:26,622 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor: Forwarding abort to 
mapOp: {} MAP
{noformat}

However on some tasks that are terminated shortly after init, TezProcessor is 
never called. Moreover, when AM tries to kill the task again (when it's already 
running, having ignored the abort) Tez says the task is already aborted and 
doesn't propagate this either.

{noformat}
2017-12-18T14:47:22,995  INFO [TezTR-667720_3619_3_2_12_0 
(1513367667720_3619_3_02_12_0)] reducesink.VectorReduceSinkCommonOperator: 
Using tag = -1
(this is the end of Hive init)
...
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.hadoop.hive.llap.daemon.impl.LlapTaskReporter: Asked to die via task 
heartbeat: attempt_1513367667720_3619_3_02_12_0
2017-12-18T14:47:23,133 INFO  [TaskHeartbeatThread ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Attempting to abort 
attempt_1513367667720_3619_3_02_12_0 due to an invocation of 
shutdownRequested
(no TezProcessor log statements)
{noformat}
The task keeps running and the next kill is ignored
{noformat}
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl: DBG: Received 
terminateFragment request for attempt_1513367667720_3619_3_02_12_0
...
2017-12-18T14:47:23,575 INFO  [IPC Server handler 2 on 40617 ()] 
org.apache.tez.runtime.task.TezTaskRunner2: Ignoring killTask request since the 
task with id attempt_1513367667720_3619_3_02_12_0 has ended for reason: 
CONTAINER_STOP_REQUESTED. IgnoredError:  
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3866) add ability to pass information between plugins

2017-11-16 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3866:
--
Description: 
Distinct from TEZ-3815, since even with a hack that makes plugins aware of each 
other, they cannot store task-specific info without calling into each others' 
per-task structures.

In particular, I need a piece of info that is generated when the custom 
scheduler calls getContext().taskAllocated, to be propagated to communicator 
plugin registerRunningTaskAttempt, so that the custom communicator could 
include additional info in task submission (e.g. a scheduler-specific priority 
or other instructions). There doesn't seem to be any means to do it now.

  was:
Distinct from TEZ-3815, since even with a hack that makes plugins aware of each 
other, they cannot store task-specific info without calling into each others' 
per-task structures.

In particular, I need a piece of info that is generated when the custom 
scheduler calls getContext().taskAllocated, to be propagated to communicator 
plugin registerRunningTaskAttempt, so that the custom communicator could 
include additional info in task submission. There doesn't seem to be any means 
to do it now.


> add ability to pass information between plugins
> ---
>
> Key: TEZ-3866
> URL: https://issues.apache.org/jira/browse/TEZ-3866
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> Distinct from TEZ-3815, since even with a hack that makes plugins aware of 
> each other, they cannot store task-specific info without calling into each 
> others' per-task structures.
> In particular, I need a piece of info that is generated when the custom 
> scheduler calls getContext().taskAllocated, to be propagated to communicator 
> plugin registerRunningTaskAttempt, so that the custom communicator could 
> include additional info in task submission (e.g. a scheduler-specific 
> priority or other instructions). There doesn't seem to be any means to do it 
> now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3866) add ability to pass information between plugins

2017-11-15 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3866:
--
Description: 
Distinct from TEZ-3815, since even with a hack that makes plugins aware of each 
other, they cannot store task-specific info without calling into each others' 
per-task structures.

In particular, I need a piece of info that is generated when the custom 
scheduler calls getContext().taskAllocated, to be propagated to communicator 
plugin registerRunningTaskAttempt, so that the custom communicator could 
include additional info in task submission. There doesn't seem to be any means 
to do it now.

  was:
Distinct from TEZ-3815, since even with the hack that makes plugins aware of 
each other, they cannot store task-specific info without calling into each 
others' per-task structures.

In particular, I need a piece of info that is generated when the custom 
scheduler calls getContext().taskAllocated, to be propagated to communicator 
plugin registerRunningTaskAttempt, so that the custom communicator could 
include additional info in task submission. There doesn't seem to be any means 
to do it now.


> add ability to pass information between plugins
> ---
>
> Key: TEZ-3866
> URL: https://issues.apache.org/jira/browse/TEZ-3866
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> Distinct from TEZ-3815, since even with a hack that makes plugins aware of 
> each other, they cannot store task-specific info without calling into each 
> others' per-task structures.
> In particular, I need a piece of info that is generated when the custom 
> scheduler calls getContext().taskAllocated, to be propagated to communicator 
> plugin registerRunningTaskAttempt, so that the custom communicator could 
> include additional info in task submission. There doesn't seem to be any 
> means to do it now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3866) add ability to pass information between plugins

2017-11-15 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254677#comment-16254677
 ] 

Sergey Shelukhin commented on TEZ-3866:
---

cc [~hagleitn] [~aplusplus] this is needed for Hive workload management to 
avoid some major ugly code that exists right now :)

> add ability to pass information between plugins
> ---
>
> Key: TEZ-3866
> URL: https://issues.apache.org/jira/browse/TEZ-3866
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> Distinct from TEZ-3815, since even with the hack that makes plugins aware of 
> each other, they cannot store task-specific info without calling into each 
> others' per-task structures.
> In particular, I need a piece of info that is generated when the custom 
> scheduler calls getContext().taskAllocated, to be propagated to communicator 
> plugin registerRunningTaskAttempt, so that the custom communicator could 
> include additional info in task submission. There doesn't seem to be any 
> means to do it now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TEZ-3866) add ability to pass information between plugins

2017-11-15 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3866:
-

 Summary: add ability to pass information between plugins
 Key: TEZ-3866
 URL: https://issues.apache.org/jira/browse/TEZ-3866
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin


Distinct from TEZ-3815, since even with the hack that makes plugins aware of 
each other, they cannot store task-specific info without calling into each 
others' per-task structures.

In particular, I need a piece of info that is generated when the custom 
scheduler calls getContext().taskAllocated, to be propagated to communicator 
plugin registerRunningTaskAttempt, so that the custom communicator could 
include additional info in task submission. There doesn't seem to be any means 
to do it now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3846) Tez AM may not clean up properly on an internal error

2017-09-29 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186309#comment-16186309
 ] 

Sergey Shelukhin commented on TEZ-3846:
---

Tez version was 0.9.0 (the one Hive is using on master). Unfortunately I don't 
have vertex logs.

> Tez AM may not clean up properly on an internal error
> -
>
> Key: TEZ-3846
> URL: https://issues.apache.org/jira/browse/TEZ-3846
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Zhiyuan Yang
>
> Normally, in Hive we blindly reopen the session on any submit error; however 
> I accidentally broke that, and while investigating noticed a new error before 
> reopen that claims that session where a DAG has failed is still running a 
> DAG. Looks like it should either clean up, or if we assume OOM is not 
> clean-up-able, die completely.
> {noformat}
> 2017-09-28T01:07:12,352  INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> client.TezClient: Submitted dag to TezSession, 
> sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, 
> applicationId=application_1506585924598_0001, 
> dagId=dag_1506585924598_0001_53, dagName=SELECT count(1) FROM (
> ...
> 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> SessionState: Status: Failed
> 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> SessionState: Vertex failed, vertexName=Map 61, 
> vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex 
> vertex_1506585924598_0001_53_01 [Map 61] killed/failed due 
> to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed, 
> vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: 
> GC overhead limit exceeded
> 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> SessionState: Invalid event V_INTERNAL_ERROR on Vertex 
> vertex_1506585924598_0001_53_00 [Map 60]
> 2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> log.PerfLogger:  end=1506586045787 duration=13435 
> from=org.apache.hadoop.hive.ql.exec.tez.monitoring.TezJobMonitor>
> ... [reuse]
> 2017-09-28T01:07:28,459  INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] 
> client.TezClient: Submitting dag to TezSession, 
> sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, 
> applicationId=application_1506585924598_0001, dagName=insert overwrite table 
> orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE, 
> callerType=HIVE_QUERY_ID, 
> callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7 }
> 2017-09-28T01:07:35,259  INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] 
> exec.Task: Dag submit failed due to App master already running a DAG
> {noformat}
> Session continues living and failing like that multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3846) Tez AM may not clean up properly on an internal error

2017-09-28 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3846:
--
Summary: Tez AM may not clean up properly on an internal error  (was: Tez 
session may not clean up on internal error)

> Tez AM may not clean up properly on an internal error
> -
>
> Key: TEZ-3846
> URL: https://issues.apache.org/jira/browse/TEZ-3846
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> Normally, in Hive we blindly reopen the session on any error; however I 
> accidentally broke that, and while investigating noticed a new error before 
> reopen that claims that session where a DAG has failed is still running a 
> DAG. Looks like it should either clean up, or if we assume OOM is not 
> clean-up-able, die completely.
> {noformat}
> 2017-09-28T01:07:12,352  INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> client.TezClient: Submitted dag to TezSession, 
> sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, 
> applicationId=application_1506585924598_0001, 
> dagId=dag_1506585924598_0001_53, dagName=SELECT count(1) FROM (
> ...
> 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> SessionState: Status: Failed
> 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> SessionState: Vertex failed, vertexName=Map 61, 
> vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex 
> vertex_1506585924598_0001_53_01 [Map 61] killed/failed due 
> to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed, 
> vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: 
> GC overhead limit exceeded
> 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> SessionState: Invalid event V_INTERNAL_ERROR on Vertex 
> vertex_1506585924598_0001_53_00 [Map 60]
> 2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> log.PerfLogger:  end=1506586045787 duration=13435 
> from=org.apache.hadoop.hive.ql.exec.tez.monitoring.TezJobMonitor>
> ... [reuse]
> 2017-09-28T01:07:28,459  INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] 
> client.TezClient: Submitting dag to TezSession, 
> sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, 
> applicationId=application_1506585924598_0001, dagName=insert overwrite table 
> orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE, 
> callerType=HIVE_QUERY_ID, 
> callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7 }
> 2017-09-28T01:07:35,259  INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] 
> exec.Task: Dag submit failed due to App master already running a DAG
> {noformat}
> Session continues living and failing like that multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3846) Tez session may not clean up on internal error

2017-09-28 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185046#comment-16185046
 ] 

Sergey Shelukhin commented on TEZ-3846:
---

cc [~aplusplus] [~sseth]

> Tez session may not clean up on internal error
> --
>
> Key: TEZ-3846
> URL: https://issues.apache.org/jira/browse/TEZ-3846
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> Normally, in Hive we blindly reopen the session on any error; however I 
> accidentally broke that, and while investigating noticed a new error before 
> reopen that claims that session where a DAG has failed is still running a 
> DAG. Looks like it should either clean up, or if we assume OOM is not 
> clean-up-able, die completely.
> {noformat}
> 2017-09-28T01:07:12,352  INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> client.TezClient: Submitted dag to TezSession, 
> sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, 
> applicationId=application_1506585924598_0001, 
> dagId=dag_1506585924598_0001_53, dagName=SELECT count(1) FROM (
> ...
> 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> SessionState: Status: Failed
> 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> SessionState: Vertex failed, vertexName=Map 61, 
> vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex 
> vertex_1506585924598_0001_53_01 [Map 61] killed/failed due 
> to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed, 
> vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: 
> GC overhead limit exceeded
> 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> SessionState: Invalid event V_INTERNAL_ERROR on Vertex 
> vertex_1506585924598_0001_53_00 [Map 60]
> 2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
> log.PerfLogger:  end=1506586045787 duration=13435 
> from=org.apache.hadoop.hive.ql.exec.tez.monitoring.TezJobMonitor>
> ... [reuse]
> 2017-09-28T01:07:28,459  INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] 
> client.TezClient: Submitting dag to TezSession, 
> sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, 
> applicationId=application_1506585924598_0001, dagName=insert overwrite table 
> orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE, 
> callerType=HIVE_QUERY_ID, 
> callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7 }
> 2017-09-28T01:07:35,259  INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] 
> exec.Task: Dag submit failed due to App master already running a DAG
> {noformat}
> Session continues living and failing like that multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3846) Tez AM may not clean up properly on an internal error

2017-09-28 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3846:
--
Description: 
Normally, in Hive we blindly reopen the session on any submit error; however I 
accidentally broke that, and while investigating noticed a new error before 
reopen that claims that session where a DAG has failed is still running a DAG. 
Looks like it should either clean up, or if we assume OOM is not clean-up-able, 
die completely.
{noformat}
2017-09-28T01:07:12,352  INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
client.TezClient: Submitted dag to TezSession, 
sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, 
applicationId=application_1506585924598_0001, dagId=dag_1506585924598_0001_53, 
dagName=SELECT count(1) FROM (
...
2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
SessionState: Status: Failed
2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
SessionState: Vertex failed, vertexName=Map 61, 
vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex 
vertex_1506585924598_0001_53_01 [Map 61] killed/failed due 
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed, 
vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: GC 
overhead limit exceeded
2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
SessionState: Invalid event V_INTERNAL_ERROR on Vertex 
vertex_1506585924598_0001_53_00 [Map 60]
2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
log.PerfLogger: 
... [reuse]
2017-09-28T01:07:28,459  INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] 
client.TezClient: Submitting dag to TezSession, 
sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, 
applicationId=application_1506585924598_0001, dagName=insert overwrite table 
orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE, 
callerType=HIVE_QUERY_ID, 
callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7 }
2017-09-28T01:07:35,259  INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] 
exec.Task: Dag submit failed due to App master already running a DAG
{noformat}
Session continues living and failing like that multiple times.

  was:
Normally, in Hive we blindly reopen the session on any error; however I 
accidentally broke that, and while investigating noticed a new error before 
reopen that claims that session where a DAG has failed is still running a DAG. 
Looks like it should either clean up, or if we assume OOM is not clean-up-able, 
die completely.
{noformat}
2017-09-28T01:07:12,352  INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
client.TezClient: Submitted dag to TezSession, 
sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, 
applicationId=application_1506585924598_0001, dagId=dag_1506585924598_0001_53, 
dagName=SELECT count(1) FROM (
...
2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
SessionState: Status: Failed
2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
SessionState: Vertex failed, vertexName=Map 61, 
vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex 
vertex_1506585924598_0001_53_01 [Map 61] killed/failed due 
to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed, 
vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: GC 
overhead limit exceeded
2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
SessionState: Invalid event V_INTERNAL_ERROR on Vertex 
vertex_1506585924598_0001_53_00 [Map 60]
2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] 
log.PerfLogger: 
... [reuse]
2017-09-28T01:07:28,459  INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] 
client.TezClient: Submitting dag to TezSession, 
sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, 
applicationId=application_1506585924598_0001, dagName=insert overwrite table 
orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE, 
callerType=HIVE_QUERY_ID, 
callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7 }
2017-09-28T01:07:35,259  INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] 
exec.Task: Dag submit failed due to App master already running a DAG
{noformat}
Session continues living and failing like that multiple times.


> Tez AM may not clean up properly on an internal error
> -
>
> Key: TEZ-3846
> URL: https://issues.apache.org/jira/browse/TEZ-3846
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> Normally, in Hive we blindly reopen the session on any submit error; however 
> I accidentally broke that, and while investigating noticed a new error before 
> reopen that claims that session where a DAG has failed is still running a 
> DAG. Looks like it should either clean up, or if we assume OOM is 

[jira] [Commented] (TEZ-3385) DAGClient API should be accessible outside of DAG submission

2017-08-31 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149692#comment-16149692
 ] 

Sergey Shelukhin commented on TEZ-3385:
---

Is this only for DAGClient, or for TezClient only? Hive might be interested in 
the latter for HA/multi-HS2 work (transferring Tez sessions between HS2s).
cc [~hagleitn]

> DAGClient API should be accessible outside of DAG submission
> 
>
> Key: TEZ-3385
> URL: https://issues.apache.org/jira/browse/TEZ-3385
> Project: Apache Tez
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>
>   In PIG-4958, I had to resort to  
> DAGClient client = new DAGClientImpl(appId, dagID, new 
> TezConfiguration(conf), null);
> This is not good as DAGClientImpl is a internal class and not something users 
> should be referring to. Tez needs to have an API to give DAGClient given the 
> appId, dagID and configuration. This is something basic like 
> JobClient.getJob(String jobID). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3823) expose AM location and application ID from TezClient

2017-08-28 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3823:
--
Attachment: TEZ-3823.patch

[~sseth] can you take a look? Small patch


> expose AM location and application ID from TezClient
> 
>
> Key: TEZ-3823
> URL: https://issues.apache.org/jira/browse/TEZ-3823
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3823.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3823) expose AM location and application ID from TezClient

2017-08-28 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3823:
--
Summary: expose AM location and application ID from TezClient  (was: expose 
AM location from TezClient)

> expose AM location and application ID from TezClient
> 
>
> Key: TEZ-3823
> URL: https://issues.apache.org/jira/browse/TEZ-3823
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TEZ-3815) allow plugins to be aware of each other

2017-08-09 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3815:
-

 Summary: allow plugins to be aware of each other
 Key: TEZ-3815
 URL: https://issues.apache.org/jira/browse/TEZ-3815
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin


Given that many sets of plugins (e.g. LLAP) come as a package deal and do not 
work without each other, it would make sense for them to be aware of each 
other. 
Not sure yet of the best way for this to work, we probably don't want too much 
complexity, dependency systems, etc. Perhaps after all the plugins are 
initialized fully, an optional-to-implement call could be made to each of them 
passing a map from plugin type (communicator, scheduler, etc.) to the instance 
that is going to be used for this DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3815) allow plugins to be aware of each other

2017-08-09 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120487#comment-16120487
 ] 

Sergey Shelukhin commented on TEZ-3815:
---

cc [~sseth]

> allow plugins to be aware of each other
> ---
>
> Key: TEZ-3815
> URL: https://issues.apache.org/jira/browse/TEZ-3815
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> Given that many sets of plugins (e.g. LLAP) come as a package deal and do not 
> work without each other, it would make sense for them to be aware of each 
> other. 
> Not sure yet of the best way for this to work, we probably don't want too 
> much complexity, dependency systems, etc. Perhaps after all the plugins are 
> initialized fully, an optional-to-implement call could be made to each of 
> them passing a map from plugin type (communicator, scheduler, etc.) to the 
> instance that is going to be used for this DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TEZ-3812) race condition in ssl shuffle

2017-08-03 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113350#comment-16113350
 ] 

Sergey Shelukhin commented on TEZ-3812:
---

cc [~sseth] [~visakh_nair]

> race condition in ssl shuffle
> -
>
> Key: TEZ-3812
> URL: https://issues.apache.org/jira/browse/TEZ-3812
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> ShuffleUtils does the following:{noformat}
> if (sslFactory == null) {
> synchronized (HttpConnectionParams.class) {
>   //Create sslFactory if it is null or if it was destroyed earlier
>   if (sslFactory == null || 
> sslFactory.getKeystoresFactory().getTrustManagers() == null) {
> sslFactory =
> new 
> SSLFactory(org.apache.hadoop.security.ssl.SSLFactory.Mode.CLIENT, conf);
> try {
>   sslFactory.init();
> {noformat}
> It is possible for a thread to get sslFactory that has been assigned but not 
> initialized. It could result in e.g. the hostnameVerifier being null:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: no HostnameVerifier specified
> at 
> javax.net.ssl.HttpsURLConnection.setHostnameVerifier(HttpsURLConnection.java:265)
> at org.apache.tez.http.SSLFactory.configure(SSLFactory.java:219)
> at org.apache.tez.http.HttpConnection.setupConnection(HttpConnection.java:98)
> at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:137)
> at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:123)
> at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:340)
> at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:260)
> at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:178)
> at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)
> at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TEZ-3812) race condition in ssl shuffle

2017-08-03 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3812:
--
Summary: race condition in ssl shuffle  (was: race condition in ssl 
shuffle?)

> race condition in ssl shuffle
> -
>
> Key: TEZ-3812
> URL: https://issues.apache.org/jira/browse/TEZ-3812
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> ShuffleUtils does the following:{noformat}
> if (sslFactory == null) {
> synchronized (HttpConnectionParams.class) {
>   //Create sslFactory if it is null or if it was destroyed earlier
>   if (sslFactory == null || 
> sslFactory.getKeystoresFactory().getTrustManagers() == null) {
> sslFactory =
> new 
> SSLFactory(org.apache.hadoop.security.ssl.SSLFactory.Mode.CLIENT, conf);
> try {
>   sslFactory.init();
> {noformat}
> It is possible for a thread to get sslFactory that has been assigned but not 
> initialized. It could result in e.g. the hostnameVerifier being null:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: no HostnameVerifier specified
> at 
> javax.net.ssl.HttpsURLConnection.setHostnameVerifier(HttpsURLConnection.java:265)
> at org.apache.tez.http.SSLFactory.configure(SSLFactory.java:219)
> at org.apache.tez.http.HttpConnection.setupConnection(HttpConnection.java:98)
> at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:137)
> at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:123)
> at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:340)
> at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:260)
> at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:178)
> at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)
> at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)
>  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TEZ-3812) race condition in ssl shuffle?

2017-08-03 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3812:
-

 Summary: race condition in ssl shuffle?
 Key: TEZ-3812
 URL: https://issues.apache.org/jira/browse/TEZ-3812
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin


ShuffleUtils does the following:{noformat}
if (sslFactory == null) {
synchronized (HttpConnectionParams.class) {
  //Create sslFactory if it is null or if it was destroyed earlier
  if (sslFactory == null || 
sslFactory.getKeystoresFactory().getTrustManagers() == null) {
sslFactory =
new 
SSLFactory(org.apache.hadoop.security.ssl.SSLFactory.Mode.CLIENT, conf);
try {
  sslFactory.init();
{noformat}

It is possible for a thread to get sslFactory that has been assigned but not 
initialized. It could result in e.g. the hostnameVerifier being null:
{noformat}
Caused by: java.lang.IllegalArgumentException: no HostnameVerifier specified
at 
javax.net.ssl.HttpsURLConnection.setHostnameVerifier(HttpsURLConnection.java:265)
at org.apache.tez.http.SSLFactory.configure(SSLFactory.java:219)
at org.apache.tez.http.HttpConnection.setupConnection(HttpConnection.java:98)
at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:137)
at org.apache.tez.http.HttpConnection.connect(HttpConnection.java:123)
at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:340)
at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:260)
at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:178)
at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)
at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)
 
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TEZ-3706) add option to skip Tez UI build

2017-04-28 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3706:
-

 Summary: add option to skip Tez UI build
 Key: TEZ-3706
 URL: https://issues.apache.org/jira/browse/TEZ-3706
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin


The UI build takes forever downloading some files and messing around. It should 
be possible to skip it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TEZ-3637) TezMerger logs too much

2017-02-23 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3637:
--
Fix Version/s: 0.9.0

> TezMerger logs too much
> ---
>
> Key: TEZ-3637
> URL: https://issues.apache.org/jira/browse/TEZ-3637
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Siddharth Seth
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (TEZ-3637) TezMerger logs too much on INFO level

2017-02-23 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3637:
--
Summary: TezMerger logs too much on INFO level  (was: TezMerger logs too 
much)

> TezMerger logs too much on INFO level
> -
>
> Key: TEZ-3637
> URL: https://issues.apache.org/jira/browse/TEZ-3637
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Siddharth Seth
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (TEZ-3637) TezMerger logs too much

2017-02-23 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3637:
-

 Summary: TezMerger logs too much
 Key: TEZ-3637
 URL: https://issues.apache.org/jira/browse/TEZ-3637
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin
Assignee: Siddharth Seth






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (TEZ-3186) teztask event problem when running repeated queries on LLAP

2016-03-28 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin resolved TEZ-3186.
---
Resolution: Cannot Reproduce

I'll reopen if I see it after updating Tez to master, next time I run smth

>  teztask event problem when running repeated queries on LLAP
> 
>
> Key: TEZ-3186
> URL: https://issues.apache.org/jira/browse/TEZ-3186
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Siddharth Seth
>
> I am running multiple queries in a row against LLAP from CLI.
> I was running them by copy-pasting multiple lines of "source this.sql" and 
> "source that.sql" into CLI.
> When I switched to running via hive -f all-queries.sql (could be a 
> coincidence), one of the queries now fails towards the end with an error like 
> this:
> {noformat}
> 2016-03-23 21:57:35,531 [INFO] [TaskSchedulerEventHandlerThread] 
> |tezplugins.LlapTaskSchedulerService|: Ignoring deallocate request for task 
> attempt_1455662455106_3046_5_00_000526_0 which hasn't been assigned to a 
> container
> 2016-03-23 21:57:35,531 [INFO] [TaskSchedulerEventHandlerThread] 
> |rm.TaskSchedulerManager|: Task: attempt_1455662455106_3046_5_00_000526_0 has 
> no container assignment in the scheduler
> 2016-03-23 21:57:35,533 [ERROR] [Dispatcher thread {Central}] 
> |impl.TaskAttemptImpl|: Can't handle this event at current state for 
> attempt_1455662455106_3046_5_00_06_1
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795)
> at 
> org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120)
> at 
> org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2202)
> at 
> org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2187)
> at 
> org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114)
> at java.lang.Thread.run(Thread.java:745)
> 2016-03-23 21:57:35,537 [INFO] [Dispatcher thread {Central}] 
> |history.HistoryEventHandler|: 
> [HISTORY][DAG:dag_1455662455106_3046_5][Event:TASK_FINISHED]: vertexName=Map 
> 1, taskId=task_1455662455106_3046_5_00_000527, startTime=1458784644802, 
> finishTime=1458784655537, timeTaken=10735, status=KILLED, 
> successfulAttemptID=null, diagnostics=Killing tasks in vertex: 
> vertex_1455662455106_3046_5_00 [Map 1] due to trigger: OWN_TASK_FAILURE, 
> counters=Counters: 0
> {noformat}
> This is on master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3153) build uses tmp, that is problematic on a shared machine

2016-03-01 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3153:
-

 Summary: build uses tmp, that is problematic on a shared machine
 Key: TEZ-3153
 URL: https://issues.apache.org/jira/browse/TEZ-3153
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin


{noformat}
INFO] bower@1.7.7 node_modules/bower
[INFO] 
[INFO] --- exec-maven-plugin:1.3.2:exec (ember build) @ tez-ui2 ---
version: 1.13.13
Could not find watchman, falling back to NodeWatcher for file system events.
Visit http://www.ember-cli.com/user-guide/#watchman for more info.
BuildingBuilding.Build failed.
File: modules/ember-wormhole/components/ember-wormhole.js
EACCES, mkdir '/tmp/async-disk-cache/3b95d6f55686e4f2ba8e38923a59b8dd'
Error: EACCES, mkdir '/tmp/async-disk-cache/3b95d6f55686e4f2ba8e38923a59b8dd'
{noformat}
Looks like some stuff during the build writes to tmp; some other user has 
already created /tmp/async-disk-cache so I get an access error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3151) expose DAG credentials to plugins

2016-02-29 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172858#comment-15172858
 ] 

Sergey Shelukhin commented on TEZ-3151:
---

Yeah, I put the updated context in the description; there was no description 
before that.

> expose DAG credentials to plugins
> -
>
> Key: TEZ-3151
> URL: https://issues.apache.org/jira/browse/TEZ-3151
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3151.01.patch, TEZ-3151.patch
>
>
> Tez plugins need to pass credentials (e.g. HBase tokens, etc.) to tasks. 
> Right now they only have access to AM credentials.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3151) expose DAG credentials to plugins

2016-02-29 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3151:
--
Attachment: TEZ-3151.01.patch

Updated.

> expose DAG credentials to plugins
> -
>
> Key: TEZ-3151
> URL: https://issues.apache.org/jira/browse/TEZ-3151
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3151.01.patch, TEZ-3151.patch
>
>
> Tez plugins need to pass credentials (e.g. HBase tokens, etc.) to tasks. 
> Right now they only have access to AM credentials.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-3151) expose DAG credentials to plugins

2016-02-29 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15172641#comment-15172641
 ] 

Sergey Shelukhin edited comment on TEZ-3151 at 2/29/16 9:56 PM:


[~sseth] does this make sense?


was (Author: sershe):
[~sseth] does it make sense?

> expose DAG credentials to plugins
> -
>
> Key: TEZ-3151
> URL: https://issues.apache.org/jira/browse/TEZ-3151
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3151.patch
>
>
> Tez plugins need to pass credentials (e.g. HBase tokens, etc.) to tasks. 
> Right now they only have access to AM credentials.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3151) expose DAG credentials to plugins

2016-02-29 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3151:
--
Attachment: TEZ-3151.patch

[~sseth] does it make sense?

> expose DAG credentials to plugins
> -
>
> Key: TEZ-3151
> URL: https://issues.apache.org/jira/browse/TEZ-3151
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-3151.patch
>
>
> Tez plugins need to pass credentials (e.g. HBase tokens, etc.) to tasks. 
> Right now they only have access to AM credentials.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3151) expose DAG credentials to plugins

2016-02-29 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3151:
--
Description: Tez plugins need to pass credentials (e.g. HBase tokens, etc.) 
to tasks. Right now they only have access to AM credentials.

> expose DAG credentials to plugins
> -
>
> Key: TEZ-3151
> URL: https://issues.apache.org/jira/browse/TEZ-3151
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
>
> Tez plugins need to pass credentials (e.g. HBase tokens, etc.) to tasks. 
> Right now they only have access to AM credentials.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3151) expose DAG credentials to plugins

2016-02-29 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3151:
-

 Summary: expose DAG credentials to plugins
 Key: TEZ-3151
 URL: https://issues.apache.org/jira/browse/TEZ-3151
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin
Assignee: Sergey Shelukhin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3077) TezClient.waitTillReady should support timeout

2016-01-26 Thread Sergey Shelukhin (JIRA)
Sergey Shelukhin created TEZ-3077:
-

 Summary: TezClient.waitTillReady should support timeout
 Key: TEZ-3077
 URL: https://issues.apache.org/jira/browse/TEZ-3077
 Project: Apache Tez
  Issue Type: Bug
Reporter: Sergey Shelukhin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3077) TezClient.waitTillReady should support timeout

2016-01-26 Thread Sergey Shelukhin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin updated TEZ-3077:
--
Description: Also preWarm.

> TezClient.waitTillReady should support timeout
> --
>
> Key: TEZ-3077
> URL: https://issues.apache.org/jira/browse/TEZ-3077
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>
> Also preWarm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2480) TEZ-2003: exception when closing output (ignored)

2015-11-18 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15011768#comment-15011768
 ] 

Sergey Shelukhin commented on TEZ-2480:
---

+1

> TEZ-2003: exception when closing output (ignored)
> -
>
> Key: TEZ-2480
> URL: https://issues.apache.org/jira/browse/TEZ-2480
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: TEZ-2003
>Reporter: Sergey Shelukhin
>Assignee: Siddharth Seth
> Attachments: TEZ-2480.1.txt
>
>
> Happens a lot in some queries:
> {noformat}
> sershe_20150522112029_d0863b33-8d2f-4b4c-b013-9ef70a2bc586:1_Map 1_8_0)] WARN 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Ignoring exception when 
> closing output Reducer 2(cleanup). Exception 
> class=java.lang.NullPointerException, message=null
> java.lang.NullPointerException
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:618)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:81)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:613)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:831)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:608)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1425)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:198)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:768)
> at 
> org.apache.tez.runtime.library.common.sort.impl.TezSpillRecord.(TezSpillRecord.java:64)
> at 
> org.apache.tez.runtime.library.common.sort.impl.TezSpillRecord.(TezSpillRecord.java:56)
> at 
> org.apache.tez.runtime.library.common.sort.impl.TezSpillRecord.(TezSpillRecord.java:51)
> at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.generateEvents(OrderedPartitionedKVOutput.java:209)
> at 
> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:186)
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.cleanup(LogicalIOProcessorRuntimeTask.java:849)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:104)
> at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:35)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Can this be fixed or not logged?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2917) change some logs from info to debug

2015-10-30 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983531#comment-14983531
 ] 

Sergey Shelukhin edited comment on TEZ-2917 at 10/30/15 10:45 PM:
--

MRInputLegacy deferring initialization - how is this useful. Another thing btw 
is that whatever developer can figure out by just looking at config file, 
metadata, etc., and code should not be on info level. I.e. logging configs and 
other such things.
getContext().getDestinationVertexName() + ": "
 + "outputFormat=" + outputFormatClassName
 + ", using newmapreduce API=" + useNewApi - that looks like it could 
just be figured out from the job and config

Using oldApi, MRpartitionerClass= - same

Waiting for N... - is just logged many times all the time and it seems 
extremely situational, when some small piece of code got stuck. Start of the 
process is already logged, and so is the end (I can restore these back to info) 
so it doesn't give much information even if it does hang; in 99.99% cases it's 
just noise.

Cleaning up task, Initializing task are extremely situational given the 
surrounding log lines (i.e. running... is logged at info).

I dunno, I don't really care either way, this is based on users' complaints 
that too much obscure stuff that doesn't matter is getting logged. [~hagleitn] 
any opinion?


was (Author: sershe):
MRInputLegacy deferring initialization - how is this useful. Another thing btw 
is that whatever developer can figure out by just looking at config file, 
metadata, etc., and code should not be on info level. I.e. logging configs and 
other such things.
getContext().getDestinationVertexName() + ": "
 + "outputFormat=" + outputFormatClassName
 + ", using newmapreduce API=" + useNewApi - that looks like it could 
just be figured out from the job and config

Using oldApi, MRpartitionerClass= - same

Waiting for N... - is just logged many times all the time and it seems 
extremely situational, when some small piece of code got stuck. Maybe it can 
log waiting once.

Cleaning up task, Initializing task are extremely situational given the 
surrounding log lines (i.e. running... is logged at info).

I dunno, I don't really care either way, this is based on users' complaints 
that too much obscure stuff that doesn't matter is getting logged. [~hagleitn] 
any opinion?

> change some logs from info to debug
> ---
>
> Key: TEZ-2917
> URL: https://issues.apache.org/jira/browse/TEZ-2917
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-2917.patch
>
>
> I've done a highly unscientific summarization of the logs from some random 
> queries, and will now change some log statements that are the most prevalent 
> and not extremely useful from info to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2917) change some logs from info to debug

2015-10-30 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983531#comment-14983531
 ] 

Sergey Shelukhin commented on TEZ-2917:
---

MRInputLegacy deferring initialization - how is this useful. Another thing btw 
is that whatever developer can figure out by just looking at config file, 
metadata, etc., and code should not be on info level. I.e. logging configs and 
other such things.
getContext().getDestinationVertexName() + ": "
 + "outputFormat=" + outputFormatClassName
 + ", using newmapreduce API=" + useNewApi - that looks like it could 
just be figured out from the job and config

Using oldApi, MRpartitionerClass= - same

Waiting for N... - same. It is just logged all the time and it seems extremely 
situational, when some small piece of code got stuck. Maybe it can log waiting 
once.

Cleaning up task, Initializing task are extremely situational given the 
surrounding log lines (i.e. running... is logged at info).

I dunno, I don't really care either way, this is based on users' complaints 
that too much obscure stuff that doesn't matter is getting logged. [~hagleitn] 
any opinion?

> change some logs from info to debug
> ---
>
> Key: TEZ-2917
> URL: https://issues.apache.org/jira/browse/TEZ-2917
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-2917.patch
>
>
> I've done a highly unscientific summarization of the logs from some random 
> queries, and will now change some log statements that are the most prevalent 
> and not extremely useful from info to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2917) change some logs from info to debug

2015-10-30 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14983419#comment-14983419
 ] 

Sergey Shelukhin commented on TEZ-2917:
---

I dunno if it matters how we disable the lines, as long as we do - to debug, 
you'd still have to reenable them. Many of these lines look like they would 
only be useful in very narrow debugging scenarios (e.g. ratio calculation 
details), so I think they belong on debug level. The lines that are useful for 
general understanding of what is going on are retained (e.g. for task 
transitions, running and finished, plus unexpected conditions, are info, but 
stuff like initializing, cleaning up or whatever are debug).
Do you have suggestions for which lines to keep at info?


> change some logs from info to debug
> ---
>
> Key: TEZ-2917
> URL: https://issues.apache.org/jira/browse/TEZ-2917
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-2917.patch
>
>
> I've done a highly unscientific summarization of the logs from some random 
> queries, and will now change some log statements that are the most prevalent 
> and not extremely useful from info to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-2917) change some logs from info to debug

2015-10-29 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981456#comment-14981456
 ] 

Sergey Shelukhin commented on TEZ-2917:
---

[~sseth] can you take a look?

> change some logs from info to debug
> ---
>
> Key: TEZ-2917
> URL: https://issues.apache.org/jira/browse/TEZ-2917
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: TEZ-2917.patch
>
>
> I've done a highly unscientific summarization of the logs from some random 
> queries, and will now change some log statements that are the most prevalent 
> and not extremely useful from info to debug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >