[jira] [Work logged] (HIVE-24201) WorkloadManager can support delayed move if destination pool does not have enough sessions

2021-04-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24201?focusedWorklogId=591125=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-591125
 ]

ASF GitHub Bot logged work on HIVE-24201:
-

Author: ASF GitHub Bot
Created on: 29/Apr/21 16:52
Start Date: 29/Apr/21 16:52
Worklog Time Spent: 10m 
  Work Description: sankarh merged pull request #2065:
URL: https://github.com/apache/hive/pull/2065


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 591125)
Time Spent: 2.5h  (was: 2h 20m)

> WorkloadManager can support delayed move if destination pool does not have 
> enough sessions
> --
>
> Key: HIVE-24201
> URL: https://issues.apache.org/jira/browse/HIVE-24201
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2, llap
>Affects Versions: 4.0.0
>Reporter: Adesh Kumar Rao
>Assignee: Pritha Dawn
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> To reproduce, create a resource plan with move trigger, like below:
> {code:java}
> ++
> |line|
> ++
> | experiment[status=DISABLED,parallelism=null,defaultPool=default] |
> |  +  default[allocFraction=0.888,schedulingPolicy=null,parallelism=1] |
> |  |  mapped for default |
> |  +  pool2[allocFraction=0.1,schedulingPolicy=fair,parallelism=1] |
> |  |  trigger t1: if (ELAPSED_TIME > 20) { MOVE TO pool1 } |
> |  |  mapped for users: abcd   |
> |  +  pool1[allocFraction=0.012,schedulingPolicy=null,parallelism=1] |
> |  |  mapped for users: efgh   |
>  
> {code}
> Now, run two queries in pool1 and pool2 using different users. The query 
> running in pool2 will tried to move to pool1 and it will get killed because 
> pool1 will not have session to handle the query.
> Currently, the Workload management move trigger kills the query being moved 
> to a different pool if destination pool does not have enough capacity.  We 
> could have a "delayed move" configuration which lets the query run in the 
> source pool as long as possible, if the destination pool is full. It will 
> attempt the move to destination pool only when there is claim upon the source 
> pool. If the destination pool is not full, delayed move behaves as normal 
> move i.e. the move will happen immediately.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24201) WorkloadManager can support delayed move if destination pool does not have enough sessions

2021-04-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24201?focusedWorklogId=591123=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-591123
 ]

ASF GitHub Bot logged work on HIVE-24201:
-

Author: ASF GitHub Bot
Created on: 29/Apr/21 16:50
Start Date: 29/Apr/21 16:50
Worklog Time Spent: 10m 
  Work Description: sankarh commented on pull request #2065:
URL: https://github.com/apache/hive/pull/2065#issuecomment-829425475


   +1, Patch looks good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 591123)
Time Spent: 2h 20m  (was: 2h 10m)

> WorkloadManager can support delayed move if destination pool does not have 
> enough sessions
> --
>
> Key: HIVE-24201
> URL: https://issues.apache.org/jira/browse/HIVE-24201
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2, llap
>Affects Versions: 4.0.0
>Reporter: Adesh Kumar Rao
>Assignee: Pritha Dawn
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> To reproduce, create a resource plan with move trigger, like below:
> {code:java}
> ++
> |line|
> ++
> | experiment[status=DISABLED,parallelism=null,defaultPool=default] |
> |  +  default[allocFraction=0.888,schedulingPolicy=null,parallelism=1] |
> |  |  mapped for default |
> |  +  pool2[allocFraction=0.1,schedulingPolicy=fair,parallelism=1] |
> |  |  trigger t1: if (ELAPSED_TIME > 20) { MOVE TO pool1 } |
> |  |  mapped for users: abcd   |
> |  +  pool1[allocFraction=0.012,schedulingPolicy=null,parallelism=1] |
> |  |  mapped for users: efgh   |
>  
> {code}
> Now, run two queries in pool1 and pool2 using different users. The query 
> running in pool2 will tried to move to pool1 and it will get killed because 
> pool1 will not have session to handle the query.
> Currently, the Workload management move trigger kills the query being moved 
> to a different pool if destination pool does not have enough capacity.  We 
> could have a "delayed move" configuration which lets the query run in the 
> source pool as long as possible, if the destination pool is full. It will 
> attempt the move to destination pool only when there is claim upon the source 
> pool. If the destination pool is not full, delayed move behaves as normal 
> move i.e. the move will happen immediately.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24201) WorkloadManager can support delayed move if destination pool does not have enough sessions

2021-04-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24201?focusedWorklogId=590863=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-590863
 ]

ASF GitHub Bot logged work on HIVE-24201:
-

Author: ASF GitHub Bot
Created on: 29/Apr/21 07:01
Start Date: 29/Apr/21 07:01
Worklog Time Spent: 10m 
  Work Description: sankarh commented on a change in pull request #2065:
URL: https://github.com/apache/hive/pull/2065#discussion_r622782514



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TriggerValidatorRunnable.java
##
@@ -95,7 +95,7 @@ public void run() {
 
   Trigger chosenTrigger = violatedSessions.get(sessionState);
   if (chosenTrigger != null) {
-LOG.info("Query: {}. {}. Applying action.", 
sessionState.getWmContext().getQueryId(),
+LOG.debug("Query: {}. {}. Applying action.", 
sessionState.getWmContext().getQueryId(),

Review comment:
   Do we have some other log that signifies the trigger event is generated?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 590863)
Time Spent: 2h  (was: 1h 50m)

> WorkloadManager can support delayed move if destination pool does not have 
> enough sessions
> --
>
> Key: HIVE-24201
> URL: https://issues.apache.org/jira/browse/HIVE-24201
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2, llap
>Affects Versions: 4.0.0
>Reporter: Adesh Kumar Rao
>Assignee: Pritha Dawn
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> To reproduce, create a resource plan with move trigger, like below:
> {code:java}
> ++
> |line|
> ++
> | experiment[status=DISABLED,parallelism=null,defaultPool=default] |
> |  +  default[allocFraction=0.888,schedulingPolicy=null,parallelism=1] |
> |  |  mapped for default |
> |  +  pool2[allocFraction=0.1,schedulingPolicy=fair,parallelism=1] |
> |  |  trigger t1: if (ELAPSED_TIME > 20) { MOVE TO pool1 } |
> |  |  mapped for users: abcd   |
> |  +  pool1[allocFraction=0.012,schedulingPolicy=null,parallelism=1] |
> |  |  mapped for users: efgh   |
>  
> {code}
> Now, run two queries in pool1 and pool2 using different users. The query 
> running in pool2 will tried to move to pool1 and it will get killed because 
> pool1 will not have session to handle the query.
> Currently, the Workload management move trigger kills the query being moved 
> to a different pool if destination pool does not have enough capacity.  We 
> could have a "delayed move" configuration which lets the query run in the 
> source pool as long as possible, if the destination pool is full. It will 
> attempt the move to destination pool only when there is claim upon the source 
> pool. If the destination pool is not full, delayed move behaves as normal 
> move i.e. the move will happen immediately.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24201) WorkloadManager can support delayed move if destination pool does not have enough sessions

2021-04-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24201?focusedWorklogId=590864=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-590864
 ]

ASF GitHub Bot logged work on HIVE-24201:
-

Author: ASF GitHub Bot
Created on: 29/Apr/21 07:01
Start Date: 29/Apr/21 07:01
Worklog Time Spent: 10m 
  Work Description: sankarh commented on a change in pull request #2065:
URL: https://github.com/apache/hive/pull/2065#discussion_r622782514



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TriggerValidatorRunnable.java
##
@@ -95,7 +95,7 @@ public void run() {
 
   Trigger chosenTrigger = violatedSessions.get(sessionState);
   if (chosenTrigger != null) {
-LOG.info("Query: {}. {}. Applying action.", 
sessionState.getWmContext().getQueryId(),
+LOG.debug("Query: {}. {}. Applying action.", 
sessionState.getWmContext().getQueryId(),

Review comment:
   Do we have some other log that signifies the trigger event is generated? 
Info logs are helping us to trace such events.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 590864)
Time Spent: 2h 10m  (was: 2h)

> WorkloadManager can support delayed move if destination pool does not have 
> enough sessions
> --
>
> Key: HIVE-24201
> URL: https://issues.apache.org/jira/browse/HIVE-24201
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2, llap
>Affects Versions: 4.0.0
>Reporter: Adesh Kumar Rao
>Assignee: Pritha Dawn
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> To reproduce, create a resource plan with move trigger, like below:
> {code:java}
> ++
> |line|
> ++
> | experiment[status=DISABLED,parallelism=null,defaultPool=default] |
> |  +  default[allocFraction=0.888,schedulingPolicy=null,parallelism=1] |
> |  |  mapped for default |
> |  +  pool2[allocFraction=0.1,schedulingPolicy=fair,parallelism=1] |
> |  |  trigger t1: if (ELAPSED_TIME > 20) { MOVE TO pool1 } |
> |  |  mapped for users: abcd   |
> |  +  pool1[allocFraction=0.012,schedulingPolicy=null,parallelism=1] |
> |  |  mapped for users: efgh   |
>  
> {code}
> Now, run two queries in pool1 and pool2 using different users. The query 
> running in pool2 will tried to move to pool1 and it will get killed because 
> pool1 will not have session to handle the query.
> Currently, the Workload management move trigger kills the query being moved 
> to a different pool if destination pool does not have enough capacity.  We 
> could have a "delayed move" configuration which lets the query run in the 
> source pool as long as possible, if the destination pool is full. It will 
> attempt the move to destination pool only when there is claim upon the source 
> pool. If the destination pool is not full, delayed move behaves as normal 
> move i.e. the move will happen immediately.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24201) WorkloadManager can support delayed move if destination pool does not have enough sessions

2021-04-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24201?focusedWorklogId=590493=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-590493
 ]

ASF GitHub Bot logged work on HIVE-24201:
-

Author: ASF GitHub Bot
Created on: 28/Apr/21 16:16
Start Date: 28/Apr/21 16:16
Worklog Time Spent: 10m 
  Work Description: sankarh commented on a change in pull request #2065:
URL: https://github.com/apache/hive/pull/2065#discussion_r622334824



##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/WorkloadManager.java
##
@@ -682,11 +685,8 @@ private void processCurrentEvents(EventState e, 
WmThreadSyncWork syncWork) throw
 // as possible
 Map recordMoveEvents = new HashMap<>();
 for (MoveSession moveSession : e.moveSessions) {
-  if (HiveConf.getBoolVar(conf, ConfVars.HIVE_SERVER2_WM_DELAYED_MOVE)) {
-handleMoveSessionOnMasterThread(moveSession, syncWork, 
poolsToRedistribute, e.toReuse, recordMoveEvents, true);
-  } else {
-handleMoveSessionOnMasterThread(moveSession, syncWork, 
poolsToRedistribute, e.toReuse, recordMoveEvents, false);
-  }
+  handleMoveSessionOnMasterThread(moveSession, syncWork, 
poolsToRedistribute, e.toReuse,

Review comment:
   Shall store the HiveConf.getBoolVar(conf, 
ConfVars.HIVE_SERVER2_WM_DELAYED_MOVE) in a boolean variable outside the "for" 
loop and use it here.

##
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##
@@ -3753,11 +3753,12 @@ private static void 
populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal
 "Determines behavior of the wm move trigger when destination pool is 
full.\n" +
 "If true, the query will run in source pool as long as possible if 
destination pool is full;\n" +
 "if false, the query will be killed if destination pool is full."),
-
HIVE_SERVER2_WM_DELAYED_MOVE_TIMEOUT("hive.server2.wm.delayed.move.timeout", 
"600",
+
HIVE_SERVER2_WM_DELAYED_MOVE_TIMEOUT("hive.server2.wm.delayed.move.timeout", 
"3600",
 new TimeValidator(TimeUnit.SECONDS),
 "The amount of time a delayed move is allowed to run in the source 
pool,\n" +
-"when a delayed move session times out, the session is moved to the 
destination pool.\n"),
-
HIVE_SERVER2_WM_DELAYED_MOVE_VALIDATOR_INTERVAL("hive.server2.wm.delayed.move.validator.interval",
 "10",
+"when a delayed move session times out, the session is moved to the 
destination pool.\n" +
+"A value of 0 indicates no timeout"),
+
HIVE_SERVER2_WM_DELAYED_MOVE_VALIDATOR_INTERVAL("hive.server2.wm.delayed.move.validator.interval",
 "60",
 new TimeValidator(TimeUnit.SECONDS),
 "Interval for checking for expired delayed moves and retries. Value of 
0 indicates no checks."),

Review comment:
   I think, we shouldn't allow 0 for interval. It creates confusion as we 
set timeout but it doesn't work.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 590493)
Time Spent: 1h 50m  (was: 1h 40m)

> WorkloadManager can support delayed move if destination pool does not have 
> enough sessions
> --
>
> Key: HIVE-24201
> URL: https://issues.apache.org/jira/browse/HIVE-24201
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2, llap
>Affects Versions: 4.0.0
>Reporter: Adesh Kumar Rao
>Assignee: Pritha Dawn
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> To reproduce, create a resource plan with move trigger, like below:
> {code:java}
> ++
> |line|
> ++
> | experiment[status=DISABLED,parallelism=null,defaultPool=default] |
> |  +  default[allocFraction=0.888,schedulingPolicy=null,parallelism=1] |
> |  |  mapped for default |
> |  +  pool2[allocFraction=0.1,schedulingPolicy=fair,parallelism=1] |
> |  |  trigger t1: if (ELAPSED_TIME > 20) { MOVE TO pool1 } |
> |  |  mapped for users: abcd   |
> |  +  pool1[allocFraction=0.012,schedulingPolicy=null,parallelism=1] |
> |  |  mapped for users: efgh   |
>  
> {code}
> Now, run two queries in pool1 and pool2 using different users. The query 
> running in pool2 will tried to move to pool1 and it will get killed because 
> pool1 will not 

[jira] [Work logged] (HIVE-24201) WorkloadManager can support delayed move if destination pool does not have enough sessions

2021-04-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24201?focusedWorklogId=589120=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-589120
 ]

ASF GitHub Bot logged work on HIVE-24201:
-

Author: ASF GitHub Bot
Created on: 26/Apr/21 10:26
Start Date: 26/Apr/21 10:26
Worklog Time Spent: 10m 
  Work Description: sankarh commented on a change in pull request #2065:
URL: https://github.com/apache/hive/pull/2065#discussion_r620083627



##
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##
@@ -3749,6 +3749,17 @@ private static void 
populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal
 new TimeValidator(TimeUnit.SECONDS),
 "The timeout for AM registry registration, after which (on attempting 
to use the\n" +
 "session), we kill it and try to get another one."),
+HIVE_SERVER2_WM_DELAYED_MOVE("hive.server2.wm.delayed.move", false,
+"Determines behavior of the wm move trigger when destination pool is 
full.\n" +
+"If true, the query will run in source pool as long as possible if 
destination pool is full;\n" +
+"if false, the query will be killed if destination pool is full."),
+
HIVE_SERVER2_WM_DELAYED_MOVE_TIMEOUT("hive.server2.wm.delayed.move.timeout", 
"600",
+new TimeValidator(TimeUnit.SECONDS),
+"The amount of time a delayed move is allowed to run in the source 
pool,\n" +
+"when a delayed move session times out, the session is moved to the 
destination pool.\n"),
+
HIVE_SERVER2_WM_DELAYED_MOVE_VALIDATOR_INTERVAL("hive.server2.wm.delayed.move.validator.interval",
 "10",
+new TimeValidator(TimeUnit.SECONDS),
+"Interval for checking for expired delayed moves and retries. Value of 
0 indicates no checks."),

Review comment:
   Does "0" means no timeout check or no support of delayed move itself? I 
think, in any case, this creates confusion. We shouldn't allow 0 and this 
config should be > 0.
   hive.server2.wm.delayed.move.timeout=0 can be used for no timeout case.

##
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##
@@ -3749,6 +3749,17 @@ private static void 
populateLlapDaemonVarsSet(Set llapDaemonVarsSetLocal
 new TimeValidator(TimeUnit.SECONDS),
 "The timeout for AM registry registration, after which (on attempting 
to use the\n" +
 "session), we kill it and try to get another one."),
+HIVE_SERVER2_WM_DELAYED_MOVE("hive.server2.wm.delayed.move", false,
+"Determines behavior of the wm move trigger when destination pool is 
full.\n" +
+"If true, the query will run in source pool as long as possible if 
destination pool is full;\n" +
+"if false, the query will be killed if destination pool is full."),
+
HIVE_SERVER2_WM_DELAYED_MOVE_TIMEOUT("hive.server2.wm.delayed.move.timeout", 
"600",
+new TimeValidator(TimeUnit.SECONDS),
+"The amount of time a delayed move is allowed to run in the source 
pool,\n" +
+"when a delayed move session times out, the session is moved to the 
destination pool.\n"),

Review comment:
   If value 0 have special meaning such as "doesn't expire", then need to 
capture it here.

##
File path: ql/src/java/org/apache/hadoop/hive/ql/exec/tez/WorkloadManager.java
##
@@ -790,45 +842,72 @@ private void dumpPoolState(PoolState ps, List 
set) {
 }
   }
 
-  private void handleMoveSessionOnMasterThread(final MoveSession moveSession,
-final WmThreadSyncWork syncWork,
-final HashSet poolsToRedistribute,
-final Map toReuse,
-final Map recordMoveEvents) {
+  private static enum MoveSessionResult {
+OK, // Normal case - the session was moved.
+KILLED, // Killed because destination pool was full and delayed move is 
false.
+CONVERTED_TO_DELAYED_MOVE, // the move session was added to the pool's 
delayed moves as the dest. pool was full
+// and delayed move is true.
+ERROR
+  }
+
+  private MoveSessionResult handleMoveSessionOnMasterThread(final MoveSession 
moveSession,
+  final WmThreadSyncWork syncWork,
+  final HashSet poolsToRedistribute,
+  final Map toReuse,
+  final Map recordMoveEvents,
+  final boolean convertToDelayedMove) {
 String destPoolName = moveSession.destPool;
-LOG.info("Handling move session event: {}", moveSession);
+LOG.info("Handling move session event: {}, Convert to Delayed Move: {}", 
moveSession, convertToDelayedMove);
 if (validMove(moveSession.srcSession, destPoolName)) {
+  String srcPoolName = moveSession.srcSession.getPoolName();
+  PoolState srcPool = pools.get(srcPoolName);
+  boolean capacityAvailableInDest = capacityAvailable(destPoolName);
+  // If delayed move is set to true and if destination pool doesn't have 
enough capacity, don't kill the query.
+  // Let the query run in source