[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption

2018-02-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368903#comment-16368903
 ] 

ASF GitHub Bot commented on DRILL-6143:
---

Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1119


> Make Fragment Runner's RPC Timeout a SystemOption
> -
>
> Key: DRILL-6143
> URL: https://issues.apache.org/jira/browse/DRILL-6143
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.13.0
>
>
> Queries frequently fail sporadically on some clusters due to the following 
> error
> {code}
> oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION 
> ERROR: Exceeded timeout (25000) while waiting send intermediate work 
> fragments to remote nodes. Sent 5 and only heard response back from 4 nodes.
> {code}
> This error happens because the FragmentsRunner has a hardcoded timeout 
> RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the 
> timeout to 10 seconds resolved the sporadic failures that were observed. This 
> timeout should be changed to 10 and should also be configurable via the 
> SystemOptionManager



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption

2018-02-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359170#comment-16359170
 ] 

ASF GitHub Bot commented on DRILL-6143:
---

Github user vrozov commented on the issue:

https://github.com/apache/drill/pull/1119
  
LGTM


> Make Fragment Runner's RPC Timeout a SystemOption
> -
>
> Key: DRILL-6143
> URL: https://issues.apache.org/jira/browse/DRILL-6143
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.13.0
>
>
> Queries frequently fail sporadically on some clusters due to the following 
> error
> {code}
> oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION 
> ERROR: Exceeded timeout (25000) while waiting send intermediate work 
> fragments to remote nodes. Sent 5 and only heard response back from 4 nodes.
> {code}
> This error happens because the FragmentsRunner has a hardcoded timeout 
> RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the 
> timeout to 10 seconds resolved the sporadic failures that were observed. This 
> timeout should be changed to 10 and should also be configurable via the 
> SystemOptionManager



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption

2018-02-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358955#comment-16358955
 ] 

ASF GitHub Bot commented on DRILL-6143:
---

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/1119#discussion_r167346417
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java
 ---
@@ -212,7 +212,8 @@
   new OptionDefinition(ExecConstants.CPU_LOAD_AVERAGE),
   new OptionDefinition(ExecConstants.ENABLE_VECTOR_VALIDATOR),
   new OptionDefinition(ExecConstants.ENABLE_ITERATOR_VALIDATOR),
-  new OptionDefinition(ExecConstants.OUTPUT_BATCH_SIZE_VALIDATOR, new 
OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, true, false))
+  new OptionDefinition(ExecConstants.OUTPUT_BATCH_SIZE_VALIDATOR, new 
OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, true, false)),
+  new 
OptionDefinition(ExecConstants.FRAG_RUNNER_RPC_TIMEOUT_VALIDATOR, new 
OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, false, true)),
--- End diff --

internal should be true since we want this to show up in the internal 
options table and not the standard system options table. Changing to adminOnly 
= true seems reasonable.


> Make Fragment Runner's RPC Timeout a SystemOption
> -
>
> Key: DRILL-6143
> URL: https://issues.apache.org/jira/browse/DRILL-6143
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> Queries frequently fail sporadically on some clusters due to the following 
> error
> {code}
> oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION 
> ERROR: Exceeded timeout (25000) while waiting send intermediate work 
> fragments to remote nodes. Sent 5 and only heard response back from 4 nodes.
> {code}
> This error happens because the FragmentsRunner has a hardcoded timeout 
> RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the 
> timeout to 10 seconds resolved the sporadic failures that were observed. This 
> timeout should be changed to 10 and should also be configurable via the 
> SystemOptionManager



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption

2018-02-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358942#comment-16358942
 ] 

ASF GitHub Bot commented on DRILL-6143:
---

Github user Ben-Zvi commented on a diff in the pull request:

https://github.com/apache/drill/pull/1119#discussion_r167343642
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/options/SystemOptionManager.java
 ---
@@ -212,7 +212,8 @@
   new OptionDefinition(ExecConstants.CPU_LOAD_AVERAGE),
   new OptionDefinition(ExecConstants.ENABLE_VECTOR_VALIDATOR),
   new OptionDefinition(ExecConstants.ENABLE_ITERATOR_VALIDATOR),
-  new OptionDefinition(ExecConstants.OUTPUT_BATCH_SIZE_VALIDATOR, new 
OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, true, false))
+  new OptionDefinition(ExecConstants.OUTPUT_BATCH_SIZE_VALIDATOR, new 
OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, true, false)),
+  new 
OptionDefinition(ExecConstants.FRAG_RUNNER_RPC_TIMEOUT_VALIDATOR, new 
OptionMetaData(OptionValue.AccessibleScopes.SYSTEM, false, true)),
--- End diff --

Question: Should the last two parameters be instead:
   adminOnly = true
   internal = false 



> Make Fragment Runner's RPC Timeout a SystemOption
> -
>
> Key: DRILL-6143
> URL: https://issues.apache.org/jira/browse/DRILL-6143
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> Queries frequently fail sporadically on some clusters due to the following 
> error
> {code}
> oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION 
> ERROR: Exceeded timeout (25000) while waiting send intermediate work 
> fragments to remote nodes. Sent 5 and only heard response back from 4 nodes.
> {code}
> This error happens because the FragmentsRunner has a hardcoded timeout 
> RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the 
> timeout to 10 seconds resolved the sporadic failures that were observed. This 
> timeout should be changed to 10 and should also be configurable via the 
> SystemOptionManager



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption

2018-02-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358913#comment-16358913
 ] 

ASF GitHub Bot commented on DRILL-6143:
---

Github user ilooner commented on a diff in the pull request:

https://github.com/apache/drill/pull/1119#discussion_r167338469
  
--- Diff: exec/java-exec/src/main/resources/drill-module.conf ---
@@ -413,6 +413,7 @@ drill.exec.options: {
 # to start at least 2 partitions then HashAgg fallbacks to this case. 
It can be
 # enabled by setting this flag to true. By default it's set to false 
such that
 # query will fail if there is not enough memory
+drill.exec.rpc.fragrunner.timeout: 3,
--- End diff --

Thanks for catching the ordering. I reduced the default to 1. 


> Make Fragment Runner's RPC Timeout a SystemOption
> -
>
> Key: DRILL-6143
> URL: https://issues.apache.org/jira/browse/DRILL-6143
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> Queries frequently fail sporadically on some clusters due to the following 
> error
> {code}
> oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION 
> ERROR: Exceeded timeout (25000) while waiting send intermediate work 
> fragments to remote nodes. Sent 5 and only heard response back from 4 nodes.
> {code}
> This error happens because the FragmentsRunner has a hardcoded timeout 
> RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the 
> timeout to 10 seconds resolved the sporadic failures that were observed. This 
> timeout should be changed to 10 and should also be configurable via the 
> SystemOptionManager



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption

2018-02-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358802#comment-16358802
 ] 

ASF GitHub Bot commented on DRILL-6143:
---

Github user vrozov commented on a diff in the pull request:

https://github.com/apache/drill/pull/1119#discussion_r167312527
  
--- Diff: exec/java-exec/src/main/resources/drill-module.conf ---
@@ -413,6 +413,7 @@ drill.exec.options: {
 # to start at least 2 partitions then HashAgg fallbacks to this case. 
It can be
 # enabled by setting this flag to true. By default it's set to false 
such that
 # query will fail if there is not enough memory
+drill.exec.rpc.fragrunner.timeout: 3,
--- End diff --

The value looks a little bit high and it needs to be moved either below 
`drill.exec.hashagg.fallback.enabled` or above the preceding comment, otherwise 
LGTM.


> Make Fragment Runner's RPC Timeout a SystemOption
> -
>
> Key: DRILL-6143
> URL: https://issues.apache.org/jira/browse/DRILL-6143
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> Queries frequently fail sporadically on some clusters due to the following 
> error
> {code}
> oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION 
> ERROR: Exceeded timeout (25000) while waiting send intermediate work 
> fragments to remote nodes. Sent 5 and only heard response back from 4 nodes.
> {code}
> This error happens because the FragmentsRunner has a hardcoded timeout 
> RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the 
> timeout to 10 seconds resolved the sporadic failures that were observed. This 
> timeout should be changed to 10 and should also be configurable via the 
> SystemOptionManager



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption

2018-02-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358752#comment-16358752
 ] 

ASF GitHub Bot commented on DRILL-6143:
---

Github user priteshm commented on the issue:

https://github.com/apache/drill/pull/1119
  
@arina-ielchiieva is on vacation. @vrozov, @Ben-Zvi  can you take a look?


> Make Fragment Runner's RPC Timeout a SystemOption
> -
>
> Key: DRILL-6143
> URL: https://issues.apache.org/jira/browse/DRILL-6143
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> Queries frequently fail sporadically on some clusters due to the following 
> error
> {code}
> oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION 
> ERROR: Exceeded timeout (25000) while waiting send intermediate work 
> fragments to remote nodes. Sent 5 and only heard response back from 4 nodes.
> {code}
> This error happens because the FragmentsRunner has a hardcoded timeout 
> RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the 
> timeout to 10 seconds resolved the sporadic failures that were observed. This 
> timeout should be changed to 10 and should also be configurable via the 
> SystemOptionManager



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6143) Make Fragment Runner's RPC Timeout a SystemOption

2018-02-08 Thread Timothy Farkas (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357932#comment-16357932
 ] 

Timothy Farkas commented on DRILL-6143:
---

There seem to be cases where a drillbit does not send a response to the 
FragmentRunner with a larger timeout. There is likely another issue which can 
cause a response never to be sent in some cases. I will create a separate 
ticket for this case.

> Make Fragment Runner's RPC Timeout a SystemOption
> -
>
> Key: DRILL-6143
> URL: https://issues.apache.org/jira/browse/DRILL-6143
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
> Fix For: 1.13.0
>
>
> Queries frequently fail sporadically on some clusters due to the following 
> error
> {code}
> oadd.org.apache.drill.common.exceptions.UserRemoteException: CONNECTION 
> ERROR: Exceeded timeout (25000) while waiting send intermediate work 
> fragments to remote nodes. Sent 5 and only heard response back from 4 nodes.
> {code}
> This error happens because the FragmentsRunner has a hardcoded timeout 
> RPC_WAIT_IN_MSECS_PER_FRAGMENT which is set at 5 seconds. Increasing the 
> timeout to 10 seconds resolved the sporadic failures that were observed. This 
> timeout should be changed to 10 and should also be configurable via the 
> SystemOptionManager



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)