[jira] [Commented] (IMPALA-8183) TestRPCTimeout.test_reportexecstatus_retry times out

2019-02-12 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766741#comment-16766741
 ] 

ASF subversion and git services commented on IMPALA-8183:
-

Commit 9492d451d5d5a82bfc6f4c93c3a0c6e6d0cc4981 in impala's branch 
refs/heads/master from Thomas Tauber-Marshall
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=9492d45 ]

IMPALA-8183: fix test_reportexecstatus_retry flakiness

The test is designed to cause ReportExecStatus() rpcs to fail by
backing up the control service queue. Previously, after a failed
ReportExecStatus() we would wait 'report_status_retry_interval_ms'
between retries, which was 100ms by default and wasn't touched by the
test. That 100ms was right on the edge of being enough time for the
coordinator to keep up with processing the reports, so that some would
fail but most would succeed. It was always possible that we could hit
IMPALA-2990 in this setup, but it was unlikely.

Now, with IMPALA-4555 'report_status_retry_interval_ms' was removed
and we instead wait 'status_report_interval_ms' between retries. By
default, this is 5000ms, so it should give the coordinator even more
time and make these issues less likely. However, the test sets
'status_report_interval_ms' to 10ms, which isn't nearly enough time
for the coordinator to do its processing, causing lots of the
ReportExecStatus() rpcs to fail and making us hit IMPALA-2990 pretty
often.

The solution is to set 'status_report_interval_ms' to 100ms in the
test, which roughly achieves the same retry frequency as before. The
same change is made to a similar test test_reportexecstatus_timeout.

Testing:
- Ran test_reportexecstatus_retry in a loop 400 times without seeing a
  failure. It previously repro-ed for me about once per 50 runs.
- Manually verified that both tests are still hitting the error paths
  that they are supposed to be testing.

Change-Id: I7027a6e099c543705e5845ee0e5268f1f9a3fb05
Reviewed-on: http://gerrit.cloudera.org:8080/12461
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> TestRPCTimeout.test_reportexecstatus_retry times out
> 
>
> Key: IMPALA-8183
> URL: https://issues.apache.org/jira/browse/IMPALA-8183
> Project: IMPALA
>  Issue Type: Bug
>  Components: Distributed Exec
>Affects Versions: Impala 3.2.0
>Reporter: Andrew Sherman
>Assignee: Thomas Tauber-Marshall
>Priority: Blocker
>
> There are 2 forms of failure, where the test itself times out, and where the 
> whole test run times out, suspiciously just after running 
> test_reportexecstatus_retry
> {quote}
> Error Message
> Failed: Timeout >7200s
> Stacktrace
> custom_cluster/test_rpc_timeout.py:143: in test_reportexecstatus_retry
> self.execute_query_verify_metrics(self.TEST_QUERY, None, 10)
> custom_cluster/test_rpc_timeout.py:45: in execute_query_verify_metrics
> self.execute_query(query, query_options)
> common/impala_test_suite.py:601: in wrapper
> return function(*args, **kwargs)
> common/impala_test_suite.py:632: in execute_query
> return self.__execute_query(self.client, query, query_options)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:384: in wait_for_finished
> time.sleep(0.05)
> E   Failed: Timeout >7200s
> {quote}
> {quote}
> Test run timed out. This probably happened due to a hung thread which can be 
> confirmed by looking at the stacktrace of running impalad processes at 
> /data/jenkins/workspace/xxx/repos/Impala/logs/timeout_stacktrace
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8183) TestRPCTimeout.test_reportexecstatus_retry times out

2019-02-12 Thread Thomas Tauber-Marshall (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766371#comment-16766371
 ] 

Thomas Tauber-Marshall commented on IMPALA-8183:


Alright - figured out what's going on.

The test is designed to cause ReportExecStatus() rpcs to fail by backing up the 
control service queue. Prior to IMPALA-4555, after a failed ReportExecStatus() 
we would wait 'report_status_retry_interval_ms' between retries, which was 
100ms by default and wasn't touched by the test. That 100ms was right on the 
edge of being enough time for the coordinator to keep up with processing the 
reports, so that some would fail but most would succeed. It was always possible 
that we could hit 2990 in this setup, but it was unlikely.

Now, we wait 'status_report_interval_ms'. By default, this is 5000ms, so it 
should give the coordinator even more time and make these issues less likely. 
However, the test sets 'status_report_interval_ms' to 10ms, which isn't nearly 
enough time for the coordinator to do its processing, causing lots of the 
ReportExecStatus() rpcs to fail and making us hit 2990 pretty often.

Not sure what the solution is yet, the test will need to be reworked, but at 
least this isn't a bug with IMPALA-4555

> TestRPCTimeout.test_reportexecstatus_retry times out
> 
>
> Key: IMPALA-8183
> URL: https://issues.apache.org/jira/browse/IMPALA-8183
> Project: IMPALA
>  Issue Type: Bug
>  Components: Distributed Exec
>Affects Versions: Impala 3.2.0
>Reporter: Andrew Sherman
>Assignee: Thomas Tauber-Marshall
>Priority: Blocker
>
> There are 2 forms of failure, where the test itself times out, and where the 
> whole test run times out, suspiciously just after running 
> test_reportexecstatus_retry
> {quote}
> Error Message
> Failed: Timeout >7200s
> Stacktrace
> custom_cluster/test_rpc_timeout.py:143: in test_reportexecstatus_retry
> self.execute_query_verify_metrics(self.TEST_QUERY, None, 10)
> custom_cluster/test_rpc_timeout.py:45: in execute_query_verify_metrics
> self.execute_query(query, query_options)
> common/impala_test_suite.py:601: in wrapper
> return function(*args, **kwargs)
> common/impala_test_suite.py:632: in execute_query
> return self.__execute_query(self.client, query, query_options)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:384: in wait_for_finished
> time.sleep(0.05)
> E   Failed: Timeout >7200s
> {quote}
> {quote}
> Test run timed out. This probably happened due to a hung thread which can be 
> confirmed by looking at the stacktrace of running impalad processes at 
> /data/jenkins/workspace/xxx/repos/Impala/logs/timeout_stacktrace
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8183) TestRPCTimeout.test_reportexecstatus_retry times out

2019-02-12 Thread Thomas Tauber-Marshall (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766304#comment-16766304
 ] 

Thomas Tauber-Marshall commented on IMPALA-8183:


Yeah, this may just be a poorly written test. If this is happening very often, 
we can xfail it, otherwise it'll probably be easier to write a good test for 
this case once my rpc debugging patch goes in

> TestRPCTimeout.test_reportexecstatus_retry times out
> 
>
> Key: IMPALA-8183
> URL: https://issues.apache.org/jira/browse/IMPALA-8183
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Andrew Sherman
>Assignee: Thomas Tauber-Marshall
>Priority: Major
>
> There are 2 forms of failure, where the test itself times out, and where the 
> whole test run times out, suspiciously just after running 
> test_reportexecstatus_retry
> {quote}
> Error Message
> Failed: Timeout >7200s
> Stacktrace
> custom_cluster/test_rpc_timeout.py:143: in test_reportexecstatus_retry
> self.execute_query_verify_metrics(self.TEST_QUERY, None, 10)
> custom_cluster/test_rpc_timeout.py:45: in execute_query_verify_metrics
> self.execute_query(query, query_options)
> common/impala_test_suite.py:601: in wrapper
> return function(*args, **kwargs)
> common/impala_test_suite.py:632: in execute_query
> return self.__execute_query(self.client, query, query_options)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:384: in wait_for_finished
> time.sleep(0.05)
> E   Failed: Timeout >7200s
> {quote}
> {quote}
> Test run timed out. This probably happened due to a hung thread which can be 
> confirmed by looking at the stacktrace of running impalad processes at 
> /data/jenkins/workspace/xxx/repos/Impala/logs/timeout_stacktrace
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8183) TestRPCTimeout.test_reportexecstatus_retry times out

2019-02-11 Thread Andrew Sherman (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765520#comment-16765520
 ] 

Andrew Sherman commented on IMPALA-8183:


Log says:
E0210 11:23:32.714677 21460 query-state.cc:424] 
5a4a60d227d5669e:cfc4ed21] Cancelling fragment instances due to failure 
to reach the coordinator. (ReportExecStatus() RPC failed: Remote error: Service 
unavailable: ReportExecStatus request on impala.ControlService from 
127.0.0.1:49775 dropped due to backpressure. The service queue is full; it has 
2147483647 items.
). Query 5a4a60d227d5669e:cfc4ed21 may hang. See IMPALA-2990.
so is this just an instance of IMPALA-2990?

> TestRPCTimeout.test_reportexecstatus_retry times out
> 
>
> Key: IMPALA-8183
> URL: https://issues.apache.org/jira/browse/IMPALA-8183
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Andrew Sherman
>Assignee: Thomas Tauber-Marshall
>Priority: Major
>
> There are 2 forms of failure, where the test itself times out, and where the 
> whole test run times out, suspiciously just after running 
> test_reportexecstatus_retry
> {quote}
> Error Message
> Failed: Timeout >7200s
> Stacktrace
> custom_cluster/test_rpc_timeout.py:143: in test_reportexecstatus_retry
> self.execute_query_verify_metrics(self.TEST_QUERY, None, 10)
> custom_cluster/test_rpc_timeout.py:45: in execute_query_verify_metrics
> self.execute_query(query, query_options)
> common/impala_test_suite.py:601: in wrapper
> return function(*args, **kwargs)
> common/impala_test_suite.py:632: in execute_query
> return self.__execute_query(self.client, query, query_options)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:384: in wait_for_finished
> time.sleep(0.05)
> E   Failed: Timeout >7200s
> {quote}
> {quote}
> Test run timed out. This probably happened due to a hung thread which can be 
> confirmed by looking at the stacktrace of running impalad processes at 
> /data/jenkins/workspace/xxx/repos/Impala/logs/timeout_stacktrace
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Commented] (IMPALA-8183) TestRPCTimeout.test_reportexecstatus_retry times out

2019-02-11 Thread Andrew Sherman (JIRA)


[ 
https://issues.apache.org/jira/browse/IMPALA-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16765502#comment-16765502
 ] 

Andrew Sherman commented on IMPALA-8183:


Code that hangs is 
[https://github.com/apache/impala/blob/master/tests/beeswax/impala_beeswax.py#L372-L384|https://github.com/apache/impala/blob/master/tests/beeswax/impala_beeswax.py#L372-L384]
which suggests query has some other state than FINISHED or EXCEPTION.
Meanwhile log files contain no trace of the query id 
3d486150ff28c15c:118403c2





> TestRPCTimeout.test_reportexecstatus_retry times out
> 
>
> Key: IMPALA-8183
> URL: https://issues.apache.org/jira/browse/IMPALA-8183
> Project: IMPALA
>  Issue Type: Bug
>Reporter: Andrew Sherman
>Priority: Major
>
> There are 2 forms of failure, where the test itself times out, and where the 
> whole test run times out, suspiciously just after running 
> test_reportexecstatus_retry
> {quote}
> Error Message
> Failed: Timeout >7200s
> Stacktrace
> custom_cluster/test_rpc_timeout.py:143: in test_reportexecstatus_retry
> self.execute_query_verify_metrics(self.TEST_QUERY, None, 10)
> custom_cluster/test_rpc_timeout.py:45: in execute_query_verify_metrics
> self.execute_query(query, query_options)
> common/impala_test_suite.py:601: in wrapper
> return function(*args, **kwargs)
> common/impala_test_suite.py:632: in execute_query
> return self.__execute_query(self.client, query, query_options)
> common/impala_test_suite.py:699: in __execute_query
> return impalad_client.execute(query, user=user)
> common/impala_connection.py:174: in execute
> return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:183: in execute
> handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:360: in __execute_query
> self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:384: in wait_for_finished
> time.sleep(0.05)
> E   Failed: Timeout >7200s
> {quote}
> {quote}
> Test run timed out. This probably happened due to a hung thread which can be 
> confirmed by looking at the stacktrace of running impalad processes at 
> /data/jenkins/workspace/xxx/repos/Impala/logs/timeout_stacktrace
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org