[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-15 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292354#comment-16292354
 ] 

Adam Szita commented on HIVE-18263:
---

Thanks for reviewing [~zsombor.klara], and committing [~pvary]!

This morning after the change was committed I've restarted ptest2, and I've 
been monitoring its log:
It started by killing off the existing 12 slaves it was using before, and 
simultaneously creating 12 new ones, all according to plan:
{code}
2017-12-15 09:44:41 INFO  [localhost-startStop-1] ExecutionController:85 - 
Reading configuration from file: 
/opt/apache-tomcat-7.0.72/conf/cloudhost.properties
2017-12-15 09:44:46 INFO  [localhost-startStop-1] 
CloudExecutionContextProvider:130 - CloudExecutionContextProvider 
maxHostsPerCreateRequest = 2
2017-12-15 09:44:46 INFO  [localhost-startStop-1] 
CloudExecutionContextProvider:421 - Requesting termination of [35.225.33.208, 
104.198.248.189, 35.184.205.41, 35.184.14.247, 35.192.52.184, 104.154.229.171, 
35.192.216.79, 35.224.189.104, 35.184.147.31, 35.225.218.206, 104.198.217.87, 
35.224.37.167]
{code}
One hour later the caretaker thread saw the following IP's in 
{{mTerminatedHosts}}:
{code}
2017-12-15 10:44:51 INFO  [CloudExecutionContextProvider-BackgroundWorker] 
CloudExecutionContextProvider:340 - Performing background work
2017-12-15 10:44:51 INFO  [CloudExecutionContextProvider-BackgroundWorker] 
CloudExecutionContextProvider:345 - Currently tracked terminated hosts: 
[35.225.33.208, 35.184.205.41, 104.154.229.171, 35.192.216.79, 35.184.147.31, 
35.225.218.206, 35.224.37.167]
{code}
..and it did nothing after, killed no slaves.

All of these 7 IPs can be found in the list of 12 old slaves above (they were 
killed at startup). So without this change it is very likely that 5 slaves 
would've been killed during test execution unnecessarily.
(e.g. host with IP 104.198.217.87 is not part of mTerminatedHosts, but it is 
running right now, and it reuses an IP from the old group)

All-in-all I believe this is working now as it should.

> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> 
>
> Key: HIVE-18263
> URL: https://issues.apache.org/jira/browse/HIVE-18263
> Project: Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Reporter: Adam Szita
>Assignee: Adam Szita
> Fix For: 3.0.0
>
> Attachments: HIVE-18263.0.patch, HIVE-18263.1.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. 
> Usually it should take about 1.5 hours, but in some cases it took over 4-5 
> hours.
> Looking in the logs of one such execution I've seen that some commands that 
> were sent to test executing slaves returned 255. Here this typically means 
> that there is unknown return code for the remote call since hiveptest-server 
> can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while 
> running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of 
> slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed 
> ([mTerminatedHosts 
> map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
>  contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of 
> hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the 
> file are seen as running slaves, ptest will terminate these first so it can 
> begin with a fresh start
> ** The IPs of these terminated instances already make their way into 
> {{mTerminatedHosts}} upon initialization...
> * The cloud provider may reuse some older IPs, so it is not too rare that the 
> same IP that belonged to a terminated host is assigned to a new one.
> This is problematic: Hive ptest's slave caretaker thread kicks in every 60 
> minutes and might see a running host that has the same IP as an old slave had 
> which was terminated at startup. It will think that this host should be 
> terminated since it already tried 60 minutes ago as its IP is in 
> {{mTerminatedHosts}}
> We have to fix this by making sure that if a new slave is created, we check 
> the contents of {{mTerminatedHosts}} and remove this IP from it if it is 
> there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-14 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291987#comment-16291987
 ] 

Hive QA commented on HIVE-18263:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12902084/HIVE-18263.1.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 24 failed/errored test(s), 11497 tests 
executed
*Failed tests:*
{noformat}
TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file (likely timed 
out) (batchId=168)

[smb_mapjoin_15.q,vector_windowing_expressions.q,auto_join1.q,insert_values_partitioned.q,selectDistinctStar.q,vector_windowing_gby.q,vectorized_timestamp.q,bucket4.q,vector_groupby_mapjoin.q,cbo_subq_exists.q,insert_values_dynamic_partitioned.q,schema_evol_orc_vec_part_all_primitive.q,infer_bucket_sort_bucketed_table.q,cbo_union.q,filter_join_breaktask2.q,udaf_all_keyword.q,schema_evol_orc_vec_part.q,tez_self_join.q,vector_data_types.q,mm_conversions.q,vector_partitioned_date_time.q,schema_evol_orc_nonvec_part.q,dynamic_semijoin_reduction_2.q,vector_mapjoin_reduce.q,vectorization_3.q,auto_sortmerge_join_8.q,disable_merge_for_bucketing.q,vectorized_date_funcs.q,vector_varchar_simple.q,vector_adaptor_usage_mode.q]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2]
 (batchId=152)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] 
(batchId=169)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[quotedid_smb]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=160)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[archive_partspec2]
 (batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_alter_table_exchange_partition_fail]
 (batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part]
 (batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[ctas_noemptyfolder]
 (batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[materialized_view_authorization_create_no_grant]
 (batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[materialized_view_authorization_drop_other]
 (batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[stats_aggregator_error_1]
 (batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[stats_publisher_error_1]
 (batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[subquery_notin_implicit_gby]
 (batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[truncate_bucketed_column]
 (batchId=93)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_10]
 (batchId=138)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketsortoptimize_insert_7]
 (batchId=128)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] 
(batchId=120)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] 
(batchId=113)
org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints 
(batchId=226)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8251/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8251/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8251/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 24 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12902084 - PreCommit-HIVE-Build

> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> 
>
> Key: HIVE-18263
> URL: https://issues.apache.org/jira/browse/HIVE-18263
> Project: Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18263.0.patch, HIVE-18263.1.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to 

[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-14 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291945#comment-16291945
 ] 

Hive QA commented on HIVE-18263:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
1s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
 2s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
12s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
12s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
12s{color} | {color:red} The patch generated 3 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}  8m 46s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh |
| git revision | master / e120bd8 |
| Default Java | 1.8.0_111 |
| asflicense | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8251/yetus/patch-asflicense-problems.txt
 |
| modules | C: testutils/ptest2 U: testutils/ptest2 |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8251/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> 
>
> Key: HIVE-18263
> URL: https://issues.apache.org/jira/browse/HIVE-18263
> Project: Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18263.0.patch, HIVE-18263.1.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. 
> Usually it should take about 1.5 hours, but in some cases it took over 4-5 
> hours.
> Looking in the logs of one such execution I've seen that some commands that 
> were sent to test executing slaves returned 255. Here this typically means 
> that there is unknown return code for the remote call since hiveptest-server 
> can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while 
> running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of 
> slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed 
> ([mTerminatedHosts 
> map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
>  contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of 
> hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the 
> file are seen as running 

[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-14 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290922#comment-16290922
 ] 

Adam Szita commented on HIVE-18263:
---

Attached [^HIVE-18263.1.patch] with the additional logging included.

> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> 
>
> Key: HIVE-18263
> URL: https://issues.apache.org/jira/browse/HIVE-18263
> Project: Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18263.0.patch, HIVE-18263.1.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. 
> Usually it should take about 1.5 hours, but in some cases it took over 4-5 
> hours.
> Looking in the logs of one such execution I've seen that some commands that 
> were sent to test executing slaves returned 255. Here this typically means 
> that there is unknown return code for the remote call since hiveptest-server 
> can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while 
> running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of 
> slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed 
> ([mTerminatedHosts 
> map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
>  contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of 
> hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the 
> file are seen as running slaves, ptest will terminate these first so it can 
> begin with a fresh start
> ** The IPs of these terminated instances already make their way into 
> {{mTerminatedHosts}} upon initialization...
> * The cloud provider may reuse some older IPs, so it is not too rare that the 
> same IP that belonged to a terminated host is assigned to a new one.
> This is problematic: Hive ptest's slave caretaker thread kicks in every 60 
> minutes and might see a running host that has the same IP as an old slave had 
> which was terminated at startup. It will think that this host should be 
> terminated since it already tried 60 minutes ago as its IP is in 
> {{mTerminatedHosts}}
> We have to fix this by making sure that if a new slave is created, we check 
> the contents of {{mTerminatedHosts}} and remove this IP from it if it is 
> there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-14 Thread Peter Vary (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290901#comment-16290901
 ] 

Peter Vary commented on HIVE-18263:
---

We would like to deploy this proposed solution tomorrow morning CET.
With an additional log line showing the IP addresses of the hosts in the 
{{mTerminatedHosts}} map.

Please feel free to voice any concerns until then.

Thanks,
Peter


> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> 
>
> Key: HIVE-18263
> URL: https://issues.apache.org/jira/browse/HIVE-18263
> Project: Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18263.0.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. 
> Usually it should take about 1.5 hours, but in some cases it took over 4-5 
> hours.
> Looking in the logs of one such execution I've seen that some commands that 
> were sent to test executing slaves returned 255. Here this typically means 
> that there is unknown return code for the remote call since hiveptest-server 
> can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while 
> running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of 
> slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed 
> ([mTerminatedHosts 
> map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
>  contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of 
> hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the 
> file are seen as running slaves, ptest will terminate these first so it can 
> begin with a fresh start
> ** The IPs of these terminated instances already make their way into 
> {{mTerminatedHosts}} upon initialization...
> * The cloud provider may reuse some older IPs, so it is not too rare that the 
> same IP that belonged to a terminated host is assigned to a new one.
> This is problematic: Hive ptest's slave caretaker thread kicks in every 60 
> minutes and might see a running host that has the same IP as an old slave had 
> which was terminated at startup. It will think that this host should be 
> terminated since it already tried 60 minutes ago as its IP is in 
> {{mTerminatedHosts}}
> We have to fix this by making sure that if a new slave is created, we check 
> the contents of {{mTerminatedHosts}} and remove this IP from it if it is 
> there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-13 Thread Barna Zsombor Klara (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289286#comment-16289286
 ] 

Barna Zsombor Klara commented on HIVE-18263:


Based on the logs I tend to agree with [~szita] and the proposed patch is small 
with a very limited impact. Even if we misunderstood something here the 
possibility of regression is small.

> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> 
>
> Key: HIVE-18263
> URL: https://issues.apache.org/jira/browse/HIVE-18263
> Project: Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18263.0.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. 
> Usually it should take about 1.5 hours, but in some cases it took over 4-5 
> hours.
> Looking in the logs of one such execution I've seen that some commands that 
> were sent to test executing slaves returned 255. Here this typically means 
> that there is unknown return code for the remote call since hiveptest-server 
> can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while 
> running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of 
> slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed 
> ([mTerminatedHosts 
> map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
>  contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of 
> hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the 
> file are seen as running slaves, ptest will terminate these first so it can 
> begin with a fresh start
> ** The IPs of these terminated instances already make their way into 
> {{mTerminatedHosts}} upon initialization...
> * The cloud provider may reuse some older IPs, so it is not too rare that the 
> same IP that belonged to a terminated host is assigned to a new one.
> This is problematic: Hive ptest's slave caretaker thread kicks in every 60 
> minutes and might see a running host that has the same IP as an old slave had 
> which was terminated at startup. It will think that this host should be 
> terminated since it already tried 60 minutes ago as its IP is in 
> {{mTerminatedHosts}}
> We have to fix this by making sure that if a new slave is created, we check 
> the contents of {{mTerminatedHosts}} and remove this IP from it if it is 
> there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-13 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289154#comment-16289154
 ] 

Adam Szita commented on HIVE-18263:
---

I believe the description of HIVE-15102 says exactly what happens and matches 
what I have been observing, but I disagree with the proposed solution and the 
conclusion as well.

I'm showing here an example of this issue through slave with IP 35.225.218.206, 
that was killed during a job unnecessarily. I'm also providing some snippets of 
logs of 
[PreCommit-HIVE-Build/8176|https://builds.apache.org/job/PreCommit-HIVE-Build/8176/consoleFull]


Tomcat was just started and the first job assigned to it was 
PreCommit-HIVE-Build/8176/.
On hiveptest-server-upstream, /data/hiveptest/hosts file contains 
35.225.218.206 even right now, and was contained on 2017-12-11 morning as well.
Although I don't have all the server logs from that time now, we will see that 
upon initalization this host was killed, and then a new one was created with 
same IP.

Jenkins has started running the job at 2017-12-11T09:31:55Z as seen at the 
beginning of the logs:
{code}

[INFO] Copying spring-webmvc-3.2.16.RELEASE.jar to 
/home/jenkins/jenkins-slave/workspace/PreCommit-HIVE-Build/hive/build/hive/testutils/ptest2/target/lib/spring-webmvc-3.2.16.RELEASE.jar
[INFO] Copying velocity-1.7.jar to 
/home/jenkins/jenkins-slave/workspace/PreCommit-HIVE-Build/hive/build/hive/testutils/ptest2/target/lib/velocity-1.7.jar
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 9.239 s
[INFO] Finished at: 2017-12-11T09:31:55Z
[INFO] Final Memory: 37M/913M
[INFO] 
+ local 
'PTEST_CLASSPATH=/home/jenkins/jenkins-slave/workspace/PreCommit-HIVE-Build/hive/build/hive/testutils/ptest2/target/hive-ptest-3.0-classes.jar:/home/jenkins/jenkins-slave/workspace/PreCommit-HIVE-Build/hive/build/hive/testutils/ptest2/target/lib/*'
+ java -cp 
'/home/jenkins/jenkins-slave/workspace/PreCommit-HIVE-Build/hive/build/hive/testutils/ptest2/target/hive-ptest-3.0-classes.jar:/home/jenkins/jenkins-slave/workspace/PreCommit-HIVE-Build/hive/build/hive/testutils/ptest2/target/lib/*'
 org.apache
{code}

Here unfortunately we don't have logs from the server, what happens is 
basically the server is reading the hosts file, terminating any running nodes 
that are found in this file and then creating 12 new executors.
This takes time because only 2 executors are created simultaneously, and there 
is a wait time in between them, hence the log of jenkins shows what happens 
when ptest already created the slaves: (you can see 35.225.218.206 in the 
middle)

{code}
2017-12-11 09:49:35,101  INFO [TestExecutor] 
LocalCommand.awaitProcessCompletion:67 Finished LocalCommandId=46. 
ElapsedTime(ms)=296543
2017-12-11 09:49:35,101 DEBUG [TestExecutor] PrepPhase.execute:68 Deleting 
/data/hiveptest/working/scratch/source-prep.sh: true
2017-12-11 09:49:35,110  INFO [HostExecutor 3] LocalCommand.:45 Starting 
LocalCommandId=48: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 104.198.217.87 'pkill -f java'
2017-12-11 09:49:35,111  INFO [HostExecutor 0] LocalCommand.:45 Starting 
LocalCommandId=50: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.184.147.31 'pkill -f java'
2017-12-11 09:49:35,111  INFO [HostExecutor 1] LocalCommand.:45 Starting 
LocalCommandId=49: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.224.37.167 'pkill -f java'
2017-12-11 09:49:35,111  INFO [HostExecutor 2] LocalCommand.:45 Starting 
LocalCommandId=47: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.184.14.247 'pkill -f java'
2017-12-11 09:49:35,112  INFO [HostExecutor 6] LocalCommand.:45 Starting 
LocalCommandId=55: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.184.205.41 'pkill -f java'
2017-12-11 09:49:35,112  INFO [HostExecutor 8] LocalCommand.:45 Starting 
LocalCommandId=54: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.192.223.60 'pkill -f java'
2017-12-11 09:49:35,112  INFO [HostExecutor 7] LocalCommand.:45 Starting 
LocalCommandId=53: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.192.216.79 'pkill -f java'
2017-12-11 09:49:35,113  INFO [HostExecutor 4] LocalCommand.:45 Starting 
LocalCommandId=51: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.225.218.206 'pkill -f java'
2017-12-11 09:49:35,113  INFO [HostExecutor 5] LocalCommand.:45 Starting 
LocalCommandId=52: ssh -v -i /home/hiveptest/.ssh/hive-ptest-user-key  -l 
hiveptest 35.188.119.125 'pkill -f java'
2017-12-11 09:49:35,113  INFO [HostExecutor 9] LocalCommand.:45 Starting 
LocalCommandId=56: ssh -v -i 

[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-12 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288151#comment-16288151
 ] 

Hive QA commented on HIVE-18263:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12901711/HIVE-18263.0.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 11499 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.org.apache.hadoop.hive.cli.TestCliDriver
 (batchId=85)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_join25] (batchId=72)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_sortmerge_join_2] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] 
(batchId=12)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] 
(batchId=151)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] 
(batchId=169)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=160)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part]
 (batchId=93)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] 
(batchId=113)
org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints 
(batchId=226)
org.apache.hive.service.cli.TestEmbeddedThriftBinaryCLIService.testExecuteStatementParallel
 (batchId=230)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8198/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8198/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8198/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 14 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12901711 - PreCommit-HIVE-Build

> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> 
>
> Key: HIVE-18263
> URL: https://issues.apache.org/jira/browse/HIVE-18263
> Project: Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18263.0.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. 
> Usually it should take about 1.5 hours, but in some cases it took over 4-5 
> hours.
> Looking in the logs of one such execution I've seen that some commands that 
> were sent to test executing slaves returned 255. Here this typically means 
> that there is unknown return code for the remote call since hiveptest-server 
> can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while 
> running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of 
> slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed 
> ([mTerminatedHosts 
> map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
>  contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of 
> hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the 
> file are seen as running slaves, ptest will terminate these first so it can 
> begin with a fresh start
> ** The IPs of these terminated instances already make their way into 
> {{mTerminatedHosts}} upon initialization...
> * The cloud provider may reuse some older IPs, so it is not too rare that the 
> same IP that belonged to a terminated host is assigned to a new one.
> This is problematic: Hive ptest's slave caretaker thread kicks in every 60 
> minutes and might see a running host that has the same IP as an old slave had 
> which was terminated at startup. It will think that this host should be 
> terminated since it already tried 60 minutes ago as its IP 

[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-12 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16288034#comment-16288034
 ] 

Hive QA commented on HIVE-18263:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
1s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
51s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
10s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
12s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m  
9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m  
9s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
11s{color} | {color:red} The patch generated 3 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}  8m 29s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh |
| git revision | master / d6ce23d |
| Default Java | 1.8.0_111 |
| asflicense | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8198/yetus/patch-asflicense-problems.txt
 |
| modules | C: testutils/ptest2 U: testutils/ptest2 |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8198/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> 
>
> Key: HIVE-18263
> URL: https://issues.apache.org/jira/browse/HIVE-18263
> Project: Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18263.0.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. 
> Usually it should take about 1.5 hours, but in some cases it took over 4-5 
> hours.
> Looking in the logs of one such execution I've seen that some commands that 
> were sent to test executing slaves returned 255. Here this typically means 
> that there is unknown return code for the remote call since hiveptest-server 
> can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while 
> running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of 
> slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed 
> ([mTerminatedHosts 
> map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
>  contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of 
> hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the 
> file are seen as running slaves, ptest will 

[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-12 Thread Barna Zsombor Klara (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287976#comment-16287976
 ] 

Barna Zsombor Klara commented on HIVE-18263:


Thank you for the patch Adam and for the detailed analysis. Just one minor 
question:
Any reason we gather all the addresses into one set then iterate over the set 
instead of iterating over the nodes and iterating in a nested manner over the 
addresses to remove them from the failed hosts collection?
Not a big issue, I'm just curious.
Otherwise +1.

> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> 
>
> Key: HIVE-18263
> URL: https://issues.apache.org/jira/browse/HIVE-18263
> Project: Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18263.0.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. 
> Usually it should take about 1.5 hours, but in some cases it took over 4-5 
> hours.
> Looking in the logs of one such execution I've seen that some commands that 
> were sent to test executing slaves returned 255. Here this typically means 
> that there is unknown return code for the remote call since hiveptest-server 
> can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while 
> running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of 
> slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed 
> ([mTerminatedHosts 
> map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
>  contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of 
> hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the 
> file are seen as running slaves, ptest will terminate these first so it can 
> begin with a fresh start
> ** The IPs of these terminated instances already make their way into 
> {{mTerminatedHosts}} upon initialization...
> * The cloud provider may reuse some older IPs, so it is not too rare that the 
> same IP that belonged to a terminated host is assigned to a new one.
> This is problematic: Hive ptest's slave caretaker thread kicks in every 60 
> minutes and might see a running host that has the same IP as an old slave had 
> which was terminated at startup. It will think that this host should be 
> terminated since it already tried 60 minutes ago as its IP is in 
> {{mTerminatedHosts}}
> We have to fix this by making sure that if a new slave is created, we check 
> the contents of {{mTerminatedHosts}} and remove this IP from it if it is 
> there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287929#comment-16287929
 ] 

Ashutosh Chauhan commented on HIVE-18263:
-

HIVE-15102 looks related and had some discussion on the topic.

> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> 
>
> Key: HIVE-18263
> URL: https://issues.apache.org/jira/browse/HIVE-18263
> Project: Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18263.0.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. 
> Usually it should take about 1.5 hours, but in some cases it took over 4-5 
> hours.
> Looking in the logs of one such execution I've seen that some commands that 
> were sent to test executing slaves returned 255. Here this typically means 
> that there is unknown return code for the remote call since hiveptest-server 
> can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while 
> running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of 
> slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed 
> ([mTerminatedHosts 
> map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
>  contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of 
> hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the 
> file are seen as running slaves, ptest will terminate these first so it can 
> begin with a fresh start
> ** The IPs of these terminated instances already make their way into 
> {{mTerminatedHosts}} upon initialization...
> * The cloud provider may reuse some older IPs, so it is not too rare that the 
> same IP that belonged to a terminated host is assigned to a new one.
> This is problematic: Hive ptest's slave caretaker thread kicks in every 60 
> minutes and might see a running host that has the same IP as an old slave had 
> which was terminated at startup. It will think that this host should be 
> terminated since it already tried 60 minutes ago as its IP is in 
> {{mTerminatedHosts}}
> We have to fix this by making sure that if a new slave is created, we check 
> the contents of {{mTerminatedHosts}} and remove this IP from it if it is 
> there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-12 Thread Andrew Sherman (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287840#comment-16287840
 ] 

Andrew Sherman commented on HIVE-18263:
---

Thanks [~szita] for working on this important but unglamourous issue

> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> 
>
> Key: HIVE-18263
> URL: https://issues.apache.org/jira/browse/HIVE-18263
> Project: Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18263.0.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. 
> Usually it should take about 1.5 hours, but in some cases it took over 4-5 
> hours.
> Looking in the logs of one such execution I've seen that some commands that 
> were sent to test executing slaves returned 255. Here this typically means 
> that there is unknown return code for the remote call since hiveptest-server 
> can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while 
> running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of 
> slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed 
> ([mTerminatedHosts 
> map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
>  contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of 
> hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the 
> file are seen as running slaves, ptest will terminate these first so it can 
> begin with a fresh start
> ** The IPs of these terminated instances already make their way into 
> {{mTerminatedHosts}} upon initialization...
> * The cloud provider may reuse some older IPs, so it is not too rare that the 
> same IP that belonged to a terminated host is assigned to a new one.
> This is problematic: Hive ptest's slave caretaker thread kicks in every 60 
> minutes and might see a running host that has the same IP as an old slave had 
> which was terminated at startup. It will think that this host should be 
> terminated since it already tried 60 minutes ago as its IP is in 
> {{mTerminatedHosts}}
> We have to fix this by making sure that if a new slave is created, we check 
> the contents of {{mTerminatedHosts}} and remove this IP from it if it is 
> there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18263) Ptest execution are multiple times slower sometimes due to dying executor slaves

2017-12-12 Thread Adam Szita (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287816#comment-16287816
 ] 

Adam Szita commented on HIVE-18263:
---

[~pvary], [~zsombor.klara] can you take a look please?

> Ptest execution are multiple times slower sometimes due to dying executor 
> slaves
> 
>
> Key: HIVE-18263
> URL: https://issues.apache.org/jira/browse/HIVE-18263
> Project: Hive
>  Issue Type: Bug
>  Components: Testing Infrastructure
>Reporter: Adam Szita
>Assignee: Adam Szita
> Attachments: HIVE-18263.0.patch
>
>
> PreCommit-HIVE-Build job has been seen running very long from time to time. 
> Usually it should take about 1.5 hours, but in some cases it took over 4-5 
> hours.
> Looking in the logs of one such execution I've seen that some commands that 
> were sent to test executing slaves returned 255. Here this typically means 
> that there is unknown return code for the remote call since hiveptest-server 
> can't reach these slaves anymore.
> In the hiveptest-server logs it is seen that some slaves were killed while 
> running the job normally, and here is why:
> * Hive's ptest-server checks periodically in every 60 minutes the status of 
> slaves. It also keeps track of slaves that were terminated.
> ** If upon such check it is found that a slave that was already killed 
> ([mTerminatedHosts 
> map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93]
>  contains its IP) is still running, it will try and terminate it again.
> * The server also maintains a file on its local FS that contains the IP of 
> hosts that were used before. (This probably for resilience reasons)
> ** This file is read when tomcat server starts and if any of the IPs in the 
> file are seen as running slaves, ptest will terminate these first so it can 
> begin with a fresh start
> ** The IPs of these terminated instances already make their way into 
> {{mTerminatedHosts}} upon initialization...
> * The cloud provider may reuse some older IPs, so it is not too rare that the 
> same IP that belonged to a terminated host is assigned to a new one.
> This is problematic: Hive ptest's slave caretaker thread kicks in every 60 
> minutes and might see a running host that has the same IP as an old slave had 
> which was terminated at startup. It will think that this host should be 
> terminated since it already tried 60 minutes ago as its IP is in 
> {{mTerminatedHosts}}
> We have to fix this by making sure that if a new slave is created, we check 
> the contents of {{mTerminatedHosts}} and remove this IP from it if it is 
> there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)