[jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability

2017-09-22 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15947:
-
Description: 
Currently Templeton service doesn't restrict number of job operation requests. 
It simply accepts and tries to run all operations. If more number of concurrent 
job submit requests comes then the time to submit job operations can increase 
significantly. Templetonused hdfs to store staging file for job. If HDFS 
storage can't respond to large number of requests and throttles then the job 
submission can take very large times in order of minutes.

This behavior may not be suitable for all applications and client applications  
may be looking for predictable and low response for successful request or send 
throttle response to client to wait for some time before re-requesting job 
operation.

In this JIRA, I am trying to address following job operations 
1) Submit new Job
2) Get Job Status
3) List jobs

These three operations has different complexity due to variance in use of 
cluster resources like YARN/HDFS.

The idea is to introduce a new config templeton.parallellism.job.submit which 
controls maximum number of concurrent active job submissions within Templeton 
and use this config to control better response times. If a new job submission 
request sees that there are already templeton.parallellism.job.submit jobs 
getting submitted concurrently then the request will fail with Http error 503 
with reason 

   “Too many concurrent job submission requests received. Please wait for some 
time before retrying.”
 
The client is expected to catch this response and retry after waiting for some 
time. The default value for the config templeton.parallellism.job.submit is set 
to ‘0’. This means by default job submission requests are always accepted. The 
behavior needs to be enabled based on requirements.

We can have similar behavior for Status and List operations with configs 
templeton.parallellism.job.status and templeton.parallellism.job.list 
respectively.

Once the job operation is started, the operation can take longer time. The 
client which has requested for job operation may not be waiting for indefinite 
amount of time. This work introduces configurations

templeton.job.submit.timeout
templeton.job.status.timeout
templeton.job.list.timeout

to specify maximum amount of time job operation can execute. If time out 
happens then list and status job requests returns to client with message

"List job request got timed out. Please retry the operation after waiting for 
some time."

If submit job request gets timed out then 
  i) The job submit request thread which receives time out will check if 
valid job id is generated in job request.
  ii) If it is generated then issue kill job request on cancel thread pool. 
Don't wait for operation to complete and returns to client with time out 
message. 

Side effects of enabling time out for submit operations
1) This has a possibility for having active job for some time by the client 
gets response and a list operation from client could potential show the newly 
created job before it gets killed.
2) We do best effort to kill the job and no guarantees. This means there is a 
possibility of duplicate job created. One possible reason for this could be a 
case where job is created and then operation timed out but kill request failed 
due to resource manager unavailability. When resource manager restarts, it will 
restarts the job which got created.

Fixing this scenario is not part of the scope of this JIRA. The job operation 
functionality can be enabled only if above side effects are acceptable.


  was:
Currently Templeton service doesn't restrict number of job operation requests. 
It simply accepts and tries to run all operations. If more number of concurrent 
job submit requests comes then the time to submit job operations can increase 
significantly. Templetonused hdfs to store staging file for job. If HDFS 
storage can't respond to large number of requests and throttles then the job 
submission can take very large times in order of minutes.

This behavior may not be suitable for all applications and client applications  
may be looking for predictable and low response for successful request or send 
throttle response to client to wait for some time before re-requesting job 
operation.

In this JIRA, I am trying to address following job operations 
1) Submit new Job
2) Get Job Status
3) List jobs

These three operations has different complexity due to variance in use of 
cluster resources like YARN/HDFS.

The idea is to introduce a new config templeton.job.submit.exec.max-procs which 
controls maximum number of concurrent active job submissions within Templeton 
and use this config to control better response times. If a new job submission 
request sees that there are already templeton.job.submit.exec.max-procs jobs 
getting 

[jira] [Assigned] (HIVE-16225) Memory leak in Templeton service

2017-03-15 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka reassigned HIVE-16225:


Assignee: Daniel Dai  (was: Subramanyam Pattipaka)

> Memory leak in Templeton service
> 
>
> Key: HIVE-16225
> URL: https://issues.apache.org/jira/browse/HIVE-16225
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Daniel Dai
> Attachments: screenshot-1.png
>
>
> This is a known beast. here are details
> The problem seems to be similar to the one discussed in HIVE-13749. If we 
> submit very large number of jobs like 1000 to 2000 then we can see increase 
> in Configuration objects count.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (HIVE-16226) Make messages in AzureFileSystemThreadPoolExecutor debug

2017-03-15 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka resolved HIVE-16226.
--
Resolution: Invalid

Opened incorrectly in wrong project.

> Make messages in AzureFileSystemThreadPoolExecutor debug
> 
>
> Key: HIVE-16226
> URL: https://issues.apache.org/jira/browse/HIVE-16226
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>
> Some of warn messages are confusing to end users if they disable parallism. 
> Move them to debug.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (HIVE-16225) Memory leak in Templeton service

2017-03-15 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka reassigned HIVE-16225:


Assignee: Subramanyam Pattipaka

> Memory leak in Templeton service
> 
>
> Key: HIVE-16225
> URL: https://issues.apache.org/jira/browse/HIVE-16225
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: screenshot-1.png
>
>
> This is a known beast. here are details
> The problem seems to be similar to the one discussed in HIVE-13749. If we 
> submit very large number of jobs like 1000 to 2000 then we can see increase 
> in Configuration objects count.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-16225) Memory leak in Templeton service

2017-03-15 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-16225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-16225:
-
Attachment: screenshot-1.png

> Memory leak in Templeton service
> 
>
> Key: HIVE-16225
> URL: https://issues.apache.org/jira/browse/HIVE-16225
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
> Attachments: screenshot-1.png
>
>
> This is a known beast. here are details
> The problem seems to be similar to the one discussed in HIVE-13749. If we 
> submit very large number of jobs like 1000 to 2000 then we can see increase 
> in Configuration objects count.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability

2017-03-15 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15947:
-
Attachment: HIVE-15947.10.patch

Fixed Minor review comments.

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.10.patch, HIVE-15947.2.patch, 
> HIVE-15947.3.patch, HIVE-15947.4.patch, HIVE-15947.6.patch, 
> HIVE-15947.7.patch, HIVE-15947.8.patch, HIVE-15947.9.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.
> Once the job operation is started, the operation can take longer time. The 
> client which has requested for job operation may not be waiting for 
> indefinite amount of time. This work introduces configurations
> templeton.exec.job.submit.timeout
> templeton.exec.job.status.timeout
> templeton.exec.job.list.timeout
> to specify maximum amount of time job operation can execute. If time out 
> happens then list and status job requests returns to client with message
> "List job request got timed out. Please retry the operation after waiting for 
> some time."
> If submit job request gets timed out then 
>   i) The job submit request thread which receives time out will check if 
> valid job id is generated in job request.
>   ii) If it is generated then issue kill job request on cancel thread 
> pool. Don't wait for operation to complete and returns to client with time 
> out message. 
> Side effects of enabling time out for submit operations
> 1) This has a possibility for having active job for some time by the client 
> gets response and a list operation from client could potential show the newly 
> created job before it gets killed.
> 2) We do best effort to kill the job and no guarantees. This means there is a 
> possibility of duplicate job created. One possible reason for this could be a 
> case where job is created and then operation timed out but kill request 
> failed due to resource manager unavailability. When resource manager 
> restarts, it will restarts the job which got created.
> Fixing this scenario is not part of the scope of this JIRA. The job operation 
> functionality can be enabled only if above side effects are acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability

2017-03-14 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15947:
-
Attachment: HIVE-15947.9.patch

Minor code comment fixes.

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.2.patch, HIVE-15947.3.patch, 
> HIVE-15947.4.patch, HIVE-15947.6.patch, HIVE-15947.7.patch, 
> HIVE-15947.8.patch, HIVE-15947.9.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.
> Once the job operation is started, the operation can take longer time. The 
> client which has requested for job operation may not be waiting for 
> indefinite amount of time. This work introduces configurations
> templeton.exec.job.submit.timeout
> templeton.exec.job.status.timeout
> templeton.exec.job.list.timeout
> to specify maximum amount of time job operation can execute. If time out 
> happens then list and status job requests returns to client with message
> "List job request got timed out. Please retry the operation after waiting for 
> some time."
> If submit job request gets timed out then 
>   i) The job submit request thread which receives time out will check if 
> valid job id is generated in job request.
>   ii) If it is generated then issue kill job request on cancel thread 
> pool. Don't wait for operation to complete and returns to client with time 
> out message. 
> Side effects of enabling time out for submit operations
> 1) This has a possibility for having active job for some time by the client 
> gets response and a list operation from client could potential show the newly 
> created job before it gets killed.
> 2) We do best effort to kill the job and no guarantees. This means there is a 
> possibility of duplicate job created. One possible reason for this could be a 
> case where job is created and then operation timed out but kill request 
> failed due to resource manager unavailability. When resource manager 
> restarts, it will restarts the job which got created.
> Fixing this scenario is not part of the scope of this JIRA. The job operation 
> functionality can be enabled only if above side effects are acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability

2017-03-13 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15947:
-
Attachment: HIVE-15947.8.patch

Made changes to return too many requests status for webhcat.

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.2.patch, HIVE-15947.3.patch, 
> HIVE-15947.4.patch, HIVE-15947.6.patch, HIVE-15947.7.patch, 
> HIVE-15947.8.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.
> Once the job operation is started, the operation can take longer time. The 
> client which has requested for job operation may not be waiting for 
> indefinite amount of time. This work introduces configurations
> templeton.exec.job.submit.timeout
> templeton.exec.job.status.timeout
> templeton.exec.job.list.timeout
> to specify maximum amount of time job operation can execute. If time out 
> happens then list and status job requests returns to client with message
> "List job request got timed out. Please retry the operation after waiting for 
> some time."
> If submit job request gets timed out then 
>   i) The job submit request thread which receives time out will check if 
> valid job id is generated in job request.
>   ii) If it is generated then issue kill job request on cancel thread 
> pool. Don't wait for operation to complete and returns to client with time 
> out message. 
> Side effects of enabling time out for submit operations
> 1) This has a possibility for having active job for some time by the client 
> gets response and a list operation from client could potential show the newly 
> created job before it gets killed.
> 2) We do best effort to kill the job and no guarantees. This means there is a 
> possibility of duplicate job created. One possible reason for this could be a 
> case where job is created and then operation timed out but kill request 
> failed due to resource manager unavailability. When resource manager 
> restarts, it will restarts the job which got created.
> Fixing this scenario is not part of the scope of this JIRA. The job operation 
> functionality can be enabled only if above side effects are acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability

2017-03-10 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15947:
-
Attachment: HIVE-15947.7.patch

New patch with minor changes.

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.2.patch, HIVE-15947.3.patch, 
> HIVE-15947.4.patch, HIVE-15947.6.patch, HIVE-15947.7.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.
> Once the job operation is started, the operation can take longer time. The 
> client which has requested for job operation may not be waiting for 
> indefinite amount of time. This work introduces configurations
> templeton.exec.job.submit.timeout
> templeton.exec.job.status.timeout
> templeton.exec.job.list.timeout
> to specify maximum amount of time job operation can execute. If time out 
> happens then list and status job requests returns to client with message
> "List job request got timed out. Please retry the operation after waiting for 
> some time."
> If submit job request gets timed out then 
>   i) The job submit request thread which receives time out will check if 
> valid job id is generated in job request.
>   ii) If it is generated then issue kill job request on cancel thread 
> pool. Don't wait for operation to complete and returns to client with time 
> out message. 
> Side effects of enabling time out for submit operations
> 1) This has a possibility for having active job for some time by the client 
> gets response and a list operation from client could potential show the newly 
> created job before it gets killed.
> 2) We do best effort to kill the job and no guarantees. This means there is a 
> possibility of duplicate job created. One possible reason for this could be a 
> case where job is created and then operation timed out but kill request 
> failed due to resource manager unavailability. When resource manager 
> restarts, it will restarts the job which got created.
> Fixing this scenario is not part of the scope of this JIRA. The job operation 
> functionality can be enabled only if above side effects are acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability

2017-03-10 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15947:
-
Attachment: HIVE-15947.6.patch

Latest patch after fixing all review comments.

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.2.patch, HIVE-15947.3.patch, 
> HIVE-15947.4.patch, HIVE-15947.6.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.
> Once the job operation is started, the operation can take longer time. The 
> client which has requested for job operation may not be waiting for 
> indefinite amount of time. This work introduces configurations
> templeton.exec.job.submit.timeout
> templeton.exec.job.status.timeout
> templeton.exec.job.list.timeout
> to specify maximum amount of time job operation can execute. If time out 
> happens then list and status job requests returns to client with message
> "List job request got timed out. Please retry the operation after waiting for 
> some time."
> If submit job request gets timed out then 
>   i) The job submit request thread which receives time out will check if 
> valid job id is generated in job request.
>   ii) If it is generated then issue kill job request on cancel thread 
> pool. Don't wait for operation to complete and returns to client with time 
> out message. 
> Side effects of enabling time out for submit operations
> 1) This has a possibility for having active job for some time by the client 
> gets response and a list operation from client could potential show the newly 
> created job before it gets killed.
> 2) We do best effort to kill the job and no guarantees. This means there is a 
> possibility of duplicate job created. One possible reason for this could be a 
> case where job is created and then operation timed out but kill request 
> failed due to resource manager unavailability. When resource manager 
> restarts, it will restarts the job which got created.
> Fixing this scenario is not part of the scope of this JIRA. The job operation 
> functionality can be enabled only if above side effects are acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability

2017-03-03 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15947:
-
Attachment: HIVE-15947.4.patch

Latest patch with failure scenarios handled gracefully.

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.2.patch, HIVE-15947.3.patch, 
> HIVE-15947.4.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.
> Once the job operation is started, the operation can take longer time. The 
> client which has requested for job operation may not be waiting for 
> indefinite amount of time. This work introduces configurations
> templeton.exec.job.submit.timeout
> templeton.exec.job.status.timeout
> templeton.exec.job.list.timeout
> to specify maximum amount of time job operation can execute. If time out 
> happens then list and status job requests returns to client with message
> "List job request got timed out. Please retry the operation after waiting for 
> some time."
> If submit job request gets timed out then 
>   i) The job submit request thread which receives time out will check if 
> valid job id is generated in job request.
>   ii) If it is generated then issue kill job request on cancel thread 
> pool. Don't wait for operation to complete and returns to client with time 
> out message. 
> Side effects of enabling time out for submit operations
> 1) This has a possibility for having active job for some time by the client 
> gets response and a list operation from client could potential show the newly 
> created job before it gets killed.
> 2) We do best effort to kill the job and no guarantees. This means there is a 
> possibility of duplicate job created. One possible reason for this could be a 
> case where job is created and then operation timed out but kill request 
> failed due to resource manager unavailability. When resource manager 
> restarts, it will restarts the job which got created.
> Fixing this scenario is not part of the scope of this JIRA. The job operation 
> functionality can be enabled only if above side effects are acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability

2017-02-28 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15947:
-
Description: 
Currently Templeton service doesn't restrict number of job operation requests. 
It simply accepts and tries to run all operations. If more number of concurrent 
job submit requests comes then the time to submit job operations can increase 
significantly. Templetonused hdfs to store staging file for job. If HDFS 
storage can't respond to large number of requests and throttles then the job 
submission can take very large times in order of minutes.

This behavior may not be suitable for all applications and client applications  
may be looking for predictable and low response for successful request or send 
throttle response to client to wait for some time before re-requesting job 
operation.

In this JIRA, I am trying to address following job operations 
1) Submit new Job
2) Get Job Status
3) List jobs

These three operations has different complexity due to variance in use of 
cluster resources like YARN/HDFS.

The idea is to introduce a new config templeton.job.submit.exec.max-procs which 
controls maximum number of concurrent active job submissions within Templeton 
and use this config to control better response times. If a new job submission 
request sees that there are already templeton.job.submit.exec.max-procs jobs 
getting submitted concurrently then the request will fail with Http error 503 
with reason 

   “Too many concurrent job submission requests received. Please wait for some 
time before retrying.”
 
The client is expected to catch this response and retry after waiting for some 
time. The default value for the config templeton.job.submit.exec.max-procs is 
set to ‘0’. This means by default job submission requests are always accepted. 
The behavior needs to be enabled based on requirements.

We can have similar behavior for Status and List operations with configs 
templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
respectively.

Once the job operation is started, the operation can take longer time. The 
client which has requested for job operation may not be waiting for indefinite 
amount of time. This work introduces configurations

templeton.exec.job.submit.timeout
templeton.exec.job.status.timeout
templeton.exec.job.list.timeout

to specify maximum amount of time job operation can execute. If time out 
happens then list and status job requests returns to client with message

"List job request got timed out. Please retry the operation after waiting for 
some time."

If submit job request gets timed out then 
  i) The job submit request thread which receives time out will check if 
valid job id is generated in job request.
  ii) If it is generated then issue kill job request on cancel thread pool. 
Don't wait for operation to complete and returns to client with time out 
message. 

Side effects of enabling time out for submit operations
1) This has a possibility for having active job for some time by the client 
gets response and a list operation from client could potential show the newly 
created job before it gets killed.
2) We do best effort to kill the job and no guarantees. This means there is a 
possibility of duplicate job created. One possible reason for this could be a 
case where job is created and then operation timed out but kill request failed 
due to resource manager unavailability. When resource manager restarts, it will 
restarts the job which got created.

Fixing this scenario is not part of the scope of this JIRA. The job operation 
functionality can be enabled only if above side effects are acceptable.


  was:
Currently Templeton service doesn't restrict number of job operation requests. 
It simply accepts and tries to run all operations. If more number of concurrent 
job submit requests comes then the time to submit job operations can increase 
significantly. Templetonused hdfs to store staging file for job. If HDFS 
storage can't respond to large number of requests and throttles then the job 
submission can take very large times in order of minutes.

This behavior may not be suitable for all applications and client applications  
may be looking for predictable and low response for successful request or send 
throttle response to client to wait for some time before re-requesting job 
operation.

In this JIRA, I am trying to address following job operations 
1) Submit new Job
2) Get Job Status
3) List jobs

These three operations has different complexity due to variance in use of 
cluster resources like YARN/HDFS.

The idea is to introduce a new config templeton.job.submit.exec.max-procs which 
controls maximum number of concurrent active job submissions within Templeton 
and use this config to control better response times. If a new job submission 
request sees that there are already 

[jira] [Commented] (HIVE-15947) Enhance Templeton service job operations reliability

2017-02-28 Thread Subramanyam Pattipaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888794#comment-15888794
 ] 

Subramanyam Pattipaka commented on HIVE-15947:
--

[~thejas], [~daijy], [~kiran.kolli], [~ashitg], you can find review board 
request at https://reviews.apache.org/r/57159/

Please let me know if you have any comments.

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.2.patch, HIVE-15947.3.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability

2017-02-27 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15947:
-
Attachment: HIVE-15947.3.patch

Changed implementation to use Thread pools instead of semaphore. Also 
implemented Thread pool time out functionality as well.

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.2.patch, HIVE-15947.3.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-15947) Enhance Templeton service job operations reliability

2017-02-21 Thread Subramanyam Pattipaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877500#comment-15877500
 ] 

Subramanyam Pattipaka commented on HIVE-15947:
--

[~kiran.kolli], I have fixed comments provided by you. 

[~thejas], Can you please provide comments if you have any? I have provided 
unit tests for threads getting killed and interrupted. I have tried to simulate 
threads getting killed using shutdownNow(). Is there a better way to simulate 
kill thread behavior for webhcat request and verify this behavior?

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.2.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability

2017-02-21 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15947:
-
Attachment: HIVE-15947.2.patch

Incorporated review comments.

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.2.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-15947) Enhance Templeton service job operations reliability

2017-02-16 Thread Subramanyam Pattipaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870954#comment-15870954
 ] 

Subramanyam Pattipaka commented on HIVE-15947:
--

[~thejas], please review these changes and let me know if you have any comments.

cc: [~ashitg], [~kiran.kolli]

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability

2017-02-16 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15947:
-
Status: Patch Available  (was: Open)

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15947) Enhance Templeton service job operations reliability

2017-02-16 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15947:
-
Attachment: HIVE-15947.patch

Attaching patch with changes. Introduced configs and verified changes works 
fine on real cluster with 400 job submit requests which also make requests 
until those jobs are completed. Also added unit tests to verify the behavior of 
concurrent job requests.

> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
> Attachments: HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (HIVE-15947) Enhance Templeton service job operations reliability

2017-02-16 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka reassigned HIVE-15947:



> Enhance Templeton service job operations reliability
> 
>
> Key: HIVE-15947
> URL: https://issues.apache.org/jira/browse/HIVE-15947
> Project: Hive
>  Issue Type: Bug
>Reporter: Subramanyam Pattipaka
>Assignee: Subramanyam Pattipaka
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>“Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-15803) msck can hang when nested partitions are present

2017-02-06 Thread Subramanyam Pattipaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855321#comment-15855321
 ] 

Subramanyam Pattipaka commented on HIVE-15803:
--

[~rajesh.balamohan] and [~ashutoshc], instead of simply passing null to 
recursive call, can we maintain an atomic counter and pass null as soon we see 
number of threads in the pool are over. That way we can best utilize the 
threads.

> msck can hang when nested partitions are present
> 
>
> Key: HIVE-15803
> URL: https://issues.apache.org/jira/browse/HIVE-15803
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
>
> Steps to reproduce. 
> {noformat}
> CREATE TABLE `repairtable`( `col` string) PARTITIONED BY (  `p1` string,  
> `p2` string)
> hive> dfs -mkdir -p /apps/hive/warehouse/test.db/repairtable/p1=c/p2=a/p3=b;
> hive> dfs -touchz 
> /apps/hive/warehouse/test.db/repairtable/p1=c/p2=a/p3=b/datafile;
> hive> set hive.mv.files.thread;
> hive.mv.files.thread=15
> hive> set hive.mv.files.thread=1;
> hive> MSCK TABLE repairtable;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15807) MSCK operations hangs in HiveMetaStoreChecker.checkPartitionDirs

2017-02-03 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15807:
-
Attachment: msck-jstack.txt

> MSCK operations hangs in HiveMetaStoreChecker.checkPartitionDirs
> 
>
> Key: HIVE-15807
> URL: https://issues.apache.org/jira/browse/HIVE-15807
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Subramanyam Pattipaka
>Assignee: Pengcheng Xiong
> Fix For: 2.2.0
>
> Attachments: msck-jstack.txt
>
>
> The seems to be a regression from HIVE-14511. The operation was hung in 
> checkPartitionDirs. The data has 3 levels of partitions (month, date, id) 
> which has total of 800 partitions.
> An example path would look like month=9/day=30/id=12
> The default value for hive config hive.mv.files.thread was set to 128. I have 
> attached the jstack of hive process used to run msck command
> checkPartitionDirs is implemented as recursive function which uses same pool 
> to submit worker threads. It seems thread pool ran out of thread to do the 
> actual work and all threads seems to be waiting and hung. Please take a look 
> the stack and confirm if this is the case here.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (HIVE-15807) MSCK operation hangs in HiveMetaStoreChecker.checkPartitionDirs

2017-02-03 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka updated HIVE-15807:
-
Summary: MSCK operation hangs in HiveMetaStoreChecker.checkPartitionDirs  
(was: MSCK operations hangs in HiveMetaStoreChecker.checkPartitionDirs)

> MSCK operation hangs in HiveMetaStoreChecker.checkPartitionDirs
> ---
>
> Key: HIVE-15807
> URL: https://issues.apache.org/jira/browse/HIVE-15807
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Subramanyam Pattipaka
>Assignee: Pengcheng Xiong
> Fix For: 2.2.0
>
> Attachments: msck-jstack.txt
>
>
> The seems to be a regression from HIVE-14511. The operation was hung in 
> checkPartitionDirs. The data has 3 levels of partitions (month, date, id) 
> which has total of 800 partitions.
> An example path would look like month=9/day=30/id=12
> The default value for hive config hive.mv.files.thread was set to 128. I have 
> attached the jstack of hive process used to run msck command
> checkPartitionDirs is implemented as recursive function which uses same pool 
> to submit worker threads. It seems thread pool ran out of thread to do the 
> actual work and all threads seems to be waiting and hung. Please take a look 
> the stack and confirm if this is the case here.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (HIVE-15807) MSCK operations hangs in HiveMetaStoreChecker.checkPartitionDirs

2017-02-03 Thread Subramanyam Pattipaka (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Subramanyam Pattipaka reassigned HIVE-15807:



> MSCK operations hangs in HiveMetaStoreChecker.checkPartitionDirs
> 
>
> Key: HIVE-15807
> URL: https://issues.apache.org/jira/browse/HIVE-15807
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.2.0
>Reporter: Subramanyam Pattipaka
>Assignee: Pengcheng Xiong
> Fix For: 2.2.0
>
>
> The seems to be a regression from HIVE-14511. The operation was hung in 
> checkPartitionDirs. The data has 3 levels of partitions (month, date, id) 
> which has total of 800 partitions.
> An example path would look like month=9/day=30/id=12
> The default value for hive config hive.mv.files.thread was set to 128. I have 
> attached the jstack of hive process used to run msck command
> checkPartitionDirs is implemented as recursive function which uses same pool 
> to submit worker threads. It seems thread pool ran out of thread to do the 
> actual work and all threads seems to be waiting and hung. Please take a look 
> the stack and confirm if this is the case here.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-14511) Improve MSCK for partitioned table to deal with special cases

2016-08-15 Thread Subramanyam Pattipaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421970#comment-15421970
 ] 

Subramanyam Pattipaka commented on HIVE-14511:
--

[~pxiong], can you please make following extra changes

1. Check for configs mapred.input.dir.recursive and 
hive.mapred.supports.subdirectories are enabled and then ignore having 
directories after you reach depth same as number of partition columns.
2. If you find any files at unexpected locations then please check for a config 
(can't remember the config name) and produce error for each such path and move 
ahead. Otherwise, fail the operation.

> Improve MSCK for partitioned table to deal with special cases
> -
>
> Key: HIVE-14511
> URL: https://issues.apache.org/jira/browse/HIVE-14511
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-14511.01.patch
>
>
> Some users will have a folder rather than a file under the last partition 
> folder. However, msck is going to search for the leaf folder rather than the 
> last partition folder. We need to improve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HIVE-14511) Improve MSCK for partitioned table to deal with special cases

2016-08-15 Thread Subramanyam Pattipaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421763#comment-15421763
 ] 

Subramanyam Pattipaka edited comment on HIVE-14511 at 8/15/16 10:07 PM:


[~sershe], Even if we introduce another command to be flexible to cater this 
scenario, what if the user data has changed in terms of directory structure. 
Why does the user has to recreate all tables again? Why not repair table is 
also flexible (with this patch) such that configs mapred.input.dir.recursive 
and hive.mapred.supports.subdirectories are supported add relevant partitions. 
Further having two commands may be confusing. 

I don't mean to add file here  a=1/00_0 f. I mean only to ignore these and 
list them in error log if a config is enabled such that users can act on them. 
Error is better instead of debug. This way, all configurations would give these 
details. For example if we have following files

tbldir/a=1/file1.txt
tbldir/a=2/b=1/file2.txt
tbldir/a=2/b=1/c=1/file3.txt

and we are trying to create partitioned table with partitions on a and b with 
root directory tbldir 

Here ERROR log would say ignoring file tbldir/a=1/file1.txt due to incorrect 
structure if ignore config is set. Otherwise, operation is failed.

We add only one partition with values (2, 1).

msck is still restrict and the ask here is to support configs 
mapred.input.dir.recursive and hive.mapred.supports.subdirectories.



was (Author: pattipaka):
[~sershe], Even if we introduce another command to be flexible to cater this 
scenario, what if the user data has changed in terms of directory structure. 
Why does the user has to recreate all tables again? Why not repair table is 
also flexible (with this patch) such that configs mapred.input.dir.recursive 
and hive.mapred.supports.subdirectories are supported add relevant partitions. 
Further having two commands may be confusing. 

I don't mean to add file here  a=1/00_0 f. I mean only to ignore these and 
list them in error log if a config is enabled such that users can act on them. 
Error is better instead of debug. This way, all configurations would give these 
details. For example if we have following files

tbldir/a=1/file1.txt
tbldir/a=2/b=1/file2.txt

and we are trying to create partitioned table with partitions on a and b with 
root directory tbldir 

Here ERROR log would say ignoring file tbldir/a=1/file1.txt due to incorrect 
structure if ignore config is set. Otherwise, operation is failed.

We add only one partition with values (2, 1).

msck is still restrict and the ask here is to support configs 
mapred.input.dir.recursive and hive.mapred.supports.subdirectories.


> Improve MSCK for partitioned table to deal with special cases
> -
>
> Key: HIVE-14511
> URL: https://issues.apache.org/jira/browse/HIVE-14511
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-14511.01.patch
>
>
> Some users will have a folder rather than a file under the last partition 
> folder. However, msck is going to search for the leaf folder rather than the 
> last partition folder. We need to improve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14511) Improve MSCK for partitioned table to deal with special cases

2016-08-15 Thread Subramanyam Pattipaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421763#comment-15421763
 ] 

Subramanyam Pattipaka commented on HIVE-14511:
--

[~sershe], Even if we introduce another command to be flexible to cater this 
scenario, what if the user data has changed in terms of directory structure. 
Why does the user has to recreate all tables again? Why not repair table is 
also flexible (with this patch) such that configs mapred.input.dir.recursive 
and hive.mapred.supports.subdirectories are supported add relevant partitions. 
Further having two commands may be confusing. 

I don't mean to add file here  a=1/00_0 f. I mean only to ignore these and 
list them in error log if a config is enabled such that users can act on them. 
Error is better instead of debug. This way, all configurations would give these 
details. For example if we have following files

tbldir/a=1/file1.txt
tbldir/a=2/b=1/file2.txt

and we are trying to create partitioned table with partitions on a and b with 
root directory tbldir 

Here ERROR log would say ignoring file tbldir/a=1/file1.txt due to incorrect 
structure if ignore config is set. Otherwise, operation is failed.

We add only one partition with values (2, 1).

msck is still restrict and the ask here is to support configs 
mapred.input.dir.recursive and hive.mapred.supports.subdirectories.


> Improve MSCK for partitioned table to deal with special cases
> -
>
> Key: HIVE-14511
> URL: https://issues.apache.org/jira/browse/HIVE-14511
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-14511.01.patch
>
>
> Some users will have a folder rather than a file under the last partition 
> folder. However, msck is going to search for the leaf folder rather than the 
> last partition folder. We need to improve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14511) Improve MSCK for partitioned table to deal with special cases

2016-08-15 Thread Subramanyam Pattipaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421694#comment-15421694
 ] 

Subramanyam Pattipaka commented on HIVE-14511:
--

Yes. That's correct. We should also check that there are no files exists until 
the required depth. May be thats what you wanted here? For example, files like

tbldir/file1
tbldir/p1=1/file2

exists then partition creation should fail. If there is ignore config option 
set then probably we should move ahead ignoring these files. But, please log 
them under debug mode such that those can be collected and may be user may want 
to act on them once they have the list instead of deleting one at a time and 
rerunning msck.

> Improve MSCK for partitioned table to deal with special cases
> -
>
> Key: HIVE-14511
> URL: https://issues.apache.org/jira/browse/HIVE-14511
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-14511.01.patch
>
>
> Some users will have a folder rather than a file under the last partition 
> folder. However, msck is going to search for the leaf folder rather than the 
> last partition folder. We need to improve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14511) Improve MSCK for partitioned table to deal with special cases

2016-08-15 Thread Subramanyam Pattipaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15421268#comment-15421268
 ] 

Subramanyam Pattipaka commented on HIVE-14511:
--

I mean to add only p1=1/p2=1. For example if you have following structure

data/p1=1/p2=1/p3=1
  /p3=2
  /p3=3
 /p2=2/p3=1
  /p3=2
/p1=2/p2=1/p3=1

Now, I want to add only (1,1), (1,2) and (2,1) as partitions. If you remove the 
above check then this is possible.

In first iteration you would list 

p1=1
p1=2

in next iteration you would list 

/p1=1/p2=1
/p1=1/p2=3
/p1=2/p2=1

As depth is 0 we stop here and these are the paths for partitions if user want 
to create on p1 and p2 as partition columns. If you want you can check for 
existence of use of config mapred.input.dir.recursive and 
hive.mapred.supports.subdirectories.

> Improve MSCK for partitioned table to deal with special cases
> -
>
> Key: HIVE-14511
> URL: https://issues.apache.org/jira/browse/HIVE-14511
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-14511.01.patch
>
>
> Some users will have a folder rather than a file under the last partition 
> folder. However, msck is going to search for the leaf folder rather than the 
> last partition folder. We need to improve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14511) Improve MSCK for partitioned table to deal with special cases

2016-08-11 Thread Subramanyam Pattipaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417738#comment-15417738
 ] 

Subramanyam Pattipaka commented on HIVE-14511:
--

[~pxiong], As your change stops at depth same as number of partition columns, 
your current code has a bug at

if (!directoryFound && maxDepth == 0) {

This again assumes that you don't have directory at maxDepth. You are 
terminating your search here anyway. Any path you find at this level will 
qualify to be a partition. I think you should remove check !directoryFound. 
Same at other locations.

> Improve MSCK for partitioned table to deal with special cases
> -
>
> Key: HIVE-14511
> URL: https://issues.apache.org/jira/browse/HIVE-14511
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-14511.01.patch
>
>
> Some users will have a folder rather than a file under the last partition 
> folder. However, msck is going to search for the leaf folder rather than the 
> last partition folder. We need to improve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14511) Improve MSCK for partitioned table to deal with special cases

2016-08-11 Thread Subramanyam Pattipaka (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417732#comment-15417732
 ] 

Subramanyam Pattipaka commented on HIVE-14511:
--

[~sershe], some users have their large data in structure with format 
data/partlevel=0/partlevel2=0/partlevel3=0/partleve4=0//partleveln=0/file1

Given this structure, using configs mapred.input.dir.recursive and 
hive.mapred.supports.subdirectories set to true, the expectation is that we can 
create partitions at any level and query data. 

Users can generate data considering various tools in mind. Asking them to 
reorganize data and create a copy for Hive may put hurdle for trying out Hive 
as data could be very huge and it may not always be possible.

This fix will ensure that we add appropriate partitions for above case when 
user tries to create partitions with any number of levels.

> Improve MSCK for partitioned table to deal with special cases
> -
>
> Key: HIVE-14511
> URL: https://issues.apache.org/jira/browse/HIVE-14511
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Pengcheng Xiong
>Assignee: Pengcheng Xiong
> Attachments: HIVE-14511.01.patch
>
>
> Some users will have a folder rather than a file under the last partition 
> folder. However, msck is going to search for the leaf folder rather than the 
> last partition folder. We need to improve that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)