[jira] [Created] (YARN-4299) Distcp fails even if ignoreFailures option is set

2015-10-26 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-4299:
---

 Summary: Distcp fails even if ignoreFailures option is set
 Key: YARN-4299
 URL: https://issues.apache.org/jira/browse/YARN-4299
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.0
Reporter: Prabhu Joseph


hadoop distcp fails even if ignoreFailures option is set using -i option.

When an IOException is thrown from RetriableFileCopyCommand, the hadoopFailures 
method in CopyMapper does not honor ignoreFailures.


if (ignoreFailures && exception.getCause() instanceof 
RetriableFileCopyCommand.CopyReadException)

OR should be used above.

And there is one more bug, when i wrap IOException with CopyReadException, the 
exception.getCause is still IOException.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4256) YARN fair scheduler vcores with decimal values

2015-10-15 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960049#comment-14960049
 ] 

Prabhu Joseph commented on YARN-4256:
-

Thanks Jun Gong. 

> YARN fair scheduler vcores with decimal values
> --
>
> Key: YARN-4256
> URL: https://issues.apache.org/jira/browse/YARN-4256
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Prabhu Joseph
>Assignee: Jun Gong
>Priority: Minor
> Fix For: 2.7.2
>
> Attachments: YARN-4256.001.patch
>
>
> When the queue with vcores is in decimal value, the value after the decimal 
> point is taken as vcores by FairScheduler.
> For the below queue,
> 2 mb,20 vcores,20.25 disks
> 3 mb,40.2 vcores,30.25 disks
> When many applications submitted  parallely into queue, all were in PENDING 
> state as the vcores is taken as 2 skipping the value 40.
> The code FairSchedulerConfiguration.java to Pattern match the vcores has to 
> be improved in such a way either throw 
> AllocationConfigurationException("Missing resource") or consider the value 
> before decimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4256) YARN fair scheduler vcores with decimal values

2015-10-12 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-4256:
---

 Summary: YARN fair scheduler vcores with decimal values
 Key: YARN-4256
 URL: https://issues.apache.org/jira/browse/YARN-4256
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 2.7.1
Reporter: Prabhu Joseph
Priority: Critical
 Fix For: 2.7.2


When the queue with vcores is in decimal value, the value after the decimal 
point is taken as vcores by FairScheduler.

For the below queue,

2 mb,20 vcores,20.25 disks
3 mb,40.2 vcores,30.25 disks

When many applications submitted  parallely into queue, all were in PENDING 
state as the vcores is taken as 2 skipping the value 40.

The code FairSchedulerConfiguration.java to Pattern match the vcores has to be 
improved in such a way either throw AllocationConfigurationException("Missing 
resource") or consider the value before decimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4256) YARN fair scheduler vcores with decimal values

2015-10-12 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-4256:

Priority: Minor  (was: Critical)

> YARN fair scheduler vcores with decimal values
> --
>
> Key: YARN-4256
> URL: https://issues.apache.org/jira/browse/YARN-4256
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 2.7.1
>Reporter: Prabhu Joseph
>Priority: Minor
> Fix For: 2.7.2
>
>
> When the queue with vcores is in decimal value, the value after the decimal 
> point is taken as vcores by FairScheduler.
> For the below queue,
> 2 mb,20 vcores,20.25 disks
> 3 mb,40.2 vcores,30.25 disks
> When many applications submitted  parallely into queue, all were in PENDING 
> state as the vcores is taken as 2 skipping the value 40.
> The code FairSchedulerConfiguration.java to Pattern match the vcores has to 
> be improved in such a way either throw 
> AllocationConfigurationException("Missing resource") or consider the value 
> before decimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4437) JobEndNotification info logs are missing in AM container syslog

2015-12-09 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-4437:
---

 Summary: JobEndNotification info logs are missing in AM container 
syslog
 Key: YARN-4437
 URL: https://issues.apache.org/jira/browse/YARN-4437
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.7.0
Reporter: Prabhu Joseph
Priority: Minor


JobEndNotification logs are not written by MRAppMaster and JobEndNotifier 
classes even though Log.info is present. The reason was  
MRAppMaster.this.stop() has been called before the JobEndNotification and hence 
somewhere during the stop log appenders also made null.

AM container syslog is not having below logs from JobEndNotifier

   Job end notification trying + urlToNotify
   Job end notification to + urlToNotify + succeeded / failed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4469) yarn application -status should not show a stack trace for an unknown application ID

2015-12-21 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066359#comment-15066359
 ] 

Prabhu Joseph commented on YARN-4469:
-

[~templedf]  The issue is already corrected in 2.7.0 as part of YARN-2356

> yarn application -status should not show a stack trace for an unknown 
> application ID
> 
>
> Key: YARN-4469
> URL: https://issues.apache.org/jira/browse/YARN-4469
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.1
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>
> For example:
> {noformat}
> # yarn application -status application_1234567890_12345
> Exception in thread "main" 
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1234567890_12345' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:324)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:170)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:401)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
>   at 
> org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:190)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:399)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.printApplicationReport(ApplicationCLI.java:429)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.run(ApplicationCLI.java:154)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>   at 
> org.apache.hadoop.yarn.client.cli.ApplicationCLI.main(ApplicationCLI.java:77)
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException):
>  Application with id 'application_1234567890_12345' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:324)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:170)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:401)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
>   at 

[jira] [Comment Edited] (YARN-5295) YARN queue-mappings to check Queue is present before submitting job

2016-06-27 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351347#comment-15351347
 ] 

Prabhu Joseph edited comment on YARN-5295 at 6/27/16 4:23 PM:
--

Hi [~sunilg], Yes, if test queue is present, the application submitted by test 
user placed into test queue. But if test queue is not present or if test queue 
is not a leaf queue or if test user does not have either Submit_Applications or 
Administer_Queue ACL, then the application is rejected. Instead, the 
getMappedQueue in CapacityScheduler can do the three sanity checks and return a 
valid queue that is platform instead of test. (Assuming test user passes the 
sanity checks on platform Queue)

Currently the sanity checks are done separately after deciding the queue to be 
placed, instead sanity checks can be included in getMappedQueue logic, where 
once queue mapping is chosen from the list, the sanity checks can be done and 
if it fails, then move to the next queue mapping in the list.




was (Author: prabhu joseph):
[~sunilg] Yes, if test queue is present, the application submitted by test user 
placed into test queue. But if test queue is not present or if test queue is 
not a leaf queue or if test user does not have either Submit_Applications or 
Administer_Queue ACL, then the application is rejected. Instead, the 
getMappedQueue in CapacityScheduler can do the three sanity checks and return a 
valid queue that is platform instead of test. (Assuming test user passes the 
sanity checks on platform Queue)

Currently the sanity checks are done separately after deciding the queue to be 
placed, instead sanity checks can be included in getMappedQueue logic, where 
once queue mapping is chosen from the list, the sanity checks can be done and 
if it fails, then move to the next queue mapping in the list.



> YARN queue-mappings to check Queue is present before submitting job
> ---
>
> Key: YARN-5295
> URL: https://issues.apache.org/jira/browse/YARN-5295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: Prabhu Joseph
>
> In yarn Queue-Mappings, Yarn should check if the queue is present before 
> submitting the job. If not present it should go to next mapping available.
> For example if we have
> yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
> and I submit job with user "test" and if there is no "test" queue then it 
> should check the second mapping (g:edw:platform) in the list and if test is 
> part of edw group it should submit job in platform queue.
> Below Sanity checks has to be done for the mapped queue in the list and if it 
> fails then the the next queue mapping has to be chosen, when there is no 
> queue mapping passing the sanity check, only then the application has to be 
> Rejected.
> 1. is queue present
> 2. is queue not a leaf queue
> 3. is user either have ACL Submit_Applications or Administer_Queue of the 
> queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5295) YARN queue-mappings to check Queue is present before submitting job

2016-06-27 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351450#comment-15351450
 ] 

Prabhu Joseph commented on YARN-5295:
-

Yes, doing Sanity Check 1 and 2 well before in getMappedQueue is suffice to 
help administrators to configure a default queue to any user or group in case 
of no valid queue mapping. For example, with this fix, Administrators can allow 
any new user added and who does not have queue created with same user name can 
still be placed in default queue through list of queue mappings 
u:%user:%user,u:%user:default

> YARN queue-mappings to check Queue is present before submitting job
> ---
>
> Key: YARN-5295
> URL: https://issues.apache.org/jira/browse/YARN-5295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: Prabhu Joseph
>
> In yarn Queue-Mappings, Yarn should check if the queue is present before 
> submitting the job. If not present it should go to next mapping available.
> For example if we have
> yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
> and I submit job with user "test" and if there is no "test" queue then it 
> should check the second mapping (g:edw:platform) in the list and if test is 
> part of edw group it should submit job in platform queue.
> Below Sanity checks has to be done for the mapped queue in the list and if it 
> fails then the the next queue mapping has to be chosen, when there is no 
> queue mapping passing the sanity check, only then the application has to be 
> Rejected.
> 1. is queue present
> 2. is queue not a leaf queue
> 3. is user either have ACL Submit_Applications or Administer_Queue of the 
> queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5295) YARN queue-mappings to check Queue is present before submitting job

2016-06-27 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351347#comment-15351347
 ] 

Prabhu Joseph commented on YARN-5295:
-

[~sunilg] Yes, if test queue is present, the application submitted by test user 
placed into test queue. But if test queue is not present or if test queue is 
not a leaf queue or if test user does not have either Submit_Applications or 
Administer_Queue ACL, then the application is rejected. Instead, the 
getMappedQueue in CapacityScheduler can do the three sanity checks and return a 
valid queue that is platform instead of test. (Assuming test user passes the 
sanity checks on platform Queue)

Currently the sanity checks are done separately after deciding the queue to be 
placed, instead sanity checks can be included in getMappedQueue logic, where 
once queue mapping is chosen from the list, the sanity checks can be done and 
if it fails, then move to the next queue mapping in the list.



> YARN queue-mappings to check Queue is present before submitting job
> ---
>
> Key: YARN-5295
> URL: https://issues.apache.org/jira/browse/YARN-5295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: Prabhu Joseph
>
> In yarn Queue-Mappings, Yarn should check if the queue is present before 
> submitting the job. If not present it should go to next mapping available.
> For example if we have
> yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
> and I submit job with user "test" and if there is no "test" queue then it 
> should check the second mapping (g:edw:platform) in the list and if test is 
> part of edw group it should submit job in platform queue.
> Below Sanity checks has to be done for the mapped queue in the list and if it 
> fails then the the next queue mapping has to be chosen, when there is no 
> queue mapping passing the sanity check, only then the application has to be 
> Rejected.
> 1. is queue present
> 2. is queue not a leaf queue
> 3. is user either have ACL Submit_Applications or Administer_Queue of the 
> queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-5295) YARN queue-mappings to check Queue is present before submitting job

2016-06-27 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351347#comment-15351347
 ] 

Prabhu Joseph edited comment on YARN-5295 at 6/27/16 4:25 PM:
--

Hi [~sunilg], Yes, if test queue is present, the application submitted by test 
user placed into test queue. But if test queue is not present or if test queue 
is not a leaf queue or if test user does not have either Submit_Applications or 
Administer_Queue ACL, then the application is rejected. Instead, the 
getMappedQueue in CapacityScheduler can do the three sanity checks well before 
and return a valid queue that is platform instead of test. (Assuming test user 
passes the sanity checks on platform Queue)

Currently the sanity checks are done separately after deciding the queue to be 
placed, instead sanity checks can be included in getMappedQueue logic, where 
once queue mapping is chosen from the list, the sanity checks can be done and 
if it fails, then move to the next queue mapping in the list.




was (Author: prabhu joseph):
Hi [~sunilg], Yes, if test queue is present, the application submitted by test 
user placed into test queue. But if test queue is not present or if test queue 
is not a leaf queue or if test user does not have either Submit_Applications or 
Administer_Queue ACL, then the application is rejected. Instead, the 
getMappedQueue in CapacityScheduler can do the three sanity checks and return a 
valid queue that is platform instead of test. (Assuming test user passes the 
sanity checks on platform Queue)

Currently the sanity checks are done separately after deciding the queue to be 
placed, instead sanity checks can be included in getMappedQueue logic, where 
once queue mapping is chosen from the list, the sanity checks can be done and 
if it fails, then move to the next queue mapping in the list.



> YARN queue-mappings to check Queue is present before submitting job
> ---
>
> Key: YARN-5295
> URL: https://issues.apache.org/jira/browse/YARN-5295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: Prabhu Joseph
>
> In yarn Queue-Mappings, Yarn should check if the queue is present before 
> submitting the job. If not present it should go to next mapping available.
> For example if we have
> yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
> and I submit job with user "test" and if there is no "test" queue then it 
> should check the second mapping (g:edw:platform) in the list and if test is 
> part of edw group it should submit job in platform queue.
> Below Sanity checks has to be done for the mapped queue in the list and if it 
> fails then the the next queue mapping has to be chosen, when there is no 
> queue mapping passing the sanity check, only then the application has to be 
> Rejected.
> 1. is queue present
> 2. is queue not a leaf queue
> 3. is user either have ACL Submit_Applications or Administer_Queue of the 
> queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5295) YARN queue-mappings to check Queue is present before submitting job

2016-06-25 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-5295:
---

 Summary: YARN queue-mappings to check Queue is present before 
submitting job
 Key: YARN-5295
 URL: https://issues.apache.org/jira/browse/YARN-5295
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Affects Versions: 2.7.2
Reporter: Prabhu Joseph




In yarn Queue-Mappings, Yarn should check if the queue is present before 
submitting the job. If not present it should go to next mapping available.

For example if we have
yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
and I submit job with user "test" and if there is no "test" queue then it 
should check the second mapping (g:edw:platform) in the list and if test is 
part of edw group it should submit job in platform queue.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5295) YARN queue-mappings to check Queue is present before submitting job

2016-06-25 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-5295:

Description: 
In yarn Queue-Mappings, Yarn should check if the queue is present before 
submitting the job. If not present it should go to next mapping available.

For example if we have
yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
and I submit job with user "test" and if there is no "test" queue then it 
should check the second mapping (g:edw:platform) in the list and if test is 
part of edw group it should submit job in platform queue.

Below Sanity checks has to be done for the mapped queue in the list and if it 
fails then the the next queue mapping has to be chosen, when there is no queue 
mapping passing the sanity check, only then the application has to be Rejected.

1. is queue present
2. is queue not a leaf queue
3. is user either have ACL Submit_Applications or Administer_Queue of the queue.



  was:
In yarn Queue-Mappings, Yarn should check if the queue is present before 
submitting the job. If not present it should go to next mapping available.

For example if we have
yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
and I submit job with user "test" and if there is no "test" queue then it 
should check the second mapping (g:edw:platform) in the list and if test is 
part of edw group it should submit job in platform queue.

Below Sanity Checks has to be done for the mapped queue in the list and if it 
fails then the the next queue mapping has to be chosen, when there is no queue 
mapping passing the sanity check, only then the application has to be Rejected.

1. is queue present
2. is queue not a leaf queue
3. is user either have ACL Submit_Applications or Administer_Queue of the queue.




> YARN queue-mappings to check Queue is present before submitting job
> ---
>
> Key: YARN-5295
> URL: https://issues.apache.org/jira/browse/YARN-5295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: Prabhu Joseph
>
> In yarn Queue-Mappings, Yarn should check if the queue is present before 
> submitting the job. If not present it should go to next mapping available.
> For example if we have
> yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
> and I submit job with user "test" and if there is no "test" queue then it 
> should check the second mapping (g:edw:platform) in the list and if test is 
> part of edw group it should submit job in platform queue.
> Below Sanity checks has to be done for the mapped queue in the list and if it 
> fails then the the next queue mapping has to be chosen, when there is no 
> queue mapping passing the sanity check, only then the application has to be 
> Rejected.
> 1. is queue present
> 2. is queue not a leaf queue
> 3. is user either have ACL Submit_Applications or Administer_Queue of the 
> queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5295) YARN queue-mappings to check Queue is present before submitting job

2016-06-25 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-5295:

Description: 
In yarn Queue-Mappings, Yarn should check if the queue is present before 
submitting the job. If not present it should go to next mapping available.

For example if we have
yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
and I submit job with user "test" and if there is no "test" queue then it 
should check the second mapping (g:edw:platform) in the list and if test is 
part of edw group it should submit job in platform queue.

Below Sanity Checks has to be done for the mapped queue in the list and if it 
fails then the the next queue mapping has to be chosen, when there is no queue 
mapping passing the sanity check, only then the application has to be Rejected.

1. is queue present
2. is queue not a leaf queue
3. is user either have ACL Submit_Applications or Administer_Queue of the queue.



  was:


In yarn Queue-Mappings, Yarn should check if the queue is present before 
submitting the job. If not present it should go to next mapping available.

For example if we have
yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
and I submit job with user "test" and if there is no "test" queue then it 
should check the second mapping (g:edw:platform) in the list and if test is 
part of edw group it should submit job in platform queue.



> YARN queue-mappings to check Queue is present before submitting job
> ---
>
> Key: YARN-5295
> URL: https://issues.apache.org/jira/browse/YARN-5295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: Prabhu Joseph
>
> In yarn Queue-Mappings, Yarn should check if the queue is present before 
> submitting the job. If not present it should go to next mapping available.
> For example if we have
> yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
> and I submit job with user "test" and if there is no "test" queue then it 
> should check the second mapping (g:edw:platform) in the list and if test is 
> part of edw group it should submit job in platform queue.
> Below Sanity Checks has to be done for the mapped queue in the list and if it 
> fails then the the next queue mapping has to be chosen, when there is no 
> queue mapping passing the sanity check, only then the application has to be 
> Rejected.
> 1. is queue present
> 2. is queue not a leaf queue
> 3. is user either have ACL Submit_Applications or Administer_Queue of the 
> queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4682) AMRM client to log when AMRM token updated

2016-02-10 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141162#comment-15141162
 ] 

Prabhu Joseph commented on YARN-4682:
-

[~ste...@apache.org]

  Steve, do i need to checkout branch-2. 

The issue "No AMRMToken" happened on hadoop-2.4.1. So like you mentioned, the 
fix of  YARN-3103 and YARN-2212 is missing there. I am doing a testing with the 
YARN-3103 fix, for every 
yarn.resourcemanager.am-rm-tokens.master-key-rolling-interval-secs, the 
AMRMToken gets updated. How to decrease the life time of a token, trying to 
simulate the issue again.

> AMRM client to log when AMRM token updated
> --
>
> Key: YARN-4682
> URL: https://issues.apache.org/jira/browse/YARN-4682
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
> Attachments: YARN-4682.patch
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> There's no information right now as to when the AMRM token gets updated; if 
> something has gone wrong with the update, you can't tell when it last when 
> through.
> fix: add a log statement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4682) AMRM client to log when AMRM token updated

2016-02-10 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-4682:

Attachment: YARN-4682.patch.1

> AMRM client to log when AMRM token updated
> --
>
> Key: YARN-4682
> URL: https://issues.apache.org/jira/browse/YARN-4682
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
> Attachments: YARN-4682.patch, YARN-4682.patch.1
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> There's no information right now as to when the AMRM token gets updated; if 
> something has gone wrong with the update, you can't tell when it last when 
> through.
> fix: add a log statement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4682) AMRM client to log when AMRM token updated

2016-02-10 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15142301#comment-15142301
 ] 

Prabhu Joseph commented on YARN-4682:
-

git checkout branch-2
git diff > YARN-4682.patch.1
But I am not seeing any difference with the previous patch.

> AMRM client to log when AMRM token updated
> --
>
> Key: YARN-4682
> URL: https://issues.apache.org/jira/browse/YARN-4682
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
> Attachments: YARN-4682.patch, YARN-4682.patch.1
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> There's no information right now as to when the AMRM token gets updated; if 
> something has gone wrong with the update, you can't tell when it last when 
> through.
> fix: add a log statement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4682) AMRM client to log when AMRM token updated

2016-02-12 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15144490#comment-15144490
 ] 

Prabhu Joseph commented on YARN-4682:
-

Thanks Steve

> AMRM client to log when AMRM token updated
> --
>
> Key: YARN-4682
> URL: https://issues.apache.org/jira/browse/YARN-4682
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
> Attachments: YARN-4682-002.patch, YARN-4682.patch, YARN-4682.patch.1
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> There's no information right now as to when the AMRM token gets updated; if 
> something has gone wrong with the update, you can't tell when it last when 
> through.
> fix: add a log statement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (YARN-4682) AMRM client to log when AMRM token updated

2016-02-10 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-4682:

Attachment: YARN-4682.patch

> AMRM client to log when AMRM token updated
> --
>
> Key: YARN-4682
> URL: https://issues.apache.org/jira/browse/YARN-4682
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
> Attachments: YARN-4682.patch
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> There's no information right now as to when the AMRM token gets updated; if 
> something has gone wrong with the update, you can't tell when it last when 
> through.
> fix: add a log statement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4682) AMRM client to log when AMRM token updated

2016-02-10 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15140720#comment-15140720
 ] 

Prabhu Joseph commented on YARN-4682:
-

[~ste...@apache.org]  Added a info log.

> AMRM client to log when AMRM token updated
> --
>
> Key: YARN-4682
> URL: https://issues.apache.org/jira/browse/YARN-4682
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
> Attachments: YARN-4682.patch
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> There's no information right now as to when the AMRM token gets updated; if 
> something has gone wrong with the update, you can't tell when it last when 
> through.
> fix: add a log statement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4730) YARN preemption based on instantaneous fair share

2016-02-23 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-4730:
---

 Summary: YARN preemption based on instantaneous fair share
 Key: YARN-4730
 URL: https://issues.apache.org/jira/browse/YARN-4730
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Prabhu Joseph


On a big cluster with Total Cluster Resource of 10TB, 3000 cores and Fair 
Sheduler having 230 queues and total 6 jobs run a day. [ all 230 queues are 
very critical and hence the minResource is same for all]. On this case, when a 
Spark Job is run on queue A and which occupies the entire cluster resource and 
does not release any resource, another job submitted into queue B and 
preemption is getting only the Fair Share which is <10TB , 3000> / 230 = <45 GB 
, 13 cores> which is very less fair share for a queue.shared by many 
applications. 

The Preemption should get the instantaneous fair Share, that is <10TB, 3000> / 
2 (active queues) = 5TB and 1500 cores, so that the first job won't hog the 
entire cluster resource and also the subsequent jobs run fine.

This issue is only when the number of queues are very high. In case of less 
number of queues, Preemption getting Fair Share would be suffice as the fair 
share will be high. But in case of too many number of queues, Preemption should 
try to get the instantaneous Fair Share.

Note: Configuring optimal maxResources to 230 queues is difficult and also 
putting constraint for the queues using maxResource will leave  cluster 
resource idle most of the time.
There are 1000s of Spark Jobs, so asking each user to restrict the 
number of executors is also difficult.

Preempting Instantaneous Fair Share will help to overcome the above issues.

  








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4730) YARN preemption based on instantaneous fair share

2016-02-24 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-4730.
-
Resolution: Duplicate

YARN-2026

> YARN preemption based on instantaneous fair share
> -
>
> Key: YARN-4730
> URL: https://issues.apache.org/jira/browse/YARN-4730
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Reporter: Prabhu Joseph
>
> On a big cluster with Total Cluster Resource of 10TB, 3000 cores and Fair 
> Sheduler having 230 queues and total 6 jobs run a day. [ all 230 queues 
> are very critical and hence the minResource is same for all]. On this case, 
> when a Spark Job is run on queue A and which occupies the entire cluster 
> resource and does not release any resource, another job submitted into queue 
> B and preemption is getting only the Fair Share which is <10TB , 3000> / 230 
> = <45 GB , 13 cores> which is very less fair share for a queue.shared by many 
> applications. 
> The Preemption should get the instantaneous fair Share, that is <10TB, 3000> 
> / 2 (active queues) = 5TB and 1500 cores, so that the first job won't hog the 
> entire cluster resource and also the subsequent jobs run fine.
> This issue is only when the number of queues are very high. In case of less 
> number of queues, Preemption getting Fair Share would be suffice as the fair 
> share will be high. But in case of too many number of queues, Preemption 
> should try to get the instantaneous Fair Share.
> Note: Configuring optimal maxResources to 230 queues is difficult and also 
> putting constraint for the queues using maxResource will leave  cluster 
> resource idle most of the time.
> There are 1000s of Spark Jobs, so asking each user to restrict the 
> number of executors is also difficult.
> Preempting Instantaneous Fair Share will help to overcome the above issues.
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-5295) YARN queue-mappings to check Queue is present before submitting job

2016-07-01 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15358899#comment-15358899
 ] 

Prabhu Joseph commented on YARN-5295:
-

Hi [~sunilg], In addition, we also need to include below code snippet in 
UserGroupMappingPlacementRule#getMappedQueue before returning the mapped queue, 
which will return a valid queue which is a existent leaf queue. 

{code}
 for (QueueMapping mapping : mappings) {
 if (queue == null || !(queue instanceof LeafQueue)) {
  continue;
 }
 return queue;
}
{code}

> YARN queue-mappings to check Queue is present before submitting job
> ---
>
> Key: YARN-5295
> URL: https://issues.apache.org/jira/browse/YARN-5295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: Prabhu Joseph
>
> In yarn Queue-Mappings, Yarn should check if the queue is present before 
> submitting the job. If not present it should go to next mapping available.
> For example if we have
> yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
> and I submit job with user "test" and if there is no "test" queue then it 
> should check the second mapping (g:edw:platform) in the list and if test is 
> part of edw group it should submit job in platform queue.
> Below Sanity checks has to be done for the mapped queue in the list and if it 
> fails then the the next queue mapping has to be chosen, when there is no 
> queue mapping passing the sanity check, only then the application has to be 
> Rejected.
> 1. is queue present
> 2. is queue not a leaf queue
> 3. is user either have ACL Submit_Applications or Administer_Queue of the 
> queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5933) ATS stale entries in active directory causes ApplicationNotFoundException in RM

2016-11-23 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-5933:
---

 Summary: ATS stale entries in active directory causes 
ApplicationNotFoundException in RM
 Key: YARN-5933
 URL: https://issues.apache.org/jira/browse/YARN-5933
 Project: Hadoop YARN
  Issue Type: Bug
  Components: ATSv2
Affects Versions: 2.7.3
Reporter: Prabhu Joseph


On Secure cluster where ATS is down, Tez job submitted will fail while getting 
TIMELINE_DELEGATION_TOKEN with below exception

{code}
0: jdbc:hive2://kerberos-2.openstacklocal:100> select csmallint from 
alltypesorc group by csmallint;
INFO  : Session is already open
INFO  : Dag name: select csmallint from alltypesor...csmallint(Stage-1)
INFO  : Tez session was closed. Reopening...
ERROR : Failed to execute tez graph.
java.lang.RuntimeException: Failed to connect to timeline server. Connection 
retries limit exceeded. The posted timeline event may be missing
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:266)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:590)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:506)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330)
at 
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
at 
org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:72)
at org.apache.tez.client.TezClient.start(TezClient.java:409)
at 
org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:196)
at 
org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.closeAndOpen(TezSessionPoolManager.java:311)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:453)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1121)
at 
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
at 
org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
at 
org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at 
org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

Tez YarnClient has received an applicationID from RM. On Restarting ATS now, 
ATS tries to get the application report from RM and so RM will throw 
ApplicationNotFoundException. ATS will keep on requesting and which floods RM.

{code}
RM logs:
2016-11-23 13:53:57,345 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new 
applicationId: 5
2016-11-23 14:05:04,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 
on 8050, call 
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
from 172.26.71.120:37699 Call#26 Retry#0
org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
with id 'application_1479897867169_0005' doesn't exist in RM.
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:328)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
at 

[jira] [Assigned] (YARN-5933) ATS stale entries in active directory causes ApplicationNotFoundException in RM

2016-11-23 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-5933:
---

Assignee: Prabhu Joseph

> ATS stale entries in active directory causes ApplicationNotFoundException in 
> RM
> ---
>
> Key: YARN-5933
> URL: https://issues.apache.org/jira/browse/YARN-5933
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> On Secure cluster where ATS is down, Tez job submitted will fail while 
> getting TIMELINE_DELEGATION_TOKEN with below exception
> {code}
> 0: jdbc:hive2://kerberos-2.openstacklocal:100> select csmallint from 
> alltypesorc group by csmallint;
> INFO  : Session is already open
> INFO  : Dag name: select csmallint from alltypesor...csmallint(Stage-1)
> INFO  : Tez session was closed. Reopening...
> ERROR : Failed to execute tez graph.
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:266)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:590)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:506)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
>   at 
> org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:72)
>   at org.apache.tez.client.TezClient.start(TezClient.java:409)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:196)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.closeAndOpen(TezSessionPoolManager.java:311)
>   at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:453)
>   at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1121)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Tez YarnClient has received an applicationID from RM. On Restarting ATS now, 
> ATS tries to get the application report from RM and so RM will throw 
> ApplicationNotFoundException. ATS will keep on requesting and which floods RM.
> {code}
> RM logs:
> 2016-11-23 13:53:57,345 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new 
> applicationId: 5
> 2016-11-23 14:05:04,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 9 on 8050, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 172.26.71.120:37699 Call#26 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1479897867169_0005' doesn't exist in RM.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:328)
>   at 
> 

[jira] [Commented] (YARN-5933) ATS stale entries in active directory causes ApplicationNotFoundException in RM

2016-11-26 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15698124#comment-15698124
 ] 

Prabhu Joseph commented on YARN-5933:
-

Hi [~sunilg] [~gtCarrera9], Below are some of the ways to fix this issue 
assuming an application which is not found in RM at first getApplicationReport 
call will never be one of APP_FINAL_STATES at subsequent getApplicationReport 
call.

1. Once the AppState is Unknown, the appDir can be removed from ActivePath 
immediately. Not sure why there is a wait of unknownActiveMillis and then app 
marked as completed. If we choose removal of appDir immediately, then there 
won't be any need for unknownActiveMillis handling code.
2. If there is a need to move unknown state app also to done directory, then 
the appDir can be moved immediately before waiting for unknownActiveMillis 

Please share your comments.

> ATS stale entries in active directory causes ApplicationNotFoundException in 
> RM
> ---
>
> Key: YARN-5933
> URL: https://issues.apache.org/jira/browse/YARN-5933
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> On Secure cluster where ATS is down, Tez job submitted will fail while 
> getting TIMELINE_DELEGATION_TOKEN with below exception
> {code}
> 0: jdbc:hive2://kerberos-2.openstacklocal:100> select csmallint from 
> alltypesorc group by csmallint;
> INFO  : Session is already open
> INFO  : Dag name: select csmallint from alltypesor...csmallint(Stage-1)
> INFO  : Tez session was closed. Reopening...
> ERROR : Failed to execute tez graph.
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:266)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:590)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:506)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
>   at 
> org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:72)
>   at org.apache.tez.client.TezClient.start(TezClient.java:409)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:196)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.closeAndOpen(TezSessionPoolManager.java:311)
>   at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:453)
>   at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1121)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Tez YarnClient has received an applicationID from RM. On Restarting ATS now, 
> ATS tries to get the application report from RM and so RM will throw 
> ApplicationNotFoundException. ATS will keep on requesting and which floods 

[jira] [Commented] (YARN-5933) ATS stale entries in active directory causes ApplicationNotFoundException in RM

2016-11-29 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15704980#comment-15704980
 ] 

Prabhu Joseph commented on YARN-5933:
-

Thanks [~gtCarrera9], looks not a simple one to directly remove unknown appDir. 
Assume there are 10 tez jobs failed when ATS is down, then there will be 10 * 
unknownActiveSecs / scanIntervalSecs = 14400 ApplicationNotFoundException 
stacktrace will be in RM throughout that entire day logs. If there is no impact 
other than flooding of RM logs, is it better to change the 
ApplicationNotFoundException stacktrace into a single WARN message.

> ATS stale entries in active directory causes ApplicationNotFoundException in 
> RM
> ---
>
> Key: YARN-5933
> URL: https://issues.apache.org/jira/browse/YARN-5933
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> On Secure cluster where ATS is down, Tez job submitted will fail while 
> getting TIMELINE_DELEGATION_TOKEN with below exception
> {code}
> 0: jdbc:hive2://kerberos-2.openstacklocal:100> select csmallint from 
> alltypesorc group by csmallint;
> INFO  : Session is already open
> INFO  : Dag name: select csmallint from alltypesor...csmallint(Stage-1)
> INFO  : Tez session was closed. Reopening...
> ERROR : Failed to execute tez graph.
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:266)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:590)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:506)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
>   at 
> org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:72)
>   at org.apache.tez.client.TezClient.start(TezClient.java:409)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:196)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.closeAndOpen(TezSessionPoolManager.java:311)
>   at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:453)
>   at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1121)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Tez YarnClient has received an applicationID from RM. On Restarting ATS now, 
> ATS tries to get the application report from RM and so RM will throw 
> ApplicationNotFoundException. ATS will keep on requesting and which floods RM.
> {code}
> RM logs:
> 2016-11-23 13:53:57,345 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new 
> applicationId: 5
> 2016-11-23 14:05:04,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 9 on 8050, call 
> 

[jira] [Commented] (YARN-5933) ATS stale entries in active directory causes ApplicationNotFoundException in RM

2016-11-30 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15708237#comment-15708237
 ] 

Prabhu Joseph commented on YARN-5933:
-

Hi [~gtCarrera9] Okay, I think AppLogs#parseSummaryLogs() can skip subsequent 
getAppState for Unknown apps and move them to complete after unknownActiveSecs.

> ATS stale entries in active directory causes ApplicationNotFoundException in 
> RM
> ---
>
> Key: YARN-5933
> URL: https://issues.apache.org/jira/browse/YARN-5933
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> On Secure cluster where ATS is down, Tez job submitted will fail while 
> getting TIMELINE_DELEGATION_TOKEN with below exception
> {code}
> 0: jdbc:hive2://kerberos-2.openstacklocal:100> select csmallint from 
> alltypesorc group by csmallint;
> INFO  : Session is already open
> INFO  : Dag name: select csmallint from alltypesor...csmallint(Stage-1)
> INFO  : Tez session was closed. Reopening...
> ERROR : Failed to execute tez graph.
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:266)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:590)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:506)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
>   at 
> org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:72)
>   at org.apache.tez.client.TezClient.start(TezClient.java:409)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:196)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.closeAndOpen(TezSessionPoolManager.java:311)
>   at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:453)
>   at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1121)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Tez YarnClient has received an applicationID from RM. On Restarting ATS now, 
> ATS tries to get the application report from RM and so RM will throw 
> ApplicationNotFoundException. ATS will keep on requesting and which floods RM.
> {code}
> RM logs:
> 2016-11-23 13:53:57,345 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new 
> applicationId: 5
> 2016-11-23 14:05:04,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 9 on 8050, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 172.26.71.120:37699 Call#26 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1479897867169_0005' doesn't exist in RM.
>   at 
> 

[jira] [Updated] (YARN-6052) Yarn RM UI % of Queue at application level is wrong

2017-01-04 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-6052:

Attachment: RM_UI.png

> Yarn RM UI % of Queue at application level is wrong
> ---
>
> Key: YARN-6052
> URL: https://issues.apache.org/jira/browse/YARN-6052
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Priority: Minor
> Attachments: RM_UI.png
>
>
> Test Case:
> yarn.scheduler.capacity.root.capacity=100
> yarn.scheduler.capacity.root.queues=default,dummy
> yarn.scheduler.capacity.root.default.capacity=20
> yarn.scheduler.capacity.root.dummy.capacity=80
> yarn.scheduler.capacity.root.dummy.child.capacity=50
> yarn.scheduler.capacity.root.dummy.child2.capacity=50
> Memory Total is 20GB, default queue share is 4GB and dummy queue share is 
> 16GB. Child and Child1 queue gets 8GB share each.
> A map reduce job is submitted  to child2 queue which asks 2 containers of 512 
> MB. Now cluster Memory Used is 1GB.
> Root queue usage = 100 / (total memory / used memory)  = 100 / (20 / 1) =  5%
> Dummy queue usage = 100 / (16 /1) = 6.3%
> Dummy.Child2 queue usage = 100 / (8/1) = 12.5%
> At application level, % of queue is calculated as 100 / (50% of root queue 
> capacity) = 100 / (50% of 20GB) = 10.0 instead of 
> 100 / (50% of dummy queue capacity) = 100 / (50% of 16GB) = 100 / 8 = 12.5
> Where 50% is dummy.child2 capacity
> Attached RM UI screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6052) Yarn RM UI % of Queue at application level is wrong

2017-01-04 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-6052:
---

Assignee: Prabhu Joseph

> Yarn RM UI % of Queue at application level is wrong
> ---
>
> Key: YARN-6052
> URL: https://issues.apache.org/jira/browse/YARN-6052
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
> Attachments: RM_UI.png
>
>
> Test Case:
> yarn.scheduler.capacity.root.capacity=100
> yarn.scheduler.capacity.root.queues=default,dummy
> yarn.scheduler.capacity.root.default.capacity=20
> yarn.scheduler.capacity.root.dummy.capacity=80
> yarn.scheduler.capacity.root.dummy.child.capacity=50
> yarn.scheduler.capacity.root.dummy.child2.capacity=50
> Memory Total is 20GB, default queue share is 4GB and dummy queue share is 
> 16GB. Child and Child1 queue gets 8GB share each.
> A map reduce job is submitted  to child2 queue which asks 2 containers of 512 
> MB. Now cluster Memory Used is 1GB.
> Root queue usage = 100 / (total memory / used memory)  = 100 / (20 / 1) =  5%
> Dummy queue usage = 100 / (16 /1) = 6.3%
> Dummy.Child2 queue usage = 100 / (8/1) = 12.5%
> At application level, % of queue is calculated as 100 / (50% of root queue 
> capacity) = 100 / (50% of 20GB) = 10.0 instead of 
> 100 / (50% of dummy queue capacity) = 100 / (50% of 16GB) = 100 / 8 = 12.5
> Where 50% is dummy.child2 capacity
> Attached RM UI screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6053) RM Web Service shows startedTime , finsihedTime as zero when RM is kerberized and ACL is setup

2017-01-04 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-6053:
---

 Summary: RM Web Service shows startedTime , finsihedTime as zero 
when RM is kerberized and ACL is setup
 Key: YARN-6053
 URL: https://issues.apache.org/jira/browse/YARN-6053
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.3
Reporter: Prabhu Joseph
Priority: Minor


RM UI is Kerberized and ACL is setup, a user pjoseph has logged into RM UI and 
able to see the other user prabhu job’s startTime and finishTime but won’t be 
able to read the attempts of the application which is expected as ACL is setup. 
But on using RM Web Services, 
http://kerberos-3.openstacklocal:8088/ws/v1/cluster/apps/application_1482325548661_0002
 the startedTime,
finishedTime and elapsedTime are 0 [AppInfo.java sets this to zero if user does 
not have access]. We can display the correct values as anyway RM UI shows them.

Attached output of RM UI and RM WebService.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-6053) RM Web Service shows startedTime , finsihedTime as zero when RM is kerberized and ACL is setup

2017-01-04 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-6053:
---

Assignee: Prabhu Joseph

> RM Web Service shows startedTime , finsihedTime as zero when RM is kerberized 
> and ACL is setup
> --
>
> Key: YARN-6053
> URL: https://issues.apache.org/jira/browse/YARN-6053
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
> Attachments: RM_UI_ACL.png, RM_UI_start_stop.png, 
> RM_WEB_SERVICE_start_stop.png
>
>
> RM UI is Kerberized and ACL is setup, a user pjoseph has logged into RM UI 
> and able to see the other user prabhu job’s startTime and finishTime but 
> won’t be able to read the attempts of the application which is expected as 
> ACL is setup. But on using RM Web Services, 
> http://kerberos-3.openstacklocal:8088/ws/v1/cluster/apps/application_1482325548661_0002
>  the startedTime,
> finishedTime and elapsedTime are 0 [AppInfo.java sets this to zero if user 
> does not have access]. We can display the correct values as anyway RM UI 
> shows them.
> Attached output of RM UI and RM WebService.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6053) RM Web Service shows startedTime , finsihedTime as zero when RM is kerberized and ACL is setup

2017-01-04 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-6053:

Attachment: RM_UI_ACL.png
RM_WEB_SERVICE_start_stop.png
RM_UI_start_stop.png

> RM Web Service shows startedTime , finsihedTime as zero when RM is kerberized 
> and ACL is setup
> --
>
> Key: YARN-6053
> URL: https://issues.apache.org/jira/browse/YARN-6053
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Priority: Minor
> Attachments: RM_UI_ACL.png, RM_UI_start_stop.png, 
> RM_WEB_SERVICE_start_stop.png
>
>
> RM UI is Kerberized and ACL is setup, a user pjoseph has logged into RM UI 
> and able to see the other user prabhu job’s startTime and finishTime but 
> won’t be able to read the attempts of the application which is expected as 
> ACL is setup. But on using RM Web Services, 
> http://kerberos-3.openstacklocal:8088/ws/v1/cluster/apps/application_1482325548661_0002
>  the startedTime,
> finishedTime and elapsedTime are 0 [AppInfo.java sets this to zero if user 
> does not have access]. We can display the correct values as anyway RM UI 
> shows them.
> Attached output of RM UI and RM WebService.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6052) Yarn RM UI % of Queue at application level is wrong

2017-01-04 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-6052:
---

 Summary: Yarn RM UI % of Queue at application level is wrong
 Key: YARN-6052
 URL: https://issues.apache.org/jira/browse/YARN-6052
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.3
Reporter: Prabhu Joseph
Priority: Minor


Test Case:

yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.queues=default,dummy
yarn.scheduler.capacity.root.default.capacity=20
yarn.scheduler.capacity.root.dummy.capacity=80
yarn.scheduler.capacity.root.dummy.child.capacity=50
yarn.scheduler.capacity.root.dummy.child2.capacity=50

Memory Total is 20GB, default queue share is 4GB and dummy queue share is 16GB. 
Child and Child1 queue gets 8GB share each.

A map reduce job is submitted  to child2 queue which asks 2 containers of 512 
MB. Now cluster Memory Used is 1GB.

Root queue usage = 100 / (total memory / used memory)  = 100 / (20 / 1) =  5%
Dummy queue usage = 100 / (16 /1) = 6.3%
Dummy.Child2 queue usage = 100 / (8/1) = 12.5%

At application level, % of queue is calculated as 100 / (50% of root queue 
capacity) = 100 / (50% of 20GB) = 10.0 instead of 
100 / (50% of dummy queue capacity) = 100 / (50% of 16GB) = 100 / 8 = 12.5

Where 50% is dummy.child2 capacity

Attached RM UI screenshot.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6075) Yarn top for FairScheduler

2017-01-09 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-6075:

Attachment: Yarn_Top_FairScheduler.png

> Yarn top for FairScheduler
> --
>
> Key: YARN-6075
> URL: https://issues.apache.org/jira/browse/YARN-6075
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler, resourcemanager
>Reporter: Prabhu Joseph
> Attachments: Yarn_Top_FairScheduler.png
>
>
> Yarn top output for FairScheduler shows empty values. (attached output) We 
> need to handle yarn top with FairScheduler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6075) Yarn top for FairScheduler

2017-01-09 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-6075:
---

 Summary: Yarn top for FairScheduler
 Key: YARN-6075
 URL: https://issues.apache.org/jira/browse/YARN-6075
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler, resourcemanager
Reporter: Prabhu Joseph


Yarn top output for FairScheduler shows empty values. (attached output) We need 
to handle yarn top with FairScheduler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6052) Yarn RM UI % of Queue at application level is wrong

2017-01-05 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15803853#comment-15803853
 ] 

Prabhu Joseph commented on YARN-6052:
-

Sorry for the spam, the issue is already fixed by YARN-. Closing this as a 
Duplicate.

> Yarn RM UI % of Queue at application level is wrong
> ---
>
> Key: YARN-6052
> URL: https://issues.apache.org/jira/browse/YARN-6052
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
> Attachments: RM_UI.png
>
>
> Test Case:
> yarn.scheduler.capacity.root.capacity=100
> yarn.scheduler.capacity.root.queues=default,dummy
> yarn.scheduler.capacity.root.default.capacity=20
> yarn.scheduler.capacity.root.dummy.capacity=80
> yarn.scheduler.capacity.root.dummy.child.capacity=50
> yarn.scheduler.capacity.root.dummy.child2.capacity=50
> Memory Total is 20GB, default queue share is 4GB and dummy queue share is 
> 16GB. Child and Child1 queue gets 8GB share each.
> A map reduce job is submitted  to child2 queue which asks 2 containers of 512 
> MB. Now cluster Memory Used is 1GB.
> Root queue usage = 100 / (total memory / used memory)  = 100 / (20 / 1) =  5%
> Dummy queue usage = 100 / (16 /1) = 6.3%
> Dummy.Child2 queue usage = 100 / (8/1) = 12.5%
> At application level, % of queue is calculated as 100 / (50% of root queue 
> capacity) = 100 / (50% of 20GB) = 10.0 instead of 
> 100 / (50% of dummy queue capacity) = 100 / (50% of 16GB) = 100 / 8 = 12.5
> Where 50% is dummy.child2 capacity
> Attached RM UI screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6052) Yarn RM UI % of Queue at application level is wrong

2017-01-05 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-6052.
-
Resolution: Duplicate

> Yarn RM UI % of Queue at application level is wrong
> ---
>
> Key: YARN-6052
> URL: https://issues.apache.org/jira/browse/YARN-6052
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
> Attachments: RM_UI.png
>
>
> Test Case:
> yarn.scheduler.capacity.root.capacity=100
> yarn.scheduler.capacity.root.queues=default,dummy
> yarn.scheduler.capacity.root.default.capacity=20
> yarn.scheduler.capacity.root.dummy.capacity=80
> yarn.scheduler.capacity.root.dummy.child.capacity=50
> yarn.scheduler.capacity.root.dummy.child2.capacity=50
> Memory Total is 20GB, default queue share is 4GB and dummy queue share is 
> 16GB. Child and Child1 queue gets 8GB share each.
> A map reduce job is submitted  to child2 queue which asks 2 containers of 512 
> MB. Now cluster Memory Used is 1GB.
> Root queue usage = 100 / (total memory / used memory)  = 100 / (20 / 1) =  5%
> Dummy queue usage = 100 / (16 /1) = 6.3%
> Dummy.Child2 queue usage = 100 / (8/1) = 12.5%
> At application level, % of queue is calculated as 100 / (50% of root queue 
> capacity) = 100 / (50% of 20GB) = 10.0 instead of 
> 100 / (50% of dummy queue capacity) = 100 / (50% of 16GB) = 100 / 8 = 12.5
> Where 50% is dummy.child2 capacity
> Attached RM UI screenshot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2017-08-02 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112071#comment-16112071
 ] 

Prabhu Joseph commented on YARN-6929:
-

Date can be retrieved from the timestamp present in the application id while 
creating date subdirectory. So while scanning we will know which date 
subdirectory to check directly. The URL can remain the same.  

> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.nodemanager.log.retain-second of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
>  Application failed to init aggregation 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> 

[jira] [Updated] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2017-08-03 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-6929:

Description: 
The current directory structure for yarn.nodemanager.remote-app-log-dir is not 
scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). With 
retention yarn.log-aggregation.retain-seconds of 7days, there are more chances 
LogAggregationService fails to create a new directory with 
FSLimitException$MaxDirectoryItemsExceededException.

The current structure is 
//logs/. This can be 
improved with adding date as a subdirectory like 
//logs// 


{code}
WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
 Application failed to init aggregation 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
 The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
items=1048576 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
 
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
 
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
 
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
 
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:415) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
 
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
 
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
 
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
 
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
 
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) 
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
at java.lang.Thread.run(Thread.java:745) 
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
 The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
items=1048576 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
 
{code}

Thanks to Robert Mancuso for finding this issue.

  was:
The current directory structure for yarn.nodemanager.remote-app-log-dir is not 
scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). With 
retention yarn.nodemanager.log.retain-second of 7days, there are more chances 
LogAggregationService fails to create a new directory with 
FSLimitException$MaxDirectoryItemsExceededException.

The current structure is 
//logs/. This can be 
improved with adding date as a subdirectory like 
//logs// 


{code}
WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
 

[jira] [Commented] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2017-08-03 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113232#comment-16113232
 ] 

Prabhu Joseph commented on YARN-6929:
-

Yes got it.  I think Max Bucket Size can be derived from 
yarn.log-aggregation.retain-seconds (in days) say, 
yarn.log-aggregation.retain-seconds (in days) * 24 and so it will scale with 
any number of configured retention period. Else an optimal max bucket size for 
7 days retention won;t be for 30 days.

And why we need two sub directories (app_id/ bucket_size) and 
(app_id%bucket_size). I think below itself should solve.

{code}
aggregation_log_root / user / cluster_timestamp / (app_id%bucket_size) 
 where bucket_size determined from yarn.log-aggregation.retain-seconds 
(in days) * 24
{code}

> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
>  Application failed to init aggregation 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: 
> 

[jira] [Commented] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2017-08-03 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113176#comment-16113176
 ] 

Prabhu Joseph commented on YARN-6929:
-

Thanks, missed it. Hash can be generated from ApplicationID#getId() with 
yarn.log-aggregation.retain-seconds * 24 buckets (Hoping 1 hour will have less 
than millions of apps). This way random read and write of appDir is possible. 
Deletion Service will traverse on these hashDirs for every userDir.

> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
>  Application failed to init aggregation 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> 

[jira] [Commented] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2017-08-03 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113858#comment-16113858
 ] 

Prabhu Joseph commented on YARN-6929:
-

Yes clear now. 
{code}
aggregation_log_root / user / cluster_timestamp / (app_id/ bucket_size)
where bucket_size = DFS_NAMENODE_MAX_DIRECTORY_ITEMS_KEY
{code}

> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
>  Application failed to init aggregation 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> 

[jira] [Created] (YARN-6810) YARN localizer has to validate the mapreduce.tar.gz present in cache before using it

2017-07-12 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-6810:
---

 Summary: YARN localizer has to validate the mapreduce.tar.gz 
present in cache before using it
 Key: YARN-6810
 URL: https://issues.apache.org/jira/browse/YARN-6810
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.3
Reporter: Prabhu Joseph


When a localized mapreduce.tar.gz is corrupt and zero bytes, all MapReduce jobs 
fails on the cluster with "Error: Could not find or load main class 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster " as it uses corrupt 
mapreduce.tar.gz. YARN Localizer has to check if the existing mapreduce.tar.gz 
is a valid file before using it.

 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6810) YARN localizer has to validate the mapreduce.tar.gz present in cache before using it

2017-07-12 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16084191#comment-16084191
 ] 

Prabhu Joseph commented on YARN-6810:
-

[~jlowe] Missed it while searching for existing jira. Will close this as a 
Duplicate.

> YARN localizer has to validate the mapreduce.tar.gz present in cache before 
> using it
> 
>
> Key: YARN-6810
> URL: https://issues.apache.org/jira/browse/YARN-6810
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>
> When a localized mapreduce.tar.gz is corrupt and zero bytes, all MapReduce 
> jobs fails on the cluster with "Error: Could not find or load main class 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster " as it uses corrupt 
> mapreduce.tar.gz. YARN Localizer has to check if the existing 
> mapreduce.tar.gz is a valid file before using it.
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2017-08-02 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-6929:
---

 Summary: yarn.nodemanager.remote-app-log-dir structure is not 
scalable
 Key: YARN-6929
 URL: https://issues.apache.org/jira/browse/YARN-6929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.3
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


The current directory structure for yarn.nodemanager.remote-app-log-dir is not 
scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). With 
retention yarn.nodemanager.log.retain-second of 7days, there are more chances 
LogAggregationService fails to create a new directory with 
FSLimitException$MaxDirectoryItemsExceededException.

The current structure is 
//logs/. This can be 
improved with adding date as a subdirectory like 
//logs// 


{code}
WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
 Application failed to init aggregation 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
 The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
items=1048576 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
 
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
 
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
 
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
 
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:415) 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
 
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
 
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
 
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
 
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
 
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) 
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
at java.lang.Thread.run(Thread.java:745) 
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
 The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
items=1048576 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
 
{code}

Thanks to Robert Mancuso for finding this issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6616) YARN AHS shows submitTime for jobs same as startTime

2017-05-17 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-6616:
---

 Summary: YARN AHS shows submitTime for jobs same as startTime
 Key: YARN-6616
 URL: https://issues.apache.org/jira/browse/YARN-6616
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.3
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph
Priority: Minor


YARN AHS returns startTime value for both submitTime and startTime for the 
jobs.  Looks the code sets the submitTime with startTime value. 

https://github.com/apache/hadoop/blob/branch-2.7.3/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/dao/AppInfo.java#L80

{code}
curl --negotiate -u: 
http://prabhuzeppelin3.openstacklocal:8188/ws/v1/applicationhistory/apps
149501553757414950155375741495016384084
{code}





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6557) YARN ContainerLocalizer logs are missing

2017-05-04 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-6557.
-
Resolution: Fixed

Duplicate of YARN-5422

> YARN ContainerLocalizer logs are missing
> 
>
> Key: YARN-6557
> URL: https://issues.apache.org/jira/browse/YARN-6557
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Prabhu Joseph
>
> YARN LCE ContainerLocalizer runs as a separate process and the logs / error 
> messages are not captured. We need to redirect them to a stdout or separate 
> log file which helps to debug Localization issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6557) YARN ContainerLocalizer logs are missing

2017-05-04 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996267#comment-15996267
 ] 

Prabhu Joseph commented on YARN-6557:
-

[~Naganarasimha] Yes, missed it. Will close this one as duplicate.

> YARN ContainerLocalizer logs are missing
> 
>
> Key: YARN-6557
> URL: https://issues.apache.org/jira/browse/YARN-6557
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.1
>Reporter: Prabhu Joseph
>
> YARN LCE ContainerLocalizer runs as a separate process and the logs / error 
> messages are not captured. We need to redirect them to a stdout or separate 
> log file which helps to debug Localization issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6557) YARN ContainerLocalizer logs are missing

2017-05-04 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-6557:
---

 Summary: YARN ContainerLocalizer logs are missing
 Key: YARN-6557
 URL: https://issues.apache.org/jira/browse/YARN-6557
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Prabhu Joseph


YARN LCE ContainerLocalizer runs as a separate process and the logs / error 
messages are not captured. We need to redirect them to a stdout or separate log 
file which helps to debug Localization issues.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7235) RMWebServices SSL renegotiate denied

2017-09-20 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-7235:
---

 Summary: RMWebServices SSL renegotiate denied
 Key: YARN-7235
 URL: https://issues.apache.org/jira/browse/YARN-7235
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.3
Reporter: Prabhu Joseph


We see lot of SSL renegotiate denied WARN messages in RM logs

{code}
2017-08-29 08:14:15,821 WARN  mortbay.log (Slf4jLog.java:warn(76)) - SSL 
renegotiate denied: java.nio.channels.SocketChannel[connected 
local=/10.136.19.134:8078 remote=/10.136.19.103:59994]
{code}

Looks we need a similar fix like YARN-6797 for RMWebServices.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7288) ContaninerLocalizer with multiple JVM Options

2017-10-03 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-7288:
---

 Summary: ContaninerLocalizer with multiple JVM Options
 Key: YARN-7288
 URL: https://issues.apache.org/jira/browse/YARN-7288
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.3
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


Currently ContaninerLocalizer can be configured with a single JVM option 
through yarn.nodemanager.container-localizer.java.opts. There are cases where 
we need more than one like adding -Dlog4j.debug / -verbose to debug issues.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7288) ContainerLocalizer with multiple JVM Options

2017-10-04 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191240#comment-16191240
 ] 

Prabhu Joseph commented on YARN-7288:
-

It works fine now, have configured wrongly with double quotes which it does not 
expect. Thanks [~jlowe]

> ContainerLocalizer with multiple JVM Options
> 
>
> Key: YARN-7288
> URL: https://issues.apache.org/jira/browse/YARN-7288
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> Currently ContaninerLocalizer can be configured with a single JVM option 
> through yarn.nodemanager.container-localizer.java.opts. There are cases where 
> we need more than one like adding -Dlog4j.debug / -verbose to debug issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7111) ApplicationHistoryServer webpage startTime and state are not readable

2017-08-30 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146923#comment-16146923
 ] 

Prabhu Joseph commented on YARN-7111:
-

Looks the problem does not exist in 2.7.4 (attached image). Closing this as not 
a problem.

> ApplicationHistoryServer webpage startTime and state are not readable
> -
>
> Key: YARN-7111
> URL: https://issues.apache.org/jira/browse/YARN-7111
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
> Attachments: Screen Shot 2017-08-28 at 5.24.01 PM.png
>
>
> ApplicationHistoryServer webpage FINISHED applications displays startTime and 
> state in not readable format. (attached image)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7111) ApplicationHistoryServer webpage startTime and state are not readable

2017-08-30 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-7111:

Attachment: working.png

> ApplicationHistoryServer webpage startTime and state are not readable
> -
>
> Key: YARN-7111
> URL: https://issues.apache.org/jira/browse/YARN-7111
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
> Attachments: Screen Shot 2017-08-28 at 5.24.01 PM.png, working.png
>
>
> ApplicationHistoryServer webpage FINISHED applications displays startTime and 
> state in not readable format. (attached image)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7111) ApplicationHistoryServer webpage startTime and state are not readable

2017-08-30 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph resolved YARN-7111.
-
Resolution: Not A Problem

> ApplicationHistoryServer webpage startTime and state are not readable
> -
>
> Key: YARN-7111
> URL: https://issues.apache.org/jira/browse/YARN-7111
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
> Attachments: Screen Shot 2017-08-28 at 5.24.01 PM.png, working.png
>
>
> ApplicationHistoryServer webpage FINISHED applications displays startTime and 
> state in not readable format. (attached image)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-7118) AHS REST API can return NullPointerException

2017-09-07 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-7118:
---

Assignee: Prabhu Joseph

> AHS REST API can return NullPointerException
> 
>
> Key: YARN-7118
> URL: https://issues.apache.org/jira/browse/YARN-7118
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> ApplicationHistoryService REST Api returns NullPointerException
> {code}
> [prabhu@prabhu2 root]$ curl --negotiate -u: 'http:// IP>:8188/ws/v1/applicationhistory/apps?queue=test'
> {"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException"}
> {code}
> TimelineServer logs shows below.
> {code}
> 2017-08-17 17:54:54,128 WARN  webapp.GenericExceptionHandler 
> (GenericExceptionHandler.java:toResponse(98)) - INTERNAL_SERVER_ERROR
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getApps(WebServices.java:191)
> at 
> org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSWebServices.getApps(AHSWebServices.java:96)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
> at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7284) NodeManager crashes with OOM when Debug log enabled for ContainerLocalizer

2017-10-03 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-7284:

Attachment: Screen Shot 2017-10-03 at 1.29.35 PM.png
Screen Shot 2017-10-03 at 1.29.48 PM.png

> NodeManager crashes with OOM when Debug log enabled for ContainerLocalizer 
> ---
>
> Key: YARN-7284
> URL: https://issues.apache.org/jira/browse/YARN-7284
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
> Attachments: Screen Shot 2017-10-03 at 1.29.35 PM.png, Screen Shot 
> 2017-10-03 at 1.29.48 PM.png
>
>
> NodeManager crashes with OOM when DEBUG log enabled for ContainerLocalizer. 
> {code}
> 2017-10-03 07:25:20,066 FATAL yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(51)) - Thread 
> Thread[Thread-2114,5,main] threw an Error.  Shutting down now...
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3332)
> at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
> at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
> at java.lang.StringBuffer.append(StringBuffer.java:272)
> at org.apache.hadoop.util.Shell$1.run(Shell.java:900)
> {code}
> errThread part of Hadoop Common Shell reads all the DEBUG log lines and 
> appends to StringBuffer errMsg. As per the heap dump, the errMsg stores more 
> than 1GB of contents. (attached image)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7284) NodeManager crashes with OOM when Debug log enabled for ContainerLocalizer

2017-10-03 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-7284:
---

 Summary: NodeManager crashes with OOM when Debug log enabled for 
ContainerLocalizer 
 Key: YARN-7284
 URL: https://issues.apache.org/jira/browse/YARN-7284
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.3
Reporter: Prabhu Joseph


NodeManager crashes with OOM when DEBUG log enabled for ContainerLocalizer. 

{code}
2017-10-03 07:25:20,066 FATAL yarn.YarnUncaughtExceptionHandler 
(YarnUncaughtExceptionHandler.java:uncaughtException(51)) - Thread 
Thread[Thread-2114,5,main] threw an Error.  Shutting down now...
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at 
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuffer.append(StringBuffer.java:272)
at org.apache.hadoop.util.Shell$1.run(Shell.java:900)
{code}

errThread part of Hadoop Common Shell reads all the DEBUG log lines and appends 
to StringBuffer errMsg. As per the heap dump, the errMsg stores more than 1GB 
of contents. (attached image)






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7108) Refreshing Default Node Label Expression of a queue does not reflect for running apps

2017-08-28 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-7108:
---

 Summary: Refreshing Default Node Label Expression of a queue does 
not reflect for running apps
 Key: YARN-7108
 URL: https://issues.apache.org/jira/browse/YARN-7108
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Affects Versions: 2.7.3
Reporter: Prabhu Joseph


Refreshing a queue's default node label expression does not reflect for the 
running applications. 

Repro Steps:

4 node cluster, two node labels label1 and label2. label1 is Exclusive 
Partition with Node1 and Node2, label2 is Exclusive Partition with Node3 and 
Node4. A default queue whose default node label expression is label1. 

1.Shutdown NodeManagers on label1 nodes Node1 and Node2
2.Submit a sample mapreduce on default queue which will stay in ACCEPTED state 
3.Change default node label expression for default queue to label2 in 
capacity-scheduler.xml
yarn rmadmin -refreshQueues
queue's config gets reflected to label2 as shown on RM UI queue section but job 
still stays at ACCEPTED state
4. Submitting a new job into default queue moves into RUNNING state






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7108) Refreshing Default Node Label Expression of a queue does not reflect for running apps

2017-08-28 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143458#comment-16143458
 ] 

Prabhu Joseph commented on YARN-7108:
-

Have submitted a mapreduce job with default queue and no label settings.

{code}
hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /Input /Output
{code}

> Refreshing Default Node Label Expression of a queue does not reflect for 
> running apps
> -
>
> Key: YARN-7108
> URL: https://issues.apache.org/jira/browse/YARN-7108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>
> Refreshing a queue's default node label expression does not reflect for the 
> running applications. 
> Repro Steps:
> 4 node cluster, two node labels label1 and label2. label1 is Exclusive 
> Partition with Node1 and Node2, label2 is Exclusive Partition with Node3 and 
> Node4. A default queue whose default node label expression is label1. 
> 1.Shutdown NodeManagers on label1 nodes Node1 and Node2
> 2.Submit a sample mapreduce on default queue which will stay in ACCEPTED 
> state 
> 3.Change default node label expression for default queue to label2 in 
> capacity-scheduler.xml
> yarn rmadmin -refreshQueues
> queue's config gets reflected to label2 as shown on RM UI queue section but 
> job still stays at ACCEPTED state
> 4. Submitting a new job into default queue moves into RUNNING state



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7111) ApplicationHistoryServer webpage startTime and state are not readable

2017-08-28 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-7111:
---

 Summary: ApplicationHistoryServer webpage startTime and state are 
not readable
 Key: YARN-7111
 URL: https://issues.apache.org/jira/browse/YARN-7111
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.7.3
Reporter: Prabhu Joseph


ApplicationHistoryServer webpage FINISHED applications displays startTime and 
state in not readable format. (attached image)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7111) ApplicationHistoryServer webpage startTime and state are not readable

2017-08-28 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-7111:

Attachment: Screen Shot 2017-08-28 at 5.24.01 PM.png

> ApplicationHistoryServer webpage startTime and state are not readable
> -
>
> Key: YARN-7111
> URL: https://issues.apache.org/jira/browse/YARN-7111
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
> Attachments: Screen Shot 2017-08-28 at 5.24.01 PM.png
>
>
> ApplicationHistoryServer webpage FINISHED applications displays startTime and 
> state in not readable format. (attached image)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7108) Refreshing Default Node Label Expression of a queue does not reflect for running apps

2017-08-28 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143812#comment-16143812
 ] 

Prabhu Joseph commented on YARN-7108:
-

Thought application will implicitly run on queue's configured default node 
label. It moves into RUNNING state when the default node label has nodes 
running, if not it stays at ACCEPTED state which is expected. But refreshing 
queue's default label to a new label which has nodes running does not refresh 
the app state. Is this expected behavior. 



> Refreshing Default Node Label Expression of a queue does not reflect for 
> running apps
> -
>
> Key: YARN-7108
> URL: https://issues.apache.org/jira/browse/YARN-7108
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>
> Refreshing a queue's default node label expression does not reflect for the 
> running applications. 
> Repro Steps:
> 4 node cluster, two node labels label1 and label2. label1 is Exclusive 
> Partition with Node1 and Node2, label2 is Exclusive Partition with Node3 and 
> Node4. A default queue whose default node label expression is label1. 
> 1.Shutdown NodeManagers on label1 nodes Node1 and Node2
> 2.Submit a sample mapreduce on default queue which will stay in ACCEPTED 
> state 
> 3.Change default node label expression for default queue to label2 in 
> capacity-scheduler.xml
> yarn rmadmin -refreshQueues
> queue's config gets reflected to label2 as shown on RM UI queue section but 
> job still stays at ACCEPTED state
> 4. Submitting a new job into default queue moves into RUNNING state



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7118) AHS REST API can return NullPointerException

2017-08-29 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-7118:
---

 Summary: AHS REST API can return NullPointerException
 Key: YARN-7118
 URL: https://issues.apache.org/jira/browse/YARN-7118
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Prabhu Joseph


ApplicationHistoryService REST Api returns NullPointerException
{code}
[prabhu@prabhu2 root]$ curl --negotiate -u: 'http://:8188/ws/v1/applicationhistory/apps?queue=test'
{"exception":"NullPointerException","javaClassName":"java.lang.NullPointerException"}
{code}

TimelineServer logs shows below.

{code}
2017-08-17 17:54:54,128 WARN  webapp.GenericExceptionHandler 
(GenericExceptionHandler.java:toResponse(98)) - INTERNAL_SERVER_ERROR
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.webapp.WebServices.getApps(WebServices.java:191)
at 
org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSWebServices.getApps(AHSWebServices.java:96)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
{code}






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7284) NodeManager crashes with OOM when Debug log enabled for ContainerLocalizer

2017-10-03 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-7284:

Component/s: nodemanager

> NodeManager crashes with OOM when Debug log enabled for ContainerLocalizer 
> ---
>
> Key: YARN-7284
> URL: https://issues.apache.org/jira/browse/YARN-7284
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
> Attachments: Screen Shot 2017-10-03 at 1.29.35 PM.png, Screen Shot 
> 2017-10-03 at 1.29.48 PM.png
>
>
> NodeManager crashes with OOM when DEBUG log enabled for ContainerLocalizer. 
> {code}
> 2017-10-03 07:25:20,066 FATAL yarn.YarnUncaughtExceptionHandler 
> (YarnUncaughtExceptionHandler.java:uncaughtException(51)) - Thread 
> Thread[Thread-2114,5,main] threw an Error.  Shutting down now...
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3332)
> at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
> at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
> at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
> at java.lang.StringBuffer.append(StringBuffer.java:272)
> at org.apache.hadoop.util.Shell$1.run(Shell.java:900)
> {code}
> errThread part of Hadoop Common Shell reads all the DEBUG log lines and 
> appends to StringBuffer errMsg. As per the heap dump, the errMsg stores more 
> than 1GB of contents. (attached image)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7463) Using getLocalPathForWrite for Container related debug information

2017-11-15 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-7463:

Attachment: YARN-7463.1.patch

> Using getLocalPathForWrite for Container related debug information
> --
>
> Key: YARN-7463
> URL: https://issues.apache.org/jira/browse/YARN-7463
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Minor
> Attachments: YARN-7463.1.patch
>
>
> Containers debug information launch_container.sh and directory.info are 
> always logged into first directory of NM_LOG_DIRS instead of using the log 
> directory returned from getLogPathForWrite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7500) LogAggregation DeletionService should consider completedTime for long running jobs

2017-11-16 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254905#comment-16254905
 ] 

Prabhu Joseph commented on YARN-7500:
-

[~jlowe] Oh yes, missed it. 

The issue is our customer has a long running custom Yarn Application which has 
started before yarn.log-aggregation.retain-seconds and was running yesterday as 
per the RM UI, and today we din see any logs under app-logs. Logs where there 
for running job. Looks only possibility is the custom app has not updated the 
logs for many days.

Will check RM logs and hdfs-audit logs to validate and give more information.

> LogAggregation DeletionService should consider completedTime for long running 
> jobs
> --
>
> Key: YARN-7500
> URL: https://issues.apache.org/jira/browse/YARN-7500
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> Currently LogAggregation deletes the application logs based on start time of 
> the job. For long running jobs (started before 
> yarn.log-aggregation.retain-seconds), say it is failed yesterday for some 
> reason and we won't have the job logs today for debugging.
> Better to consider the completedTime of the job as part of the deletion 
> condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7500) LogAggregation DeletionService should consider completedTime for long running jobs

2017-11-15 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-7500:
---

 Summary: LogAggregation DeletionService should consider 
completedTime for long running jobs
 Key: YARN-7500
 URL: https://issues.apache.org/jira/browse/YARN-7500
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.3
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph


Currently LogAggregation deletes the application logs based on start time of 
the job. For long running jobs (started before 
yarn.log-aggregation.retain-seconds), say it is failed yesterday for some 
reason and we won't have the job logs today for debugging.

Better to consider the completedTime of the job as part of the deletion 
condition.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7428) Localizer Failed does not log containeId

2017-11-03 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-7428:

Attachment: YARN-7428.1.patch

> Localizer Failed does not log containeId
> 
>
> Key: YARN-7428
> URL: https://issues.apache.org/jira/browse/YARN-7428
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-7428.1.patch
>
>
> When a Localizer fails for some reason, the error message does not have the 
> containerId to correlate.
> {code}
> 2017-10-31 00:03:11,046 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  IOException executing command:
> java.io.InterruptedIOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:947)
> at org.apache.hadoop.util.Shell.run(Shell.java:848)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:264)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at java.lang.UNIXProcess.waitFor(UNIXProcess.java:396)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
> ... 5 more
> 2017-10-31 00:03:11,047 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed
> java.lang.NullPointerException
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7426) Add a finite shell command timeout to ContainerLocalizer

2017-11-01 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-7426:
---

 Summary: Add a finite shell command timeout to ContainerLocalizer
 Key: YARN-7426
 URL: https://issues.apache.org/jira/browse/YARN-7426
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.3
Reporter: Prabhu Joseph
Priority: Critical


When the NodeManager is overloaded and ContainerLocalizer processes are 
hanging, the containers will timeout and cleaned up. The LocalizerRunner thread 
will be interrupted during cleanup but the interrupt does not work when it is 
reading from FileInputStream. LocalizerRunner threads and ContainerLocalizer 
process keeps on accumulating which makes the node completely unresponsive. We 
can have a timeout for Shell Command to avoid this similar to HADOOP-13817.
The timeout value can be set by AM same as container timeout.

ContainerLocalizer JVM stacktrace:

{code}
"main" #1 prio=5 os_prio=0 tid=0x7fd8ec019000 nid=0xc295 runnable 
[0x7fd8f3956000]
   java.lang.Thread.State: RUNNABLE
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(ZipFile.java:219)
at java.util.zip.ZipFile.(ZipFile.java:149)
at java.util.jar.JarFile.(JarFile.java:166)
at java.util.jar.JarFile.(JarFile.java:103)
at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:893)
at sun.misc.URLClassPath$JarLoader.access$700(URLClassPath.java:756)
at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:838)
at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:831)
at java.security.AccessController.doPrivileged(Native Method)
at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:830)
at sun.misc.URLClassPath$JarLoader.(URLClassPath.java:803)
at sun.misc.URLClassPath$3.run(URLClassPath.java:530)
at sun.misc.URLClassPath$3.run(URLClassPath.java:520)
at java.security.AccessController.doPrivileged(Native Method)
at sun.misc.URLClassPath.getLoader(URLClassPath.java:519)
at sun.misc.URLClassPath.getLoader(URLClassPath.java:492)
- locked <0x00076ac75058> (a sun.misc.URLClassPath)
at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:457)
- locked <0x00076ac75058> (a sun.misc.URLClassPath)
at sun.misc.URLClassPath.getResource(URLClassPath.java:211)
at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
- locked <0x00076ac7f960> (a java.lang.Object)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:495)
{code}

NodeManager LocalizerRunner thread which is not interrupted:

{code}
"LocalizerRunner for container_e746_1508665985104_601806_01_05" #3932753 
prio=5 os_prio=0 tid=0x7fb258d5f800 nid=0x11091 runnable 
[0x7fb153946000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked <0x000718502b80> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
- locked <0x000718502bd8> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read1(BufferedReader.java:212)
at java.io.BufferedReader.read(BufferedReader.java:286)
- locked <0x000718502bd8> (a java.io.InputStreamReader)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1155)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:930)
at org.apache.hadoop.util.Shell.run(Shell.java:848)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151)
at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:264)
at 

[jira] [Commented] (YARN-7428) Localizer Failed does not log containeId

2017-11-05 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1624#comment-1624
 ] 

Prabhu Joseph commented on YARN-7428:
-

Thanks [~bibinchundatt] for the review.

> Localizer Failed does not log containeId
> 
>
> Key: YARN-7428
> URL: https://issues.apache.org/jira/browse/YARN-7428
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
> Attachments: YARN-7428.1.patch
>
>
> When a Localizer fails for some reason, the error message does not have the 
> containerId to correlate.
> {code}
> 2017-10-31 00:03:11,046 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  IOException executing command:
> java.io.InterruptedIOException: java.lang.InterruptedException
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:947)
> at org.apache.hadoop.util.Shell.run(Shell.java:848)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:264)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at java.lang.UNIXProcess.waitFor(UNIXProcess.java:396)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
> ... 5 more
> 2017-10-31 00:03:11,047 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Localizer failed
> java.lang.NullPointerException
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7426) Interrupt does not work when LocalizerRunner is reading from InputStream

2017-11-01 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-7426:

Summary: Interrupt does not work when LocalizerRunner is reading from 
InputStream  (was: Add a finite shell command timeout to ContainerLocalizer)

> Interrupt does not work when LocalizerRunner is reading from InputStream
> 
>
> Key: YARN-7426
> URL: https://issues.apache.org/jira/browse/YARN-7426
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Priority: Critical
>
> When the NodeManager is overloaded and ContainerLocalizer processes are 
> hanging, the containers will timeout and cleaned up. The LocalizerRunner 
> thread will be interrupted during cleanup but the interrupt does not work 
> when it is reading from FileInputStream. LocalizerRunner threads and 
> ContainerLocalizer process keeps on accumulating which makes the node 
> completely unresponsive. We can have a timeout for Shell Command to avoid 
> this similar to HADOOP-13817.
> The timeout value can be set by AM same as container timeout.
> ContainerLocalizer JVM stacktrace:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x7fd8ec019000 nid=0xc295 runnable 
> [0x7fd8f3956000]
>java.lang.Thread.State: RUNNABLE
>   at java.util.zip.ZipFile.open(Native Method)
>   at java.util.zip.ZipFile.(ZipFile.java:219)
>   at java.util.zip.ZipFile.(ZipFile.java:149)
>   at java.util.jar.JarFile.(JarFile.java:166)
>   at java.util.jar.JarFile.(JarFile.java:103)
>   at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:893)
>   at sun.misc.URLClassPath$JarLoader.access$700(URLClassPath.java:756)
>   at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:838)
>   at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:831)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:830)
>   at sun.misc.URLClassPath$JarLoader.(URLClassPath.java:803)
>   at sun.misc.URLClassPath$3.run(URLClassPath.java:530)
>   at sun.misc.URLClassPath$3.run(URLClassPath.java:520)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at sun.misc.URLClassPath.getLoader(URLClassPath.java:519)
>   at sun.misc.URLClassPath.getLoader(URLClassPath.java:492)
>   - locked <0x00076ac75058> (a sun.misc.URLClassPath)
>   at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:457)
>   - locked <0x00076ac75058> (a sun.misc.URLClassPath)
>   at sun.misc.URLClassPath.getResource(URLClassPath.java:211)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   - locked <0x00076ac7f960> (a java.lang.Object)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:495)
> {code}
> NodeManager LocalizerRunner thread which is not interrupted:
> {code}
> "LocalizerRunner for container_e746_1508665985104_601806_01_05" #3932753 
> prio=5 os_prio=0 tid=0x7fb258d5f800 nid=0x11091 runnable 
> [0x7fb153946000]
>java.lang.Thread.State: RUNNABLE
> at java.io.FileInputStream.readBytes(Native Method)
> at java.io.FileInputStream.read(FileInputStream.java:255)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> - locked <0x000718502b80> (a 
> java.lang.UNIXProcess$ProcessPipeInputStream)
> at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
> at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
> at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> - locked <0x000718502bd8> (a java.io.InputStreamReader)
> at java.io.InputStreamReader.read(InputStreamReader.java:184)
> at java.io.BufferedReader.fill(BufferedReader.java:161)
> at java.io.BufferedReader.read1(BufferedReader.java:212)
> at java.io.BufferedReader.read(BufferedReader.java:286)
> - locked <0x000718502bd8> (a java.io.InputStreamReader)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1155)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:930)
> at 

[jira] [Created] (YARN-7428) Localizer Failed does not log containeId

2017-11-02 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-7428:
---

 Summary: Localizer Failed does not log containeId
 Key: YARN-7428
 URL: https://issues.apache.org/jira/browse/YARN-7428
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.3
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph
Priority: Major


When a Localizer fails for some reason, the error message does not have the 
containerId to correlate.

{code}
2017-10-31 00:03:11,046 WARN 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
 IOException executing command:
java.io.InterruptedIOException: java.lang.InterruptedException
at org.apache.hadoop.util.Shell.runCommand(Shell.java:947)
at org.apache.hadoop.util.Shell.run(Shell.java:848)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151)
at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:264)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
Caused by: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at java.lang.UNIXProcess.waitFor(UNIXProcess.java:396)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:937)
... 5 more
2017-10-31 00:03:11,047 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Localizer failed
java.lang.NullPointerException
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6078) Containers stuck in Localizing state

2017-11-02 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16235308#comment-16235308
 ] 

Prabhu Joseph commented on YARN-6078:
-

We have hit this issue recently. Below are the analysis

When the NodeManager is overloaded and ContainerLocalizer processes are 
hanging, the containers will timeout and cleaned up. The LocalizerRunner thread 
will be interrupted during cleanup but the interrupt does not work when it is 
reading from FileInputStream. LocalizerRunner threads and ContainerLocalizer 
process keeps on accumulating which makes the node completely unresponsive. 

There are below options which will help to avoid this:

1. ShellCommandExecutor parseExecResult currently uses blocking read() which 
can be changed into below to use non blocking available() + sleep for some time.

{code}
while(running)
{
if(in.available() > 0)
{
n = in.read(buffer);
//do stuff with the buffer
}
else
{
Thread.sleep(500);
}
}
{code}

2. Add a timeout for shell command similar to HADOOP-13817, timeout value can 
be set by AM same as container timeout.


ContainerLocalizer JVM stacktrace:

{code}
"main" #1 prio=5 os_prio=0 tid=0x7fd8ec019000 nid=0xc295 runnable 
[0x7fd8f3956000]
   java.lang.Thread.State: RUNNABLE
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(ZipFile.java:219)
at java.util.zip.ZipFile.(ZipFile.java:149)
at java.util.jar.JarFile.(JarFile.java:166)
at java.util.jar.JarFile.(JarFile.java:103)
at sun.misc.URLClassPath$JarLoader.getJarFile(URLClassPath.java:893)
at sun.misc.URLClassPath$JarLoader.access$700(URLClassPath.java:756)
at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:838)
at sun.misc.URLClassPath$JarLoader$1.run(URLClassPath.java:831)
at java.security.AccessController.doPrivileged(Native Method)
at sun.misc.URLClassPath$JarLoader.ensureOpen(URLClassPath.java:830)
at sun.misc.URLClassPath$JarLoader.(URLClassPath.java:803)
at sun.misc.URLClassPath$3.run(URLClassPath.java:530)
at sun.misc.URLClassPath$3.run(URLClassPath.java:520)
at java.security.AccessController.doPrivileged(Native Method)
at sun.misc.URLClassPath.getLoader(URLClassPath.java:519)
at sun.misc.URLClassPath.getLoader(URLClassPath.java:492)
- locked <0x00076ac75058> (a sun.misc.URLClassPath)
at sun.misc.URLClassPath.getNextLoader(URLClassPath.java:457)
- locked <0x00076ac75058> (a sun.misc.URLClassPath)
at sun.misc.URLClassPath.getResource(URLClassPath.java:211)
at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
- locked <0x00076ac7f960> (a java.lang.Object)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:495)
{code}

NodeManager LocalizerRunner thread which is not interrupted:

{code}
"LocalizerRunner for container_e746_1508665985104_601806_01_05" #3932753 
prio=5 os_prio=0 tid=0x7fb258d5f800 nid=0x11091 runnable 
[0x7fb153946000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:255)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked <0x000718502b80> (a 
java.lang.UNIXProcess$ProcessPipeInputStream)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
- locked <0x000718502bd8> (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.read1(BufferedReader.java:212)
at java.io.BufferedReader.read(BufferedReader.java:286)
- locked <0x000718502bd8> (a java.io.InputStreamReader)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:1155)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:930)
at org.apache.hadoop.util.Shell.run(Shell.java:848)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142)
at 

[jira] [Created] (YARN-7429) Auxillary Service status on NodeManager UI / Cli

2017-11-02 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-7429:
---

 Summary: Auxillary Service status on NodeManager UI / Cli
 Key: YARN-7429
 URL: https://issues.apache.org/jira/browse/YARN-7429
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.3
Reporter: Prabhu Joseph
Priority: Major


When Auxillary services like Spark Shuffle , MapReduce Shuffle Service fails 
for some reason, and the jobs running will have issue when the remote 
containers tries to fetch the data from the Node where service is failed to 
initialize.

Reason shuffle service failed to start will be in NodeManager logs during 
startup and likely will be lost after few days when we noticed the jobs 
failing. Useful if NodeManager UI / Cli shows the list of Auxillary Services 
and their status and capture any error message if it has failed.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7463) Using getLocalPathForWrite for Container related debug information

2017-11-08 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-7463:
---

 Summary: Using getLocalPathForWrite for Container related debug 
information
 Key: YARN-7463
 URL: https://issues.apache.org/jira/browse/YARN-7463
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.7.3
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph
Priority: Minor


Containers debug information launch_container.sh and directory.info are always 
logged into first directory of NM_LOG_DIRS instead of using the log directory 
returned from getLogPathForWrite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2017-10-25 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-6929:

Attachment: YARN-6929.1.patch

> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
> Attachments: YARN-6929.1.patch, YARN-6929.patch
>
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
>  Application failed to init aggregation 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> 

[jira] [Updated] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2017-10-25 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-6929:

Attachment: YARN-6929.2.patch

> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
> Attachments: YARN-6929.1.patch, YARN-6929.2.patch, YARN-6929.patch
>
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
>  Application failed to init aggregation 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> 

[jira] [Updated] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2017-10-24 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-6929:

Attachment: YARN-6929.patch

> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
> Attachments: YARN-6929.patch
>
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
>  Application failed to init aggregation 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> 

[jira] [Assigned] (YARN-5295) YARN queue-mappings to check Queue is present before submitting job

2017-10-24 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-5295:
---

Assignee: Prabhu Joseph

> YARN queue-mappings to check Queue is present before submitting job
> ---
>
> Key: YARN-5295
> URL: https://issues.apache.org/jira/browse/YARN-5295
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 2.7.2
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> In yarn Queue-Mappings, Yarn should check if the queue is present before 
> submitting the job. If not present it should go to next mapping available.
> For example if we have
> yarn.scheduler.capacity.queue-mappings=u:%user:%user,g:edw:platform
> and I submit job with user "test" and if there is no "test" queue then it 
> should check the second mapping (g:edw:platform) in the list and if test is 
> part of edw group it should submit job in platform queue.
> Below Sanity checks has to be done for the mapped queue in the list and if it 
> fails then the the next queue mapping has to be chosen, when there is no 
> queue mapping passing the sanity check, only then the application has to be 
> Rejected.
> 1. is queue present
> 2. is queue not a leaf queue
> 3. is user either have ACL Submit_Applications or Administer_Queue of the 
> queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8291) RMRegistryOperationService don't have limit on AsyncPurge threads

2018-05-14 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-8291:
---

 Summary: RMRegistryOperationService don't have limit on AsyncPurge 
threads
 Key: YARN-8291
 URL: https://issues.apache.org/jira/browse/YARN-8291
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.7.3
Reporter: Prabhu Joseph


When there are more than 1+ containers finished - 
RMRegistryOperationService will create 1+ threads for performing AsyncPurge 
which can slowdown the ResourceManager process. There should be a limit on the 
number of threads.

{code}
"RegistryAdminService 554485" #824351 prio=5 os_prio=0 tid=0x7fe4b2bc9800 
nid=0xf8ed in Object.wait() [0x7fe31a5e4000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1386)
- locked <0x0007902ec7d8> (a org.apache.zookeeper.ClientCnxn$Packet)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
at 
org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:172)
at 
org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:161)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
at 
org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:158)
at 
org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:148)
at 
org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:36)
at 
org.apache.hadoop.registry.client.impl.zk.CuratorService.zkStat(CuratorService.java:455)
at 
org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.stat(RegistryOperationsService.java:137)
at 
org.apache.hadoop.registry.client.binding.RegistryUtils.statChildren(RegistryUtils.java:210)
at 
org.apache.hadoop.registry.server.services.RegistryAdminService.purge(RegistryAdminService.java:450)
at 
org.apache.hadoop.registry.server.services.RegistryAdminService.purge(RegistryAdminService.java:520)
at 
org.apache.hadoop.registry.server.services.RegistryAdminService$AsyncPurge.call(RegistryAdminService.java:570)
at 
org.apache.hadoop.registry.server.services.RegistryAdminService$AsyncPurge.call(RegistryAdminService.java:543)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8254) dynamically change log levels for YARN Jobs

2018-05-07 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466853#comment-16466853
 ] 

Prabhu Joseph commented on YARN-8254:
-

YarnClient can request setLogLevel for an application using a new api "yarn 
application -setLogLevel   " to RM. ResourceManager will 
pass it to ApplicationMaster through AllocateResponse.

ApplicationMaster will process the logLevel and pass it to all the task 
containers as part of the response to statusUpdate. Needs change in each 
application to support this or can simply ignore.

> dynamically change log levels for YARN Jobs
> ---
>
> Key: YARN-8254
> URL: https://issues.apache.org/jira/browse/YARN-8254
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Priority: Major
>  Labels: supportability
>
> Currently the Log Levels for Daemons can be dynamically changed. It will be 
> easier while debugging to have same for YARN Jobs. Client can setLogLevel to 
> ApplicationMaster which can set it for all the containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8254) dynamically change log levels for YARN Jobs

2018-05-07 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-8254:

Component/s: yarn

> dynamically change log levels for YARN Jobs
> ---
>
> Key: YARN-8254
> URL: https://issues.apache.org/jira/browse/YARN-8254
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Priority: Major
>  Labels: supportability
>
> Currently the Log Levels for Daemons can be dynamically changed. It will be 
> easier while debugging to have same for YARN Jobs. Client can setLogLevel to 
> ApplicationMaster which can set it for all the containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8254) dynamically change log levels for YARN Jobs

2018-05-07 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466767#comment-16466767
 ] 

Prabhu Joseph commented on YARN-8254:
-

[~Naganarasimha] Just realized this is Application specific. AM has to provide 
support to change log level to client. JobClient can request setLogLevel to AM. 
AM will internally setLogLevel for all running containers. Will move this Jira 
to MapReduce.

> dynamically change log levels for YARN Jobs
> ---
>
> Key: YARN-8254
> URL: https://issues.apache.org/jira/browse/YARN-8254
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Priority: Major
>  Labels: supportability
>
> Currently the Log Levels for Daemons can be dynamically changed. It will be 
> easier while debugging to have same for YARN Jobs. Client can setLogLevel to 
> ApplicationMaster which can set it for all the containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8254) dynamically change log levels for YARN Jobs

2018-05-07 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-8254:

Component/s: (was: yarn)

> dynamically change log levels for YARN Jobs
> ---
>
> Key: YARN-8254
> URL: https://issues.apache.org/jira/browse/YARN-8254
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Priority: Major
>  Labels: supportability
>
> Currently the Log Levels for Daemons can be dynamically changed. It will be 
> easier while debugging to have same for YARN Jobs. Client can setLogLevel to 
> ApplicationMaster which can set it for all the containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8224) LogAggregation status TIME_OUT for absent container misleading

2018-04-27 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-8224:
---

 Summary: LogAggregation status TIME_OUT for absent container 
misleading
 Key: YARN-8224
 URL: https://issues.apache.org/jira/browse/YARN-8224
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.3
Reporter: Prabhu Joseph


When a container is not launched on NM and it is absent, RM still tries to get 
the Log Aggregation Status and reports the status as TIME_OUT. (attached 
screenshot)

{code}
2018-04-26 12:47:38,403 WARN  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:handle(1070)) - Event EventType: KILL_CONTAINER sent 
to absent container container_e361_1524687599273_2110_01_000770

2018-04-26 12:49:31,743 WARN  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:handle(1086)) - Event EventType: FINISH_APPLICATION 
sent to absent application application_1524687599273_2110

{code}






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8224) LogAggregation status TIME_OUT for absent container misleading

2018-04-27 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-8224:

Description: 
When a container is not launched on NM and it is absent, RM still tries to get 
the Log Aggregation Status and reports the status as TIME_OUT. 

{code}
2018-04-26 12:47:38,403 WARN  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:handle(1070)) - Event EventType: KILL_CONTAINER sent 
to absent container container_e361_1524687599273_2110_01_000770

2018-04-26 12:49:31,743 WARN  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:handle(1086)) - Event EventType: FINISH_APPLICATION 
sent to absent application application_1524687599273_2110

{code}




  was:
When a container is not launched on NM and it is absent, RM still tries to get 
the Log Aggregation Status and reports the status as TIME_OUT. (attached 
screenshot)

{code}
2018-04-26 12:47:38,403 WARN  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:handle(1070)) - Event EventType: KILL_CONTAINER sent 
to absent container container_e361_1524687599273_2110_01_000770

2018-04-26 12:49:31,743 WARN  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:handle(1086)) - Event EventType: FINISH_APPLICATION 
sent to absent application application_1524687599273_2110

{code}





> LogAggregation status TIME_OUT for absent container misleading
> --
>
> Key: YARN-8224
> URL: https://issues.apache.org/jira/browse/YARN-8224
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Priority: Major
>
> When a container is not launched on NM and it is absent, RM still tries to 
> get the Log Aggregation Status and reports the status as TIME_OUT. 
> {code}
> 2018-04-26 12:47:38,403 WARN  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:handle(1070)) - Event EventType: KILL_CONTAINER 
> sent to absent container container_e361_1524687599273_2110_01_000770
> 2018-04-26 12:49:31,743 WARN  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:handle(1086)) - Event EventType: 
> FINISH_APPLICATION sent to absent application application_1524687599273_2110
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8224) LogAggregation status TIME_OUT for absent container misleading

2018-04-27 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-8224:

Description: 
When a container is not launched on NM and it is absent, RM still tries to get 
the Log Aggregation Status and reports the status as TIME_OUT in RM UI. 

{code}
2018-04-26 12:47:38,403 WARN  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:handle(1070)) - Event EventType: KILL_CONTAINER sent 
to absent container container_e361_1524687599273_2110_01_000770

2018-04-26 12:49:31,743 WARN  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:handle(1086)) - Event EventType: FINISH_APPLICATION 
sent to absent application application_1524687599273_2110

{code}




  was:
When a container is not launched on NM and it is absent, RM still tries to get 
the Log Aggregation Status and reports the status as TIME_OUT. 

{code}
2018-04-26 12:47:38,403 WARN  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:handle(1070)) - Event EventType: KILL_CONTAINER sent 
to absent container container_e361_1524687599273_2110_01_000770

2018-04-26 12:49:31,743 WARN  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:handle(1086)) - Event EventType: FINISH_APPLICATION 
sent to absent application application_1524687599273_2110

{code}





> LogAggregation status TIME_OUT for absent container misleading
> --
>
> Key: YARN-8224
> URL: https://issues.apache.org/jira/browse/YARN-8224
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Priority: Major
>
> When a container is not launched on NM and it is absent, RM still tries to 
> get the Log Aggregation Status and reports the status as TIME_OUT in RM UI. 
> {code}
> 2018-04-26 12:47:38,403 WARN  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:handle(1070)) - Event EventType: KILL_CONTAINER 
> sent to absent container container_e361_1524687599273_2110_01_000770
> 2018-04-26 12:49:31,743 WARN  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:handle(1086)) - Event EventType: 
> FINISH_APPLICATION sent to absent application application_1524687599273_2110
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8279) AggregationLogDeletionService does not honor yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix

2018-05-11 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-8279:
---

 Summary: AggregationLogDeletionService does not honor 
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
 Key: YARN-8279
 URL: https://issues.apache.org/jira/browse/YARN-8279
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.7.3
Reporter: Prabhu Joseph


AggregationLogDeletionService does not honor 
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix. 
AggregationLogService writes the logs into /app-logs//logs-ifile 
where as AggregationLogDeletion tries to delete from  /app-logs//logs.

Workaround is to set 
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix and 
yarn.nodemanager.remote-app-log-dir-suffix to same value "logs-ifile"

AggregationLogDeletionService has to check the format and based upon that 
choose the suffix. Currently it only checks the older suffix 
yarn.nodemanager.remote-app-log-dir-suffix.

AggregatedLogDeletionService tries to delete older suffix directory.

{code}
2018-05-11 08:48:19,989 ERROR logaggregation.AggregatedLogDeletionService 
(AggregatedLogDeletionService.java:logIOException(182)) - Could not read the 
contents of hdfs://prabhucluster:8020/app-logs/hive/logs
java.io.FileNotFoundException: File 
hdfs://prabhucluster:8020/app-logs/hive/logs does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:923)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:114)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:985)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:981)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:992)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.deleteOldLogDirsFrom(AggregatedLogDeletionService.java:98)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.run(AggregatedLogDeletionService.java:85)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8279) AggregationLogDeletionService does not honor yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix

2018-05-11 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph reassigned YARN-8279:
---

Assignee: Tarun Parimi

> AggregationLogDeletionService does not honor 
> yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
> -
>
> Key: YARN-8279
> URL: https://issues.apache.org/jira/browse/YARN-8279
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> AggregationLogDeletionService does not honor 
> yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix. 
> AggregationLogService writes the logs into /app-logs//logs-ifile 
> where as AggregationLogDeletion tries to delete from  
> /app-logs//logs.
> Workaround is to set 
> yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix and 
> yarn.nodemanager.remote-app-log-dir-suffix to same value "logs-ifile"
> AggregationLogDeletionService has to check the format and based upon that 
> choose the suffix. Currently it only checks the older suffix 
> yarn.nodemanager.remote-app-log-dir-suffix.
> AggregatedLogDeletionService tries to delete older suffix directory.
> {code}
> 2018-05-11 08:48:19,989 ERROR logaggregation.AggregatedLogDeletionService 
> (AggregatedLogDeletionService.java:logIOException(182)) - Could not read the 
> contents of hdfs://prabhucluster:8020/app-logs/hive/logs
> java.io.FileNotFoundException: File 
> hdfs://prabhucluster:8020/app-logs/hive/logs does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:923)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:114)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:985)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:981)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:992)
> at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.deleteOldLogDirsFrom(AggregatedLogDeletionService.java:98)
> at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.run(AggregatedLogDeletionService.java:85)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8254) dynamically change log levels for YARN Jobs

2018-05-07 Thread Prabhu Joseph (JIRA)
Prabhu Joseph created YARN-8254:
---

 Summary: dynamically change log levels for YARN Jobs
 Key: YARN-8254
 URL: https://issues.apache.org/jira/browse/YARN-8254
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: yarn
Affects Versions: 2.7.3
Reporter: Prabhu Joseph


Currently the Log Levels for Daemons can be dynamically changed. It will be 
easier while debugging to have same for YARN Jobs. Client can setLogLevel to 
ApplicationMaster which can set it for all the containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8254) dynamically change log levels for YARN Jobs

2018-05-07 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-8254:

Labels: supportability  (was: )

> dynamically change log levels for YARN Jobs
> ---
>
> Key: YARN-8254
> URL: https://issues.apache.org/jira/browse/YARN-8254
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Priority: Major
>  Labels: supportability
>
> Currently the Log Levels for Daemons can be dynamically changed. It will be 
> easier while debugging to have same for YARN Jobs. Client can setLogLevel to 
> ApplicationMaster which can set it for all the containers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8291) RMRegistryOperationService don't have limit on AsyncPurge threads

2018-05-18 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-8291:

Affects Version/s: (was: 2.7.3)
   3.0.0

> RMRegistryOperationService don't have limit on AsyncPurge threads
> -
>
> Key: YARN-8291
> URL: https://issues.apache.org/jira/browse/YARN-8291
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> When there are more than 1+ containers finished - 
> RMRegistryOperationService will create 1+ threads for performing 
> AsyncPurge which can slowdown the ResourceManager process. There should be a 
> limit on the number of threads.
> {code}
> "RegistryAdminService 554485" #824351 prio=5 os_prio=0 tid=0x7fe4b2bc9800 
> nid=0xf8ed in Object.wait() [0x7fe31a5e4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1386)
> - locked <0x0007902ec7d8> (a 
> org.apache.zookeeper.ClientCnxn$Packet)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
> at 
> org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:172)
> at 
> org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:161)
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
> at 
> org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:158)
> at 
> org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:148)
> at 
> org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:36)
> at 
> org.apache.hadoop.registry.client.impl.zk.CuratorService.zkStat(CuratorService.java:455)
> at 
> org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.stat(RegistryOperationsService.java:137)
> at 
> org.apache.hadoop.registry.client.binding.RegistryUtils.statChildren(RegistryUtils.java:210)
> at 
> org.apache.hadoop.registry.server.services.RegistryAdminService.purge(RegistryAdminService.java:450)
> at 
> org.apache.hadoop.registry.server.services.RegistryAdminService.purge(RegistryAdminService.java:520)
> at 
> org.apache.hadoop.registry.server.services.RegistryAdminService$AsyncPurge.call(RegistryAdminService.java:570)
> at 
> org.apache.hadoop.registry.server.services.RegistryAdminService$AsyncPurge.call(RegistryAdminService.java:543)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8291) RMRegistryOperationService don't have limit on AsyncPurge threads

2018-05-18 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480424#comment-16480424
 ] 

Prabhu Joseph commented on YARN-8291:
-

The trunk code is having this issue.

> RMRegistryOperationService don't have limit on AsyncPurge threads
> -
>
> Key: YARN-8291
> URL: https://issues.apache.org/jira/browse/YARN-8291
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> When there are more than 1+ containers finished - 
> RMRegistryOperationService will create 1+ threads for performing 
> AsyncPurge which can slowdown the ResourceManager process. There should be a 
> limit on the number of threads.
> {code}
> "RegistryAdminService 554485" #824351 prio=5 os_prio=0 tid=0x7fe4b2bc9800 
> nid=0xf8ed in Object.wait() [0x7fe31a5e4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1386)
> - locked <0x0007902ec7d8> (a 
> org.apache.zookeeper.ClientCnxn$Packet)
> at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1040)
> at 
> org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:172)
> at 
> org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:161)
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
> at 
> org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:158)
> at 
> org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:148)
> at 
> org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:36)
> at 
> org.apache.hadoop.registry.client.impl.zk.CuratorService.zkStat(CuratorService.java:455)
> at 
> org.apache.hadoop.registry.client.impl.zk.RegistryOperationsService.stat(RegistryOperationsService.java:137)
> at 
> org.apache.hadoop.registry.client.binding.RegistryUtils.statChildren(RegistryUtils.java:210)
> at 
> org.apache.hadoop.registry.server.services.RegistryAdminService.purge(RegistryAdminService.java:450)
> at 
> org.apache.hadoop.registry.server.services.RegistryAdminService.purge(RegistryAdminService.java:520)
> at 
> org.apache.hadoop.registry.server.services.RegistryAdminService$AsyncPurge.call(RegistryAdminService.java:570)
> at 
> org.apache.hadoop.registry.server.services.RegistryAdminService$AsyncPurge.call(RegistryAdminService.java:543)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8279) AggregationLogDeletionService does not honor yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix

2018-05-24 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-8279:

Affects Version/s: (was: 2.7.3)
   2.9.1

> AggregationLogDeletionService does not honor 
> yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
> -
>
> Key: YARN-8279
> URL: https://issues.apache.org/jira/browse/YARN-8279
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.9.1
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> AggregationLogDeletionService does not honor 
> yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix. 
> AggregationLogService writes the logs into /app-logs//logs-ifile 
> where as AggregationLogDeletion tries to delete from  
> /app-logs//logs.
> Workaround is to set 
> yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix and 
> yarn.nodemanager.remote-app-log-dir-suffix to same value "logs-ifile"
> AggregationLogDeletionService has to check the format and based upon that 
> choose the suffix. Currently it only checks the older suffix 
> yarn.nodemanager.remote-app-log-dir-suffix.
> AggregatedLogDeletionService tries to delete older suffix directory.
> {code}
> 2018-05-11 08:48:19,989 ERROR logaggregation.AggregatedLogDeletionService 
> (AggregatedLogDeletionService.java:logIOException(182)) - Could not read the 
> contents of hdfs://prabhucluster:8020/app-logs/hive/logs
> java.io.FileNotFoundException: File 
> hdfs://prabhucluster:8020/app-logs/hive/logs does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:923)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:114)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:985)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:981)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:992)
> at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.deleteOldLogDirsFrom(AggregatedLogDeletionService.java:98)
> at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.run(AggregatedLogDeletionService.java:85)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8279) AggregationLogDeletionService does not honor yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix

2018-05-24 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489071#comment-16489071
 ] 

Prabhu Joseph commented on YARN-8279:
-

[~jlowe] We have faced this on HDP Distribution with Hadoop version 2.7.3 but 
which has most of the latest code from Apache. The issue will also occur in 
trunk version.

> AggregationLogDeletionService does not honor 
> yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
> -
>
> Key: YARN-8279
> URL: https://issues.apache.org/jira/browse/YARN-8279
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.9.1
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> AggregationLogDeletionService does not honor 
> yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix. 
> AggregationLogService writes the logs into /app-logs//logs-ifile 
> where as AggregationLogDeletion tries to delete from  
> /app-logs//logs.
> Workaround is to set 
> yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix and 
> yarn.nodemanager.remote-app-log-dir-suffix to same value "logs-ifile"
> AggregationLogDeletionService has to check the format and based upon that 
> choose the suffix. Currently it only checks the older suffix 
> yarn.nodemanager.remote-app-log-dir-suffix.
> AggregatedLogDeletionService tries to delete older suffix directory.
> {code}
> 2018-05-11 08:48:19,989 ERROR logaggregation.AggregatedLogDeletionService 
> (AggregatedLogDeletionService.java:logIOException(182)) - Could not read the 
> contents of hdfs://prabhucluster:8020/app-logs/hive/logs
> java.io.FileNotFoundException: File 
> hdfs://prabhucluster:8020/app-logs/hive/logs does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:923)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:114)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:985)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:981)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:992)
> at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.deleteOldLogDirsFrom(AggregatedLogDeletionService.java:98)
> at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.run(AggregatedLogDeletionService.java:85)
> at java.util.TimerThread.mainLoop(Timer.java:555)
> at java.util.TimerThread.run(Timer.java:505)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8279) AggregationLogDeletionService does not honor yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix

2018-06-27 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-8279:

Description: 
AggregationLogDeletionService does not honor 
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix. 
AggregationLogService writes the logs into /app-logs//logs-ifile 
where as AggregationLogDeletion tries to delete from  /app-logs//logs.

Workaround is to set 
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix and 
yarn.nodemanager.remote-app-log-dir-suffix to same value "logs-ifile" and 
Restart HistoryServer which serves AggregationLogDeletionService

AggregationLogDeletionService has to check the format and based upon that 
choose the suffix. Currently it only checks the older suffix 
yarn.nodemanager.remote-app-log-dir-suffix.

AggregatedLogDeletionService tries to delete older suffix directory.

{code}
2018-05-11 08:48:19,989 ERROR logaggregation.AggregatedLogDeletionService 
(AggregatedLogDeletionService.java:logIOException(182)) - Could not read the 
contents of hdfs://prabhucluster:8020/app-logs/hive/logs
java.io.FileNotFoundException: File 
hdfs://prabhucluster:8020/app-logs/hive/logs does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:923)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:114)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:985)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:981)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:992)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.deleteOldLogDirsFrom(AggregatedLogDeletionService.java:98)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.run(AggregatedLogDeletionService.java:85)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
{code}

  was:
AggregationLogDeletionService does not honor 
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix. 
AggregationLogService writes the logs into /app-logs//logs-ifile 
where as AggregationLogDeletion tries to delete from  /app-logs//logs.

Workaround is to set 
yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix and 
yarn.nodemanager.remote-app-log-dir-suffix to same value "logs-ifile"

AggregationLogDeletionService has to check the format and based upon that 
choose the suffix. Currently it only checks the older suffix 
yarn.nodemanager.remote-app-log-dir-suffix.

AggregatedLogDeletionService tries to delete older suffix directory.

{code}
2018-05-11 08:48:19,989 ERROR logaggregation.AggregatedLogDeletionService 
(AggregatedLogDeletionService.java:logIOException(182)) - Could not read the 
contents of hdfs://prabhucluster:8020/app-logs/hive/logs
java.io.FileNotFoundException: File 
hdfs://prabhucluster:8020/app-logs/hive/logs does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:923)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:114)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:985)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:981)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:992)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.deleteOldLogDirsFrom(AggregatedLogDeletionService.java:98)
at 
org.apache.hadoop.yarn.logaggregation.AggregatedLogDeletionService$LogDeletionTask.run(AggregatedLogDeletionService.java:85)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
{code}


> AggregationLogDeletionService does not honor 
> yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix
> -
>
> Key: YARN-8279
> URL: https://issues.apache.org/jira/browse/YARN-8279
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.9.1
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> AggregationLogDeletionService does not honor 
> yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix. 
> AggregationLogService writes the logs into /app-logs//logs-ifile 

[jira] [Updated] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2017-10-26 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-6929:

Attachment: YARN-6929.2.patch

> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
> Attachments: YARN-6929.1.patch, YARN-6929.2.patch, YARN-6929.2.patch, 
> YARN-6929.patch
>
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
>  Application failed to init aggregation 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> 

[jira] [Updated] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2017-10-27 Thread Prabhu Joseph (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-6929:

Attachment: YARN-6929.3.patch

> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
> Attachments: YARN-6929.1.patch, YARN-6929.2.patch, YARN-6929.2.patch, 
> YARN-6929.3.patch, YARN-6929.patch
>
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
>  Application failed to init aggregation 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> 

[jira] [Commented] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2017-10-27 Thread Prabhu Joseph (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16222056#comment-16222056
 ] 

Prabhu Joseph commented on YARN-6929:
-

[~jlowe] [~rohithsharma] Need your help in reviewing this patch. Failing test 
case is an existing one YARN-7299. Have did functional testing with below test 
cases

{code}
1. New application logs gets written into correct folder structure inside 
yarn.nodemanager.remote-app-log-dir
2. yarn logs cli works fine
3. Accessing Logs from RM UI / HistoryServer UI works fine while job is running 
/ complete.
{code}



> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
> Attachments: YARN-6929.1.patch, YARN-6929.2.patch, YARN-6929.2.patch, 
> YARN-6929.3.patch, YARN-6929.patch
>
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
>  Application failed to init aggregation 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs 

  1   2   3   4   5   6   7   8   9   10   >