[jira] [Commented] (YARN-7451) Add missing tests to verify the presence of custom resources of RM apps and scheduler webservice endpoints

2018-06-19 Thread Gergo Repas (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516800#comment-16516800
 ] 

Gergo Repas commented on YARN-7451:
---

[~snemeth] Thanks for the updated patch!

+1 (non-binding)

> Add missing tests to verify the presence of custom resources of RM apps and 
> scheduler webservice endpoints
> --
>
> Key: YARN-7451
> URL: https://issues.apache.org/jira/browse/YARN-7451
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, restapi
>Affects Versions: 3.0.0
>Reporter: Grant Sohn
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-7451.001.patch, YARN-7451.002.patch, 
> YARN-7451.003.patch, YARN-7451.004.patch, YARN-7451.005.patch, 
> YARN-7451.006.patch, YARN-7451.007.patch, YARN-7451.008.patch, 
> YARN-7451.009.patch, YARN-7451.010.patch, YARN-7451.011.patch, 
> YARN-7451.012.patch, YARN-7451.013.patch, YARN-7451.014.patch, 
> YARN-7451__Expose_custom_resource_types_on_RM_scheduler_API_as_flattened_map01_02.patch
>
>
>  
> Originally, this issue was about serializing custom resources along with 
> normal resources in the RM apps and scheduler webservice endpoints.
> However, as YARN-7817 implemented this sooner, this issue is a complement of 
> YARN-7817, adding several unit tests to verify the correctness of the 
> responses of these webservice endpoints.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7451) Add missing tests to verify the presence of custom resources of RM apps and scheduler webservice endpoints

2018-06-14 Thread Gergo Repas (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512395#comment-16512395
 ] 

Gergo Repas commented on YARN-7451:
---

[~snemeth] Thanks for the updated patch. I'm wondering why the AppInfo class 
has been changed, I can't find the usages of the new methods.

> Add missing tests to verify the presence of custom resources of RM apps and 
> scheduler webservice endpoints
> --
>
> Key: YARN-7451
> URL: https://issues.apache.org/jira/browse/YARN-7451
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, restapi
>Affects Versions: 3.0.0
>Reporter: Grant Sohn
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-7451.001.patch, YARN-7451.002.patch, 
> YARN-7451.003.patch, YARN-7451.004.patch, YARN-7451.005.patch, 
> YARN-7451.006.patch, YARN-7451.007.patch, YARN-7451.008.patch, 
> YARN-7451.009.patch, YARN-7451.010.patch, YARN-7451.011.patch, 
> YARN-7451.012.patch, YARN-7451.013.patch, 
> YARN-7451__Expose_custom_resource_types_on_RM_scheduler_API_as_flattened_map01_02.patch
>
>
>  
> Originally, this issue was about serializing custom resources along with 
> normal resources in the RM apps and scheduler webservice endpoints.
> However, as YARN-7817 implemented this sooner, this issue is a complement of 
> YARN-7817, adding several unit tests to verify the correctness of the 
> responses of these webservice endpoints.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8390) Fix API incompatible changes in FairScheduler's AllocationFileLoaderService

2018-06-05 Thread Gergo Repas (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501468#comment-16501468
 ] 

Gergo Repas commented on YARN-8390:
---

Thanks [~haibochen]!

> Fix API incompatible changes in FairScheduler's AllocationFileLoaderService
> ---
>
> Key: YARN-8390
> URL: https://issues.apache.org/jira/browse/YARN-8390
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8390.000.patch
>
>
> YARN-8191 introduced API-incompatible changes, which would break some classes 
> in hive (e.g. 
> https://github.com/apache/hive/blob/master/shims/scheduler/src/main/java/org/apache/hadoop/hive/schshim/FairSchedulerShim.java),
>  this ticket's goal is to fix the incompatible changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8390) Fix API incompatible changes in FairScheduler's AllocationFileLoaderService

2018-06-04 Thread Gergo Repas (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8390:
--
Attachment: YARN-8390.000.patch

> Fix API incompatible changes in FairScheduler's AllocationFileLoaderService
> ---
>
> Key: YARN-8390
> URL: https://issues.apache.org/jira/browse/YARN-8390
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.2.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8390.000.patch
>
>
> YARN-8191 introduced API-incompatible changes, which would break some classes 
> in hive (e.g. 
> https://github.com/apache/hive/blob/master/shims/scheduler/src/main/java/org/apache/hadoop/hive/schshim/FairSchedulerShim.java),
>  this ticket's goal is to fix the incompatible changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8390) Fix API incompatible changes in FairScheduler's AllocationFileLoaderService

2018-06-04 Thread Gergo Repas (JIRA)
Gergo Repas created YARN-8390:
-

 Summary: Fix API incompatible changes in FairScheduler's 
AllocationFileLoaderService
 Key: YARN-8390
 URL: https://issues.apache.org/jira/browse/YARN-8390
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.2.0
Reporter: Gergo Repas
Assignee: Gergo Repas


YARN-8191 introduced API-incompatible changes, which would break some classes 
in hive (e.g. 
https://github.com/apache/hive/blob/master/shims/scheduler/src/main/java/org/apache/hadoop/hive/schshim/FairSchedulerShim.java),
 this ticket's goal is to fix the incompatible changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7451) Resources Types should be visible in the Cluster Apps API "resourceRequests" section

2018-05-25 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490739#comment-16490739
 ] 

Gergo Repas commented on YARN-7451:
---

[~snemeth] Thanks for the updated patch. A couple of comments:
 * {{ResourceInfoWithCustomResourceTypes.customResources}} does not have the 
unit-information
 * {{ResourceInfoWithCustomResourceTypes.customResources()}}
 ** its javadoc is empty
 ** the memory and CPU resources are filtered out based on names. The rest of 
the codebase is already relying on having memory/CPU in the first two slots of 
the array, I think we could do the same in this method (iterating over the 
resourceInformations array starting with index=2)
 * There are checkstyle warnings

> Resources Types should be visible in the Cluster Apps API "resourceRequests" 
> section
> 
>
> Key: YARN-7451
> URL: https://issues.apache.org/jira/browse/YARN-7451
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager, restapi
>Affects Versions: 3.0.0
>Reporter: Grant Sohn
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-7451.001.patch, YARN-7451.002.patch, 
> YARN-7451.003.patch, YARN-7451.004.patch, YARN-7451.005.patch, 
> YARN-7451.006.patch, YARN-7451.007.patch, YARN-7451.008.patch, 
> YARN-7451.009.patch, YARN-7451.010.patch, YARN-7451.011.patch, 
> YARN-7451.012.patch, 
> YARN-7451__Expose_custom_resource_types_on_RM_scheduler_API_as_flattened_map01_02.patch
>
>
> When running jobs that request resource types the RM Cluster Apps API should 
> include this in the "resourceRequests" object.
> Additionally, when calling the RM scheduler API it returns:
> {noformat}
>  "childQueues": {
> "queue": [
> {
> "allocatedContainers": 101,
> "amMaxResources": {
> "memory": 320390,
> "vCores": 192
> },
> "amUsedResources": {
> "memory": 1024,
> "vCores": 1
> },
> "clusterResources": {
> "memory": 640779,
> "vCores": 384
> },
> "demandResources": {
> "memory": 103424,
> "vCores": 101
> },
> "fairResources": {
> "memory": 640779,
> "vCores": 384
> },
> "maxApps": 2147483647,
> "maxResources": {
> "memory": 640779,
> "vCores": 384
> },
> "minResources": {
> "memory": 0,
> "vCores": 0
> },
> "numActiveApps": 1,
> "numPendingApps": 0,
> "preemptable": true,
> "queueName": "root.users.systest",
> "reservedContainers": 0,
> "reservedResources": {
> "memory": 0,
> "vCores": 0
> },
> "schedulingPolicy": "fair",
> "steadyFairResources": {
> "memory": 320390,
> "vCores": 192
> },
> "type": "fairSchedulerLeafQueueInfo",
> "usedResources": {
> "memory": 103424,
> "vCores": 101
> }
> }
> ]
> 

[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-25 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16490382#comment-16490382
 ] 

Gergo Repas commented on YARN-8191:
---

[~haibochen] - Thanks for the review and committing the change!
[~wilfreds] [~snemeth] - Thanks for the reviews!

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch, 
> YARN-8191.012.patch, YARN-8191.013.patch, YARN-8191.014.patch, 
> YARN-8191.015.patch, YARN-8191.016.patch, YARN-8191.017.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-24 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489481#comment-16489481
 ] 

Gergo Repas commented on YARN-8191:
---

[~haibochen] - that's a nice catch, I fixed it.

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch, 
> YARN-8191.012.patch, YARN-8191.013.patch, YARN-8191.014.patch, 
> YARN-8191.015.patch, YARN-8191.016.patch, YARN-8191.017.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-24 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.017.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch, 
> YARN-8191.012.patch, YARN-8191.013.patch, YARN-8191.014.patch, 
> YARN-8191.015.patch, YARN-8191.016.patch, YARN-8191.017.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-24 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489132#comment-16489132
 ] 

Gergo Repas commented on YARN-8191:
---

Thanks [~haibochen]. I changed the return value removeEmptyIncompatibleQueues() 
to Optional in the updated patch. Regarding the test failure - it's a 
test timeout. This test passes locally on my dev machine (I tried quite a few 
times, I have not seen any failures), however it's runtime on my dev machine 
(~55 sec) is very close to the timeout (60 sec), so I think it can be prone to 
timeouts on slower machines.

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch, 
> YARN-8191.012.patch, YARN-8191.013.patch, YARN-8191.014.patch, 
> YARN-8191.015.patch, YARN-8191.016.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-24 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.016.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch, 
> YARN-8191.012.patch, YARN-8191.013.patch, YARN-8191.014.patch, 
> YARN-8191.015.patch, YARN-8191.016.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-23 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487573#comment-16487573
 ] 

Gergo Repas commented on YARN-8191:
---

Thanks [~haibochen] for the review!

{quote}
In this patch,  getRemovedStaticQueues() and QueueManager.setQueuesToDynamic() 
together identify the queues that were added in the previous allocation file, 
but are now removed in the new allocation file, and then mark them as dynamic. 
What I mean is that, is it possible that some queues that were created 
dynamically, but are now included in the new allocation file? If so, we need to 
mark them as static.
{quote}

This is done in {{QueueManager.ensureQueueExistsAndIsCompatibleAndIsStatic()}} 
method, it's been taken care of by the {{queue.setDynamic(false);}} line.

{quote}
The behavior of  QueueManager.removeLeafQueue() is still changed with the 
refactoring.  Previously it would return true if there is no incompatible queue 
found, but it now returns false.  We should also return true if 
removeEmptyIncompatibleQueues(name, FSQueueType.PARENT) returns null.  
Similarly, in IncompatibleQueueRemovalTask.execute(), the task shall be removed 
if `removed == null`.
{quote}

Thanks, I've corrected these conditions.

{quote}
Let's add some javadoc to newly added QueueManager public methods.
{quote}

Sure, I added them.

{quote}
`reloadListener.onCheck();` makes me worried what if the listener is not set. 
Looking closely at the code, the setReloadListener() is always set right after 
the AllocationFileLoaderService constructor, so I think we can move 
reloadListener as a construnctor argument, so that we never worry if listener 
is null.{quote}

The Listener is now a constructor argument in the latest patch.

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch, 
> YARN-8191.012.patch, YARN-8191.013.patch, YARN-8191.014.patch, 
> YARN-8191.015.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-23 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.015.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch, 
> YARN-8191.012.patch, YARN-8191.013.patch, YARN-8191.014.patch, 
> YARN-8191.015.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8273) Log aggregation does not warn if HDFS quota in target directory is exceeded

2018-05-23 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486937#comment-16486937
 ] 

Gergo Repas commented on YARN-8273:
---

[~rkanter] - Thanks for the review and commiting the change!
[~snemeth] - Thanks for the review!

> Log aggregation does not warn if HDFS quota in target directory is exceeded
> ---
>
> Key: YARN-8273
> URL: https://issues.apache.org/jira/browse/YARN-8273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8273.000.patch, YARN-8273.001.patch, 
> YARN-8273.002.patch, YARN-8273.003.patch, YARN-8273.004.patch, 
> YARN-8273.005.patch, YARN-8273.006.patch
>
>
> It appears that if an HDFS space quota is set on a target directory for log 
> aggregation and the quota is already exceeded when log aggregation is 
> attempted, zero-byte log files will be written to the HDFS directory, however 
> NodeManager logs do not reflect a failure to write the files successfully 
> (i.e. there are no ERROR or WARN messages to this effect).
> An improvement may be worth investigating to alert users to this scenario, as 
> otherwise logs for a YARN application may be missing both on HDFS and locally 
> (after local log cleanup is done) and the user may not otherwise be informed.
> Steps to reproduce:
> * Set a small HDFS space quota on /tmp/logs/username/logs (e.g. 2MB)
> * Write files to HDFS such that /tmp/logs/username/logs is almost 2MB full
> * Run a Spark or MR job in the cluster
> * Observe that zero byte files are written to HDFS after job completion
> * Observe that YARN container logs are also not present on the NM hosts (or 
> are deleted after yarn.nodemanager.delete.debug-delay-sec)
> * Observe that no ERROR or WARN messages appear to be logged in the NM role 
> log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8273) Log aggregation does not warn if HDFS quota in target directory is exceeded

2018-05-22 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483871#comment-16483871
 ] 

Gergo Repas commented on YARN-8273:
---

[~rkanter] Thanks for the review! Yes, indeed LogAggregationDFSException can be 
a checked exception (and a subclass of YarnException), I've updated the patch.

> Log aggregation does not warn if HDFS quota in target directory is exceeded
> ---
>
> Key: YARN-8273
> URL: https://issues.apache.org/jira/browse/YARN-8273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8273.000.patch, YARN-8273.001.patch, 
> YARN-8273.002.patch, YARN-8273.003.patch, YARN-8273.004.patch, 
> YARN-8273.005.patch, YARN-8273.006.patch
>
>
> It appears that if an HDFS space quota is set on a target directory for log 
> aggregation and the quota is already exceeded when log aggregation is 
> attempted, zero-byte log files will be written to the HDFS directory, however 
> NodeManager logs do not reflect a failure to write the files successfully 
> (i.e. there are no ERROR or WARN messages to this effect).
> An improvement may be worth investigating to alert users to this scenario, as 
> otherwise logs for a YARN application may be missing both on HDFS and locally 
> (after local log cleanup is done) and the user may not otherwise be informed.
> Steps to reproduce:
> * Set a small HDFS space quota on /tmp/logs/username/logs (e.g. 2MB)
> * Write files to HDFS such that /tmp/logs/username/logs is almost 2MB full
> * Run a Spark or MR job in the cluster
> * Observe that zero byte files are written to HDFS after job completion
> * Observe that YARN container logs are also not present on the NM hosts (or 
> are deleted after yarn.nodemanager.delete.debug-delay-sec)
> * Observe that no ERROR or WARN messages appear to be logged in the NM role 
> log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8273) Log aggregation does not warn if HDFS quota in target directory is exceeded

2018-05-22 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8273:
--
Attachment: YARN-8273.006.patch

> Log aggregation does not warn if HDFS quota in target directory is exceeded
> ---
>
> Key: YARN-8273
> URL: https://issues.apache.org/jira/browse/YARN-8273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8273.000.patch, YARN-8273.001.patch, 
> YARN-8273.002.patch, YARN-8273.003.patch, YARN-8273.004.patch, 
> YARN-8273.005.patch, YARN-8273.006.patch
>
>
> It appears that if an HDFS space quota is set on a target directory for log 
> aggregation and the quota is already exceeded when log aggregation is 
> attempted, zero-byte log files will be written to the HDFS directory, however 
> NodeManager logs do not reflect a failure to write the files successfully 
> (i.e. there are no ERROR or WARN messages to this effect).
> An improvement may be worth investigating to alert users to this scenario, as 
> otherwise logs for a YARN application may be missing both on HDFS and locally 
> (after local log cleanup is done) and the user may not otherwise be informed.
> Steps to reproduce:
> * Set a small HDFS space quota on /tmp/logs/username/logs (e.g. 2MB)
> * Write files to HDFS such that /tmp/logs/username/logs is almost 2MB full
> * Run a Spark or MR job in the cluster
> * Observe that zero byte files are written to HDFS after job completion
> * Observe that YARN container logs are also not present on the NM hosts (or 
> are deleted after yarn.nodemanager.delete.debug-delay-sec)
> * Observe that no ERROR or WARN messages appear to be logged in the NM role 
> log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-22 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.014.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch, 
> YARN-8191.012.patch, YARN-8191.013.patch, YARN-8191.014.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-22 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483864#comment-16483864
 ] 

Gergo Repas commented on YARN-8191:
---

[~haibochen] Thanks for the review!
1) - Good point, I fixed it.
2) - This logic's origin is a suggestion from [~wilfreds] (Wilfred - please 
correct me if I'm wrong about the intentions behind {{getRemovedStaticQueues(), 
setQueuesToDynamic()}}). The point here is that the set of removed queues can 
be gathered in {{AllocationReloadListener.onReload()}} outside of the 
writeLock. It's safe to do so because onReload() is only called from the 
synchronized {{AllocationFileLoaderService.reloadAllocations()}} method. This 
way the {{AllocationReloadListener.getRemovedStaticQueues()}} logic is subject 
to the least amount of locking. The thread safety was indeed missing for 
{{QueueManager.setQueuesToDynamic()}}, I've added the missing synchronized 
block.
3) Sorry, what do you mean by "What about the other case where some dynamic 
queues are not added as static in the new allocation file?". If you mean 
dynamic queue creation via application submission, the test case for this (+the 
removal) is {{TestQueueManager.testRemovalOfDynamicLeafQueue()}}.
4-5) I have refactored this part of the code, removed 
getIncompatibleQueueName() and changed only the return type of 
removeEmptyIncompatibleQueues() to indicate if there was no queue that's been 
tried to be removed.
6) {{updateAllocationConfiguration()}} is only called when the configuration 
file has been modified, so if for example there's only one configuration 
modification during the lifetime of the RM, incompatible queues would not be 
cleaned up until a restart.

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch, 
> YARN-8191.012.patch, YARN-8191.013.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8273) Log aggregation does not warn if HDFS quota in target directory is exceeded

2018-05-22 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8273:
--
Attachment: YARN-8273.005.patch

> Log aggregation does not warn if HDFS quota in target directory is exceeded
> ---
>
> Key: YARN-8273
> URL: https://issues.apache.org/jira/browse/YARN-8273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8273.000.patch, YARN-8273.001.patch, 
> YARN-8273.002.patch, YARN-8273.003.patch, YARN-8273.004.patch, 
> YARN-8273.005.patch
>
>
> It appears that if an HDFS space quota is set on a target directory for log 
> aggregation and the quota is already exceeded when log aggregation is 
> attempted, zero-byte log files will be written to the HDFS directory, however 
> NodeManager logs do not reflect a failure to write the files successfully 
> (i.e. there are no ERROR or WARN messages to this effect).
> An improvement may be worth investigating to alert users to this scenario, as 
> otherwise logs for a YARN application may be missing both on HDFS and locally 
> (after local log cleanup is done) and the user may not otherwise be informed.
> Steps to reproduce:
> * Set a small HDFS space quota on /tmp/logs/username/logs (e.g. 2MB)
> * Write files to HDFS such that /tmp/logs/username/logs is almost 2MB full
> * Run a Spark or MR job in the cluster
> * Observe that zero byte files are written to HDFS after job completion
> * Observe that YARN container logs are also not present on the NM hosts (or 
> are deleted after yarn.nodemanager.delete.debug-delay-sec)
> * Observe that no ERROR or WARN messages appear to be logged in the NM role 
> log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-18 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480478#comment-16480478
 ] 

Gergo Repas commented on YARN-8191:
---

[~wilfreds] Thanks for the review! All your suggestions make sense, I uploaded 
a new version. I also extended the testRemovalOfIncompatibleNonEmptyQueue test 
case to check if we don't lose the IncompatibleQueueRemovalTask during 
queueManager.removePendingIncompatibleQueues().

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch, 
> YARN-8191.012.patch, YARN-8191.013.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-18 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.013.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch, 
> YARN-8191.012.patch, YARN-8191.013.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8273) Log aggregation does not warn if HDFS quota in target directory is exceeded

2018-05-18 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8273:
--
Attachment: YARN-8273.004.patch

> Log aggregation does not warn if HDFS quota in target directory is exceeded
> ---
>
> Key: YARN-8273
> URL: https://issues.apache.org/jira/browse/YARN-8273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8273.000.patch, YARN-8273.001.patch, 
> YARN-8273.002.patch, YARN-8273.003.patch, YARN-8273.004.patch
>
>
> It appears that if an HDFS space quota is set on a target directory for log 
> aggregation and the quota is already exceeded when log aggregation is 
> attempted, zero-byte log files will be written to the HDFS directory, however 
> NodeManager logs do not reflect a failure to write the files successfully 
> (i.e. there are no ERROR or WARN messages to this effect).
> An improvement may be worth investigating to alert users to this scenario, as 
> otherwise logs for a YARN application may be missing both on HDFS and locally 
> (after local log cleanup is done) and the user may not otherwise be informed.
> Steps to reproduce:
> * Set a small HDFS space quota on /tmp/logs/username/logs (e.g. 2MB)
> * Write files to HDFS such that /tmp/logs/username/logs is almost 2MB full
> * Run a Spark or MR job in the cluster
> * Observe that zero byte files are written to HDFS after job completion
> * Observe that YARN container logs are also not present on the NM hosts (or 
> are deleted after yarn.nodemanager.delete.debug-delay-sec)
> * Observe that no ERROR or WARN messages appear to be logged in the NM role 
> log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8273) Log aggregation does not warn if HDFS quota in target directory is exceeded

2018-05-18 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8273:
--
Attachment: (was: YARN-8273.003.patch)

> Log aggregation does not warn if HDFS quota in target directory is exceeded
> ---
>
> Key: YARN-8273
> URL: https://issues.apache.org/jira/browse/YARN-8273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8273.000.patch, YARN-8273.001.patch, 
> YARN-8273.002.patch, YARN-8273.003.patch, YARN-8273.004.patch
>
>
> It appears that if an HDFS space quota is set on a target directory for log 
> aggregation and the quota is already exceeded when log aggregation is 
> attempted, zero-byte log files will be written to the HDFS directory, however 
> NodeManager logs do not reflect a failure to write the files successfully 
> (i.e. there are no ERROR or WARN messages to this effect).
> An improvement may be worth investigating to alert users to this scenario, as 
> otherwise logs for a YARN application may be missing both on HDFS and locally 
> (after local log cleanup is done) and the user may not otherwise be informed.
> Steps to reproduce:
> * Set a small HDFS space quota on /tmp/logs/username/logs (e.g. 2MB)
> * Write files to HDFS such that /tmp/logs/username/logs is almost 2MB full
> * Run a Spark or MR job in the cluster
> * Observe that zero byte files are written to HDFS after job completion
> * Observe that YARN container logs are also not present on the NM hosts (or 
> are deleted after yarn.nodemanager.delete.debug-delay-sec)
> * Observe that no ERROR or WARN messages appear to be logged in the NM role 
> log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8273) Log aggregation does not warn if HDFS quota in target directory is exceeded

2018-05-18 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8273:
--
Attachment: YARN-8273.003.patch

> Log aggregation does not warn if HDFS quota in target directory is exceeded
> ---
>
> Key: YARN-8273
> URL: https://issues.apache.org/jira/browse/YARN-8273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8273.000.patch, YARN-8273.001.patch, 
> YARN-8273.002.patch, YARN-8273.003.patch, YARN-8273.003.patch
>
>
> It appears that if an HDFS space quota is set on a target directory for log 
> aggregation and the quota is already exceeded when log aggregation is 
> attempted, zero-byte log files will be written to the HDFS directory, however 
> NodeManager logs do not reflect a failure to write the files successfully 
> (i.e. there are no ERROR or WARN messages to this effect).
> An improvement may be worth investigating to alert users to this scenario, as 
> otherwise logs for a YARN application may be missing both on HDFS and locally 
> (after local log cleanup is done) and the user may not otherwise be informed.
> Steps to reproduce:
> * Set a small HDFS space quota on /tmp/logs/username/logs (e.g. 2MB)
> * Write files to HDFS such that /tmp/logs/username/logs is almost 2MB full
> * Run a Spark or MR job in the cluster
> * Observe that zero byte files are written to HDFS after job completion
> * Observe that YARN container logs are also not present on the NM hosts (or 
> are deleted after yarn.nodemanager.delete.debug-delay-sec)
> * Observe that no ERROR or WARN messages appear to be logged in the NM role 
> log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8273) Log aggregation does not warn if HDFS quota in target directory is exceeded

2018-05-17 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8273:
--
Attachment: YARN-8273.003.patch

> Log aggregation does not warn if HDFS quota in target directory is exceeded
> ---
>
> Key: YARN-8273
> URL: https://issues.apache.org/jira/browse/YARN-8273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8273.000.patch, YARN-8273.001.patch, 
> YARN-8273.002.patch, YARN-8273.003.patch
>
>
> It appears that if an HDFS space quota is set on a target directory for log 
> aggregation and the quota is already exceeded when log aggregation is 
> attempted, zero-byte log files will be written to the HDFS directory, however 
> NodeManager logs do not reflect a failure to write the files successfully 
> (i.e. there are no ERROR or WARN messages to this effect).
> An improvement may be worth investigating to alert users to this scenario, as 
> otherwise logs for a YARN application may be missing both on HDFS and locally 
> (after local log cleanup is done) and the user may not otherwise be informed.
> Steps to reproduce:
> * Set a small HDFS space quota on /tmp/logs/username/logs (e.g. 2MB)
> * Write files to HDFS such that /tmp/logs/username/logs is almost 2MB full
> * Run a Spark or MR job in the cluster
> * Observe that zero byte files are written to HDFS after job completion
> * Observe that YARN container logs are also not present on the NM hosts (or 
> are deleted after yarn.nodemanager.delete.debug-delay-sec)
> * Observe that no ERROR or WARN messages appear to be logged in the NM role 
> log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8310) Handle old NMTokenIdentifier, AMRMTokenIdentifier, and ContainerTokenIdentifier formats

2018-05-17 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479160#comment-16479160
 ] 

Gergo Repas commented on YARN-8310:
---

[~rkanter] Thanks for working on this patch. Other than having checkstyle and 
asflicense warnings, LGTM.

> Handle old NMTokenIdentifier, AMRMTokenIdentifier, and 
> ContainerTokenIdentifier formats
> ---
>
> Key: YARN-8310
> URL: https://issues.apache.org/jira/browse/YARN-8310
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Robert Kanter
>Assignee: Robert Kanter
>Priority: Major
> Attachments: YARN-8310.001.patch, YARN-8310.branch-2.001.patch
>
>
> In some recent upgrade testing, we saw this error causing the NodeManager to 
> fail to startup afterwards:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: 
> com.google.protobuf.InvalidProtocolBufferException: Protocol message 
> contained an invalid tag (zero).
>   at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:441)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:834)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:895)
> Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol 
> message contained an invalid tag (zero).
>   at 
> com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
>   at 
> com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1860)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.(YarnSecurityTokenProtos.java:1824)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2016)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto$1.parsePartialFrom(YarnSecurityTokenProtos.java:2011)
>   at 
> com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
>   at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
>   at 
> org.apache.hadoop.yarn.proto.YarnSecurityTokenProtos$ContainerTokenIdentifierProto.parseFrom(YarnSecurityTokenProtos.java:2686)
>   at 
> org.apache.hadoop.yarn.security.ContainerTokenIdentifier.readFields(ContainerTokenIdentifier.java:254)
>   at 
> org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:177)
>   at 
> org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerTokenIdentifier(BuilderUtils.java:322)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:455)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:373)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:316)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   ... 5 more
> {noformat}
> The NodeManager fails because it's trying to read a 
> {{ContainerTokenIdentifier}} in the "old" format before we changed them to 
> protobufs (YARN-668).  This is very similar to YARN-5594 where we ran into a 
> similar problem with the ResourceManager and RM Delegation Tokens.
> To provide a better experience, we should make the code able to read the old 
> format if it's unable to read it using the new format.  We didn't run into 
> any errors with the other two types of tokens that YARN-668 incompatibly 
> changed (NMTokenIdentifier and AMRMTokenIdentifier), but we may as well fix 
> those while we're at it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-17 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.012.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch, 
> YARN-8191.012.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-16 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477218#comment-16477218
 ] 

Gergo Repas commented on YARN-8191:
---

[~wilfreds] Actually, please ignore my last patch, I realized that bug 2 (the 
lack of handling a failed removeEmptyIncompatibleQueues removal) is indeed 
causing a big issue, I'll provide another update.

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8273) Log aggregation does not warn if HDFS quota in target directory is exceeded

2018-05-15 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8273:
--
Attachment: YARN-8273.002.patch

> Log aggregation does not warn if HDFS quota in target directory is exceeded
> ---
>
> Key: YARN-8273
> URL: https://issues.apache.org/jira/browse/YARN-8273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8273.000.patch, YARN-8273.001.patch, 
> YARN-8273.002.patch
>
>
> It appears that if an HDFS space quota is set on a target directory for log 
> aggregation and the quota is already exceeded when log aggregation is 
> attempted, zero-byte log files will be written to the HDFS directory, however 
> NodeManager logs do not reflect a failure to write the files successfully 
> (i.e. there are no ERROR or WARN messages to this effect).
> An improvement may be worth investigating to alert users to this scenario, as 
> otherwise logs for a YARN application may be missing both on HDFS and locally 
> (after local log cleanup is done) and the user may not otherwise be informed.
> Steps to reproduce:
> * Set a small HDFS space quota on /tmp/logs/username/logs (e.g. 2MB)
> * Write files to HDFS such that /tmp/logs/username/logs is almost 2MB full
> * Run a Spark or MR job in the cluster
> * Observe that zero byte files are written to HDFS after job completion
> * Observe that YARN container logs are also not present on the NM hosts (or 
> are deleted after yarn.nodemanager.delete.debug-delay-sec)
> * Observe that no ERROR or WARN messages appear to be logged in the NM role 
> log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-15 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475740#comment-16475740
 ] 

Gergo Repas commented on YARN-8191:
---

[~wilfreds] I have uploaded an updated patch version, this changes the return 
value of removeEmptyIncompatibleQueues() to return if the queue or any of its 
parents was removed. I have refactored updateAllocationConfiguration, I have 
extracted a new method: ensureQueueExistsAndIsCompatibleAndIsStatic, in which: 
we remove incompatible queues, create the queue if it didn't exist or have been 
just removed, and set the queue to static (regardless if it already existed or 
has been just created), since both new and existing dynamic queues need to be 
set to static.

(The test failure is unrelated).

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8273) Log aggregation does not warn if HDFS quota in target directory is exceeded

2018-05-15 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8273:
--
Attachment: YARN-8273.001.patch

> Log aggregation does not warn if HDFS quota in target directory is exceeded
> ---
>
> Key: YARN-8273
> URL: https://issues.apache.org/jira/browse/YARN-8273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8273.000.patch, YARN-8273.001.patch
>
>
> It appears that if an HDFS space quota is set on a target directory for log 
> aggregation and the quota is already exceeded when log aggregation is 
> attempted, zero-byte log files will be written to the HDFS directory, however 
> NodeManager logs do not reflect a failure to write the files successfully 
> (i.e. there are no ERROR or WARN messages to this effect).
> An improvement may be worth investigating to alert users to this scenario, as 
> otherwise logs for a YARN application may be missing both on HDFS and locally 
> (after local log cleanup is done) and the user may not otherwise be informed.
> Steps to reproduce:
> * Set a small HDFS space quota on /tmp/logs/username/logs (e.g. 2MB)
> * Write files to HDFS such that /tmp/logs/username/logs is almost 2MB full
> * Run a Spark or MR job in the cluster
> * Observe that zero byte files are written to HDFS after job completion
> * Observe that YARN container logs are also not present on the NM hosts (or 
> are deleted after yarn.nodemanager.delete.debug-delay-sec)
> * Observe that no ERROR or WARN messages appear to be logged in the NM role 
> log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-15 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.011.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch, YARN-8191.011.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-14 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473961#comment-16473961
 ] 

Gergo Repas commented on YARN-8191:
---

[~wilfreds] Thanks for the explanation. I think returning *true* when 
removeEmptyIncompatibleQueues() does not remove the queue is intentional, as 
the javadoc of removeEmptyIncompatibleQueues() says: @return true if we can 
create queueToCreate or it already exists. But this is rather confusing, I'm 
thinking a way to refactor this (possibly to split up 
removeEmptyIncompatibleQueues()).

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8268) Fair scheduler: reservable queue is configured both as parent and leaf queue

2018-05-14 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473906#comment-16473906
 ] 

Gergo Repas commented on YARN-8268:
---

[~haibochen] - thanks for the review and committing the change!
[~wilfreds] - thanks for the review!

> Fair scheduler: reservable queue is configured both as parent and leaf queue
> 
>
> Key: YARN-8268
> URL: https://issues.apache.org/jira/browse/YARN-8268
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8268.000.patch, YARN-8268.001.patch
>
>
> The following allocation file
> {code:java}
> 
> 
> 
>   someuser 
>   
> someuser 
>   
>   
> 
> 
> someuser 
>   
> 
> drf
> 
> {code}
> is being parsed as: {{PARENT=[root, root.dedicated], LEAF=[root.default, 
> root.dedicated]}} (AllocationConfiguration.configuredQueues).
> The root.dedicated should only appear as a PARENT queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-11 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471704#comment-16471704
 ] 

Gergo Repas commented on YARN-8191:
---

[~wilfreds] Thanks. What would be the motivation to put the setDynamic(false) 
call into removeEmptyIncompatibleQueues()? To me it seems counterintuitive to 
put such a side effect into a removal-type method, and 
updateAllocationConfiguration() seems to me the natural place to do this.

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-10 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.010.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-10 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470865#comment-16470865
 ] 

Gergo Repas commented on YARN-8191:
---

Unit test failure is unrelated, the last patch addresses the checkstyle issues.

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch, YARN-8191.010.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8273) Log aggregation does not warn if HDFS quota in target directory is exceeded

2018-05-10 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8273:
--
Attachment: YARN-8273.000.patch

> Log aggregation does not warn if HDFS quota in target directory is exceeded
> ---
>
> Key: YARN-8273
> URL: https://issues.apache.org/jira/browse/YARN-8273
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8273.000.patch
>
>
> It appears that if an HDFS space quota is set on a target directory for log 
> aggregation and the quota is already exceeded when log aggregation is 
> attempted, zero-byte log files will be written to the HDFS directory, however 
> NodeManager logs do not reflect a failure to write the files successfully 
> (i.e. there are no ERROR or WARN messages to this effect).
> An improvement may be worth investigating to alert users to this scenario, as 
> otherwise logs for a YARN application may be missing both on HDFS and locally 
> (after local log cleanup is done) and the user may not otherwise be informed.
> Steps to reproduce:
> * Set a small HDFS space quota on /tmp/logs/username/logs (e.g. 2MB)
> * Write files to HDFS such that /tmp/logs/username/logs is almost 2MB full
> * Run a Spark or MR job in the cluster
> * Observe that zero byte files are written to HDFS after job completion
> * Observe that YARN container logs are also not present on the NM hosts (or 
> are deleted after yarn.nodemanager.delete.debug-delay-sec)
> * Observe that no ERROR or WARN messages appear to be logged in the NM role 
> log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8273) Log aggregation does not warn if HDFS quota in target directory is exceeded

2018-05-10 Thread Gergo Repas (JIRA)
Gergo Repas created YARN-8273:
-

 Summary: Log aggregation does not warn if HDFS quota in target 
directory is exceeded
 Key: YARN-8273
 URL: https://issues.apache.org/jira/browse/YARN-8273
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 3.1.0
Reporter: Gergo Repas
Assignee: Gergo Repas


It appears that if an HDFS space quota is set on a target directory for log 
aggregation and the quota is already exceeded when log aggregation is 
attempted, zero-byte log files will be written to the HDFS directory, however 
NodeManager logs do not reflect a failure to write the files successfully (i.e. 
there are no ERROR or WARN messages to this effect).

An improvement may be worth investigating to alert users to this scenario, as 
otherwise logs for a YARN application may be missing both on HDFS and locally 
(after local log cleanup is done) and the user may not otherwise be informed.

Steps to reproduce:
* Set a small HDFS space quota on /tmp/logs/username/logs (e.g. 2MB)
* Write files to HDFS such that /tmp/logs/username/logs is almost 2MB full
* Run a Spark or MR job in the cluster
* Observe that zero byte files are written to HDFS after job completion
* Observe that YARN container logs are also not present on the NM hosts (or are 
deleted after yarn.nodemanager.delete.debug-delay-sec)
* Observe that no ERROR or WARN messages appear to be logged in the NM role log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-10 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470570#comment-16470570
 ] 

Gergo Repas commented on YARN-8191:
---

[~wilfreds] Thank you for the review! I have a question regarding two points, 
and I have addressed the rest of the points:
* Can we add a test for this case please? I do not see this specific case in 
the tests covered, testNonEmptyStaticQueueBecomingDynamicQueue does not cover 
this use case 
** I have added testRemovalOfChildlessParentQueue() to test this scenario.
* We already check all queues defined in the configuration on each reload for 
existence in the updateAllocationConfiguration via the call to 
removeEmptyIncompatibleQueues. If the queue of the correct type exists we 
currently just return. The only thing we should do now before we return is 
unset the isDynamic flag there and there is no need for a separate loop. 
removeEmptyIncompatibleQueues is called for each configured queue with each 
reload.
** Sorry, I do not understand what you're suggesting here. Could you please 
elaborate a bit more?
* Keeping the same list in two places does not make sense. You now need to keep 
them in sync and it complicates the code in updateAllocationConfiguration which 
is not needed. We have also seen clusters with 100's of dynamic queues when 
user based queues are used so the loops can become long. Please move the logic 
for generating the diff to the onReload method where we have the 2 copies as 
described above.
** I moved the logic into onReload.
* In assertNotNull should use the q1 object that is returned, multiple tests
** Fixed.
* testRemovalOfDynamicLeafQueue needs to cover a dynamic leaf queue removal 
which has a static parent, and make sure the parent is still there. It should 
also check if a non empty queue is not removed.
** Good point, fixed.
* testRemovalOfDynamicParentQueue needs to cover a dynamic parent queue without 
a leaf is removal.
** How can I create a dynamic parent queue without a leaf? I thought the only 
way to have a parent queue without a leaf is to add it to the allocation config 
with parent="true", but in this case it'd be a static queue.
* testNonEmptyDynamicQueueBecomingStaticQueue is missing a check after the 
configuration update for isDynamic or not. The test name does not really cover 
the test you do unless you do that. The queue being empty or not does not 
really matter.
** Good point, fixed.
* testNonEmptyStaticQueueBecomingDynamicQueue: "root.test.childA" is shown as 
"root.queue1 is not a static queue" in the assert message. Check the order and 
comments in the method, they seem a bit out of order.
** Thanks, fixed.
* We need to cover a parent queue in the tests: dynamic parent + dynamic leaf 
(no apps) changing to static. Leaf and parent must be static after the change 
and the other way around (testQueueTypeChange ?)
** Good idea, I've added testQueueTypeChange().
* We need a test for the updating of the assignedApp in the FSLeafQueue and 
make sure isEmpty is working OK.
** Right, I've added testApplicationAssignmentPreventsRemovalOfDynamicQueue() 
(I'm open for suggestions regarding the naming).


> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-10 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.009.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch, 
> YARN-8191.009.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8268) Fair scheduler: reservable queue is configured both as parent and leaf queue

2018-05-10 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8268:
--
Description: 
The following allocation file
{code:java}



  someuser 
  
someuser 
  
  


someuser 
  

drf

{code}
is being parsed as: {{PARENT=[root, root.dedicated], LEAF=[root.default, 
root.dedicated]}} (AllocationConfiguration.configuredQueues).

The root.dedicated should only appear as a PARENT queue.

  was:
The following allocation file
{code}



  someuser 
  
someuser 
  
  


someuser 
  

drf

{code}
is being parsed as: {{PARENT=[root, root.dedicated], LEAF=[root.default, 
root.dedicated]}} (AllocationConfiguration.configuredQueues).

The root.dedicated should only appear as a LEAF queue.


> Fair scheduler: reservable queue is configured both as parent and leaf queue
> 
>
> Key: YARN-8268
> URL: https://issues.apache.org/jira/browse/YARN-8268
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8268.000.patch, YARN-8268.001.patch
>
>
> The following allocation file
> {code:java}
> 
> 
> 
>   someuser 
>   
> someuser 
>   
>   
> 
> 
> someuser 
>   
> 
> drf
> 
> {code}
> is being parsed as: {{PARENT=[root, root.dedicated], LEAF=[root.default, 
> root.dedicated]}} (AllocationConfiguration.configuredQueues).
> The root.dedicated should only appear as a PARENT queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8248) Job hangs when queue is specified and that queue has 0 capability of a resource

2018-05-10 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470210#comment-16470210
 ] 

Gergo Repas commented on YARN-8248:
---

[~snemeth] Thanks for the updated patch, LGTM.

+1 (non-binding)

> Job hangs when queue is specified and that queue has 0 capability of a 
> resource
> ---
>
> Key: YARN-8248
> URL: https://issues.apache.org/jira/browse/YARN-8248
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, yarn
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8248-001.patch, YARN-8248-002.patch, 
> YARN-8248-003.patch, YARN-8248-004.patch, YARN-8248-005.patch
>
>
> Job hangs when mapreduce.job.queuename is specified and the queue has 0 of 
> any resource (vcores / memory / other)
> In this scenario, the job should be immediately rejected upon submission 
> since the specified queue cannot serve the resource needs of the submitted 
> job.
>  
> Command to run:
> {code:java}
> bin/yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" 
> pi -Dmapreduce.job.queuename=sample_queue 1 1000;{code}
> fair-scheduler.xml queue config (excerpt):
>  
> {code:java}
>  
> 1 mb,0vcores
> 9 mb,0vcores
> 50
> -1.0f
> 2.0
> fair
>   
> {code}
> Diagnostic message from the web UI: 
> {code:java}
> Wed May 02 06:35:57 -0700 2018] Application is added to the scheduler and is 
> not yet activated. (Resource request:  exceeds current 
> queue or its parents maximum resource allowed).{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8248) Job hangs when queue is specified and that queue has 0 capability of a resource

2018-05-09 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468961#comment-16468961
 ] 

Gergo Repas commented on YARN-8248:
---

[~snemeth] Thanks for the new patch! I can see you introduced 
FairScheduler.rejectApplicationWithMessage() which logs the rejection message 
with error level. However I don't think rejections should be logged with error 
level, since that's rather normal behaviour and doesn't indicate an error in 
the scheduler's operation. Also, in some cases you removed the logging before 
the rejectApplicationWithMessage() call, sometimes you kept it, I think it 
should be consistent.

> Job hangs when queue is specified and that queue has 0 capability of a 
> resource
> ---
>
> Key: YARN-8248
> URL: https://issues.apache.org/jira/browse/YARN-8248
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, yarn
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8248-001.patch, YARN-8248-002.patch, 
> YARN-8248-003.patch
>
>
> Job hangs when mapreduce.job.queuename is specified and the queue has 0 of 
> any resource (vcores / memory / other)
> In this scenario, the job should be immediately rejected upon submission 
> since the specified queue cannot serve the resource needs of the submitted 
> job.
>  
> Command to run:
> {code:java}
> bin/yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" 
> pi -Dmapreduce.job.queuename=sample_queue 1 1000;{code}
> fair-scheduler.xml queue config (excerpt):
>  
> {code:java}
>  
> 1 mb,0vcores
> 9 mb,0vcores
> 50
> -1.0f
> 2.0
> fair
>   
> {code}
> Diagnostic message from the web UI: 
> {code:java}
> Wed May 02 06:35:57 -0700 2018] Application is added to the scheduler and is 
> not yet activated. (Resource request:  exceeds current 
> queue or its parents maximum resource allowed).{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8268) Fair scheduler: reservable queue is configured both as parent and leaf queue

2018-05-09 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468931#comment-16468931
 ] 

Gergo Repas commented on YARN-8268:
---

By looking at the test failures, the reservation queues should have been 
dropped from the LEAF queue set, I'm uploading a mew patch.

> Fair scheduler: reservable queue is configured both as parent and leaf queue
> 
>
> Key: YARN-8268
> URL: https://issues.apache.org/jira/browse/YARN-8268
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8268.000.patch, YARN-8268.001.patch
>
>
> The following allocation file
> {code}
> 
> 
> 
>   someuser 
>   
> someuser 
>   
>   
> 
> 
> someuser 
>   
> 
> drf
> 
> {code}
> is being parsed as: {{PARENT=[root, root.dedicated], LEAF=[root.default, 
> root.dedicated]}} (AllocationConfiguration.configuredQueues).
> The root.dedicated should only appear as a LEAF queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8268) Fair scheduler: reservable queue is configured both as parent and leaf queue

2018-05-09 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8268:
--
Attachment: YARN-8268.001.patch

> Fair scheduler: reservable queue is configured both as parent and leaf queue
> 
>
> Key: YARN-8268
> URL: https://issues.apache.org/jira/browse/YARN-8268
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8268.000.patch, YARN-8268.001.patch
>
>
> The following allocation file
> {code}
> 
> 
> 
>   someuser 
>   
> someuser 
>   
>   
> 
> 
> someuser 
>   
> 
> drf
> 
> {code}
> is being parsed as: {{PARENT=[root, root.dedicated], LEAF=[root.default, 
> root.dedicated]}} (AllocationConfiguration.configuredQueues).
> The root.dedicated should only appear as a LEAF queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8268) Fair scheduler: reservable queue is configured both as parent and leaf queue

2018-05-09 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468886#comment-16468886
 ] 

Gergo Repas commented on YARN-8268:
---

I'm checking the unit test failures.

> Fair scheduler: reservable queue is configured both as parent and leaf queue
> 
>
> Key: YARN-8268
> URL: https://issues.apache.org/jira/browse/YARN-8268
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8268.000.patch
>
>
> The following allocation file
> {code}
> 
> 
> 
>   someuser 
>   
> someuser 
>   
>   
> 
> 
> someuser 
>   
> 
> drf
> 
> {code}
> is being parsed as: {{PARENT=[root, root.dedicated], LEAF=[root.default, 
> root.dedicated]}} (AllocationConfiguration.configuredQueues).
> The root.dedicated should only appear as a LEAF queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-09 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Target Version/s: 3.2.0  (was: 3.1.1)

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-09 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468755#comment-16468755
 ] 

Gergo Repas edited comment on YARN-8191 at 5/9/18 12:08 PM:


Regarding the TestRMWebServicesReservation failures: it will be resolved by 
YARN-8268.

[~wilfreds] - Thanks for the review. Please find my replies to your comments 
below.
 # In removeEmptyDynamicQueues why are we not using getLeafQueues instead of 
getQueues? The queue cleanup should start with the leafs and it removes the 
need to check if the queue has children.
 ** This is to address the following scenario: there is an empty parent=“true” 
childless queue in the allocation config, which gets removed from the config 
(and hence converted to a dynamic queue in updateAllocationConfiguration). The 
next removeEmptyDynamicQueues() should clean this queue up.
 # In removeEmptyDynamicQueues instead of adding to parentQueuesToCheck without 
checks should we not only add a parent queue if that queue is dynamic? That 
also removes the need for the root queue check while iterating and the only 
check for child queues is needed.
 ** Good point, thanks, I addressed it.
 # In updateAllocationConfiguration there should be no need to go over all the 
queues and set the isDynamic flag. The flag should be set on creation and never 
updated after that unless it was an incompatible queue. We should add a new 
getLeafQueue and getParentQueue method which takes the isDynamic flag and sets 
it. It then can call the existing method. The only place that we would call the 
new method would be in updateAllocationConfiguration. The old version of the 
method would stay as is as the isDynamic flag set to true by default. That 
should handle all placement rules etc.
 ** The dynamic flag can change, I can add a previously dynamic queue (e.g. 
root.username) to the config, which becomes a static queue. Because of this, 
it’s easier just to set all queues present in the config to static queue 
(instead of branching: if the queue exists, then call setDynamic, else call the 
new getLeafQueue and getParentQueue methods).
 # To clean up any removed configured, non dynamic, queues we need to be smart 
and we do not want to walk over all queues. In onReload we have the old and new 
configured queues and we can easily build a list of .. The changeToDynamic just 
needs to walk over the set and set the flag. The standard cleanup code for the 
dynamic queues will then take care of the cleanup without needing lots of 
iterations. Building this diff of queues can happen outside of locking in the 
FS.
 ** Good point, I got rid of that loop. However instead of relying on the diff 
the be passed in to the QueueManager, the QueueManager keeps track of the set 
of static queues, and only sets the ones to dynamic, that are not present in 
the allocation config anymore.
 # The logic behind getDynamicQueueNames is not correct: the configuredQueues 
set as stored in the AllocationConfiguration only has queues that are defined 
in the configuration. There can never be a dynamic queue in that list. Dynamic 
queues can only be created via the placement rules. (see previous comments)
 ** This method got removed as part of the previous change.
 # As a side node: getLeafQueues in the QueueManager should return an 
ImmutableList of the leaf queues like we do in getQueues
 ** Yes, that would be clearer, however returning a CopyOnWriteArrayList should 
be also fine in terms of thread-safety. Since it’s not related to my change, I 
will not touch this piece of code.

I have uploaded an updated patch (008).


was (Author: grepas):
Regarding the TestRMWebServicesReservation failures: it will be resolved by 
YARN-8268.

[~wilfreds] - Thanks for the review. Please find my replies to your comments 
below.
# In removeEmptyDynamicQueues why are we not using getLeafQueues instead of 
getQueues? The queue cleanup should start with the leafs and it removes the 
need to check if the queue has children.
** This is to address the following scenario: there is an empty parent=“true” 
childless queue in the allocation config, which gets removed from the config 
(and hence converted to a dynamic queue in updateAllocationConfiguration). The 
next removeEmptyDynamicQueues() should clean this queue up.
# In removeEmptyDynamicQueues instead of adding to parentQueuesToCheck without 
checks should we not only add a parent queue if that queue is dynamic? That 
also removes the need for the root queue check while iterating and the only 
check for child queues is needed.
** Good point, thanks, I addressed it.
# In updateAllocationConfiguration there should be no need to go over all the 
queues and set the isDynamic flag. The flag should be set on creation and never 
updated after that unless it was an incompatible queue. We should add a new 
getLeafQueue and getParentQueue method which takes the isDynamic flag and 

[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-09 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468755#comment-16468755
 ] 

Gergo Repas commented on YARN-8191:
---

Regarding the TestRMWebServicesReservation failures: it will be resolved by 
YARN-8268.

[~wilfreds] - Thanks for the review. Please find my replies to your comments 
below.
# In removeEmptyDynamicQueues why are we not using getLeafQueues instead of 
getQueues? The queue cleanup should start with the leafs and it removes the 
need to check if the queue has children.
** This is to address the following scenario: there is an empty parent=“true” 
childless queue in the allocation config, which gets removed from the config 
(and hence converted to a dynamic queue in updateAllocationConfiguration). The 
next removeEmptyDynamicQueues() should clean this queue up.
# In removeEmptyDynamicQueues instead of adding to parentQueuesToCheck without 
checks should we not only add a parent queue if that queue is dynamic? That 
also removes the need for the root queue check while iterating and the only 
check for child queues is needed.
** Good point, thanks, I addressed it.
# In updateAllocationConfiguration there should be no need to go over all the 
queues and set the isDynamic flag. The flag should be set on creation and never 
updated after that unless it was an incompatible queue. We should add a new 
getLeafQueue and getParentQueue method which takes the isDynamic flag and sets 
it. It then can call the existing method. The only place that we would call the 
new method would be in updateAllocationConfiguration. The old version of the 
method would stay as is as the isDynamic flag set to true by default. That 
should handle all placement rules etc.
** The dynamic flag can change, I can add a previously dynamic queue (e.g. 
root.username) to the config, which becomes a static queue. Because of this, 
it’s easier just to set all queues present in the config to static queue 
(instead of branching: if the queue exists, then call setDynamic, else call the 
new getLeafQueue and getParentQueue methods).
# To clean up any removed configured, non dynamic, queues we need to be smart 
and we do not want to walk over all queues. In onReload we have the old and new 
configured queues and we can easily build a list of .. The changeToDynamic just 
needs to walk over the set and set the flag. The standard cleanup code for the 
dynamic queues will then take care of the cleanup without needing lots of 
iterations. Building this diff of queues can happen outside of locking in the 
FS.
** Good point, I got rid of that loop. However instead of relying on the diff 
the be passed in to the QueueManager, the QueueManager keeps track of the set 
of static queues, and only sets the ones to dynamic, that are not present in 
the allocation config anymore.
# The logic behind getDynamicQueueNames is not correct: the configuredQueues 
set as stored in the AllocationConfiguration only has queues that are defined 
in the configuration. There can never be a dynamic queue in that list. Dynamic 
queues can only be created via the placement rules. (see previous comments)
** This method got removed as part of the previous change.
# As a side node: getLeafQueues in the QueueManager should return an 
ImmutableList of the leaf queues like we do in getQueues
** Yes, that would be clearer, however returning a CopyOnWriteArrayList should 
be also fine in terms of thread-safety. Since it’s not related to my change, I 
will not touch this piece of code.


> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the 

[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-09 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.008.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch, YARN-8191.008.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8268) Fair scheduler: reservable queue is configured both as parent and leaf queue

2018-05-09 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468701#comment-16468701
 ] 

Gergo Repas commented on YARN-8268:
---

When only applying the test changes (in TestAllocationFileLoaderService) to 
current trunk and running this test, the following assertion fails:
{code:java}
java.lang.AssertionError: reservable queue should not be a parent queue
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertFalse(Assert.java:64)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService.testReservableQueue(TestAllocationFileLoaderService.java:811)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...
{code}
This demonstrates the problem with the current allocation file processing. The 
rest of the changes in the attached patch addresses this issue.

> Fair scheduler: reservable queue is configured both as parent and leaf queue
> 
>
> Key: YARN-8268
> URL: https://issues.apache.org/jira/browse/YARN-8268
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8268.000.patch
>
>
> The following allocation file
> {code}
> 
> 
> 
>   someuser 
>   
> someuser 
>   
>   
> 
> 
> someuser 
>   
> 
> drf
> 
> {code}
> is being parsed as: {{PARENT=[root, root.dedicated], LEAF=[root.default, 
> root.dedicated]}} (AllocationConfiguration.configuredQueues).
> The root.dedicated should only appear as a LEAF queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8268) Fair scheduler: reservable queue is configured both as parent and leaf queue

2018-05-09 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8268:
--
Attachment: YARN-8268.000.patch

> Fair scheduler: reservable queue is configured both as parent and leaf queue
> 
>
> Key: YARN-8268
> URL: https://issues.apache.org/jira/browse/YARN-8268
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-8268.000.patch
>
>
> The following allocation file
> {code}
> 
> 
> 
>   someuser 
>   
> someuser 
>   
>   
> 
> 
> someuser 
>   
> 
> drf
> 
> {code}
> is being parsed as: {{PARENT=[root, root.dedicated], LEAF=[root.default, 
> root.dedicated]}} (AllocationConfiguration.configuredQueues).
> The root.dedicated should only appear as a LEAF queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8268) Fair scheduler: reservable queue is configured both as parent and leaf queue

2018-05-09 Thread Gergo Repas (JIRA)
Gergo Repas created YARN-8268:
-

 Summary: Fair scheduler: reservable queue is configured both as 
parent and leaf queue
 Key: YARN-8268
 URL: https://issues.apache.org/jira/browse/YARN-8268
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.1.0
Reporter: Gergo Repas
Assignee: Gergo Repas
 Fix For: 3.2.0


The following allocation file
{code}



  someuser 
  
someuser 
  
  


someuser 
  

drf

{code}
is being parsed as: {{PARENT=[root, root.dedicated], LEAF=[root.default, 
root.dedicated]}} (AllocationConfiguration.configuredQueues).

The root.dedicated should only appear as a LEAF queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-09 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468634#comment-16468634
 ] 

Gergo Repas commented on YARN-8191:
---

Thanks [~wilfreds], I'll address those issues. Yes, I'm still working on the 
patch, once it's good I'll reply to your comments.

For the record, the reason behind the reservation test failures is the fact 
that the
{code}



  someuser 
  
someuser 
  
  


someuser 
  

drf

{code}
test allocation file is being parsed as: {{PARENT=[root, root.dedicated], 
LEAF=[root.default, root.dedicated]}}

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-08 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467621#comment-16467621
 ] 

Gergo Repas commented on YARN-8191:
---

TestRMWebServicesReservation is broken by my changes, I'm fixing it. 
(TestAMRestart seems unrelated - I cannot reproduce it locally).

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-08 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.007.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch, YARN-8191.007.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8248) Job hangs when queue is specified and that queue has 0 capability of a resource

2018-05-08 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467091#comment-16467091
 ] 

Gergo Repas commented on YARN-8248:
---

[~snemeth] Thanks for working on this. In the Resources class the patch 
introduces a new method ({{hasAnyZeroRequestedResource()}}), which seems to be 
very specific to this usecase. It may be worth to check if you can achieve the 
same logic by using existing methods of this class (e.g. {{fitsIn()}}, 
{{isAnyMajorResourceZero()}}).

> Job hangs when queue is specified and that queue has 0 capability of a 
> resource
> ---
>
> Key: YARN-8248
> URL: https://issues.apache.org/jira/browse/YARN-8248
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, yarn
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-8248-001.patch, YARN-8248-002.patch
>
>
> Job hangs when mapreduce.job.queuename is specified and the queue has 0 of 
> any resource (vcores / memory / other)
> In this scenario, the job should be immediately rejected upon submission 
> since the specified queue cannot server the resource needs of the submitted 
> job.
>  
> Command to run:
> {code:java}
> bin/yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" 
> pi -Dmapreduce.job.queuename=sample_queue 1 1000;{code}
> fair-scheduler.xml queue config (excerpt):
>  
> {code:java}
>  
> 1 mb,0vcores
> 9 mb,0vcores
> 50
> -1.0f
> 2.0
> fair
>   
> {code}
> Diagnostic message from the web UI: 
> {code:java}
> Wed May 02 06:35:57 -0700 2018] Application is added to the scheduler and is 
> not yet activated. (Resource request:  exceeds current 
> queue or its parents maximum resource allowed).{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-07 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.006.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch, 
> YARN-8191.006.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-04 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463701#comment-16463701
 ] 

Gergo Repas edited comment on YARN-8191 at 5/4/18 11:09 AM:


(I'm checking if the unit test failure is related.)


was (Author: grepas):
(The builds server is down, I cannot access the unit test log to check if the 
unit test failure is related.)

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-04 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463701#comment-16463701
 ] 

Gergo Repas commented on YARN-8191:
---

(The builds server is down, I cannot access the unit test log to check if the 
unit test failure is related.)

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-03 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462364#comment-16462364
 ] 

Gergo Repas commented on YARN-8191:
---

[~snemeth] Thanks for the review! I've added these comments to the updated 
(v005) patch.

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-03 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.005.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch, YARN-8191.005.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-02 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461238#comment-16461238
 ] 

Gergo Repas commented on YARN-8191:
---

[~wilfreds] Thank you for the review! I uploaded an updated patch. My replies 
to your comments:
* > The queueMgr can never be null, it is created when the FS get created the 
null check is not needed in the AllocationFileLoaderService
** I removed this check
* > You are adding a new method to the FS (hasApplicationAssignedToQueue) to 
check if an application is submitted to the queue but the application attempt 
has not been created yet. If we have a large number of queues and applications 
this count loop could become really expensive. I would suggest that we track 
submitted applications in the FSLeafQueue. We can then easily redefine empty as 
all 3 types (submit/run/non-run) being 0.
** Good point, I modified the code to according to your comment.
* > The root and root.default queue should not be marked dynamic when they get 
created by the queue manager.
** They are actually marked as dynamic=false
* > I do not understand the change around the isEmpty / shutDownIfEmpty some 
code is commented out and shutDownIfEmpty is never used
** I submitted this block comment/method by mistake, I removed it.
* > I think we should redefine isEmpty() in the QueueManager to take into 
account the submitted apps instead of creating something new.
** Good point, I updated this method.
* > getDynamicQueueNames should rely on the isDynamic flag that is part of the 
queue and not calculated.
** getDynamicQueueNames() is being used while we're setting the isDynamic 
flags, that's why I could not rely on the flag itself

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-02 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.004.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch, YARN-8191.004.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-4022) queue not remove from webpage(/cluster/scheduler) when delete queue in xxx-scheduler.xml

2018-05-02 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas resolved YARN-4022.
---
Resolution: Duplicate

[~snemeth] As per our discussion, this issue is covered by YARN-8191. Let's 
close this one.

> queue not remove from webpage(/cluster/scheduler) when delete queue in 
> xxx-scheduler.xml
> 
>
> Key: YARN-4022
> URL: https://issues.apache.org/jira/browse/YARN-4022
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler, resourcemanager
>Affects Versions: 2.7.1
>Reporter: forrestchen
>Assignee: Szilard Nemeth
>Priority: Major
>  Labels: oct16-medium, scheduler
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-4022.001.patch, YARN-4022.002.patch, YARN-4022.003.patch, 
> YARN-4022.004.patch
>
>
> When I delete an existing queue by modify the xxx-schedule.xml, I can still 
> see the queue information block in webpage(/cluster/scheduler) though the 
> 'Min Resources' items all become to zero and have no item of 'Max Running 
> Applications'.
> I can still submit an application to the deleted queue and the application 
> will run using 'root.default' queue instead, but submit to an un-exist queue 
> will cause an exception.
> My expectation is the deleted queue will not displayed in webpage and submit 
> application to the deleted queue will act just like the queue doesn't exist.
> PS: There's no application running in the queue I delete.
> Some related config in yarn-site.xml:
> {code}
> 
> yarn.scheduler.fair.user-as-default-queue
> false
> 
> 
> yarn.scheduler.fair.allow-undeclared-pools
> false
> 
> {code}
> a related question is here: 
> http://stackoverflow.com/questions/26488564/hadoop-yarn-why-the-queue-cannot-be-deleted-after-i-revise-my-fair-scheduler-xm



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-05-02 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: Queue Deletion in Fair Scheduler.pdf

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: Queue Deletion in Fair Scheduler.pdf, 
> YARN-8191.000.patch, YARN-8191.001.patch, YARN-8191.002.patch, 
> YARN-8191.003.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8202) DefaultAMSProcessor should properly check units of requested custom resource types against minimum/maximum allocation

2018-05-02 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461027#comment-16461027
 ] 

Gergo Repas commented on YARN-8202:
---

[~snemeth] Thanks for the updated patch, LGTM.

+1 (non-binding)

> DefaultAMSProcessor should properly check units of requested custom resource 
> types against minimum/maximum allocation
> -
>
> Key: YARN-8202
> URL: https://issues.apache.org/jira/browse/YARN-8202
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Blocker
> Attachments: YARN-8202-001.patch, YARN-8202-002.patch, 
> YARN-8202-003.patch, YARN-8202-004.patch, YARN-8202-005.patch, 
> YARN-8202-006.patch, YARN-8202-007.patch, YARN-8202-008.patch
>
>
>  
> When I execute a pi job with arguments: 
> {code:java}
> -Dmapreduce.map.resource.memory-mb=200 
> -Dmapreduce.map.resource.resource1=500M 1 1000{code}
> and I have one node with 5GB of resource1, I get the following exception on 
> every second and the job hangs:
> {code:java}
> 2018-04-24 08:42:03,694 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 20 on 8030, call Call#386 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 172.31.119.172:58138
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested resource type=[resource1] < 0 or greater than 
> maximum allowed allocation. Requested resource= resource1: 500M>, maximum allowed allocation= resource1: 5G>, please note that maximum allowed allocation is calculated by 
> scheduler based on maximum resource of registered NodeManagers, which might 
> be less than configured maximum allocation= resource1: 9223372036854775807G>
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:286)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:242)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:258)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:249)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:230)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:433)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> {code}
> *This is because 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils#validateResourceRequest
>  does not take resource units into account.*
>  
> However, if I start a job with arguments: 
> {code:java}
> -Dmapreduce.map.resource.memory-mb=200 -Dmapreduce.map.resource.resource1=1G 
> 1 1000{code}
> and I still have 5GB of resource1 on one node then the job runs successfully.
>  
> I also tried a third job run, when I request 1GB of resource1 and I have no 
> nodes with any amount of resource1, then I restart the node with 5GBs of 
> resource1, the job ultimately completes, but just after the node with enough 
> resources registered in RM, which is the desired behaviour.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (YARN-8202) DefaultAMSProcessor should properly check units of requested custom resource types against minimum/maximum allocation

2018-04-26 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454263#comment-16454263
 ] 

Gergo Repas commented on YARN-8202:
---

[~snemeth] Thanks for the patch! A couple of comments:
 # {{UnitsConversionUtil.checkUnitArguments()}} - it may be more reusable if it 
only checked a single unit, and was called twice
 # 
{{TestApplicationMasterService.testValidateRequestCapacityAgainstMinMaxAllocationWithDifferentUnits()}}
 - the config class is CapacitySchedulerConfiguration however its using 
FairScheduler

> DefaultAMSProcessor should properly check units of requested custom resource 
> types against minimum/maximum allocation
> -
>
> Key: YARN-8202
> URL: https://issues.apache.org/jira/browse/YARN-8202
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Blocker
> Attachments: YARN-8202-001.patch, YARN-8202-002.patch, 
> YARN-8202-003.patch
>
>
>  
> When I execute a pi job with arguments: 
> {code:java}
> -Dmapreduce.map.resource.memory-mb=200 
> -Dmapreduce.map.resource.resource1=500M 1 1000{code}
> and I have one node with 5GB of resource1, I get the following exception on 
> every second and the job hangs:
> {code:java}
> 2018-04-24 08:42:03,694 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 20 on 8030, call Call#386 Retry#0 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 172.31.119.172:58138
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request, requested resource type=[resource1] < 0 or greater than 
> maximum allowed allocation. Requested resource= resource1: 500M>, maximum allowed allocation= resource1: 5G>, please note that maximum allowed allocation is calculated by 
> scheduler based on maximum resource of registered NodeManagers, which might 
> be less than configured maximum allocation= resource1: 9223372036854775807G>
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:286)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:242)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndvalidateRequest(SchedulerUtils.java:258)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:249)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:230)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:433)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> {code}
> *This is because 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils#validateResourceRequest
>  does not take resource units into account.*
>  
> However, if I start a job with arguments: 
> {code:java}
> -Dmapreduce.map.resource.memory-mb=200 -Dmapreduce.map.resource.resource1=1G 
> 1 1000{code}
> and I still have 5GB of resource1 on one node then the job runs successfully.
>  
> I also tried a third job run, when I request 1GB of resource1 and I have no 
> nodes with any amount of resource1, then I restart the node with 5GBs of 
> resource1, the job ultimately completes, but just after the node with enough 
> resources registered in RM, which 

[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-04-26 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.003.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8191.000.patch, YARN-8191.001.patch, 
> YARN-8191.002.patch, YARN-8191.003.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-04-24 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: (was: YARN-8191.002.patch)

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8191.000.patch, YARN-8191.001.patch, 
> YARN-8191.002.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-04-24 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.002.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8191.000.patch, YARN-8191.001.patch, 
> YARN-8191.002.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-04-24 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.002.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8191.000.patch, YARN-8191.001.patch, 
> YARN-8191.002.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-04-23 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.001.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8191.000.patch, YARN-8191.001.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8090) Race conditions in FadvisedChunkedFile

2018-04-20 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445927#comment-16445927
 ] 

Gergo Repas commented on YARN-8090:
---

[~miklos.szeg...@cloudera.com] Thanks for the updated patch.

+1 (non-binding)

> Race conditions in FadvisedChunkedFile
> --
>
> Key: YARN-8090
> URL: https://issues.apache.org/jira/browse/YARN-8090
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-8090.000.patch, YARN-8090.001.patch
>
>
> {code:java}
> 11:04:33.605 AM   WARNFadvisedChunkedFile 
> Failed to manage OS cache for 
> /var/run/100/yarn/nm/usercache/systest/appcache/application_1521665017379_0062/output/attempt_1521665017379_0062_m_012797_0/file.out
> EBADF: Bad file descriptor
>   at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native 
> Method)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
>   at 
> org.apache.hadoop.mapred.FadvisedChunkedFile.close(FadvisedChunkedFile.java:76)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.closeInput(ChunkedWriteHandler.java:303)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.discard(ChunkedWriteHandler.java:163)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:192)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:137)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelClosed(SimpleChannelUpstreamHandler.java:225)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:570)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:493)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.handler.ssl.SslHandler.channelClosed(SslHandler.java:1667)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>   at org.jboss.netty.channel.Channels.fireChannelClosed(Channels.java:468)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:375)
>   at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>   at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>   at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>   at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Updated] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-04-20 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-8191:
--
Attachment: YARN-8191.000.patch

> Fair scheduler: queue deletion without RM restart
> -
>
> Key: YARN-8191
> URL: https://issues.apache.org/jira/browse/YARN-8191
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.0.1
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-8191.000.patch
>
>
> The Fair Scheduler never cleans up queues even if they are deleted in the 
> allocation file, or were dynamically created and are never going to be used 
> again. Queues always remain in memory which leads to two following issues.
>  # Steady fairshares aren’t calculated correctly due to remaining queues
>  # WebUI shows deleted queues, which is confusing for users (YARN-4022).
> We want to support proper queue deletion without restarting the Resource 
> Manager:
>  # Static queues without any entries that are removed from fair-scheduler.xml 
> should be deleted from memory.
>  # Dynamic queues without any entries should be deleted.
>  # RM Web UI should only show the queues defined in the scheduler at that 
> point in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8191) Fair scheduler: queue deletion without RM restart

2018-04-20 Thread Gergo Repas (JIRA)
Gergo Repas created YARN-8191:
-

 Summary: Fair scheduler: queue deletion without RM restart
 Key: YARN-8191
 URL: https://issues.apache.org/jira/browse/YARN-8191
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: fairscheduler
Affects Versions: 3.0.1
Reporter: Gergo Repas
Assignee: Gergo Repas


The Fair Scheduler never cleans up queues even if they are deleted in the 
allocation file, or were dynamically created and are never going to be used 
again. Queues always remain in memory which leads to two following issues.
 # Steady fairshares aren’t calculated correctly due to remaining queues
 # WebUI shows deleted queues, which is confusing for users (YARN-4022).

We want to support proper queue deletion without restarting the Resource 
Manager:
 # Static queues without any entries that are removed from fair-scheduler.xml 
should be deleted from memory.
 # Dynamic queues without any entries should be deleted.
 # RM Web UI should only show the queues defined in the scheduler at that point 
in time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8090) Race conditions in FadvisedChunkedFile

2018-04-20 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445490#comment-16445490
 ] 

Gergo Repas commented on YARN-8090:
---

[~miklos.szeg...@cloudera.com] Thanks for the patch. A couple of comments:
# if such a race condition can occur, then I think readaheadRequest should be 
marked as volatile
# closeLock - it's a common practice to make such fields final
# one of the new lines is long and would trigger a checkstyle warning 
({{.readaheadStream(identifier, fd, getCurrentOffset(), readaheadLength,}})

> Race conditions in FadvisedChunkedFile
> --
>
> Key: YARN-8090
> URL: https://issues.apache.org/jira/browse/YARN-8090
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Miklos Szegedi
>Assignee: Miklos Szegedi
>Priority: Major
> Attachments: YARN-8090.000.patch
>
>
> {code:java}
> 11:04:33.605 AM   WARNFadvisedChunkedFile 
> Failed to manage OS cache for 
> /var/run/100/yarn/nm/usercache/systest/appcache/application_1521665017379_0062/output/attempt_1521665017379_0062_m_012797_0/file.out
> EBADF: Bad file descriptor
>   at org.apache.hadoop.io.nativeio.NativeIO$POSIX.posix_fadvise(Native 
> Method)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX.posixFadviseIfPossible(NativeIO.java:267)
>   at 
> org.apache.hadoop.io.nativeio.NativeIO$POSIX$CacheManipulator.posixFadviseIfPossible(NativeIO.java:146)
>   at 
> org.apache.hadoop.mapred.FadvisedChunkedFile.close(FadvisedChunkedFile.java:76)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.closeInput(ChunkedWriteHandler.java:303)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.discard(ChunkedWriteHandler.java:163)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:192)
>   at 
> org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:137)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelClosed(SimpleChannelUpstreamHandler.java:225)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:570)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.cleanup(FrameDecoder.java:493)
>   at 
> org.jboss.netty.handler.codec.frame.FrameDecoder.channelClosed(FrameDecoder.java:371)
>   at 
> org.jboss.netty.handler.ssl.SslHandler.channelClosed(SslHandler.java:1667)
>   at 
> org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:88)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
>   at 
> org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
>   at org.jboss.netty.channel.Channels.fireChannelClosed(Channels.java:468)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:375)
>   at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>   at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>   at 
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>   at 
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> 

[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

2018-04-04 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425543#comment-16425543
 ] 

Gergo Repas commented on YARN-7913:
---

[~oshevchenko] I see, that's a good point that ACL changes may cause similar 
issues.

In this ticket my proposal would be to improve error-handling in general. I'd 
propose to try to recover all applications in a passive-to-active RM 
transition-attempt even with the presence of exceptions (logging all 
exceptions, rethrowing the last one), and the async event processing would take 
care of putting the application attempt to rejected state. This way we would 
have only a couple of passive-to-active RM transition-attempts (depending on 
how long the async event processing takes), instead of N+1 passive-to-active 
attempts, where N is the number of rejected applications. This is to improve 
the handling of any unforeseen exceptions.

What do you think?

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-7913.000.poc.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.
> _The point of this ticket is to improve the error handling and reduce the 
> number of passive -> active RM transition attempts (solving the above 
> described failure scenario isn't in scope)._



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-03-05 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385855#comment-16385855
 ] 

Gergo Repas commented on YARN-5028:
---

Thanks [~yufeigu], [~rohithsharma]!

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch, YARN-5028.006.patch, YARN-5028.007-addendum.patch, 
> YARN-5028.007-addendum.patch, YARN-5028.007.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-03-01 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382113#comment-16382113
 ] 

Gergo Repas commented on YARN-5028:
---

[~rohithsharma] Apologies for causing the break, and thanks for the addendum 
patch! The addendum patch looks good to me.

(I'm also fine with having my change reverted, and adding it back later after I 
tested it with ATSv2, even if it's for a later release).

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch, YARN-5028.006.patch, YARN-5028.007-addendum.patch, 
> YARN-5028.007-addendum.patch, YARN-5028.007.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-22 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16372608#comment-16372608
 ] 

Gergo Repas commented on YARN-5028:
---

[~yufeigu] Thanks for the commit and reviews. Thanks, but I don't need this 
change in branch 2.
Thanks [~snemeth] and [~rohithsharma] for the reviews.

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch, YARN-5028.006.patch, YARN-5028.007.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-20 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370093#comment-16370093
 ] 

Gergo Repas commented on YARN-5028:
---

007.patch: I addressed the checkstyle issue, locally ran all unit tests in the 
RM (Tests run: 2209, Failures: 0, Errors: 0, Skipped: 7). I also tested on a 
real cluster, and recovery worked with new trimmed-down application state.

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch, YARN-5028.006.patch, YARN-5028.007.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-20 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-5028:
--
Attachment: YARN-5028.007.patch

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch, YARN-5028.006.patch, YARN-5028.007.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-19 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-5028:
--
Attachment: YARN-5028.006.patch

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch, YARN-5028.006.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-19 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369202#comment-16369202
 ] 

Gergo Repas commented on YARN-5028:
---

I'm looking into the test failures.

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-19 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369095#comment-16369095
 ] 

Gergo Repas commented on YARN-5028:
---

[~sunilg] - I think we can keep the ACL-s in the same field as currently it's 
in. I'm clearing the rest of the fields from AMContainerSpec, that's the main 
change.

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-19 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369088#comment-16369088
 ] 

Gergo Repas commented on YARN-5028:
---

[~rohithsharma] [~yufeigu] - A pi job produced 1891 byte long znode with trunk 
code, 1866 byte long znode with trunk+YARN-5028.004.patch.

Because of the above results I created YARN-5028.005.patch which keeps only the 
ACL information from the AMContainerSpec, this patch produced a 200 byte long 
znode for the same pi job.

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-19 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-5028:
--
Attachment: YARN-5028.005.patch

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch, 
> YARN-5028.005.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-16 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367565#comment-16367565
 ] 

Gergo Repas commented on YARN-5028:
---

[~rohithsharma] - Sure, I'm going to put together statistics on znode data 
sizes with/without the patch.

In the earlier patch versions AMContainerSpec was also removed from the 
submission context, but it was put back in because the ACLs for the 
applications are coming from the AMContainerSpec - maybe a trimmed down 
AMContainerSpec containing only ACL data could help - I'll also include such a 
version in the comparison.

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-16 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367317#comment-16367317
 ] 

Gergo Repas commented on YARN-5028:
---

I fixed the bug in 003.patch. The latest run was marked as failed too (There 
was a timeout or other error in the fork), but there were 2208 tests run, 0 
failure, 0 error, 7 skipped - [~yufeigu] I think it's not a real unit test 
failure, could you please rerun this build?

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-16 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-5028:
--
Attachment: YARN-5028.004.patch

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch, YARN-5028.004.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-15 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365788#comment-16365788
 ] 

Gergo Repas commented on YARN-5028:
---

[~yufeigu] Thanks for the suggestions, I addressed all of them in v003 (the 
above two points in the test, removing the null-check around setting 
AMContainerSpec, and copying the whole AMContainerSpec over into the new 
app.submission context).

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-15 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-5028:
--
Attachment: YARN-5028.003.patch

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch, YARN-5028.003.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-14 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16364266#comment-16364266
 ] 

Gergo Repas commented on YARN-5028:
---

[~yufeigu] - Right, the null check can be abandoned (I'll upload a new patch 
when the rest of the questions are cleared).

Regarding your second question: I'm using a new instance to trim down the state 
as much as possible (also, I had to provide a non-null reference, otherwise an 
exception is thrown at the line you quoted from createAndPopulateNewRMApp()).

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-13 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362390#comment-16362390
 ] 

Gergo Repas commented on YARN-5028:
---

[~yufeigu] The two fields that are needed for the unit tests to pass are:
 # AMContainerResourceRequests
 ** needed for multiple TestWorkPreservingRMRestart tests
 ** if this field is not present the following exception is being thrown: 
[https://github.com/apache/hadoop/blob/0c5d7d71a80bccd4ad7eab269d0727b999606a7e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java#L494]
# ApplicationType
** needed for TestRMRestart
** the exception when this field is not present: 
{code}java.lang.NullPointerException
at org.apache.hadoop.util.StringUtils.toLowerCase(StringUtils.java:1127)
at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:869)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartGetApplicationList(TestRMRestart.java:1017)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
 {code}

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5028) RMStateStore should trim down app state for completed applications

2018-02-12 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360528#comment-16360528
 ] 

Gergo Repas commented on YARN-5028:
---

[~yufeigu] Yes, sure - I started out with an empty submission context, but the 
recovery failed at certain points (when running on a real cluster). From the 
stack traces it was pretty clear which field was missing for the recovery, and 
added those to the context until recovery succeeded without any issue - this 
covers the first four fields of the quoted section. The last two fields were 
needed for all unit tests to pass.

Please let me know if you need the exact stack traces where these fields are 
needed, it takes some time to produce those so I'll wait for your reply.

> RMStateStore should trim down app state for completed applications
> --
>
> Key: YARN-5028
> URL: https://issues.apache.org/jira/browse/YARN-5028
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Karthik Kambatla
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-5028.000.patch, YARN-5028.001.patch, 
> YARN-5028.002.patch
>
>
> RMStateStore stores enough information to recover applications in case of a 
> restart. The store also retains this information for completed applications 
> to serve their status to REST, WebUI, Java and CLI clients. We don't need all 
> the information we store today to serve application status; for instance, we 
> don't need the {{ApplicationSubmissionContext}}. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7913) Improve error handling when application recovery fails with exception

2018-02-09 Thread Gergo Repas (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358263#comment-16358263
 ] 

Gergo Repas commented on YARN-7913:
---

I'm attaching a proof-of-concept, I did the following test on trunk and with 
this patch:

I started 4 long-running applications with the failure scenario setup (where 
primary resourcemanager could accept the application, but the secondary 
resourcemanager fails to place the application on the queue - since it's a 
parent queue in the secondary RM).

On trunk there were 5 "RM could not transition to Active" error messages, with 
the patch there was only 1 "RM could not transition to Active" error message.

Any comments are appreciated.

 

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-7913.000.poc.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.
> _The point of this ticket is to improve the error handling and reduce the 
> number of passive -> active RM transition attempts (solving the above 
> described failure scenario isn't in scope)._



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7913) Improve error handling when application recovery fails with exception

2018-02-09 Thread Gergo Repas (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gergo Repas updated YARN-7913:
--
Attachment: YARN-7913.000.poc.patch

> Improve error handling when application recovery fails with exception
> -
>
> Key: YARN-7913
> URL: https://issues.apache.org/jira/browse/YARN-7913
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.0.0
>Reporter: Gergo Repas
>Assignee: Gergo Repas
>Priority: Major
> Attachments: YARN-7913.000.poc.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same 
> queue is a parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for 
> applications on this queue, and an APP_REJECTED event will be dispatched to 
> the async dispatcher. On the same thread (that handles the recovery), a 
> NullPointerException is thrown when the applicationAttempt is tried to be 
> recovered 
> (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
>  I don't see a good way to avoid the NPE in this scenario, because when the 
> NPE occurs the APP_REJECTED has not been processed yet, and we don't know 
> that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X 
> applications, there will be ~X passive -> active RM transition attempts - the 
> passive -> active RM transition will only succeed when the last APP_REJECTED 
> event is processed on the async dispatcher thread.
> _The point of this ticket is to improve the error handling and reduce the 
> number of passive -> active RM transition attempts (solving the above 
> described failure scenario isn't in scope)._



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7913) Improve error handling when application recovery fails with exception

2018-02-09 Thread Gergo Repas (JIRA)
Gergo Repas created YARN-7913:
-

 Summary: Improve error handling when application recovery fails 
with exception
 Key: YARN-7913
 URL: https://issues.apache.org/jira/browse/YARN-7913
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Affects Versions: 3.0.0
Reporter: Gergo Repas
Assignee: Gergo Repas


There are edge cases when the application recovery fails with an exception.

Example failure scenario:
 * setup: a queue is a leaf queue in the primary RM's config and the same queue 
is a parent queue in the secondary RM's config.
 * When failover happens with this setup, the recovery will fail for 
applications on this queue, and an APP_REJECTED event will be dispatched to the 
async dispatcher. On the same thread (that handles the recovery), a 
NullPointerException is thrown when the applicationAttempt is tried to be 
recovered 
(https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
 I don't see a good way to avoid the NPE in this scenario, because when the NPE 
occurs the APP_REJECTED has not been processed yet, and we don't know that the 
application recovery failed.

Currently the first exception will abort the recovery, and if there are X 
applications, there will be ~X passive -> active RM transition attempts - the 
passive -> active RM transition will only succeed when the last APP_REJECTED 
event is processed on the async dispatcher thread.

_The point of this ticket is to improve the error handling and reduce the 
number of passive -> active RM transition attempts (solving the above described 
failure scenario isn't in scope)._



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   >