[jira] [Commented] (YARN-5103) With NM recovery enabled, restarting NM multiple times results in AM restart

2016-05-23 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296484#comment-15296484 ] Jason Lowe commented on YARN-5103: -- +1 latest patch lgtm. I'll fix the checkstyle indentation nit as part

[jira] [Commented] (YARN-5103) With NM recovery enabled, restarting NM multiple times results in AM restart

2016-05-20 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294140#comment-15294140 ] Jason Lowe commented on YARN-5103: -- Thanks for the patch! I'm OK skipping the unit test for this case.

[jira] [Commented] (YARN-4882) Change the log level to DEBUG for recovering completed applications

2016-05-19 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15291155#comment-15291155 ] Jason Lowe commented on YARN-4882: -- bq. let us start with debug logging success cases. If we ever run into

[jira] [Commented] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2016-05-19 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15291103#comment-15291103 ] Jason Lowe commented on YARN-1902: -- Sorry for jumping in late, but I'd like to keep moving this forward.

[jira] [Commented] (YARN-5098) Yarn Application log Aggreagation fails due to NM can not get correct HDFS delegation token

2016-05-17 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286696#comment-15286696 ] Jason Lowe commented on YARN-5098: -- The original description of this JIRA showed that the HDFS token

[jira] [Commented] (YARN-5041) application master log can not be available when clicking jobhistory's am logs link

2016-05-16 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284985#comment-15284985 ] Jason Lowe commented on YARN-5041: -- +1 lgtm. Will commit this tomorrow if there are no objections. >

[jira] [Updated] (YARN-4325) Nodemanager log handlers fail to send finished/failed events in some cases

2016-05-16 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-4325: - Summary: Nodemanager log handlers fail to send finished/failed events in some cases (was: Purge app state

[jira] [Commented] (YARN-4325) Purge app state from NM state-store should cover more LOG_HANDLING cases

2016-05-16 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284734#comment-15284734 ] Jason Lowe commented on YARN-4325: -- +1 lgtm. Committing this. > Purge app state from NM state-store

[jira] [Commented] (YARN-4325) Purge app state from NM state-store should cover more LOG_HANDLING cases

2016-05-13 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282726#comment-15282726 ] Jason Lowe commented on YARN-4325: -- Appears Jenkins is having difficulty posting to JIRA. Overall was +1

[jira] [Updated] (YARN-5041) application master log can not be available when clicking jobhistory's am logs link

2016-05-12 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-5041: - Target Version/s: 2.9.0 Fix Version/s: (was: 3.0.0) It would be a significant regression if

[jira] [Commented] (YARN-5053) More informative diagnostics when applications killed by a user

2016-05-11 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280835#comment-15280835 ] Jason Lowe commented on YARN-5053: -- +1 lgtm. Will commit this tomorrow if there are no objections. >

[jira] [Commented] (YARN-4325) Purge app state from NM state-store should cover more LOG_HANDLING cases

2016-05-11 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280731#comment-15280731 ] Jason Lowe commented on YARN-4325: -- Thanks, Junping! The test failure is related. In addition to the

[jira] [Commented] (YARN-4882) Change the log level to DEBUG for recovering completed applications

2016-05-11 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280214#comment-15280214 ] Jason Lowe commented on YARN-4882: -- The main motiviation of proposing a separate logger is to allow finer

[jira] [Commented] (YARN-4325) Purge app state from NM state-store should cover more LOG_HANDLING cases

2016-05-11 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280176#comment-15280176 ] Jason Lowe commented on YARN-4325: -- Yes, what I'm proposing is to have the log handlers always respond to

[jira] [Comment Edited] (YARN-4325) Purge app state from NM state-store should cover more LOG_HANDLING cases

2016-05-10 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278751#comment-15278751 ] Jason Lowe edited comment on YARN-4325 at 5/10/16 7:51 PM: --- I'm just thinking the

[jira] [Commented] (YARN-4325) Purge app state from NM state-store should cover more LOG_HANDLING cases

2016-05-10 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278751#comment-15278751 ] Jason Lowe commented on YARN-4325: -- I'm just thinking the explicit boolean check and special-case logic is

[jira] [Commented] (YARN-4747) AHS error 500 due to NPE when container start event is missing

2016-05-06 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274873#comment-15274873 ] Jason Lowe commented on YARN-4747: -- +1 lgtm. Committing this. > AHS error 500 due to NPE when container

[jira] [Created] (YARN-5055) max per user can be larger than max per queue

2016-05-06 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-5055: Summary: max per user can be larger than max per queue Key: YARN-5055 URL: https://issues.apache.org/jira/browse/YARN-5055 Project: Hadoop YARN Issue Type: Bug

[jira] [Created] (YARN-5053) More informative diagnostics when applications killed by a user

2016-05-06 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-5053: Summary: More informative diagnostics when applications killed by a user Key: YARN-5053 URL: https://issues.apache.org/jira/browse/YARN-5053 Project: Hadoop YARN

[jira] [Commented] (YARN-4325) Purge app state from NM state-store should cover more LOG_HANDLING cases

2016-05-06 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15274198#comment-15274198 ] Jason Lowe commented on YARN-4325: -- Thanks for the patch! For AppCompletelyDoneTransition it seems a

[jira] [Commented] (YARN-5039) Applications ACCEPTED but not starting

2016-05-06 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273984#comment-15273984 ] Jason Lowe commented on YARN-5039: -- bq. scheduler will not assign containers to decommissioning nodes,

[jira] [Commented] (YARN-5039) Applications ACCEPTED but not starting

2016-05-05 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15273240#comment-15273240 ] Jason Lowe commented on YARN-5039: -- Can you also double-check that the startup messages in the RM log show

[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-05-05 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272341#comment-15272341 ] Jason Lowe commented on YARN-4311: -- +1 lgtm. Committing this. > Removing nodes from include and exclude

[jira] [Commented] (YARN-5039) Applications ACCEPTED but not starting

2016-05-05 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272333#comment-15272333 ] Jason Lowe commented on YARN-5039: -- Yes, if the problem is indeed the same as that reported in YARN-4610

[jira] [Commented] (YARN-5039) Applications ACCEPTED but not starting

2016-05-04 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271537#comment-15271537 ] Jason Lowe commented on YARN-5039: -- OK, so I see one of the empty nodes coming in but it's not using it.

[jira] [Commented] (YARN-5039) Applications ACCEPTED but not starting

2016-05-04 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271480#comment-15271480 ] Jason Lowe commented on YARN-5039: -- Apps are pending until they are activated. Apps can be pending

[jira] [Commented] (YARN-5039) Applications ACCEPTED but not starting

2016-05-04 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271453#comment-15271453 ] Jason Lowe commented on YARN-5039: -- The screenshot also shows one app pending and one running -- I assume

[jira] [Commented] (YARN-4280) CapacityScheduler reservations may not prevent indefinite postponement on a busy cluster

2016-05-03 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15268908#comment-15268908 ] Jason Lowe commented on YARN-4280: -- The proposed algorithm does not change how reserved containers work --

[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-05-02 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15267463#comment-15267463 ] Jason Lowe commented on YARN-4311: -- Sorry for the delay in getting back to this. I think the changes to

[jira] [Updated] (YARN-4834) ProcfsBasedProcessTree doesn't track daemonized processes

2016-05-02 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-4834: - Target Version/s: 2.7.4 +1 to using the session ID to track the processes for a container. Ideally if

[jira] [Commented] (YARN-4280) CapacityScheduler reservations may not prevent indefinite postponement on a busy cluster

2016-05-02 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266711#comment-15266711 ] Jason Lowe commented on YARN-4280: -- Clarification: I meant to say "absolute max capacity" above whenever I

[jira] [Commented] (YARN-4280) CapacityScheduler reservations may not prevent indefinite postponement on a busy cluster

2016-05-02 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266673#comment-15266673 ] Jason Lowe commented on YARN-4280: -- bq. The problem of allowing one container reserved exceed queue's max

[jira] [Commented] (YARN-5009) NMLeveldbStateStoreService database can grow substantially leading to longer recovery times

2016-04-29 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264169#comment-15264169 ] Jason Lowe commented on YARN-5009: -- Thanks for the review and commit, Jian! Note that the commit to

[jira] [Commented] (YARN-4280) CapacityScheduler reservations may not prevent indefinite postponement on a busy cluster

2016-04-29 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264079#comment-15264079 ] Jason Lowe commented on YARN-4280: -- I'm not thrilled with the idea of preemption to solve this issue.

[jira] [Updated] (YARN-5009) NMLeveldbStateStoreService database can grow substantially leading to longer recovery times

2016-04-28 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-5009: - Attachment: YARN-5009.002.patch Nice catch! Yes, it should be using a long to avoid overflow in case

[jira] [Created] (YARN-5010) maxActiveApplications and maxActiveApplicationsPerUser are missing from REST API

2016-04-28 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-5010: Summary: maxActiveApplications and maxActiveApplicationsPerUser are missing from REST API Key: YARN-5010 URL: https://issues.apache.org/jira/browse/YARN-5010 Project: Hadoop

[jira] [Updated] (YARN-5009) NMLeveldbStateStoreService database can grow substantially leading to longer recovery times

2016-04-28 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-5009: - Attachment: YARN-5009.001.patch Patch to add periodic manual compactions every hour by default.

[jira] [Created] (YARN-5009) NMLeveldbStateStoreService database can grow substantially leading to longer recovery times

2016-04-28 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-5009: Summary: NMLeveldbStateStoreService database can grow substantially leading to longer recovery times Key: YARN-5009 URL: https://issues.apache.org/jira/browse/YARN-5009

[jira] [Updated] (YARN-5008) LeveldbRMStateStore database can grow substantially leading to long recovery times

2016-04-28 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-5008: - Attachment: YARN-5008.001.patch I noticed that in the cases where the database was quite large a manual

[jira] [Created] (YARN-5008) LeveldbRMStateStore database can grow substantially leading to long recovery times

2016-04-28 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-5008: Summary: LeveldbRMStateStore database can grow substantially leading to long recovery times Key: YARN-5008 URL: https://issues.apache.org/jira/browse/YARN-5008 Project:

[jira] [Commented] (YARN-4940) yarn node -list -all failed if RM start with decommissioned node

2016-04-15 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243500#comment-15243500 ] Jason Lowe commented on YARN-4940: -- +1 lgtm. The test failures appear to be unrelated. Committing this.

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-14 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241810#comment-15241810 ] Jason Lowe commented on YARN-4924: -- Filed YARN-4960 for the LeveldbIterator constructor issue and

[jira] [Commented] (YARN-4961) Wrapper for leveldb DB to aid in handling database exceptions

2016-04-14 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241807#comment-15241807 ] Jason Lowe commented on YARN-4961: -- This would be akin to what LeveldbIterator does for the leveldb

[jira] [Created] (YARN-4961) Wrapper for leveldb DB to aid in handling database exceptions

2016-04-14 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-4961: Summary: Wrapper for leveldb DB to aid in handling database exceptions Key: YARN-4961 URL: https://issues.apache.org/jira/browse/YARN-4961 Project: Hadoop YARN

[jira] [Created] (YARN-4960) Runtime DBException can escape LeveldbIterator constructor

2016-04-14 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-4960: Summary: Runtime DBException can escape LeveldbIterator constructor Key: YARN-4960 URL: https://issues.apache.org/jira/browse/YARN-4960 Project: Hadoop YARN Issue

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-14 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241734#comment-15241734 ] Jason Lowe commented on YARN-4924: -- bq. leveldbIterator may also throws DBException, yes? Yes, if the

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-14 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241431#comment-15241431 ] Jason Lowe commented on YARN-4924: -- Thanks for updating the patch! If createWriteBatch does ever throw

[jira] [Commented] (YARN-2567) Add a percentage-node threshold for RM to wait for new allocations after restart/failover

2016-04-12 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-2567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15237129#comment-15237129 ] Jason Lowe commented on YARN-2567: -- The problem with delaying or otherwise making the state store

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-11 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236117#comment-15236117 ] Jason Lowe commented on YARN-4924: -- org.iq80.levedb.DBException (the one we're interested in catching) is

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-11 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236015#comment-15236015 ] Jason Lowe commented on YARN-4924: -- Thanks for updating the patch! cleanupKeysWithPrefix can now let the

[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-04-11 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235358#comment-15235358 ] Jason Lowe commented on YARN-4311: -- If we only remove truly untracked nodes then option 1 should be OK.

[jira] [Updated] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-04-11 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-4311: - Target Version/s: 2.8.0, 2.7.4 Fix Version/s: (was: 2.8.0) I reverted this from trunk,

[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-04-11 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235331#comment-15235331 ] Jason Lowe commented on YARN-4311: -- No need for a followup JIRA, I'll revert the one in trunk until we

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-11 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235212#comment-15235212 ] Jason Lowe commented on YARN-4924: -- Thanks for updating the patch! It may not be clear to others reading

[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-04-11 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235166#comment-15235166 ] Jason Lowe commented on YARN-4311: -- Thanks for updating the patch! Couple of comments: The patch has a

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-08 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232430#comment-15232430 ] Jason Lowe commented on YARN-4924: -- Thanks for the patch! I don't think removeDeprecatedKeys is an

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-07 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230248#comment-15230248 ] Jason Lowe commented on YARN-4924: -- Yeah, now that the NM registers with the list of apps it thinks are

[jira] [Created] (YARN-4930) Convenience command to perform a work-preserving stop of the nodemanager

2016-04-06 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-4930: Summary: Convenience command to perform a work-preserving stop of the nodemanager Key: YARN-4930 URL: https://issues.apache.org/jira/browse/YARN-4930 Project: Hadoop YARN

[jira] [Commented] (YARN-4930) Convenience command to perform a work-preserving stop of the nodemanager

2016-04-06 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229060#comment-15229060 ] Jason Lowe commented on YARN-4930: -- Currently admins can accomplish the task via a kill -9 of the

[jira] [Commented] (YARN-4924) NM recovery race can lead to container not cleaned up

2016-04-06 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15228198#comment-15228198 ] Jason Lowe commented on YARN-4924: -- I agree with [~sandflee] that postponing the finish app event dispatch

[jira] [Updated] (YARN-4773) Log aggregation performs extraneous filesystem operations when rolling log aggregation is disabled

2016-04-05 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-4773: - Fix Version/s: 2.6.5 2.7.3 Apologies for the long delay. +1 for the branch-2.6 patch.

[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-04-05 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15226259#comment-15226259 ] Jason Lowe commented on YARN-4311: -- +1 latest patch lgtm. Committing this. > Removing nodes from include

[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-03-31 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220279#comment-15220279 ] Jason Lowe commented on YARN-4311: -- Thanks for updating the patch! Everything looks great except nodes

[jira] [Commented] (YARN-4882) Change the log level to DEBUG for recovering completed applications

2016-03-30 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218340#comment-15218340 ] Jason Lowe commented on YARN-4882: -- The problem with saying we can up the log level to DEBUG is that

[jira] [Commented] (YARN-4773) Log aggregation performs extraneous filesystem operations when rolling log aggregation is disabled

2016-03-28 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215046#comment-15215046 ] Jason Lowe commented on YARN-4773: -- +1 committing this. > Log aggregation performs extraneous filesystem

[jira] [Commented] (YARN-4814) ATS 1.5 timelineclient impl call flush after every event write

2016-03-25 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212571#comment-15212571 ] Jason Lowe commented on YARN-4814: -- bq. May be I can revert that patch and commit with the right log

[jira] [Commented] (YARN-4311) Removing nodes from include and exclude lists will not remove them from decommissioned nodes list

2016-03-25 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212400#comment-15212400 ] Jason Lowe commented on YARN-4311: -- Thanks for updating the patch! Sorry for the delay in getting back to

[jira] [Resolved] (YARN-4814) ATS 1.5 timelineclient impl call flush after every event write

2016-03-25 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved YARN-4814. -- Resolution: Fixed Committed to trunk, branch-2, and branch-2.8. Thanks [~xgong]! > ATS 1.5

[jira] [Commented] (YARN-4814) ATS 1.5 timelineclient impl call flush after every event write

2016-03-25 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212344#comment-15212344 ] Jason Lowe commented on YARN-4814: -- Despite the Hudson comment above, this was still needing a commit.

[jira] [Commented] (YARN-4773) Log aggregation performs extraneous filesystem operations when rolling log aggregation is disabled

2016-03-25 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212314#comment-15212314 ] Jason Lowe commented on YARN-4773: -- Thanks for updating the patch! I see the Private annotation was

[jira] [Resolved] (YARN-4839) ResourceManager deadlock between RMAppAttemptImpl and SchedulerApplicationAttempt

2016-03-25 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved YARN-4839. -- Resolution: Duplicate Resolving since the fix was incorporated into 2.8 as part of YARN-3361. >

[jira] [Commented] (YARN-4773) Log aggregation performs extraneous filesystem operations when rolling log aggregation is disabled

2016-03-24 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210763#comment-15210763 ] Jason Lowe commented on YARN-4773: -- Patch looks good overall, just a few nits: - getCleanupOldLogTimes

[jira] [Created] (YARN-4839) ResourceManager deadlock between RMAppAttemptImpl and SchedulerApplicationAttempt

2016-03-20 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-4839: Summary: ResourceManager deadlock between RMAppAttemptImpl and SchedulerApplicationAttempt Key: YARN-4839 URL: https://issues.apache.org/jira/browse/YARN-4839 Project:

[jira] [Commented] (YARN-4839) ResourceManager deadlock between RMAppAttemptImpl and SchedulerApplicationAttempt

2016-03-19 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201687#comment-15201687 ] Jason Lowe commented on YARN-4839: -- bq. Could this be the same issue as pointed out by YARN-4247? It is

[jira] [Commented] (YARN-4686) MiniYARNCluster.start() returns before cluster is completely started

2016-03-19 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200203#comment-15200203 ] Jason Lowe commented on YARN-4686: -- bq. Still interested in if Jason Lowe or Karthik Kambatla have

[jira] [Commented] (YARN-4839) ResourceManager deadlock between RMAppAttemptImpl and SchedulerApplicationAttempt

2016-03-19 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201578#comment-15201578 ] Jason Lowe commented on YARN-4839: -- Stack trace of the relevant threads: {noformat} "IPC Server handler 32

[jira] [Commented] (YARN-4839) ResourceManager deadlock between RMAppAttemptImpl and SchedulerApplicationAttempt

2016-03-19 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201680#comment-15201680 ] Jason Lowe commented on YARN-4839: -- This appears to have been fixed as a side-effect of YARN-3361 which

[jira] [Resolved] (YARN-4783) Log aggregation failure for application when Nodemanager is restarted

2016-03-19 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe resolved YARN-4783. -- Resolution: Won't Fix Resolving as Won't Fix per the above discussion since we don't want to keep an

[jira] [Commented] (YARN-4814) ATS 1.5 timelineclient impl call flush after every event write

2016-03-18 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197348#comment-15197348 ] Jason Lowe commented on YARN-4814: -- +1 lgtm, holding off on committing to allow others to comment. I

[jira] [Commented] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files

2016-03-15 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196297#comment-15196297 ] Jason Lowe commented on YARN-4818: -- Is there a recent change in 2.8 that broke this? I haven't seen it in

[jira] [Commented] (YARN-4814) ATS 1.5 timelineclient impl call flush after every event write

2016-03-15 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196259#comment-15196259 ] Jason Lowe commented on YARN-4814: -- Sorry I missed this before it went in. Don't we have an issue where

[jira] [Commented] (YARN-4789) Provide helpful exception for non-PATH-like conflict with admin.user.env

2016-03-11 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191371#comment-15191371 ] Jason Lowe commented on YARN-4789: -- This looks related to MAPREDUCE-6491. > Provide helpful exception for

[jira] [Commented] (YARN-4783) Log aggregation failure for application when Nodemanager is restarted

2016-03-10 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189434#comment-15189434 ] Jason Lowe commented on YARN-4783: -- Thanks for posting the details from the logs! The problem is as I

[jira] [Assigned] (YARN-4773) Log aggregation performs extraneous filesystem operations when rolling log aggregation is disabled

2016-03-09 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe reassigned YARN-4773: Assignee: Jun Gong Feel free, as I'm currently busy with other tasks. I filed it and left it

[jira] [Commented] (YARN-4773) Log aggregation performs extraneous filesystem operations when rolling log aggregation is disabled

2016-03-09 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187270#comment-15187270 ] Jason Lowe commented on YARN-4773: -- Yes, [~hex108] that is the scenario. > Log aggregation performs

[jira] [Commented] (YARN-4783) Log aggregation failure for application when Nodemanager is restarted

2016-03-09 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187137#comment-15187137 ] Jason Lowe commented on YARN-4783: -- >From the exception it appears the HDFS token is being cancelled

[jira] [Commented] (YARN-4773) Log aggregation performs extraneous filesystem operations when rolling log aggregation is disabled

2016-03-08 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185757#comment-15185757 ] Jason Lowe commented on YARN-4773: -- It is not fixed by YARN-4720 since this is a listStatus call not a

[jira] [Updated] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2016-03-08 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-4771: - Attachment: YARN-4771.002.patch Updated patch to fix the unit test in case it's useful. > Some containers

[jira] [Created] (YARN-4775) Nodemanager checks for old aggregated logs as the NM user instead of the app user

2016-03-08 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-4775: Summary: Nodemanager checks for old aggregated logs as the NM user instead of the app user Key: YARN-4775 URL: https://issues.apache.org/jira/browse/YARN-4775 Project:

[jira] [Updated] (YARN-4773) Log aggregation performs extraneous filesystem operations when rolling log aggregation is disabled

2016-03-08 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-4773: - Affects Version/s: (was: 2.7.2) 2.6.0 Target Version/s: (was: 2.7.3) >

[jira] [Created] (YARN-4773) Log aggregation performs extraneous filesystem operations when rolling log aggregation is disabled

2016-03-08 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-4773: Summary: Log aggregation performs extraneous filesystem operations when rolling log aggregation is disabled Key: YARN-4773 URL: https://issues.apache.org/jira/browse/YARN-4773

[jira] [Updated] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2016-03-08 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-4771: - Attachment: YARN-4771.001.patch Simple patch which preserves the container as long as the application is

[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2016-03-08 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184991#comment-15184991 ] Jason Lowe commented on YARN-4771: -- The problem occurs because removeVeryOldStoppedContainersFromCache

[jira] [Created] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

2016-03-08 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-4771: Summary: Some containers can be skipped during log aggregation after NM restart Key: YARN-4771 URL: https://issues.apache.org/jira/browse/YARN-4771 Project: Hadoop YARN

[jira] [Commented] (YARN-4760) proxy redirect to history server uses wrong URL

2016-03-07 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183169#comment-15183169 ] Jason Lowe commented on YARN-4760: -- +1 lgtm. Committing this. > proxy redirect to history server uses

[jira] [Created] (YARN-4760) proxy redirect to history server uses wrong URL

2016-03-03 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-4760: Summary: proxy redirect to history server uses wrong URL Key: YARN-4760 URL: https://issues.apache.org/jira/browse/YARN-4760 Project: Hadoop YARN Issue Type: Bug

[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-03 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178722#comment-15178722 ] Jason Lowe commented on YARN-4744: -- Thanks for updating the patch! +1, pending Jenkins. > Too many

[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-03 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178661#comment-15178661 ] Jason Lowe commented on YARN-4744: -- Ah, ignore my previous comment -- I see now that we don't have the

[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-03 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178658#comment-15178658 ] Jason Lowe commented on YARN-4744: -- Even if the Docker stuff doesn't work totally, it has the same logic

[jira] [Commented] (YARN-4744) Too many signal to container failure in case of LCE

2016-03-03 Thread Jason Lowe (JIRA)
[ https://issues.apache.org/jira/browse/YARN-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177872#comment-15177872 ] Jason Lowe commented on YARN-4744: -- Thanks for the patch! bq. In addition, logging in

<    9   10   11   12   13   14   15   16   17   18   >