[jira] [Commented] (YARN-5951) Changes to allow CapacityScheduler to use configuration store
[ https://issues.apache.org/jira/browse/YARN-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110501#comment-16110501 ] Jiandan Yang commented on YARN-5951: - [~jhung] [~leftnoteasy] I found this patch has two problems: 1. MutableCSConfigurationProvider#recoverConf iterator PendingMutations,calling removeFirst in confirmMutation will lead to ConcurrentModificationException {code:java} List uncommittedLogs = confStore.getPendingMutations(); Configuration oldConf = new Configuration(schedConf); for (LogMutation mutation : uncommittedLogs) { .. confStore.confirmMutation(mutation.getId(), true); .. } {code} 2. LeveldbConfigurationStore#initialize should update txnId after pendingMutations.add {code:java} while (itr.hasNext()) { Map.Entryentry = itr.next(); if (!new String(entry.getKey(), StandardCharsets.UTF_8) .startsWith(LOG_PREFIX)) { break; } pendingMutations.add(deserLogMutation(entry.getValue())); txnId = deserLogMutation(entry.getValue()).getId(); // update txnId } {code} > Changes to allow CapacityScheduler to use configuration store > - > > Key: YARN-5951 > URL: https://issues.apache.org/jira/browse/YARN-5951 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Jonathan Hung > Fix For: YARN-5734 > > Attachments: YARN-5951-YARN-5734.001.patch, > YARN-5951-YARN-5734.002.patch, YARN-5951-YARN-5734.003.patch, > YARN-5951-YARN-5734.004.patch > > > EDIT: changing this ticket. Found that the CapacityStoreConfigurationProvider > is not necessary, since we can just grab a Configuration object from > StoreConfigurationProvider with type "SCHEDULER" and create a > CapacitySchedulerConfiguration from it. > This ticket will track changes needed for integrating other components to be > used by the capacity scheduler. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5951) Changes to allow CapacityScheduler to use configuration store
[ https://issues.apache.org/jira/browse/YARN-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1611#comment-1611 ] Jiandan Yang commented on YARN-5951: - [~jhung] Sorry, my mistake, it's [YARN-5947|https://issues.apache.org/jira/browse/YARN-5947] > Changes to allow CapacityScheduler to use configuration store > - > > Key: YARN-5951 > URL: https://issues.apache.org/jira/browse/YARN-5951 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Hung >Assignee: Jonathan Hung > Fix For: YARN-5734 > > Attachments: YARN-5951-YARN-5734.001.patch, > YARN-5951-YARN-5734.002.patch, YARN-5951-YARN-5734.003.patch, > YARN-5951-YARN-5734.004.patch > > > EDIT: changing this ticket. Found that the CapacityStoreConfigurationProvider > is not necessary, since we can just grab a Configuration object from > StoreConfigurationProvider with type "SCHEDULER" and create a > CapacitySchedulerConfiguration from it. > This ticket will track changes needed for integrating other components to be > used by the capacity scheduler. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted
[ https://issues.apache.org/jira/browse/YARN-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7168: Description: In our cluster, when found NodeManager frequently FullGC when decommissioning NodeManager, and we found the biggest object is dataQueue of DataStreamer, it has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below. The root reason is that the size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted. I know NodeManager may stop writing when interruped, but DFSOutputStream also could do something to avoid fullgc !mat.jpg|memory_analysis! was: In our cluster, when found NodeManager frequently FullGC when decommissioning NodeManager, and we found the biggest object is dataQueue of DataStreamer, it has almost 6w DFSPacket, and every DFSPacket is about 64k. !mat.jpg|memory_analysis! The root reason is that the size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted. I know NodeManager may stop writing when interruped, but DFSOutputStream also could do something to avoid fullgc > The size of dataQueue and ackQueue in DataStreamer has no limit when writer > thread is interrupted > - > > Key: YARN-7168 > URL: https://issues.apache.org/jira/browse/YARN-7168 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Jiandan Yang > Attachments: mat.jpg > > > In our cluster, when found NodeManager frequently FullGC when decommissioning > NodeManager, and we found the biggest object is dataQueue of DataStreamer, it > has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below. > The root reason is that the size of dataQueue and ackQueue in DataStreamer > has no limit when writer thread is interrupted. I know NodeManager may stop > writing when interruped, but DFSOutputStream also could do something to avoid > fullgc > !mat.jpg|memory_analysis! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted
[ https://issues.apache.org/jira/browse/YARN-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7168: Description: In our cluster, when found NodeManager frequently FullGC when decommissioning NodeManager, and we found the biggest object is dataQueue of DataStreamer, it has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below. The root reason is that the size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted. DFSOutputStream#waitAndQueuePacket does not wait when writer thread is interrupted. I know NodeManager may stop writing when interruped, but DFSOutputStream also could do something to avoid fullgc {code:java} while (!streamerClosed && dataQueue.size() + ackQueue.size() > dfsClient.getConf().getWriteMaxPackets()) { if (firstWait) { Span span = Tracer.getCurrentSpan(); if (span != null) { span.addTimelineAnnotation("dataQueue.wait"); } firstWait = false; } try { dataQueue.wait(); } catch (InterruptedException e) { // If we get interrupted while waiting to queue data, we still need to get rid // of the current packet. This is because we have an invariant that if // currentPacket gets full, it will get queued before the next writeChunk. // // Rather than wait around for space in the queue, we should instead try to // return to the caller as soon as possible, even though we slightly overrun // the MAX_PACKETS length. Thread.currentThread().interrupt(); break; } } } finally { Span span = Tracer.getCurrentSpan(); if ((span != null) && (!firstWait)) { span.addTimelineAnnotation("end.wait"); } } {code} !mat.jpg|memory_analysis! was: In our cluster, when found NodeManager frequently FullGC when decommissioning NodeManager, and we found the biggest object is dataQueue of DataStreamer, it has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below. The root reason is that the size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted. DFSOutputStream#waitAndQueuePacket does not wait when writer thread is interrupted. {code:java} while (!streamerClosed && dataQueue.size() + ackQueue.size() > dfsClient.getConf().getWriteMaxPackets()) { if (firstWait) { Span span = Tracer.getCurrentSpan(); if (span != null) { span.addTimelineAnnotation("dataQueue.wait"); } firstWait = false; } try { dataQueue.wait(); } catch (InterruptedException e) { // If we get interrupted while waiting to queue data, we still need to get rid // of the current packet. This is because we have an invariant that if // currentPacket gets full, it will get queued before the next writeChunk. // // Rather than wait around for space in the queue, we should instead try to // return to the caller as soon as possible, even though we slightly overrun // the MAX_PACKETS length. Thread.currentThread().interrupt(); break; } } } finally { Span span = Tracer.getCurrentSpan(); if ((span != null) && (!firstWait)) { span.addTimelineAnnotation("end.wait"); } } {code} I know NodeManager may stop writing when interruped, but DFSOutputStream also could do something to avoid fullgc !mat.jpg|memory_analysis! > The size of dataQueue and ackQueue in DataStreamer has no limit when writer > thread is interrupted > - > > Key: YARN-7168 > URL: https://issues.apache.org/jira/browse/YARN-7168 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Jiandan Yang > Attachments: mat.jpg > > > In our cluster, when found NodeManager frequently FullGC when decommissioning > NodeManager, and we found the biggest object is dataQueue of DataStreamer, it > has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below. > The root reason is that the size of dataQueue and ackQueue in DataStreamer > has no limit when writer thread is interrupted. > DFSOutputStream#waitAndQueuePacket does not wait when writer thread is > interrupted. I know NodeManager may stop writing when interruped, but > DFSOutputStream also could do something to avoid fullgc >
[jira] [Updated] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted
[ https://issues.apache.org/jira/browse/YARN-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7168: Description: In our cluster, when found NodeManager frequently FullGC when decommissioning NodeManager, and we found the biggest object is dataQueue of DataStreamer, it has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below. The root reason is that the size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted. DFSOutputStream#waitAndQueuePacket does not wait when writer thread is interrupted. {code:java} while (!streamerClosed && dataQueue.size() + ackQueue.size() > dfsClient.getConf().getWriteMaxPackets()) { if (firstWait) { Span span = Tracer.getCurrentSpan(); if (span != null) { span.addTimelineAnnotation("dataQueue.wait"); } firstWait = false; } try { dataQueue.wait(); } catch (InterruptedException e) { // If we get interrupted while waiting to queue data, we still need to get rid // of the current packet. This is because we have an invariant that if // currentPacket gets full, it will get queued before the next writeChunk. // // Rather than wait around for space in the queue, we should instead try to // return to the caller as soon as possible, even though we slightly overrun // the MAX_PACKETS length. Thread.currentThread().interrupt(); break; } } } finally { Span span = Tracer.getCurrentSpan(); if ((span != null) && (!firstWait)) { span.addTimelineAnnotation("end.wait"); } } {code} I know NodeManager may stop writing when interruped, but DFSOutputStream also could do something to avoid fullgc !mat.jpg|memory_analysis! was: In our cluster, when found NodeManager frequently FullGC when decommissioning NodeManager, and we found the biggest object is dataQueue of DataStreamer, it has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below. The root reason is that the size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted. I know NodeManager may stop writing when interruped, but DFSOutputStream also could do something to avoid fullgc !mat.jpg|memory_analysis! > The size of dataQueue and ackQueue in DataStreamer has no limit when writer > thread is interrupted > - > > Key: YARN-7168 > URL: https://issues.apache.org/jira/browse/YARN-7168 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Jiandan Yang > Attachments: mat.jpg > > > In our cluster, when found NodeManager frequently FullGC when decommissioning > NodeManager, and we found the biggest object is dataQueue of DataStreamer, it > has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below. > The root reason is that the size of dataQueue and ackQueue in DataStreamer > has no limit when writer thread is interrupted. > DFSOutputStream#waitAndQueuePacket does not wait when writer thread is > interrupted. > {code:java} > while (!streamerClosed && dataQueue.size() + ackQueue.size() > > dfsClient.getConf().getWriteMaxPackets()) { > if (firstWait) { > Span span = Tracer.getCurrentSpan(); > if (span != null) { > span.addTimelineAnnotation("dataQueue.wait"); > } > firstWait = false; > } > try { > dataQueue.wait(); > } catch (InterruptedException e) { > // If we get interrupted while waiting to queue data, we still > need to get rid > // of the current packet. This is because we have an invariant > that if > // currentPacket gets full, it will get queued before the next > writeChunk. > // > // Rather than wait around for space in the queue, we should > instead try to > // return to the caller as soon as possible, even though we > slightly overrun > // the MAX_PACKETS length. > Thread.currentThread().interrupt(); > break; > } > } > } finally { > Span span = Tracer.getCurrentSpan(); > if ((span != null) && (!firstWait)) { > span.addTimelineAnnotation("end.wait"); > } > } > {code} > I know NodeManager may stop writing when interruped, but DFSOutputStream > also could do something to avoid fullgc >
[jira] [Commented] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted
[ https://issues.apache.org/jira/browse/YARN-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156350#comment-16156350 ] Jiandan Yang commented on YARN-7168: - Sorry, I should create this issue in Hadoop HDFS, can anyone help me move to Hadoop HDFS project? > The size of dataQueue and ackQueue in DataStreamer has no limit when writer > thread is interrupted > - > > Key: YARN-7168 > URL: https://issues.apache.org/jira/browse/YARN-7168 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Jiandan Yang > Attachments: mat.jpg > > > In our cluster, when found NodeManager frequently FullGC when decommissioning > NodeManager, and we found the biggest object is dataQueue of DataStreamer, it > has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below. > The root reason is that the size of dataQueue and ackQueue in DataStreamer > has no limit when writer thread is interrupted. > DFSOutputStream#waitAndQueuePacket does not wait when writer thread is > interrupted. I know NodeManager may stop writing when interruped, but > DFSOutputStream also could do something to avoid Infinite growth of dataQueue. > {code:java} > while (!streamerClosed && dataQueue.size() + ackQueue.size() > > dfsClient.getConf().getWriteMaxPackets()) { > if (firstWait) { > Span span = Tracer.getCurrentSpan(); > if (span != null) { > span.addTimelineAnnotation("dataQueue.wait"); > } > firstWait = false; > } > try { > dataQueue.wait(); > } catch (InterruptedException e) { > // If we get interrupted while waiting to queue data, we still > need to get rid > // of the current packet. This is because we have an invariant > that if > // currentPacket gets full, it will get queued before the next > writeChunk. > // > // Rather than wait around for space in the queue, we should > instead try to > // return to the caller as soon as possible, even though we > slightly overrun > // the MAX_PACKETS length. > Thread.currentThread().interrupt(); > break; > } > } > } finally { > Span span = Tracer.getCurrentSpan(); > if ((span != null) && (!firstWait)) { > span.addTimelineAnnotation("end.wait"); > } > } > {code} > !mat.jpg|memory_analysis! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted
[ https://issues.apache.org/jira/browse/YARN-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7168: Attachment: mat.jpg > The size of dataQueue and ackQueue in DataStreamer has no limit when writer > thread is interrupted > - > > Key: YARN-7168 > URL: https://issues.apache.org/jira/browse/YARN-7168 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Jiandan Yang > Attachments: mat.jpg > > > In our cluster, when found NodeManager frequently FullGC when decommissioning > NodeManager, and we found the biggest object is dataQueue of DataStreamer, it > has almost 6w DFSPacket, and every DFSPacket is about 64k. > !mat.jpg|memory_analysis! > The root reason is that the size of dataQueue and ackQueue in DataStreamer > has no limit when writer thread is interrupted. I know NodeManager may stop > writing when interruped, but DFSOutputStream also could do something to avoid > fullgc -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted
Jiandan Yang created YARN-7168: --- Summary: The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted Key: YARN-7168 URL: https://issues.apache.org/jira/browse/YARN-7168 Project: Hadoop YARN Issue Type: Bug Components: client Reporter: Jiandan Yang In our cluster, when found NodeManager frequently FullGC when decommissioning NodeManager, and we found the biggest object is dataQueue of DataStreamer, it has almost 6w DFSPacket, and every DFSPacket is about 64k. !mat.jpg|memory_analysis! The root reason is that the size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted. I know NodeManager may stop writing when interruped, but DFSOutputStream also could do something to avoid fullgc -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted
[ https://issues.apache.org/jira/browse/YARN-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7168: Description: In our cluster, when found NodeManager frequently FullGC when decommissioning NodeManager, and we found the biggest object is dataQueue of DataStreamer, it has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below. The root reason is that the size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted. DFSOutputStream#waitAndQueuePacket does not wait when writer thread is interrupted. I know NodeManager may stop writing when interruped, but DFSOutputStream also could do something to avoid Infinite growth of dataQueue. {code:java} while (!streamerClosed && dataQueue.size() + ackQueue.size() > dfsClient.getConf().getWriteMaxPackets()) { if (firstWait) { Span span = Tracer.getCurrentSpan(); if (span != null) { span.addTimelineAnnotation("dataQueue.wait"); } firstWait = false; } try { dataQueue.wait(); } catch (InterruptedException e) { // If we get interrupted while waiting to queue data, we still need to get rid // of the current packet. This is because we have an invariant that if // currentPacket gets full, it will get queued before the next writeChunk. // // Rather than wait around for space in the queue, we should instead try to // return to the caller as soon as possible, even though we slightly overrun // the MAX_PACKETS length. Thread.currentThread().interrupt(); break; } } } finally { Span span = Tracer.getCurrentSpan(); if ((span != null) && (!firstWait)) { span.addTimelineAnnotation("end.wait"); } } {code} !mat.jpg|memory_analysis! was: In our cluster, when found NodeManager frequently FullGC when decommissioning NodeManager, and we found the biggest object is dataQueue of DataStreamer, it has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below. The root reason is that the size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted. DFSOutputStream#waitAndQueuePacket does not wait when writer thread is interrupted. I know NodeManager may stop writing when interruped, but DFSOutputStream also could do something to avoid fullgc {code:java} while (!streamerClosed && dataQueue.size() + ackQueue.size() > dfsClient.getConf().getWriteMaxPackets()) { if (firstWait) { Span span = Tracer.getCurrentSpan(); if (span != null) { span.addTimelineAnnotation("dataQueue.wait"); } firstWait = false; } try { dataQueue.wait(); } catch (InterruptedException e) { // If we get interrupted while waiting to queue data, we still need to get rid // of the current packet. This is because we have an invariant that if // currentPacket gets full, it will get queued before the next writeChunk. // // Rather than wait around for space in the queue, we should instead try to // return to the caller as soon as possible, even though we slightly overrun // the MAX_PACKETS length. Thread.currentThread().interrupt(); break; } } } finally { Span span = Tracer.getCurrentSpan(); if ((span != null) && (!firstWait)) { span.addTimelineAnnotation("end.wait"); } } {code} !mat.jpg|memory_analysis! > The size of dataQueue and ackQueue in DataStreamer has no limit when writer > thread is interrupted > - > > Key: YARN-7168 > URL: https://issues.apache.org/jira/browse/YARN-7168 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Reporter: Jiandan Yang > Attachments: mat.jpg > > > In our cluster, when found NodeManager frequently FullGC when decommissioning > NodeManager, and we found the biggest object is dataQueue of DataStreamer, it > has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below. > The root reason is that the size of dataQueue and ackQueue in DataStreamer > has no limit when writer thread is interrupted. > DFSOutputStream#waitAndQueuePacket does not wait when writer thread is > interrupted. I know NodeManager may stop writing when interruped, but > DFSOutputStream also could do something to
[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA
[ https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7497: Description: YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but it does not support Yarn RM HA. YARN-6840 supports RM HA, but too many scheduler configurations may exceed znode limit, for example 10 thousand queues. HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, new active RM can load scheduler configuration from HDFS. was:YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but it does not support Yarn RM HA. HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, new active RM can load scheduler configuration from HDFS. > Add HDFSSchedulerConfigurationStore for RM HA > - > > Key: YARN-7497 > URL: https://issues.apache.org/jira/browse/YARN-7497 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Jiandan Yang > Attachments: YARN-7497.001.patch > > > YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but > it does not support Yarn RM HA. > YARN-6840 supports RM HA, but too many scheduler configurations may exceed > znode limit, for example 10 thousand queues. > HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, > new active RM can load scheduler configuration from HDFS. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA
[ https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7497: Attachment: YARN-7497.001.patch > Add HDFSSchedulerConfigurationStore for RM HA > - > > Key: YARN-7497 > URL: https://issues.apache.org/jira/browse/YARN-7497 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Jiandan Yang > Attachments: YARN-7497.001.patch > > > YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but > it does not support Yarn RM HA. HDFSSchedulerConfigurationStore store conf > file in HDFS, when RM failover, new active RM can load scheduler > configuration from HDFS. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA
Jiandan Yang created YARN-7497: --- Summary: Add HDFSSchedulerConfigurationStore for RM HA Key: YARN-7497 URL: https://issues.apache.org/jira/browse/YARN-7497 Project: Hadoop YARN Issue Type: New Feature Components: yarn Reporter: Jiandan Yang YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but it does not support Yarn RM HA. HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, new active RM can load scheduler configuration from HDFS. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA
[ https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7497: Attachment: YARN-7497.005.patch > Add HDFSSchedulerConfigurationStore for RM HA > - > > Key: YARN-7497 > URL: https://issues.apache.org/jira/browse/YARN-7497 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Jiandan Yang > Attachments: YARN-7497.001.patch, YARN-7497.002.patch, > YARN-7497.003.patch, YARN-7497.004.patch, YARN-7497.005.patch > > > YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but > it does not support Yarn RM HA. > YARN-6840 supports RM HA, but too many scheduler configurations may exceed > znode limit, for example 10 thousand queues. > HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, > new active RM can load scheduler configuration from HDFS. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA
[ https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270525#comment-16270525 ] Jiandan Yang commented on YARN-7497: - [~gphillips] I have moved those two static constant into YarnConfiguration. > Add HDFSSchedulerConfigurationStore for RM HA > - > > Key: YARN-7497 > URL: https://issues.apache.org/jira/browse/YARN-7497 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Jiandan Yang > Attachments: YARN-7497.001.patch, YARN-7497.002.patch, > YARN-7497.003.patch, YARN-7497.004.patch, YARN-7497.005.patch > > > YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but > it does not support Yarn RM HA. > YARN-6840 supports RM HA, but too many scheduler configurations may exceed > znode limit, for example 10 thousand queues. > HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, > new active RM can load scheduler configuration from HDFS. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA
[ https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270497#comment-16270497 ] Jiandan Yang commented on YARN-7497: - [~jhung] Please help me to review and give me some comment about this patch. Thank you. > Add HDFSSchedulerConfigurationStore for RM HA > - > > Key: YARN-7497 > URL: https://issues.apache.org/jira/browse/YARN-7497 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Jiandan Yang > Attachments: YARN-7497.001.patch, YARN-7497.002.patch, > YARN-7497.003.patch, YARN-7497.004.patch > > > YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but > it does not support Yarn RM HA. > YARN-6840 supports RM HA, but too many scheduler configurations may exceed > znode limit, for example 10 thousand queues. > HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, > new active RM can load scheduler configuration from HDFS. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA
[ https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7497: Attachment: YARN-7497.006.patch fix TestYarnConfigurationFields fail > Add HDFSSchedulerConfigurationStore for RM HA > - > > Key: YARN-7497 > URL: https://issues.apache.org/jira/browse/YARN-7497 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Jiandan Yang > Attachments: YARN-7497.001.patch, YARN-7497.002.patch, > YARN-7497.003.patch, YARN-7497.004.patch, YARN-7497.005.patch, > YARN-7497.006.patch > > > YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but > it does not support Yarn RM HA. > YARN-6840 supports RM HA, but too many scheduler configurations may exceed > znode limit, for example 10 thousand queues. > HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, > new active RM can load scheduler configuration from HDFS. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA
[ https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7497: Attachment: YARN-7497.004.patch fix findbug error > Add HDFSSchedulerConfigurationStore for RM HA > - > > Key: YARN-7497 > URL: https://issues.apache.org/jira/browse/YARN-7497 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Jiandan Yang > Attachments: YARN-7497.001.patch, YARN-7497.002.patch, > YARN-7497.003.patch, YARN-7497.004.patch > > > YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but > it does not support Yarn RM HA. > YARN-6840 supports RM HA, but too many scheduler configurations may exceed > znode limit, for example 10 thousand queues. > HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, > new active RM can load scheduler configuration from HDFS. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA
[ https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7497: Attachment: YARN-7497.003.patch fix UT and whitespace error > Add HDFSSchedulerConfigurationStore for RM HA > - > > Key: YARN-7497 > URL: https://issues.apache.org/jira/browse/YARN-7497 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Jiandan Yang > Attachments: YARN-7497.001.patch, YARN-7497.002.patch, > YARN-7497.003.patch > > > YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but > it does not support Yarn RM HA. > YARN-6840 supports RM HA, but too many scheduler configurations may exceed > znode limit, for example 10 thousand queues. > HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, > new active RM can load scheduler configuration from HDFS. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA
[ https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7497: Attachment: YARN-7497.002.patch upload v2 patch > Add HDFSSchedulerConfigurationStore for RM HA > - > > Key: YARN-7497 > URL: https://issues.apache.org/jira/browse/YARN-7497 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Jiandan Yang > Attachments: YARN-7497.001.patch, YARN-7497.002.patch > > > YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but > it does not support Yarn RM HA. > YARN-6840 supports RM HA, but too many scheduler configurations may exceed > znode limit, for example 10 thousand queues. > HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, > new active RM can load scheduler configuration from HDFS. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5636) Support reserving resources on certain nodes for certain applications
[ https://issues.apache.org/jira/browse/YARN-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287229#comment-16287229 ] Jiandan Yang commented on YARN-5636: - [~Tao Jie] I think you solution is good. You can provide a patch to review. > Support reserving resources on certain nodes for certain applications > - > > Key: YARN-5636 > URL: https://issues.apache.org/jira/browse/YARN-5636 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Tao Jie > > We have met such circumstance: > We are trying to run storm on yarn by Slider, and Storm writes > data to local disk on node. If some containers or the application fails, we > expect that those containers would restart on the same node as they run > before, otherwise data written on local would lost. > For slider, it will trying to ensure restarted container on same nodes as > before. However for yarn, resource may be assigned to other applications when > former long-running application is down. > As a result we'd better to have a mechanism that reserve some resource for > certain long-running applications on certain nodes for a period of time. Does > it make sense? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7715) Support NM promotion/demotion of running containers.
[ https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475703#comment-16475703 ] Jiandan Yang commented on YARN-7715: - [~miklos.szeg...@cloudera.com] How to inform AM if update cgroup resource fail? > Support NM promotion/demotion of running containers. > > > Key: YARN-7715 > URL: https://issues.apache.org/jira/browse/YARN-7715 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Arun Suresh >Assignee: Miklos Szegedi >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-7715.000.patch, YARN-7715.001.patch, > YARN-7715.002.patch, YARN-7715.003.patch, YARN-7715.004.patch > > > In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups > params for the containers, based on opportunistic or guaranteed, in the > *preStart* method. > Now that YARN-5085 is in, Container executionType (as well as the cpu, memory > and any other resources) can be updated after the container has started. This > means we need the ability to change cgroups params after container start. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7715) Support NM promotion/demotion of running containers.
[ https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476961#comment-16476961 ] Jiandan Yang commented on YARN-7715: - Thanks [~miklos.szeg...@cloudera.com] Updating execution type also needs to update cgroup(cfs_period_us, cfs_quota_us, shares), AM is not notified when update cgroup fail. Recover container will error when NM restarts if not storing updated execution type > Support NM promotion/demotion of running containers. > > > Key: YARN-7715 > URL: https://issues.apache.org/jira/browse/YARN-7715 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Arun Suresh >Assignee: Miklos Szegedi >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-7715.000.patch, YARN-7715.001.patch, > YARN-7715.002.patch, YARN-7715.003.patch, YARN-7715.004.patch > > > In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups > params for the containers, based on opportunistic or guaranteed, in the > *preStart* method. > Now that YARN-5085 is in, Container executionType (as well as the cpu, memory > and any other resources) can be updated after the container has started. This > means we need the ability to change cgroups params after container start. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Description: Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and “cpu.shares” to isolate cpu resource. However, * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; no support for differentiated latency * Request latency of services running on container may be frequent shake when all containers share cpus, and latency-sensitive services can not afford in our production environment. So we need more finer cpu isolation. My co-workers and I propose a solution using cgroup cpuset to binds containers to different processors according to a [Google’s PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. Later I will upload a detailed design doc. was: Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and “cpu.shares” to isolate cpu resource. However, * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; no support for differentiated latency * Request latency of services running on container may be frequent shake when all containers share cpus, and latency-sensitive services can not afford in our production environment. So we need more finer cpu isolation. My co-workers and I propose a solution using cgroup cpuset to binds containers to different processors according to a [Google’s PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. Later I will upload a detailed design doc. > Add support CPU isolation for latency-sensitive (LS) tasks > --- > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake when > all containers share cpus, and latency-sensitive services can not afford in > our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors according to a [Google’s > PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Attachment: CPU-isolation-for-latency-sensitive-services-v1.pdf > Add support CPU isolation for latency-sensitive (LS) tasks > --- > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake when > all containers share cpus, and latency-sensitive services can not afford in > our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors according to a [Google’s > PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Attachment: (was: CPU-isolation-for-latency-sensitive-services-v1.pdf) > Add support CPU isolation for latency-sensitive (LS) tasks > --- > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake when > all containers share cpus, and latency-sensitive services can not afford in > our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors according to a [Google’s > PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Attachment: CPU-isolation-for-latency-sensitive-services-v1.pdf > Add support CPU isolation for latency-sensitive (LS) tasks > --- > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake when > all containers share cpus, and latency-sensitive services can not afford in > our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors according to a [Google’s > PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480529#comment-16480529 ] Jiandan Yang commented on YARN-8320: - upload design doc v1. Please feel free to let me know your questions / comments. If everyone agrees with the general approach, I will go ahead to create a patch > Add support CPU isolation for latency-sensitive (LS) tasks > --- > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake when > all containers share cpus, and latency-sensitive services can not afford in > our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors according to a [Google’s > PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Summary: Add support CPU isolation for latency-sensitive (LS) service (was: Add support CPU isolation for latency-sensitive (LS) tasks) > Add support CPU isolation for latency-sensitive (LS) service > - > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake when > all containers share cpus, and latency-sensitive services can not afford in > our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors according to a [Google’s > PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Attachment: CPU-isolation-for-latency-sensitive-services-v1.pdf > Add support CPU isolation for latency-sensitive (LS) tasks > --- > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake when > all containers share cpus, and latency-sensitive services can not afford in > our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors according to a [Google’s > PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks
Jiandan Yang created YARN-8320: --- Summary: Add support CPU isolation for latency-sensitive (LS) tasks Key: YARN-8320 URL: https://issues.apache.org/jira/browse/YARN-8320 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Reporter: Jiandan Yang Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and “cpu.shares” to isolate cpu resource. However, * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; no support for differentiated latency * Request latency of services running on container may be frequent shake when all containers share cpus, and latency-sensitive services can not afford in our production environment. So we need more finer cpu isolation. My co-workers and I propose a solution using cgroup cpuset to binds containers to different processors according to a [Google’s PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. Later I will upload a detailed design doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Attachment: (was: CPU-isolation-for-latency-sensitive-services-v1.pdf) > Add support CPU isolation for latency-sensitive (LS) tasks > --- > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake when > all containers share cpus, and latency-sensitive services can not afford in > our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors according to a [Google’s > PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482336#comment-16482336 ] Jiandan Yang commented on YARN-8320: - upload v1 patch to initiate disscussion > Add support CPU isolation for latency-sensitive (LS) service > - > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, > YARN-8320.001.patch > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake > when all containers share cpus, and latency-sensitive services can not afford > in our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors, this is inspired by the isolation > technique in [Borg > system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Attachment: YARN-8320.001.patch > Add support CPU isolation for latency-sensitive (LS) service > - > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, > YARN-8320.001.patch > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake > when all containers share cpus, and latency-sensitive services can not afford > in our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors, this is inspired by the isolation > technique in [Borg > system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Attachment: (was: YARN-8320.001.patch) > Add support CPU isolation for latency-sensitive (LS) service > - > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, > YARN-8320.001.patch > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake > when all containers share cpus, and latency-sensitive services can not afford > in our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors, this is inspired by the isolation > technique in [Borg > system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Attachment: (was: CPU-isolation-for-latency-sensitive-services-v1.pdf) > Add support CPU isolation for latency-sensitive (LS) service > - > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, > YARN-8320.001.patch > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake > when all containers share cpus, and latency-sensitive services can not afford > in our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors, this is inspired by the isolation > technique in [Borg > system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Attachment: CPU-isolation-for-latency-sensitive-services-v1.pdf > Add support CPU isolation for latency-sensitive (LS) service > - > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, > YARN-8320.001.patch > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake > when all containers share cpus, and latency-sensitive services can not afford > in our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors, this is inspired by the isolation > technique in [Borg > system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Support CPU isolation for latency-sensitive (LS) service
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Attachment: CPU-isolation-for-latency-sensitive-services-v2.pdf > Support CPU isolation for latency-sensitive (LS) service > > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, > CPU-isolation-for-latency-sensitive-services-v2.pdf, YARN-8320.001.patch > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake > when all containers share cpus, and latency-sensitive services can not afford > in our production environment. > So we need more fine-grained cpu isolation. > Here we propose a solution using cgroup cpuset to binds containers to > different processors, this is inspired by the isolation technique in [Borg > system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8320) Support CPU isolation for latency-sensitive (LS) service
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487066#comment-16487066 ] Jiandan Yang commented on YARN-8320: - [~cheersyang] and I discuss design offline together. Add more details in v2 design doc. > Support CPU isolation for latency-sensitive (LS) service > > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, > YARN-8320.001.patch > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake > when all containers share cpus, and latency-sensitive services can not afford > in our production environment. > So we need more fine-grained cpu isolation. > Here we propose a solution using cgroup cpuset to binds containers to > different processors, this is inspired by the isolation technique in [Borg > system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7715) Support NM promotion/demotion of running containers.
[ https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476776#comment-16476776 ] Jiandan Yang commented on YARN-7715: - Hi, [~miklos.szeg...@cloudera.com] Thanks for your reply. I mean AM does not know when NM updates resource failed. Consider flowing case: 1. AM increase vcore by updateContainer 2. NM update Cgroup failed when executing CGroupsCpuResourceHandlerImpl#updateContainer And another question: updated containes need to store, but I did not find related code in your patch. > Support NM promotion/demotion of running containers. > > > Key: YARN-7715 > URL: https://issues.apache.org/jira/browse/YARN-7715 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Arun Suresh >Assignee: Miklos Szegedi >Priority: Major > Fix For: 3.2.0 > > Attachments: YARN-7715.000.patch, YARN-7715.001.patch, > YARN-7715.002.patch, YARN-7715.003.patch, YARN-7715.004.patch > > > In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups > params for the containers, based on opportunistic or guaranteed, in the > *preStart* method. > Now that YARN-5085 is in, Container executionType (as well as the cpu, memory > and any other resources) can be updated after the container has started. This > means we need the ability to change cgroups params after container start. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Attachment: CPU-isolation-for-latency-sensitive-services-v1.pdf > Add support CPU isolation for latency-sensitive (LS) service > - > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake > when all containers share cpus, and latency-sensitive services can not afford > in our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors, this is inspired by the isolation > technique in [Borg > system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service
[ https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8320: Attachment: (was: CPU-isolation-for-latency-sensitive-services-v1.pdf) > Add support CPU isolation for latency-sensitive (LS) service > - > > Key: YARN-8320 > URL: https://issues.apache.org/jira/browse/YARN-8320 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Priority: Major > Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf > > > Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and > “cpu.shares” to isolate cpu resource. However, > * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; > no support for differentiated latency > * Request latency of services running on container may be frequent shake > when all containers share cpus, and latency-sensitive services can not afford > in our production environment. > So we need more finer cpu isolation. > My co-workers and I propose a solution using cgroup cpuset to binds > containers to different processors, this is inspired by the isolation > technique in [Borg > system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf]. > Later I will upload a detailed design doc. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6673) Add cpu cgroup configurations for opportunistic containers
[ https://issues.apache.org/jira/browse/YARN-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312473#comment-16312473 ] Jiandan Yang edited comment on YARN-6673 at 1/5/18 4:49 AM: - [~miklos.szeg...@cloudera.com] How about setting Cpu share for Opportunistic container *CPU_DEFAULT_WEIGHT_OPPORTUNISTIC * containerVCores* was (Author: yangjiandan): [~miklos.szeg...@cloudera.com] How about setting Cpu share for Opportunistic container * CPU_DEFAULT_WEIGHT_OPPORTUNISTIC * containerVCores* > Add cpu cgroup configurations for opportunistic containers > -- > > Key: YARN-6673 > URL: https://issues.apache.org/jira/browse/YARN-6673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Haibo Chen >Assignee: Miklos Szegedi > Fix For: 3.0.0-beta1 > > Attachments: YARN-6673.000.patch > > > In addition to setting cpu.cfs_period_us on a per-container basis, we could > also set cpu.shares to 2 for opportunistic containers so they are run on a > best-effort basis -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6673) Add cpu cgroup configurations for opportunistic containers
[ https://issues.apache.org/jira/browse/YARN-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312473#comment-16312473 ] Jiandan Yang commented on YARN-6673: - [~miklos.szeg...@cloudera.com] How about setting Cpu share for Opportunistic container * CPU_DEFAULT_WEIGHT_OPPORTUNISTIC * containerVCores* > Add cpu cgroup configurations for opportunistic containers > -- > > Key: YARN-6673 > URL: https://issues.apache.org/jira/browse/YARN-6673 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Haibo Chen >Assignee: Miklos Szegedi > Fix For: 3.0.0-beta1 > > Attachments: YARN-6673.000.patch > > > In addition to setting cpu.cfs_period_us on a per-container basis, we could > also set cpu.shares to 2 for opportunistic containers so they are run on a > best-effort basis -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7693) ContainersMonitor support configurable
[ https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313097#comment-16313097 ] Jiandan Yang commented on YARN-7693: - [~miklos.szeg...@cloudera.com] Opportunistic Containers may impact Guaranteed Contains when they are under the same group memory.soft_limit_in_bytes is best-effort and not guaranteed. Just think the follow steps: 1. memory utilization of Guaranteed Containers in a NodeManager is very low, real memory usage is under allocation due to little traffic; 2. Scheduler some Opportunistic Containers on that NodeManager due to oversubscription; 3. Guaranteed Containers memory utilization increases duo to a lot of traffic, and do not reach the hard limit of them 4. *hadoop-yarn* exceeds hard limit 5. if set oom-killer, Guaranteed Container may be killed, that is not in line with the principle 6. if not set oom-killer, Guaranteed Container may hang So Opportunistic Containers may impact Guaranteed Contains when They are under the same group. If They are under different groups. Guaranteed and Opportunistic have their own hard limit, Opportunistic Containers never impact Guaranteed Containers. Monitor resource utilization of Guaranteed Containers, if there is a gap between allocation and required, then picking a part of gap resource to Opportunistic Group; If the gap is less than a given value, then decrease the hard limit of Guaranteed Group. Kill containers when adjust hard limit fails for given times in order to make sure the resource of Guaranteed Containers. > ContainersMonitor support configurable > -- > > Key: YARN-7693 > URL: https://issues.apache.org/jira/browse/YARN-7693 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Minor > Attachments: YARN-7693.001.patch, YARN-7693.002.patch > > > Currently ContainersMonitor has only one default implementation > ContainersMonitorImpl, > After introducing Opportunistic Container, ContainersMonitor needs to monitor > system metrics and even dynamically adjust Opportunistic and Guaranteed > resources in the cgroup, so another ContainersMonitor may need to be > implemented. > The current ContainerManagerImpl ContainersMonitorImpl direct new > ContainerManagerImpl, so ContainersMonitor need to be configurable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7880) FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when running sls
[ https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7880: Description: {code} 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to RUNNING java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) {code} was: {code} 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to RUNNING java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) {code} > FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when running sls > --- > > Key: YARN-7880 > URL: https://issues.apache.org/jira/browse/YARN-7880 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jiandan Yang >Priority: Major > > {code} > 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: > container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED > to RUNNING > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7880) FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when running sls
Jiandan Yang created YARN-7880: --- Summary: FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when running sls Key: YARN-7880 URL: https://issues.apache.org/jira/browse/YARN-7880 Project: Hadoop YARN Issue Type: Bug Reporter: Jiandan Yang 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to RUNNING java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7880) FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when running sls
[ https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7880: Description: {code} 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to RUNNING java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) {code} was: 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to RUNNING java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) > FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when running sls > --- > > Key: YARN-7880 > URL: https://issues.apache.org/jira/browse/YARN-7880 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jiandan Yang >Priority: Major > > {code} > 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: > container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED > to RUNNING > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls
[ https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7880: Summary: CapacityScheduler$ResourceCommitterService throws NPE when running sls (was: FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when running sls) > CapacityScheduler$ResourceCommitterService throws NPE when running sls > -- > > Key: YARN-7880 > URL: https://issues.apache.org/jira/browse/YARN-7880 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jiandan Yang >Priority: Major > > {code} > 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: > container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED > to RUNNING > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls
[ https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7880: Description: sls test case: node count = 9000, job count=10k,task num of job = 500, task run time = 100s {code} 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to RUNNING java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) {code} was: {code} 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to RUNNING java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) {code} > CapacityScheduler$ResourceCommitterService throws NPE when running sls > -- > > Key: YARN-7880 > URL: https://issues.apache.org/jira/browse/YARN-7880 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jiandan Yang >Priority: Major > > sls test case: node count = 9000, job count=10k,task num of job = 500, task > run time = 100s > {code} > 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: > container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED > to RUNNING > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls
[ https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7880: Affects Version/s: 3.0.0 > CapacityScheduler$ResourceCommitterService throws NPE when running sls > -- > > Key: YARN-7880 > URL: https://issues.apache.org/jira/browse/YARN-7880 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Jiandan Yang >Priority: Major > > sls test case: node count = 9000, job count=10k,task num of job = 500, task > run time = 100s > {code} > 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: > container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED > to RUNNING > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls
[ https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7880: Component/s: yarn > CapacityScheduler$ResourceCommitterService throws NPE when running sls > -- > > Key: YARN-7880 > URL: https://issues.apache.org/jira/browse/YARN-7880 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jiandan Yang >Priority: Major > > sls test case: node count = 9000, job count=10k,task num of job = 500, task > run time = 100s > {code} > 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: > container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED > to RUNNING > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls
[ https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7880: Description: sls test case: node count = 9000, job count=10k,task num of job = 500, task run time = 100s {code} 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to RUNNING java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) {code} some CapacityScheduler$AsyncScheduleThread also throws NPE {code} java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequests(SchedulerApplicationAttempt.java:1341) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:302) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:389) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:470) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:252) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:816) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:854) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:856) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:735) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1343) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1337) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1434) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1199) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:474) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:501) {code} was: sls test case: node count = 9000, job count=10k,task num of job = 500, task run time = 100s {code} 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to RUNNING java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) {code} > CapacityScheduler$ResourceCommitterService throws NPE when running sls > -- > > Key: YARN-7880 > URL:
[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls
[ https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7880: Description: sls test case: node count = 9000, job count=10k,task num of job = 500, task run time = 100s, but it does not occur when node count = 500 and 2000. {code} 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to RUNNING java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) {code} some CapacityScheduler$AsyncScheduleThread also throws NPE {code} java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequests(SchedulerApplicationAttempt.java:1341) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:302) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:389) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:470) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:252) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:816) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:854) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:856) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:735) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1343) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1337) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1434) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1199) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:474) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:501) {code} was: sls test case: node count = 9000, job count=10k,task num of job = 500, task run time = 100s {code} 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to RUNNING java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) {code} some CapacityScheduler$AsyncScheduleThread also throws NPE {code} java.lang.NullPointerException at
[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls
[ https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7880: Description: sls test case: node count = 9000, job count=10k,task num of job = 500, task run time = 100s, but it does not occur when node count = 500 and 2000. {code} 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to RUNNING java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) {code} some CapacityScheduler$AsyncScheduleThread also throws NPE {code} 18/02/02 20:40:34 INFO resourcemanager.DefaultAMSProcessor: AM registration appattempt_1517575125794_4564_01 18/02/02 20:40:34 INFO resourcemanager.RMAuditLogger: USER=default OPERATION=Register App Master TARGET=ApplicationMasterService RESULT=SUCCESS APPID=application_1517575125794_4564 APPATTEMPTID=appattempt_1517575125794_4564_01 Exception in thread "Thread-43" 18/02/02 20:40:34 INFO appmaster.AMSimulator: Register the application master for application application_1517575125794_4564 18/02/02 20:40:34 INFO resourcemanager.MockAMLauncher: Notify AM launcher launched:container_1517575125794_4564_01_01 18/02/02 20:40:34 INFO rmcontainer.RMContainerImpl: container_1517575125794_2703_01_01 Container Transitioned from ACQUIRED to RUNNING 18/02/02 20:40:34 INFO attempt.RMAppAttemptImpl: appattempt_1517575125794_4564_01 State change from ALLOCATED to LAUNCHED on event = LAUNCHED 18/02/02 20:40:34 INFO attempt.RMAppAttemptImpl: appattempt_1517575125794_4564_01 State change from LAUNCHED to RUNNING on event = REGISTERED 18/02/02 20:40:34 INFO rmapp.RMAppImpl: application_1517575125794_4564 State change from ACCEPTED to RUNNING on event = ATTEMPT_REGISTERED java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequests(SchedulerApplicationAttempt.java:1341) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:302) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:389) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:470) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:252) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:816) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:854) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:856) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:735) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1343) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1337) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1434) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1199) at
[jira] [Commented] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls
[ https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351990#comment-16351990 ] Jiandan Yang commented on YARN-7880: - duplicated with YARN-7591 > CapacityScheduler$ResourceCommitterService throws NPE when running sls > -- > > Key: YARN-7880 > URL: https://issues.apache.org/jira/browse/YARN-7880 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jiandan Yang >Priority: Major > > sls test case: node count = 9000, job count=10k,task num of job = 500, task > run time = 100s, but it does not occur when node count = 500 and 2000. > {code} > 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: > container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED > to RUNNING > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) > {code} > some CapacityScheduler$AsyncScheduleThread also throws NPE > {code} > 18/02/02 20:40:34 INFO resourcemanager.DefaultAMSProcessor: AM registration > appattempt_1517575125794_4564_01 > 18/02/02 20:40:34 INFO resourcemanager.RMAuditLogger: USER=default > OPERATION=Register App Master TARGET=ApplicationMasterService > RESULT=SUCCESS APPID=application_1517575125794_4564 > APPATTEMPTID=appattempt_1517575125794_4564_01 > Exception in thread "Thread-43" 18/02/02 20:40:34 INFO appmaster.AMSimulator: > Register the application master for application application_1517575125794_4564 > 18/02/02 20:40:34 INFO resourcemanager.MockAMLauncher: Notify AM launcher > launched:container_1517575125794_4564_01_01 > 18/02/02 20:40:34 INFO rmcontainer.RMContainerImpl: > container_1517575125794_2703_01_01 Container Transitioned from ACQUIRED > to RUNNING > 18/02/02 20:40:34 INFO attempt.RMAppAttemptImpl: > appattempt_1517575125794_4564_01 State change from ALLOCATED to LAUNCHED > on event = LAUNCHED > 18/02/02 20:40:34 INFO attempt.RMAppAttemptImpl: > appattempt_1517575125794_4564_01 State change from LAUNCHED to RUNNING on > event = REGISTERED > 18/02/02 20:40:34 INFO rmapp.RMAppImpl: application_1517575125794_4564 State > change from ACCEPTED to RUNNING on event = ATTEMPT_REGISTERED > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequests(SchedulerApplicationAttempt.java:1341) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:302) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:470) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:252) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:816) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:854) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:856) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:735) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559) > at >
[jira] [Resolved] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls
[ https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang resolved YARN-7880. - Resolution: Duplicate Assignee: Jiandan Yang Fix Version/s: 3.0.0 > CapacityScheduler$ResourceCommitterService throws NPE when running sls > -- > > Key: YARN-7880 > URL: https://issues.apache.org/jira/browse/YARN-7880 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0 >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Fix For: 3.0.0 > > > sls test case: node count = 9000, job count=10k,task num of job = 500, task > run time = 100s, but it does not occur when node count = 500 and 2000. > {code} > 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: > container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED > to RUNNING > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541) > {code} > some CapacityScheduler$AsyncScheduleThread also throws NPE > {code} > 18/02/02 20:40:34 INFO resourcemanager.DefaultAMSProcessor: AM registration > appattempt_1517575125794_4564_01 > 18/02/02 20:40:34 INFO resourcemanager.RMAuditLogger: USER=default > OPERATION=Register App Master TARGET=ApplicationMasterService > RESULT=SUCCESS APPID=application_1517575125794_4564 > APPATTEMPTID=appattempt_1517575125794_4564_01 > Exception in thread "Thread-43" 18/02/02 20:40:34 INFO appmaster.AMSimulator: > Register the application master for application application_1517575125794_4564 > 18/02/02 20:40:34 INFO resourcemanager.MockAMLauncher: Notify AM launcher > launched:container_1517575125794_4564_01_01 > 18/02/02 20:40:34 INFO rmcontainer.RMContainerImpl: > container_1517575125794_2703_01_01 Container Transitioned from ACQUIRED > to RUNNING > 18/02/02 20:40:34 INFO attempt.RMAppAttemptImpl: > appattempt_1517575125794_4564_01 State change from ALLOCATED to LAUNCHED > on event = LAUNCHED > 18/02/02 20:40:34 INFO attempt.RMAppAttemptImpl: > appattempt_1517575125794_4564_01 State change from LAUNCHED to RUNNING on > event = REGISTERED > 18/02/02 20:40:34 INFO rmapp.RMAppImpl: application_1517575125794_4564 State > change from ACCEPTED to RUNNING on event = ATTEMPT_REGISTERED > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequests(SchedulerApplicationAttempt.java:1341) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:302) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:389) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:470) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:252) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:816) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:854) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:856) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:735) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559) > at >
[jira] [Created] (YARN-7929) SLS supports setting container execution
Jiandan Yang created YARN-7929: --- Summary: SLS supports setting container execution Key: YARN-7929 URL: https://issues.apache.org/jira/browse/YARN-7929 Project: Hadoop YARN Issue Type: New Feature Components: scheduler-load-simulator Reporter: Jiandan Yang Assignee: Jiandan Yang SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file can not set execution type of container. This jira will introduce execution type in SLS to help better simulation. RUMEN has default execution type GUARANTEED SYNTH set execution type by field map_execution_type and reduce_execution_type SLS set execution type by field container.execution_type For compatibility set GUARANTEED as default value when not setting above fields in trace file -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7929) SLS supports setting container execution
[ https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7929: Issue Type: Sub-task (was: New Feature) Parent: YARN-5065 > SLS supports setting container execution > > > Key: YARN-7929 > URL: https://issues.apache.org/jira/browse/YARN-7929 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler-load-simulator >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Minor > > SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file > can not set execution type of container. > This jira will introduce execution type in SLS to help better simulation. > RUMEN has default execution type GUARANTEED > SYNTH set execution type by field map_execution_type and reduce_execution_type > SLS set execution type by field container.execution_type > For compatibility set GUARANTEED as default value when not setting above > fields in trace file -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7929) SLS supports setting container execution
[ https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7929: Attachment: YARN-7929.001.patch > SLS supports setting container execution > > > Key: YARN-7929 > URL: https://issues.apache.org/jira/browse/YARN-7929 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler-load-simulator >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Minor > Attachments: YARN-7929.001.patch > > > SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file > can not set execution type of container. > This jira will introduce execution type in SLS to help better simulation. > RUMEN has default execution type GUARANTEED > SYNTH set execution type by field map_execution_type and reduce_execution_type > SLS set execution type by field container.execution_type > For compatibility set GUARANTEED as default value when not setting above > fields in trace file -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7929) SLS supports setting container execution
[ https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374021#comment-16374021 ] Jiandan Yang edited comment on YARN-7929 at 2/24/18 1:50 AM: -- Hi [~youchen], thanks for your attention. I did encounter the issue of merging failed when I pull latest code in my local develop environment. I will upload a new patch based on latest code. "water level" to the NMSimulator simulates actual resource utilization, the scheduling of OPPORTUNISTIC containers through the central RM need actual node utilization according to design doc in YARN-1011. was (Author: yangjiandan): Hi [~yochen], thanks for your attention. I did encounter the issue of merging failed when I pull latest code in my local develop environment. I will upload a new patch based on latest code. "water level" to the NMSimulator simulates actual resource utilization, the scheduling of OPPORTUNISTIC containers through the central RM need actual node utilization according to design doc in YARN-1011. > SLS supports setting container execution > > > Key: YARN-7929 > URL: https://issues.apache.org/jira/browse/YARN-7929 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler-load-simulator >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-7929.001.patch, YARN-7929.002.patch > > > SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file > can not set execution type of container. > This jira will introduce execution type in SLS to help better simulation. > This will help the perf testing with regarding to the Opportunistic > Containers. > RUMEN has default execution type GUARANTEED > SYNTH set execution type by field map_execution_type and > reduce_execution_type > SLS set execution type by field container.execution_type > For compatibility set GUARANTEED as default value when not setting above > fields in trace file -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7929) SLS supports setting container execution
[ https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374021#comment-16374021 ] Jiandan Yang edited comment on YARN-7929 at 2/23/18 7:09 AM: -- Hi [~yochen], thanks for your attention. I did encounter the issue of merging failed when I pull latest code in my local develop environment. I will upload a new patch based on latest code. "water level" to the NMSimulator simulates actual resource utilization, the scheduling of OPPORTUNISTIC containers through the central RM need actual node utilization according to design doc in YARN-1011. was (Author: yangjiandan): Hi [~yochen], thanks for your attention. I did encounter the issue of merging failed when I pull latest code in my local develop environment. I will upload a new patch based latest code. "water level" to the NMSimulator simulates actual resource utilization, the scheduling of OPPORTUNISTIC containers through the central RM need actual node utilization according to design doc in YARN-1011. > SLS supports setting container execution > > > Key: YARN-7929 > URL: https://issues.apache.org/jira/browse/YARN-7929 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler-load-simulator >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-7929.001.patch > > > SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file > can not set execution type of container. > This jira will introduce execution type in SLS to help better simulation. > This will help the perf testing with regarding to the Opportunistic > Containers. > RUMEN has default execution type GUARANTEED > SYNTH set execution type by field map_execution_type and > reduce_execution_type > SLS set execution type by field container.execution_type > For compatibility set GUARANTEED as default value when not setting above > fields in trace file -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7929) SLS supports setting container execution
[ https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374021#comment-16374021 ] Jiandan Yang commented on YARN-7929: - Hi [~yochen], thanks for your attention. I did encounter the issue of merging failed when I pull latest code in my local develop environment. I will upload a new patch based latest code. "water level" to the NMSimulator simulates actual resource utilization, the scheduling of OPPORTUNISTIC containers through the central RM need actual node utilization according to design doc in YARN-1011. > SLS supports setting container execution > > > Key: YARN-7929 > URL: https://issues.apache.org/jira/browse/YARN-7929 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler-load-simulator >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-7929.001.patch > > > SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file > can not set execution type of container. > This jira will introduce execution type in SLS to help better simulation. > This will help the perf testing with regarding to the Opportunistic > Containers. > RUMEN has default execution type GUARANTEED > SYNTH set execution type by field map_execution_type and > reduce_execution_type > SLS set execution type by field container.execution_type > For compatibility set GUARANTEED as default value when not setting above > fields in trace file -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7929) SLS supports setting container execution
[ https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7929: Attachment: YARN-7929.002.patch > SLS supports setting container execution > > > Key: YARN-7929 > URL: https://issues.apache.org/jira/browse/YARN-7929 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler-load-simulator >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-7929.001.patch, YARN-7929.002.patch > > > SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file > can not set execution type of container. > This jira will introduce execution type in SLS to help better simulation. > This will help the perf testing with regarding to the Opportunistic > Containers. > RUMEN has default execution type GUARANTEED > SYNTH set execution type by field map_execution_type and > reduce_execution_type > SLS set execution type by field container.execution_type > For compatibility set GUARANTEED as default value when not setting above > fields in trace file -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7970) Compatibility issue: throw RpcNoSuchMethodException when run mapreduce job
Jiandan Yang created YARN-7970: --- Summary: Compatibility issue: throw RpcNoSuchMethodException when run mapreduce job Key: YARN-7970 URL: https://issues.apache.org/jira/browse/YARN-7970 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.0.0 Reporter: Jiandan Yang Running teragen failed in the version of hadoop-3.1, and hdfs server is 2.8. The reason of failing is 2.8 HDFS does not have setErasureCodingPolicy. The detailed exception trace is: ``` 2018-02-26 11:22:53,178 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1518615699369_0006 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchMethodException): Unknown method setErasureCodingPolicy called on org.apache.hadoop.hdfs.protocol.ClientProtocol protocol. at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:436) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491) at org.apache.hadoop.ipc.Client.call(Client.java:1437) at org.apache.hadoop.ipc.Client.call(Client.java:1347) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy11.setErasureCodingPolicy(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setErasureCodingPolicy(ClientNamenodeProtocolTranslatorPB.java:1583) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy12.setErasureCodingPolicy(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.setErasureCodingPolicy(DFSClient.java:2678) at org.apache.hadoop.hdfs.DistributedFileSystem$63.doCall(DistributedFileSystem.java:2665) at org.apache.hadoop.hdfs.DistributedFileSystem$63.doCall(DistributedFileSystem.java:2662) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.setErasureCodingPolicy(DistributedFileSystem.java:2680) at org.apache.hadoop.mapreduce.JobResourceUploader.disableErasureCodingForPath(JobResourceUploader.java:882) at org.apache.hadoop.mapreduce.JobResourceUploader.uploadResourcesInternal(JobResourceUploader.java:174) at org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:131) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588) at org.apache.hadoop.examples.terasort.TeraGen.run(TeraGen.java:304) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.examples.terasort.TeraGen.main(TeraGen.java:308) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Updated] (YARN-7970) Compatibility issue: throw RpcNoSuchMethodException when run mapreduce job
[ https://issues.apache.org/jira/browse/YARN-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7970: Description: Running teragen failed in the version of hadoop-3.1, and hdfs server is 2.8. The reason of failing is 2.8 HDFS does not have setErasureCodingPolicy. The detailed exception trace is: {code:java} 2018-02-26 11:22:53,178 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1518615699369_0006 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchMethodException): Unknown method setErasureCodingPolicy called on org.apache.hadoop.hdfs.protocol.ClientProtocol protocol. at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:436) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491) at org.apache.hadoop.ipc.Client.call(Client.java:1437) at org.apache.hadoop.ipc.Client.call(Client.java:1347) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy11.setErasureCodingPolicy(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setErasureCodingPolicy(ClientNamenodeProtocolTranslatorPB.java:1583) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy12.setErasureCodingPolicy(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.setErasureCodingPolicy(DFSClient.java:2678) at org.apache.hadoop.hdfs.DistributedFileSystem$63.doCall(DistributedFileSystem.java:2665) at org.apache.hadoop.hdfs.DistributedFileSystem$63.doCall(DistributedFileSystem.java:2662) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.setErasureCodingPolicy(DistributedFileSystem.java:2680) at org.apache.hadoop.mapreduce.JobResourceUploader.disableErasureCodingForPath(JobResourceUploader.java:882) at org.apache.hadoop.mapreduce.JobResourceUploader.uploadResourcesInternal(JobResourceUploader.java:174) at org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:131) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588) at org.apache.hadoop.examples.terasort.TeraGen.run(TeraGen.java:304) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.examples.terasort.TeraGen.main(TeraGen.java:308) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at
[jira] [Updated] (YARN-7970) Compatibility issue: throw RpcNoSuchMethodException when run mapreduce job
[ https://issues.apache.org/jira/browse/YARN-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7970: Description: Running teragen failed in the version of hadoop-3.1, and hdfs server is 2.8. The reason of failing is 2.8 HDFS does not have setErasureCodingPolicy. The detailed exception trace is: 2018-02-26 11:22:53,178 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1518615699369_0006 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchMethodException): Unknown method setErasureCodingPolicy called on org.apache.hadoop.hdfs.protocol.ClientProtocol protocol. at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:436) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491) at org.apache.hadoop.ipc.Client.call(Client.java:1437) at org.apache.hadoop.ipc.Client.call(Client.java:1347) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy11.setErasureCodingPolicy(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setErasureCodingPolicy(ClientNamenodeProtocolTranslatorPB.java:1583) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) at com.sun.proxy.$Proxy12.setErasureCodingPolicy(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.setErasureCodingPolicy(DFSClient.java:2678) at org.apache.hadoop.hdfs.DistributedFileSystem$63.doCall(DistributedFileSystem.java:2665) at org.apache.hadoop.hdfs.DistributedFileSystem$63.doCall(DistributedFileSystem.java:2662) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.setErasureCodingPolicy(DistributedFileSystem.java:2680) at org.apache.hadoop.mapreduce.JobResourceUploader.disableErasureCodingForPath(JobResourceUploader.java:882) at org.apache.hadoop.mapreduce.JobResourceUploader.uploadResourcesInternal(JobResourceUploader.java:174) at org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:131) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588) at org.apache.hadoop.examples.terasort.TeraGen.run(TeraGen.java:304) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.examples.terasort.TeraGen.main(TeraGen.java:308) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at
[jira] [Updated] (YARN-7693) ContainersMonitor support configurable
[ https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7693: Attachment: YARN-7693.001.patch > ContainersMonitor support configurable > -- > > Key: YARN-7693 > URL: https://issues.apache.org/jira/browse/YARN-7693 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Minor > Attachments: YARN-7693.001.patch > > > Currently ContainersMonitor has only one default implementation > ContainersMonitorImpl, > After introducing Opportunistic Container, ContainersMonitor needs to monitor > system metrics and even dynamically adjust Opportunistic and Guaranteed > resources in the cgroup, so another ContainersMonitor may need to be > implemented. > The current ContainerManagerImpl ContainersMonitorImpl direct new > ContainerManagerImpl, so ContainersMonitor need to be configurable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7693) ContainersMonitor support configurable
[ https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7693: Attachment: YARN-7693.002.patch fix TestYarnConfigurationFields error > ContainersMonitor support configurable > -- > > Key: YARN-7693 > URL: https://issues.apache.org/jira/browse/YARN-7693 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Minor > Attachments: YARN-7693.001.patch, YARN-7693.002.patch > > > Currently ContainersMonitor has only one default implementation > ContainersMonitorImpl, > After introducing Opportunistic Container, ContainersMonitor needs to monitor > system metrics and even dynamically adjust Opportunistic and Guaranteed > resources in the cgroup, so another ContainersMonitor may need to be > implemented. > The current ContainerManagerImpl ContainersMonitorImpl direct new > ContainerManagerImpl, so ContainersMonitor need to be configurable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7693) ContainersMonitor support configurable
[ https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309129#comment-16309129 ] Jiandan Yang commented on YARN-7693: - [~miklos.szeg...@cloudera.com] Thanks for your attention. This jira does not conflict with YARN-7064. I file this jira because currently ContainersMonitorImpl has some problems: 1. online service may be crash due to high system resource utilization. ContainersMonitorImpl only check pmem and vmem of every container, and did not check the overall system utilization. This may be impact online service when offline task and online service run on the Yarn at the same time. For example, each container's memory did not exceed the limit, but the system's total memory utilization may be 100% because of oversubscription, and the decision of killing container by RM may not be timely enough, then it will affect the online service. 2. Directly kill Opportunistic container is too violent. Dynamically adjusting Opportunistic container resources may be a better choice. So I proposal to: 1) Seperate containers into two different group Opportunistic_Group and Guaranteed_Group under *hadoop-yarn* 2) Monitor system resource utilization and dynamically adjust resource of Opportunistic_Group 3) Kill container only when adjust resource fail for given times > ContainersMonitor support configurable > -- > > Key: YARN-7693 > URL: https://issues.apache.org/jira/browse/YARN-7693 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Minor > Attachments: YARN-7693.001.patch, YARN-7693.002.patch > > > Currently ContainersMonitor has only one default implementation > ContainersMonitorImpl, > After introducing Opportunistic Container, ContainersMonitor needs to monitor > system metrics and even dynamically adjust Opportunistic and Guaranteed > resources in the cgroup, so another ContainersMonitor may need to be > implemented. > The current ContainerManagerImpl ContainersMonitorImpl direct new > ContainerManagerImpl, so ContainersMonitor need to be configurable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7693) ContainersMonitor support configurable
[ https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7693: Description: Currently ContainersMonitor has only one default implementation ContainersMonitorImpl, After introducing Opportunistic Container, ContainersMonitor needs to monitor system metrics and even dynamically adjust Opportunistic and Guaranteed resources in the cgroup, so another ContainersMonitor may need to be implemented. The current ContainerManagerImpl ContainersMonitorImpl direct new ContainerManagerImpl, so ContainersMonitor need to be configurable. was: Currently ContainersMonitor has only one default implementation ContainersMonitorImpl, After introducing Opportunistic Container, ContainersMonitor needs to monitor system metrics and even dynamically adjust Opportunistic and Guaranteed resources in the cgroup, so another ContainersMonitor may need to be implemented. The current ContainerManagerImpl ContainersMonitorImpl direct new ContainerManagerImpl, so ContainersMonitor need to be configurable. > ContainersMonitor support configurable > -- > > Key: YARN-7693 > URL: https://issues.apache.org/jira/browse/YARN-7693 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Minor > > Currently ContainersMonitor has only one default implementation > ContainersMonitorImpl, > After introducing Opportunistic Container, ContainersMonitor needs to monitor > system metrics and even dynamically adjust Opportunistic and Guaranteed > resources in the cgroup, so another ContainersMonitor may need to be > implemented. > The current ContainerManagerImpl ContainersMonitorImpl direct new > ContainerManagerImpl, so ContainersMonitor need to be configurable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-7693) ContainersMonitor support configurable
Jiandan Yang created YARN-7693: --- Summary: ContainersMonitor support configurable Key: YARN-7693 URL: https://issues.apache.org/jira/browse/YARN-7693 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Reporter: Jiandan Yang Assignee: Jiandan Yang Priority: Minor Currently ContainersMonitor has only one default implementation ContainersMonitorImpl, After introducing Opportunistic Container, ContainersMonitor needs to monitor system metrics and even dynamically adjust Opportunistic and Guaranteed resources in the cgroup, so another ContainersMonitor may need to be implemented. The current ContainerManagerImpl ContainersMonitorImpl direct new ContainerManagerImpl, so ContainersMonitor need to be configurable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7929) SLS supports setting container execution
[ https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7929: Attachment: YARN-7929.004.patch > SLS supports setting container execution > > > Key: YARN-7929 > URL: https://issues.apache.org/jira/browse/YARN-7929 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler-load-simulator >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-7929.001.patch, YARN-7929.002.patch, > YARN-7929.003.patch, YARN-7929.004.patch > > > SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file > can not set execution type of container. > This jira will introduce execution type in SLS to help better simulation. > This will help the perf testing with regarding to the Opportunistic > Containers. > RUMEN has default execution type GUARANTEED > SYNTH set execution type by field map_execution_type and > reduce_execution_type > SLS set execution type by field container.execution_type > For compatibility set GUARANTEED as default value when not setting above > fields in trace file -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7929) SLS supports setting container execution
[ https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378110#comment-16378110 ] Jiandan Yang commented on YARN-7929: - fix checkstyle issues and upload YARN-7929.004.patch > SLS supports setting container execution > > > Key: YARN-7929 > URL: https://issues.apache.org/jira/browse/YARN-7929 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler-load-simulator >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-7929.001.patch, YARN-7929.002.patch, > YARN-7929.003.patch, YARN-7929.004.patch > > > SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file > can not set execution type of container. > This jira will introduce execution type in SLS to help better simulation. > This will help the perf testing with regarding to the Opportunistic > Containers. > RUMEN has default execution type GUARANTEED > SYNTH set execution type by field map_execution_type and > reduce_execution_type > SLS set execution type by field container.execution_type > For compatibility set GUARANTEED as default value when not setting above > fields in trace file -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7929) SLS supports setting container execution
[ https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378179#comment-16378179 ] Jiandan Yang commented on YARN-7929: - fix checkstyle HiddenField and upload 005.patch > SLS supports setting container execution > > > Key: YARN-7929 > URL: https://issues.apache.org/jira/browse/YARN-7929 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler-load-simulator >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-7929.001.patch, YARN-7929.002.patch, > YARN-7929.003.patch, YARN-7929.004.patch, YARN-7929.005.patch > > > SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file > can not set execution type of container. > This jira will introduce execution type in SLS to help better simulation. > This will help the perf testing with regarding to the Opportunistic > Containers. > RUMEN has default execution type GUARANTEED > SYNTH set execution type by field map_execution_type and > reduce_execution_type > SLS set execution type by field container.execution_type > For compatibility set GUARANTEED as default value when not setting above > fields in trace file -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7929) SLS supports setting container execution
[ https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7929: Attachment: YARN-7929.005.patch > SLS supports setting container execution > > > Key: YARN-7929 > URL: https://issues.apache.org/jira/browse/YARN-7929 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler-load-simulator >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-7929.001.patch, YARN-7929.002.patch, > YARN-7929.003.patch, YARN-7929.004.patch, YARN-7929.005.patch > > > SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file > can not set execution type of container. > This jira will introduce execution type in SLS to help better simulation. > This will help the perf testing with regarding to the Opportunistic > Containers. > RUMEN has default execution type GUARANTEED > SYNTH set execution type by field map_execution_type and > reduce_execution_type > SLS set execution type by field container.execution_type > For compatibility set GUARANTEED as default value when not setting above > fields in trace file -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8664: Description: ResourceManager logs about exception is: {code:java} 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 11.13.73.101:51083 java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) at org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) at org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) at org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) at com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) at org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) {code} ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes when NM losting, and AllocateResponse#getProto will call ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of PB . Because ResourcePBImpl is not thread safe and multiple AM will call allocate at the same time, ResourcePBImpl#getProto may throw NullPointerException or UnsupportedOperationException. I wrote a test code which can reproduce exception. {code:java} @Test public void testResource1() throws InterruptedException { ResourcePBImpl resource = (ResourcePBImpl) Resource.newInstance(1, 1); for (int i =0;i<10;i++ ) { Thread thread = new PBThread(resource); thread.setName("t"+i); thread.start(); } Thread.sleep(1); } class PBThread extends Thread { ResourcePBImpl resourcePB; public PBThread(ResourcePBImpl resourcePB) { this.resourcePB = resourcePB; } @Override public void run() { while(true) { this.resourcePB.getProto(); } } } {code} was: ResourceManager logs about exception is: {code:java} 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 11.13.73.101:51083 java.lang.NullPointerException
[jira] [Commented] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580881#comment-16580881 ] Jiandan Yang commented on YARN-8664: - [~cheersyang] Jenkins is probably not OK. Would you please fix it? > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.001.pathch, > YARN-8664-branch-2.8.2.001.patch, YARN-8664-branch-2.8.2.002.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1() throws
[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8664: Attachment: YARN-8664-branch-2.8.2.002.patch > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.2.001.patch, > YARN-8664-branch-2.8.2.002.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1() throws InterruptedException { > ResourcePBImpl resource = (ResourcePBImpl)
[jira] [Commented] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580626#comment-16580626 ] Jiandan Yang commented on YARN-8664: - Jenkins report ERROR: Docker failed to build image, which is not related to patch. upload patch again to trigger Jenkins. > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.2.001.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1() throws
[jira] [Commented] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580688#comment-16580688 ] Jiandan Yang commented on YARN-8664: - Thank [~cheersyang] for quick response. There is no problem in trunk, because it replace ResourcePBImpl with LightWeightResource introduced by YARN-6909. I will update a patch for branch-2.8 > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.2.001.patch, > YARN-8664-branch-2.8.2.002.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code
[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8664: Attachment: YARN-8664-branch-2.8.001.pathch > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.001.pathch, > YARN-8664-branch-2.8.2.001.patch, YARN-8664-branch-2.8.2.002.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1() throws InterruptedException { > ResourcePBImpl resource =
[jira] [Created] (YARN-8645) Yarn NM fail to start when remount cpu control group
Jiandan Yang created YARN-8645: --- Summary: Yarn NM fail to start when remount cpu control group Key: YARN-8645 URL: https://issues.apache.org/jira/browse/YARN-8645 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Jiandan Yang NM failed to start when we update Yarn to latest version. NM logs are as follows: {code:java} 2018-08-08 16:07:01,244 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl: Mounting controller cpu at /sys/fs/cgroup/cpu 2018-08-08 16:07:01,246 WARN [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 32. Privileged Execution Operation Stderr: Feature disabled: mount cgroup Stdout: Full command array for failed execution: [/home/hadoop/hadoop_hbase/hadoop-current/bin/container-executor, --mount-cgroups, hadoop-yarn, cpu,cpuset,cpuacct=/sys/fs/cgroup/cpu] 2018-08-08 16:07:01,247 ERROR [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl: Failed to mount controller: cpu 2018-08-08 16:07:01,247 ERROR [main] org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems! org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Failed to mount controller: cpu {code} The cause of error is that 351cf87c92872d90f62c476f85ae4d02e485769c disable mounting cgroups by default in container-executor, which make container-executor return non-zero when executing mount-cgroups -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
Jiandan Yang created YARN-8664: --- Summary: ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting Key: YARN-8664 URL: https://issues.apache.org/jira/browse/YARN-8664 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.8.2 Environment: Reporter: Jiandan Yang Assignee: Jiandan Yang ResourceManager logs about exception is: {code:java} 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 11.13.73.101:51083 java.lang.NullPointerException at org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) at org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) at org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) at org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) at org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) at com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) at org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) {code} ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes when NM losting, and AllocateResponse#getProto will call ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of PB . Because ResourcePBImpl is not thread safe and multiple AM will call allocate at the same time, ResourcePBImpl#getProto may throw NullPointerException or UnsupportedOperationException. I wrote a test code which can reproduce exception. {code:java} @Test public void testResource1() throws InterruptedException { ResourcePBImpl resource = (ResourcePBImpl) Resource.newInstance(1, 1); for(long i=0;i<100;i++) { resource.setResourceInformation("r" + i, ResourceInformation.newInstance("r" + i, i)); } for (int i =0;i<10;i++ ) { Thread thread = new PBThread(resource); thread.setName("t"+i); thread.start(); } Thread.sleep(1); } class PBThread extends Thread { ResourcePBImpl resourcePB; public PBThread(ResourcePBImpl resourcePB) { this.resourcePB = resourcePB; } @Override public void run()
[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8664: Attachment: YARN-8664-branch-2.8.2.001.patch > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.2.001.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1() throws InterruptedException { > ResourcePBImpl resource = (ResourcePBImpl) Resource.newInstance(1, 1); > for(long i=0;i<100;i++)
[jira] [Commented] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579884#comment-16579884 ] Jiandan Yang commented on YARN-8664: - replace rmNode.getTotalCapability() with Resources.clone(rmNode.getTotalCapability()) to avoid access ResourcePBImpl by multiple threads. > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.2.001.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1()
[jira] [Updated] (YARN-8717) set memory.limit_in_bytes when NodeManager starting
[ https://issues.apache.org/jira/browse/YARN-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8717: Attachment: YARN-8717.001.patch > set memory.limit_in_bytes when NodeManager starting > --- > > Key: YARN-8717 > URL: https://issues.apache.org/jira/browse/YARN-8717 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8717.001.patch > > > CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn to > restrict total resource of cpu of NM when NM starting; > CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at > hirachy of hadoop-yarn to control memory resource of NM -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8717) set memory.limit_in_bytes when NodeManager starting
[ https://issues.apache.org/jira/browse/YARN-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594858#comment-16594858 ] Jiandan Yang commented on YARN-8717: - Hi [~cheersyang] Thanks for watching. We found NM was killed by OOM-killer. conditions are as follows: ``` yarn.nodemanager.resource.memory.enabled=false yarn.nodemanager.resource.memory-mb = 100G Physical Memory of NM machine is 120G NM has two container, each requests 40G memory, but actual each request 50G+ ``` So we thought setting limit on the hireachy of hadoop-yarn > set memory.limit_in_bytes when NodeManager starting > --- > > Key: YARN-8717 > URL: https://issues.apache.org/jira/browse/YARN-8717 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8717.001.patch > > > CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn to > restrict total resource of cpu of NM when NM starting; > CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at > hirachy of hadoop-yarn to control memory resource of NM -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8664: Attachment: YARN-8664-branch-2.8.002.pathch > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.001.pathch, > YARN-8664-branch-2.8.002.pathch, YARN-8664-branch-2.8.01.patch, > YARN-8664-branch-2.8.2.001.patch, YARN-8664-branch-2.8.2.002.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1()
[jira] [Updated] (YARN-8717) set memory.limit_in_bytes when NodeManager starting
[ https://issues.apache.org/jira/browse/YARN-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8717: Description: CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn to restrict total resource of cpu of NM when NM starting; CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at hirachy of hadoop-yarn to control memory resource of NM (was: CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn to restrict total resource of cpu of NM when NM starting; CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at hirachy of hadoop-yarn to control cpu resource of NM) > set memory.limit_in_bytes when NodeManager starting > --- > > Key: YARN-8717 > URL: https://issues.apache.org/jira/browse/YARN-8717 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > > CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn to > restrict total resource of cpu of NM when NM starting; > CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at > hirachy of hadoop-yarn to control memory resource of NM -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8717) set memory.limit_in_bytes when NodeManager starting
Jiandan Yang created YARN-8717: --- Summary: set memory.limit_in_bytes when NodeManager starting Key: YARN-8717 URL: https://issues.apache.org/jira/browse/YARN-8717 Project: Hadoop YARN Issue Type: New Feature Environment: CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn to restrict total resource of cpu of NM when NM starting; CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at hirachy of hadoop-yarn to control cpu resource of NM Reporter: Jiandan Yang Assignee: Jiandan Yang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8717) set memory.limit_in_bytes when NodeManager starting
[ https://issues.apache.org/jira/browse/YARN-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8717: Description: CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn to restrict total resource of cpu of NM when NM starting; CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at hirachy of hadoop-yarn to control cpu resource of NM > set memory.limit_in_bytes when NodeManager starting > --- > > Key: YARN-8717 > URL: https://issues.apache.org/jira/browse/YARN-8717 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > > CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn to > restrict total resource of cpu of NM when NM starting; > CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at > hirachy of hadoop-yarn to control cpu resource of NM -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8717) set memory.limit_in_bytes when NodeManager starting
[ https://issues.apache.org/jira/browse/YARN-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8717: Environment: (was: CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn to restrict total resource of cpu of NM when NM starting; CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at hirachy of hadoop-yarn to control cpu resource of NM) > set memory.limit_in_bytes when NodeManager starting > --- > > Key: YARN-8717 > URL: https://issues.apache.org/jira/browse/YARN-8717 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8717) set memory.limit_in_bytes when NodeManager starting
[ https://issues.apache.org/jira/browse/YARN-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594858#comment-16594858 ] Jiandan Yang edited comment on YARN-8717 at 9/7/18 3:05 AM: - Hi [~cheersyang] Thanks for watching. We found NM was killed by OOM-killer. conditions are as follows: ``` yarn.nodemanager.resource.memory.enforced=false yarn.nodemanager.resource.memory-mb = 100G Physical Memory of NM machine is 120G NM has two container, each requests 40G memory, but actual each request 50G+ ``` So we thought setting limit on the hireachy of hadoop-yarn was (Author: yangjiandan): Hi [~cheersyang] Thanks for watching. We found NM was killed by OOM-killer. conditions are as follows: ``` yarn.nodemanager.resource.memory.enabled=false yarn.nodemanager.resource.memory-mb = 100G Physical Memory of NM machine is 120G NM has two container, each requests 40G memory, but actual each request 50G+ ``` So we thought setting limit on the hireachy of hadoop-yarn > set memory.limit_in_bytes when NodeManager starting > --- > > Key: YARN-8717 > URL: https://issues.apache.org/jira/browse/YARN-8717 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Labels: cgroups > Attachments: YARN-8717.001.patch > > > CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn to > restrict total resource of cpu of NM when NM starting; > CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at > hirachy of hadoop-yarn to control memory resource of NM -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8664: Attachment: (was: YARN-8664-branch-2.8.001.pathch) > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.003.patch, > YARN-8664-branch-2.8.004.patch, YARN-8664-branch-2.8.01.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1() throws InterruptedException { > ResourcePBImpl resource =
[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8664: Attachment: YARN-8664-branch-2.8.004.patch > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.003.patch, > YARN-8664-branch-2.8.004.patch, YARN-8664-branch-2.8.01.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1() throws InterruptedException { > ResourcePBImpl resource =
[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8664: Attachment: (was: YARN-8664-branch-2.8.2.002.patch) > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.003.patch, > YARN-8664-branch-2.8.004.patch, YARN-8664-branch-2.8.01.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1() throws InterruptedException { > ResourcePBImpl resource =
[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8664: Attachment: (was: YARN-8664-branch-2.8.2.001.patch) > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.003.patch, > YARN-8664-branch-2.8.004.patch, YARN-8664-branch-2.8.01.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1() throws InterruptedException { > ResourcePBImpl resource =
[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
[ https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-8664: Attachment: (was: YARN-8664-branch-2.8.002.pathch) > ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting > - > > Key: YARN-8664 > URL: https://issues.apache.org/jira/browse/YARN-8664 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.8.2 > Environment: >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8664-branch-2.8.003.patch, > YARN-8664-branch-2.8.004.patch, YARN-8664-branch-2.8.01.patch > > > ResourceManager logs about exception is: > {code:java} > 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] > org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 > Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from > 11.13.73.101:51083 > java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402) > at > org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254) > at > org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287) > at > org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669) > at > com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336) > at > com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323) > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61) > at > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457) > {code} > ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes > when NM losting, and AllocateResponse#getProto will call > ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of > PB . Because ResourcePBImpl is not thread safe and > multiple AM will call allocate at the same time, ResourcePBImpl#getProto may > throw NullPointerException or UnsupportedOperationException. > I wrote a test code which can reproduce exception. > {code:java} > @Test > public void testResource1() throws InterruptedException { > ResourcePBImpl resource =
[jira] [Updated] (YARN-7693) ContainersMonitor support configurable
[ https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7693: Priority: Minor (was: Blocker) > ContainersMonitor support configurable > -- > > Key: YARN-7693 > URL: https://issues.apache.org/jira/browse/YARN-7693 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Minor > Attachments: YARN-7693.001.patch, YARN-7693.002.patch > > > Currently ContainersMonitor has only one default implementation > ContainersMonitorImpl, > After introducing Opportunistic Container, ContainersMonitor needs to monitor > system metrics and even dynamically adjust Opportunistic and Guaranteed > resources in the cgroup, so another ContainersMonitor may need to be > implemented. > The current ContainerManagerImpl ContainersMonitorImpl direct new > ContainerManagerImpl, so ContainersMonitor need to be configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7693) ContainersMonitor support configurable
[ https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiandan Yang updated YARN-7693: Priority: Blocker (was: Minor) > ContainersMonitor support configurable > -- > > Key: YARN-7693 > URL: https://issues.apache.org/jira/browse/YARN-7693 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Blocker > Attachments: YARN-7693.001.patch, YARN-7693.002.patch > > > Currently ContainersMonitor has only one default implementation > ContainersMonitorImpl, > After introducing Opportunistic Container, ContainersMonitor needs to monitor > system metrics and even dynamically adjust Opportunistic and Guaranteed > resources in the cgroup, so another ContainersMonitor may need to be > implemented. > The current ContainerManagerImpl ContainersMonitorImpl direct new > ContainerManagerImpl, so ContainersMonitor need to be configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org