[jira] [Commented] (YARN-5951) Changes to allow CapacityScheduler to use configuration store

2017-08-02 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110501#comment-16110501
 ] 

Jiandan Yang  commented on YARN-5951:
-

[~jhung] [~leftnoteasy] I found this patch has two problems:
1. MutableCSConfigurationProvider#recoverConf iterator PendingMutations,calling 
removeFirst in confirmMutation will lead to ConcurrentModificationException
{code:java}
List uncommittedLogs = confStore.getPendingMutations();
Configuration oldConf = new Configuration(schedConf);
for (LogMutation mutation : uncommittedLogs) {
  ..
  confStore.confirmMutation(mutation.getId(), true);
  ..
}
{code}

2. LeveldbConfigurationStore#initialize should update txnId after  
pendingMutations.add

{code:java}
  while (itr.hasNext()) {
Map.Entry entry = itr.next();
if (!new String(entry.getKey(), StandardCharsets.UTF_8)
.startsWith(LOG_PREFIX)) {
  break;
}
pendingMutations.add(deserLogMutation(entry.getValue()));
txnId = deserLogMutation(entry.getValue()).getId();
// update txnId

  }
{code}




> Changes to allow CapacityScheduler to use configuration store
> -
>
> Key: YARN-5951
> URL: https://issues.apache.org/jira/browse/YARN-5951
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
> Fix For: YARN-5734
>
> Attachments: YARN-5951-YARN-5734.001.patch, 
> YARN-5951-YARN-5734.002.patch, YARN-5951-YARN-5734.003.patch, 
> YARN-5951-YARN-5734.004.patch
>
>
> EDIT: changing this ticket. Found that the CapacityStoreConfigurationProvider 
> is not necessary, since we can just grab a Configuration object from 
> StoreConfigurationProvider with type "SCHEDULER" and create a 
> CapacitySchedulerConfiguration from it.
> This ticket will track changes needed for integrating other components to be 
> used by the capacity scheduler.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5951) Changes to allow CapacityScheduler to use configuration store

2017-08-02 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1611#comment-1611
 ] 

Jiandan Yang  commented on YARN-5951:
-

[~jhung] Sorry, my mistake, it's 
[YARN-5947|https://issues.apache.org/jira/browse/YARN-5947]

> Changes to allow CapacityScheduler to use configuration store
> -
>
> Key: YARN-5951
> URL: https://issues.apache.org/jira/browse/YARN-5951
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Jonathan Hung
>Assignee: Jonathan Hung
> Fix For: YARN-5734
>
> Attachments: YARN-5951-YARN-5734.001.patch, 
> YARN-5951-YARN-5734.002.patch, YARN-5951-YARN-5734.003.patch, 
> YARN-5951-YARN-5734.004.patch
>
>
> EDIT: changing this ticket. Found that the CapacityStoreConfigurationProvider 
> is not necessary, since we can just grab a Configuration object from 
> StoreConfigurationProvider with type "SCHEDULER" and create a 
> CapacitySchedulerConfiguration from it.
> This ticket will track changes needed for integrating other components to be 
> used by the capacity scheduler.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted

2017-09-06 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7168:

Description: 
In our cluster, when found NodeManager frequently FullGC when decommissioning 
NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below.
The root reason is that the size of dataQueue and ackQueue in DataStreamer has 
no limit when writer thread is interrupted. I know NodeManager may stop writing 
when interruped, but DFSOutputStream also could do something to avoid fullgc

!mat.jpg|memory_analysis!


  was:
In our cluster, when found NodeManager frequently FullGC when decommissioning 
NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
has almost 6w DFSPacket, and every DFSPacket is about 64k.
!mat.jpg|memory_analysis!
The root reason is that the size of dataQueue and ackQueue in DataStreamer has 
no limit when writer thread is interrupted. I know NodeManager may stop writing 
when interruped, but DFSOutputStream also could do something to avoid fullgc



> The size of dataQueue and ackQueue in DataStreamer has no limit when writer 
> thread is interrupted
> -
>
> Key: YARN-7168
> URL: https://issues.apache.org/jira/browse/YARN-7168
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Jiandan Yang 
> Attachments: mat.jpg
>
>
> In our cluster, when found NodeManager frequently FullGC when decommissioning 
> NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
> has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below.
> The root reason is that the size of dataQueue and ackQueue in DataStreamer 
> has no limit when writer thread is interrupted. I know NodeManager may stop 
> writing when interruped, but DFSOutputStream also could do something to avoid 
> fullgc
> !mat.jpg|memory_analysis!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted

2017-09-06 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7168:

Description: 
In our cluster, when found NodeManager frequently FullGC when decommissioning 
NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below.
The root reason is that the size of dataQueue and ackQueue in DataStreamer has 
no limit when writer thread is interrupted.
DFSOutputStream#waitAndQueuePacket does not wait when writer thread is 
interrupted. I know NodeManager may stop writing when interruped, but 
DFSOutputStream also could do something to avoid fullgc

{code:java}
while (!streamerClosed && dataQueue.size() + ackQueue.size() >
  dfsClient.getConf().getWriteMaxPackets()) {
if (firstWait) {
  Span span = Tracer.getCurrentSpan();
  if (span != null) {
span.addTimelineAnnotation("dataQueue.wait");
  }
  firstWait = false;
}
try {
  dataQueue.wait();
} catch (InterruptedException e) {
  // If we get interrupted while waiting to queue data, we still 
need to get rid
  // of the current packet. This is because we have an invariant 
that if
  // currentPacket gets full, it will get queued before the next 
writeChunk.
  //
  // Rather than wait around for space in the queue, we should 
instead try to
  // return to the caller as soon as possible, even though we 
slightly overrun
  // the MAX_PACKETS length.
  Thread.currentThread().interrupt();  
  break;
}
  }
} finally {
  Span span = Tracer.getCurrentSpan();
  if ((span != null) && (!firstWait)) {
span.addTimelineAnnotation("end.wait");
  }
}
{code}

!mat.jpg|memory_analysis!


  was:
In our cluster, when found NodeManager frequently FullGC when decommissioning 
NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below.
The root reason is that the size of dataQueue and ackQueue in DataStreamer has 
no limit when writer thread is interrupted.
DFSOutputStream#waitAndQueuePacket does not wait when writer thread is 
interrupted.
{code:java}
while (!streamerClosed && dataQueue.size() + ackQueue.size() >
  dfsClient.getConf().getWriteMaxPackets()) {
if (firstWait) {
  Span span = Tracer.getCurrentSpan();
  if (span != null) {
span.addTimelineAnnotation("dataQueue.wait");
  }
  firstWait = false;
}
try {
  dataQueue.wait();
} catch (InterruptedException e) {
  // If we get interrupted while waiting to queue data, we still 
need to get rid
  // of the current packet. This is because we have an invariant 
that if
  // currentPacket gets full, it will get queued before the next 
writeChunk.
  //
  // Rather than wait around for space in the queue, we should 
instead try to
  // return to the caller as soon as possible, even though we 
slightly overrun
  // the MAX_PACKETS length.
  Thread.currentThread().interrupt();  
  break;
}
  }
} finally {
  Span span = Tracer.getCurrentSpan();
  if ((span != null) && (!firstWait)) {
span.addTimelineAnnotation("end.wait");
  }
}
{code}

 I know NodeManager may stop writing when interruped, but DFSOutputStream also 
could do something to avoid fullgc

!mat.jpg|memory_analysis!



> The size of dataQueue and ackQueue in DataStreamer has no limit when writer 
> thread is interrupted
> -
>
> Key: YARN-7168
> URL: https://issues.apache.org/jira/browse/YARN-7168
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Jiandan Yang 
> Attachments: mat.jpg
>
>
> In our cluster, when found NodeManager frequently FullGC when decommissioning 
> NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
> has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below.
> The root reason is that the size of dataQueue and ackQueue in DataStreamer 
> has no limit when writer thread is interrupted.
> DFSOutputStream#waitAndQueuePacket does not wait when writer thread is 
> interrupted. I know NodeManager may stop writing when interruped, but 
> DFSOutputStream also could do something to avoid fullgc
> 

[jira] [Updated] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted

2017-09-06 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7168:

Description: 
In our cluster, when found NodeManager frequently FullGC when decommissioning 
NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below.
The root reason is that the size of dataQueue and ackQueue in DataStreamer has 
no limit when writer thread is interrupted.
DFSOutputStream#waitAndQueuePacket does not wait when writer thread is 
interrupted.
{code:java}
while (!streamerClosed && dataQueue.size() + ackQueue.size() >
  dfsClient.getConf().getWriteMaxPackets()) {
if (firstWait) {
  Span span = Tracer.getCurrentSpan();
  if (span != null) {
span.addTimelineAnnotation("dataQueue.wait");
  }
  firstWait = false;
}
try {
  dataQueue.wait();
} catch (InterruptedException e) {
  // If we get interrupted while waiting to queue data, we still 
need to get rid
  // of the current packet. This is because we have an invariant 
that if
  // currentPacket gets full, it will get queued before the next 
writeChunk.
  //
  // Rather than wait around for space in the queue, we should 
instead try to
  // return to the caller as soon as possible, even though we 
slightly overrun
  // the MAX_PACKETS length.
  Thread.currentThread().interrupt();  
  break;
}
  }
} finally {
  Span span = Tracer.getCurrentSpan();
  if ((span != null) && (!firstWait)) {
span.addTimelineAnnotation("end.wait");
  }
}
{code}

 I know NodeManager may stop writing when interruped, but DFSOutputStream also 
could do something to avoid fullgc

!mat.jpg|memory_analysis!


  was:
In our cluster, when found NodeManager frequently FullGC when decommissioning 
NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below.
The root reason is that the size of dataQueue and ackQueue in DataStreamer has 
no limit when writer thread is interrupted. I know NodeManager may stop writing 
when interruped, but DFSOutputStream also could do something to avoid fullgc

!mat.jpg|memory_analysis!



> The size of dataQueue and ackQueue in DataStreamer has no limit when writer 
> thread is interrupted
> -
>
> Key: YARN-7168
> URL: https://issues.apache.org/jira/browse/YARN-7168
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Jiandan Yang 
> Attachments: mat.jpg
>
>
> In our cluster, when found NodeManager frequently FullGC when decommissioning 
> NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
> has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below.
> The root reason is that the size of dataQueue and ackQueue in DataStreamer 
> has no limit when writer thread is interrupted.
> DFSOutputStream#waitAndQueuePacket does not wait when writer thread is 
> interrupted.
> {code:java}
> while (!streamerClosed && dataQueue.size() + ackQueue.size() >
>   dfsClient.getConf().getWriteMaxPackets()) {
> if (firstWait) {
>   Span span = Tracer.getCurrentSpan();
>   if (span != null) {
> span.addTimelineAnnotation("dataQueue.wait");
>   }
>   firstWait = false;
> }
> try {
>   dataQueue.wait();
> } catch (InterruptedException e) {
>   // If we get interrupted while waiting to queue data, we still 
> need to get rid
>   // of the current packet. This is because we have an invariant 
> that if
>   // currentPacket gets full, it will get queued before the next 
> writeChunk.
>   //
>   // Rather than wait around for space in the queue, we should 
> instead try to
>   // return to the caller as soon as possible, even though we 
> slightly overrun
>   // the MAX_PACKETS length.
>   Thread.currentThread().interrupt();  
>   break;
> }
>   }
> } finally {
>   Span span = Tracer.getCurrentSpan();
>   if ((span != null) && (!firstWait)) {
> span.addTimelineAnnotation("end.wait");
>   }
> }
> {code}
>  I know NodeManager may stop writing when interruped, but DFSOutputStream 
> also could do something to avoid fullgc
> 

[jira] [Commented] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted

2017-09-06 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156350#comment-16156350
 ] 

Jiandan Yang  commented on YARN-7168:
-

Sorry, I should create this issue in Hadoop HDFS, can anyone help me move to 
Hadoop HDFS project?

> The size of dataQueue and ackQueue in DataStreamer has no limit when writer 
> thread is interrupted
> -
>
> Key: YARN-7168
> URL: https://issues.apache.org/jira/browse/YARN-7168
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Jiandan Yang 
> Attachments: mat.jpg
>
>
> In our cluster, when found NodeManager frequently FullGC when decommissioning 
> NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
> has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below.
> The root reason is that the size of dataQueue and ackQueue in DataStreamer 
> has no limit when writer thread is interrupted.
> DFSOutputStream#waitAndQueuePacket does not wait when writer thread is 
> interrupted. I know NodeManager may stop writing when interruped, but 
> DFSOutputStream also could do something to avoid Infinite growth of dataQueue.
> {code:java}
> while (!streamerClosed && dataQueue.size() + ackQueue.size() >
>   dfsClient.getConf().getWriteMaxPackets()) {
> if (firstWait) {
>   Span span = Tracer.getCurrentSpan();
>   if (span != null) {
> span.addTimelineAnnotation("dataQueue.wait");
>   }
>   firstWait = false;
> }
> try {
>   dataQueue.wait();
> } catch (InterruptedException e) {
>   // If we get interrupted while waiting to queue data, we still 
> need to get rid
>   // of the current packet. This is because we have an invariant 
> that if
>   // currentPacket gets full, it will get queued before the next 
> writeChunk.
>   //
>   // Rather than wait around for space in the queue, we should 
> instead try to
>   // return to the caller as soon as possible, even though we 
> slightly overrun
>   // the MAX_PACKETS length.
>   Thread.currentThread().interrupt();  
>   break;
> }
>   }
> } finally {
>   Span span = Tracer.getCurrentSpan();
>   if ((span != null) && (!firstWait)) {
> span.addTimelineAnnotation("end.wait");
>   }
> }
> {code}
> !mat.jpg|memory_analysis!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted

2017-09-06 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7168:

Attachment: mat.jpg

> The size of dataQueue and ackQueue in DataStreamer has no limit when writer 
> thread is interrupted
> -
>
> Key: YARN-7168
> URL: https://issues.apache.org/jira/browse/YARN-7168
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Jiandan Yang 
> Attachments: mat.jpg
>
>
> In our cluster, when found NodeManager frequently FullGC when decommissioning 
> NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
> has almost 6w DFSPacket, and every DFSPacket is about 64k.
> !mat.jpg|memory_analysis!
> The root reason is that the size of dataQueue and ackQueue in DataStreamer 
> has no limit when writer thread is interrupted. I know NodeManager may stop 
> writing when interruped, but DFSOutputStream also could do something to avoid 
> fullgc



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted

2017-09-06 Thread Jiandan Yang (JIRA)
Jiandan Yang  created YARN-7168:
---

 Summary: The size of dataQueue and ackQueue in DataStreamer has no 
limit when writer thread is interrupted
 Key: YARN-7168
 URL: https://issues.apache.org/jira/browse/YARN-7168
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Reporter: Jiandan Yang 


In our cluster, when found NodeManager frequently FullGC when decommissioning 
NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
has almost 6w DFSPacket, and every DFSPacket is about 64k.
!mat.jpg|memory_analysis!
The root reason is that the size of dataQueue and ackQueue in DataStreamer has 
no limit when writer thread is interrupted. I know NodeManager may stop writing 
when interruped, but DFSOutputStream also could do something to avoid fullgc




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7168) The size of dataQueue and ackQueue in DataStreamer has no limit when writer thread is interrupted

2017-09-06 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7168:

Description: 
In our cluster, when found NodeManager frequently FullGC when decommissioning 
NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below.
The root reason is that the size of dataQueue and ackQueue in DataStreamer has 
no limit when writer thread is interrupted.
DFSOutputStream#waitAndQueuePacket does not wait when writer thread is 
interrupted. I know NodeManager may stop writing when interruped, but 
DFSOutputStream also could do something to avoid Infinite growth of dataQueue.

{code:java}
while (!streamerClosed && dataQueue.size() + ackQueue.size() >
  dfsClient.getConf().getWriteMaxPackets()) {
if (firstWait) {
  Span span = Tracer.getCurrentSpan();
  if (span != null) {
span.addTimelineAnnotation("dataQueue.wait");
  }
  firstWait = false;
}
try {
  dataQueue.wait();
} catch (InterruptedException e) {
  // If we get interrupted while waiting to queue data, we still 
need to get rid
  // of the current packet. This is because we have an invariant 
that if
  // currentPacket gets full, it will get queued before the next 
writeChunk.
  //
  // Rather than wait around for space in the queue, we should 
instead try to
  // return to the caller as soon as possible, even though we 
slightly overrun
  // the MAX_PACKETS length.
  Thread.currentThread().interrupt();  
  break;
}
  }
} finally {
  Span span = Tracer.getCurrentSpan();
  if ((span != null) && (!firstWait)) {
span.addTimelineAnnotation("end.wait");
  }
}
{code}

!mat.jpg|memory_analysis!


  was:
In our cluster, when found NodeManager frequently FullGC when decommissioning 
NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below.
The root reason is that the size of dataQueue and ackQueue in DataStreamer has 
no limit when writer thread is interrupted.
DFSOutputStream#waitAndQueuePacket does not wait when writer thread is 
interrupted. I know NodeManager may stop writing when interruped, but 
DFSOutputStream also could do something to avoid fullgc

{code:java}
while (!streamerClosed && dataQueue.size() + ackQueue.size() >
  dfsClient.getConf().getWriteMaxPackets()) {
if (firstWait) {
  Span span = Tracer.getCurrentSpan();
  if (span != null) {
span.addTimelineAnnotation("dataQueue.wait");
  }
  firstWait = false;
}
try {
  dataQueue.wait();
} catch (InterruptedException e) {
  // If we get interrupted while waiting to queue data, we still 
need to get rid
  // of the current packet. This is because we have an invariant 
that if
  // currentPacket gets full, it will get queued before the next 
writeChunk.
  //
  // Rather than wait around for space in the queue, we should 
instead try to
  // return to the caller as soon as possible, even though we 
slightly overrun
  // the MAX_PACKETS length.
  Thread.currentThread().interrupt();  
  break;
}
  }
} finally {
  Span span = Tracer.getCurrentSpan();
  if ((span != null) && (!firstWait)) {
span.addTimelineAnnotation("end.wait");
  }
}
{code}

!mat.jpg|memory_analysis!



> The size of dataQueue and ackQueue in DataStreamer has no limit when writer 
> thread is interrupted
> -
>
> Key: YARN-7168
> URL: https://issues.apache.org/jira/browse/YARN-7168
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client
>Reporter: Jiandan Yang 
> Attachments: mat.jpg
>
>
> In our cluster, when found NodeManager frequently FullGC when decommissioning 
> NodeManager, and we found the biggest object is dataQueue of DataStreamer, it 
> has almost 6w DFSPacket, and every DFSPacket is about 64k, as shown below.
> The root reason is that the size of dataQueue and ackQueue in DataStreamer 
> has no limit when writer thread is interrupted.
> DFSOutputStream#waitAndQueuePacket does not wait when writer thread is 
> interrupted. I know NodeManager may stop writing when interruped, but 
> DFSOutputStream also could do something to 

[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA

2017-11-14 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7497:

Description: 
YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but it 
does not support Yarn RM HA. 
YARN-6840 supports RM HA, but too many scheduler configurations may exceed 
znode limit, for example 10 thousand queues.
HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, new 
active RM can load scheduler configuration from HDFS.

  was:YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, 
but it does not support Yarn RM HA. HDFSSchedulerConfigurationStore store conf 
file in HDFS, when RM failover, new active RM can load scheduler configuration 
from HDFS.


> Add HDFSSchedulerConfigurationStore for RM HA
> -
>
> Key: YARN-7497
> URL: https://issues.apache.org/jira/browse/YARN-7497
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Jiandan Yang 
> Attachments: YARN-7497.001.patch
>
>
> YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but 
> it does not support Yarn RM HA. 
> YARN-6840 supports RM HA, but too many scheduler configurations may exceed 
> znode limit, for example 10 thousand queues.
> HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, 
> new active RM can load scheduler configuration from HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA

2017-11-14 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7497:

Attachment: YARN-7497.001.patch

> Add HDFSSchedulerConfigurationStore for RM HA
> -
>
> Key: YARN-7497
> URL: https://issues.apache.org/jira/browse/YARN-7497
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Jiandan Yang 
> Attachments: YARN-7497.001.patch
>
>
> YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but 
> it does not support Yarn RM HA. HDFSSchedulerConfigurationStore store conf 
> file in HDFS, when RM failover, new active RM can load scheduler 
> configuration from HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA

2017-11-14 Thread Jiandan Yang (JIRA)
Jiandan Yang  created YARN-7497:
---

 Summary: Add HDFSSchedulerConfigurationStore for RM HA
 Key: YARN-7497
 URL: https://issues.apache.org/jira/browse/YARN-7497
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: yarn
Reporter: Jiandan Yang 


YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but it 
does not support Yarn RM HA. HDFSSchedulerConfigurationStore store conf file in 
HDFS, when RM failover, new active RM can load scheduler configuration from 
HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA

2017-11-29 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7497:

Attachment: YARN-7497.005.patch

> Add HDFSSchedulerConfigurationStore for RM HA
> -
>
> Key: YARN-7497
> URL: https://issues.apache.org/jira/browse/YARN-7497
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Jiandan Yang 
> Attachments: YARN-7497.001.patch, YARN-7497.002.patch, 
> YARN-7497.003.patch, YARN-7497.004.patch, YARN-7497.005.patch
>
>
> YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but 
> it does not support Yarn RM HA. 
> YARN-6840 supports RM HA, but too many scheduler configurations may exceed 
> znode limit, for example 10 thousand queues.
> HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, 
> new active RM can load scheduler configuration from HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA

2017-11-29 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270525#comment-16270525
 ] 

Jiandan Yang  commented on YARN-7497:
-

[~gphillips] I have moved those two static constant into YarnConfiguration.

> Add HDFSSchedulerConfigurationStore for RM HA
> -
>
> Key: YARN-7497
> URL: https://issues.apache.org/jira/browse/YARN-7497
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Jiandan Yang 
> Attachments: YARN-7497.001.patch, YARN-7497.002.patch, 
> YARN-7497.003.patch, YARN-7497.004.patch, YARN-7497.005.patch
>
>
> YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but 
> it does not support Yarn RM HA. 
> YARN-6840 supports RM HA, but too many scheduler configurations may exceed 
> znode limit, for example 10 thousand queues.
> HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, 
> new active RM can load scheduler configuration from HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA

2017-11-29 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270497#comment-16270497
 ] 

Jiandan Yang  commented on YARN-7497:
-

[~jhung] Please help me to review and give me some comment about this patch. 
Thank you.

> Add HDFSSchedulerConfigurationStore for RM HA
> -
>
> Key: YARN-7497
> URL: https://issues.apache.org/jira/browse/YARN-7497
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Jiandan Yang 
> Attachments: YARN-7497.001.patch, YARN-7497.002.patch, 
> YARN-7497.003.patch, YARN-7497.004.patch
>
>
> YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but 
> it does not support Yarn RM HA. 
> YARN-6840 supports RM HA, but too many scheduler configurations may exceed 
> znode limit, for example 10 thousand queues.
> HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, 
> new active RM can load scheduler configuration from HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA

2017-11-29 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7497:

Attachment: YARN-7497.006.patch

fix TestYarnConfigurationFields fail

> Add HDFSSchedulerConfigurationStore for RM HA
> -
>
> Key: YARN-7497
> URL: https://issues.apache.org/jira/browse/YARN-7497
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Jiandan Yang 
> Attachments: YARN-7497.001.patch, YARN-7497.002.patch, 
> YARN-7497.003.patch, YARN-7497.004.patch, YARN-7497.005.patch, 
> YARN-7497.006.patch
>
>
> YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but 
> it does not support Yarn RM HA. 
> YARN-6840 supports RM HA, but too many scheduler configurations may exceed 
> znode limit, for example 10 thousand queues.
> HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, 
> new active RM can load scheduler configuration from HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA

2017-11-16 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7497:

Attachment: YARN-7497.004.patch

fix findbug error

> Add HDFSSchedulerConfigurationStore for RM HA
> -
>
> Key: YARN-7497
> URL: https://issues.apache.org/jira/browse/YARN-7497
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Jiandan Yang 
> Attachments: YARN-7497.001.patch, YARN-7497.002.patch, 
> YARN-7497.003.patch, YARN-7497.004.patch
>
>
> YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but 
> it does not support Yarn RM HA. 
> YARN-6840 supports RM HA, but too many scheduler configurations may exceed 
> znode limit, for example 10 thousand queues.
> HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, 
> new active RM can load scheduler configuration from HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA

2017-11-15 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7497:

Attachment: YARN-7497.003.patch

fix UT and whitespace error 

> Add HDFSSchedulerConfigurationStore for RM HA
> -
>
> Key: YARN-7497
> URL: https://issues.apache.org/jira/browse/YARN-7497
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Jiandan Yang 
> Attachments: YARN-7497.001.patch, YARN-7497.002.patch, 
> YARN-7497.003.patch
>
>
> YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but 
> it does not support Yarn RM HA. 
> YARN-6840 supports RM HA, but too many scheduler configurations may exceed 
> znode limit, for example 10 thousand queues.
> HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, 
> new active RM can load scheduler configuration from HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7497) Add HDFSSchedulerConfigurationStore for RM HA

2017-11-15 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7497:

Attachment: YARN-7497.002.patch

upload v2 patch

> Add HDFSSchedulerConfigurationStore for RM HA
> -
>
> Key: YARN-7497
> URL: https://issues.apache.org/jira/browse/YARN-7497
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Jiandan Yang 
> Attachments: YARN-7497.001.patch, YARN-7497.002.patch
>
>
> YARN-5947 add LeveldbConfigurationStore using Leveldb as backing store, but 
> it does not support Yarn RM HA. 
> YARN-6840 supports RM HA, but too many scheduler configurations may exceed 
> znode limit, for example 10 thousand queues.
> HDFSSchedulerConfigurationStore store conf file in HDFS, when RM failover, 
> new active RM can load scheduler configuration from HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5636) Support reserving resources on certain nodes for certain applications

2017-12-11 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287229#comment-16287229
 ] 

Jiandan Yang  commented on YARN-5636:
-

[~Tao Jie] I think you solution is good. You can provide a patch to review.

> Support reserving resources on certain nodes for certain applications
> -
>
> Key: YARN-5636
> URL: https://issues.apache.org/jira/browse/YARN-5636
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Tao Jie
>
> We have met such circumstance:
> We are trying to run storm on yarn by Slider, and Storm writes 
> data to local disk on node. If some containers or the application fails, we 
> expect that those containers would restart on the same node as they run 
> before, otherwise data written on local would lost.
> For slider, it will trying to ensure restarted container on same nodes as 
> before. However for yarn, resource may be assigned to other applications when 
> former long-running application is down.
> As a result we'd better to have a mechanism that reserve some resource for 
> certain long-running applications on certain nodes for a period of time. Does 
> it make sense?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7715) Support NM promotion/demotion of running containers.

2018-05-15 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475703#comment-16475703
 ] 

Jiandan Yang  commented on YARN-7715:
-

[~miklos.szeg...@cloudera.com] How to inform AM if update cgroup resource fail?

> Support NM promotion/demotion of running containers.
> 
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch, YARN-7715.003.patch, YARN-7715.004.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7715) Support NM promotion/demotion of running containers.

2018-05-16 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476961#comment-16476961
 ] 

Jiandan Yang  commented on YARN-7715:
-

Thanks [~miklos.szeg...@cloudera.com]
Updating execution type also needs to update cgroup(cfs_period_us, 
cfs_quota_us, shares), AM is not notified when update cgroup fail.

Recover container will error when NM restarts if not storing updated execution 
type


> Support NM promotion/demotion of running containers.
> 
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch, YARN-7715.003.patch, YARN-7715.004.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks

2018-05-18 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Description: 
Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
“cpu.shares”  to isolate cpu resource. However,
* Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; no 
support for differentiated latency
* Request latency of services running on container may be frequent shake when 
all containers share cpus, and latency-sensitive services can not afford in our 
production environment.

So we need more finer cpu isolation.

My co-workers and I propose a solution using cgroup cpuset to binds containers 
to different processors according to a [Google’s 
PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
Later I will upload a detailed design doc.

 


  was:
Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
“cpu.shares”  to isolate cpu resource. However,
* Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; no 
support for differentiated latency
* Request latency of services running on container may be frequent shake when 
all containers share cpus, and latency-sensitive services can not afford in our 
production environment.
So we need more finer cpu isolation.
My co-workers and I propose a solution using cgroup cpuset to binds containers 
to different processors according to a [Google’s 
PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
Later I will upload a detailed design doc.

 



> Add support CPU isolation for latency-sensitive  (LS) tasks
> ---
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares”  to isolate cpu resource. However,
> * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
> * Request latency of services running on container may be frequent shake when 
> all containers share cpus, and latency-sensitive services can not afford in 
> our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors according to a [Google’s 
> PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
> Later I will upload a detailed design doc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks

2018-05-18 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: CPU-isolation-for-latency-sensitive-services-v1.pdf

> Add support CPU isolation for latency-sensitive  (LS) tasks
> ---
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares”  to isolate cpu resource. However,
> * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
> * Request latency of services running on container may be frequent shake when 
> all containers share cpus, and latency-sensitive services can not afford in 
> our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors according to a [Google’s 
> PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
> Later I will upload a detailed design doc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks

2018-05-18 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: (was: CPU-isolation-for-latency-sensitive-services-v1.pdf)

> Add support CPU isolation for latency-sensitive  (LS) tasks
> ---
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares”  to isolate cpu resource. However,
> * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
> * Request latency of services running on container may be frequent shake when 
> all containers share cpus, and latency-sensitive services can not afford in 
> our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors according to a [Google’s 
> PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
> Later I will upload a detailed design doc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks

2018-05-18 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: CPU-isolation-for-latency-sensitive-services-v1.pdf

> Add support CPU isolation for latency-sensitive  (LS) tasks
> ---
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares”  to isolate cpu resource. However,
> * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
> * Request latency of services running on container may be frequent shake when 
> all containers share cpus, and latency-sensitive services can not afford in 
> our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors according to a [Google’s 
> PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
> Later I will upload a detailed design doc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks

2018-05-18 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480529#comment-16480529
 ] 

Jiandan Yang  commented on YARN-8320:
-

upload design doc v1.
Please feel free to let me know your questions / comments. If everyone agrees 
with the general approach, I will go ahead to create a patch

> Add support CPU isolation for latency-sensitive  (LS) tasks
> ---
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares”  to isolate cpu resource. However,
> * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
> * Request latency of services running on container may be frequent shake when 
> all containers share cpus, and latency-sensitive services can not afford in 
> our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors according to a [Google’s 
> PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
> Later I will upload a detailed design doc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service

2018-05-18 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Summary: Add support CPU isolation for latency-sensitive  (LS) service  
(was: Add support CPU isolation for latency-sensitive  (LS) tasks)

> Add support CPU isolation for latency-sensitive  (LS) service
> -
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares”  to isolate cpu resource. However,
> * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
> * Request latency of services running on container may be frequent shake when 
> all containers share cpus, and latency-sensitive services can not afford in 
> our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors according to a [Google’s 
> PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
> Later I will upload a detailed design doc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks

2018-05-18 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: CPU-isolation-for-latency-sensitive-services-v1.pdf

> Add support CPU isolation for latency-sensitive  (LS) tasks
> ---
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares”  to isolate cpu resource. However,
> * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
> * Request latency of services running on container may be frequent shake when 
> all containers share cpus, and latency-sensitive services can not afford in 
> our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors according to a [Google’s 
> PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
> Later I will upload a detailed design doc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks

2018-05-18 Thread Jiandan Yang (JIRA)
Jiandan Yang  created YARN-8320:
---

 Summary: Add support CPU isolation for latency-sensitive  (LS) 
tasks
 Key: YARN-8320
 URL: https://issues.apache.org/jira/browse/YARN-8320
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Reporter: Jiandan Yang 


Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
“cpu.shares”  to isolate cpu resource. However,
* Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; no 
support for differentiated latency
* Request latency of services running on container may be frequent shake when 
all containers share cpus, and latency-sensitive services can not afford in our 
production environment.
So we need more finer cpu isolation.
My co-workers and I propose a solution using cgroup cpuset to binds containers 
to different processors according to a [Google’s 
PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
Later I will upload a detailed design doc.

 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) tasks

2018-05-18 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: (was: CPU-isolation-for-latency-sensitive-services-v1.pdf)

> Add support CPU isolation for latency-sensitive  (LS) tasks
> ---
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares”  to isolate cpu resource. However,
> * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
> * Request latency of services running on container may be frequent shake when 
> all containers share cpus, and latency-sensitive services can not afford in 
> our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors according to a [Google’s 
> PPT|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
> Later I will upload a detailed design doc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service

2018-05-21 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482336#comment-16482336
 ] 

Jiandan Yang  commented on YARN-8320:
-

upload v1 patch to initiate disscussion

> Add support CPU isolation for latency-sensitive  (LS) service
> -
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, 
> YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors, this is inspired by the isolation 
> technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
>  Later I will upload a detailed design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service

2018-05-21 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: YARN-8320.001.patch

> Add support CPU isolation for latency-sensitive  (LS) service
> -
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, 
> YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors, this is inspired by the isolation 
> technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
>  Later I will upload a detailed design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service

2018-05-21 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: (was: YARN-8320.001.patch)

> Add support CPU isolation for latency-sensitive  (LS) service
> -
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, 
> YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors, this is inspired by the isolation 
> technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
>  Later I will upload a detailed design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service

2018-05-21 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: (was: CPU-isolation-for-latency-sensitive-services-v1.pdf)

> Add support CPU isolation for latency-sensitive  (LS) service
> -
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, 
> YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors, this is inspired by the isolation 
> technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
>  Later I will upload a detailed design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service

2018-05-21 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: CPU-isolation-for-latency-sensitive-services-v1.pdf

> Add support CPU isolation for latency-sensitive  (LS) service
> -
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, 
> YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors, this is inspired by the isolation 
> technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
>  Later I will upload a detailed design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Support CPU isolation for latency-sensitive (LS) service

2018-05-23 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: CPU-isolation-for-latency-sensitive-services-v2.pdf

> Support CPU isolation for latency-sensitive (LS) service
> 
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, 
> CPU-isolation-for-latency-sensitive-services-v2.pdf, YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more fine-grained cpu isolation.
> Here we propose a solution using cgroup cpuset to binds containers to 
> different processors, this is inspired by the isolation technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8320) Support CPU isolation for latency-sensitive (LS) service

2018-05-23 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487066#comment-16487066
 ] 

Jiandan Yang  commented on YARN-8320:
-

[~cheersyang] and I discuss design offline together. Add more details in v2 
design doc.

> Support CPU isolation for latency-sensitive (LS) service
> 
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, 
> YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more fine-grained cpu isolation.
> Here we propose a solution using cgroup cpuset to binds containers to 
> different processors, this is inspired by the isolation technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7715) Support NM promotion/demotion of running containers.

2018-05-15 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476776#comment-16476776
 ] 

Jiandan Yang  commented on YARN-7715:
-

Hi, [~miklos.szeg...@cloudera.com] 
Thanks for your reply. I mean AM does not know when NM updates resource failed.
Consider flowing case:
1. AM increase vcore by updateContainer
2. NM update Cgroup failed when executing 
CGroupsCpuResourceHandlerImpl#updateContainer

And another question: updated containes need to store, but I did not find 
related code in your patch.

> Support NM promotion/demotion of running containers.
> 
>
> Key: YARN-7715
> URL: https://issues.apache.org/jira/browse/YARN-7715
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Arun Suresh
>Assignee: Miklos Szegedi
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: YARN-7715.000.patch, YARN-7715.001.patch, 
> YARN-7715.002.patch, YARN-7715.003.patch, YARN-7715.004.patch
>
>
> In YARN-6673 and YARN-6674, the cgroups resource handlers update the cgroups 
> params for the containers, based on opportunistic or guaranteed, in the 
> *preStart* method.
> Now that YARN-5085 is in, Container executionType (as well as the cpu, memory 
> and any other resources) can be updated after the container has started. This 
> means we need the ability to change cgroups params after container start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service

2018-05-20 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: CPU-isolation-for-latency-sensitive-services-v1.pdf

> Add support CPU isolation for latency-sensitive  (LS) service
> -
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors, this is inspired by the isolation 
> technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
>  Later I will upload a detailed design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8320) Add support CPU isolation for latency-sensitive (LS) service

2018-05-20 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8320:

Attachment: (was: CPU-isolation-for-latency-sensitive-services-v1.pdf)

> Add support CPU isolation for latency-sensitive  (LS) service
> -
>
> Key: YARN-8320
> URL: https://issues.apache.org/jira/browse/YARN-8320
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Priority: Major
> Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more finer cpu isolation.
> My co-workers and I propose a solution using cgroup cpuset to binds 
> containers to different processors, this is inspired by the isolation 
> technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].
>  Later I will upload a detailed design doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6673) Add cpu cgroup configurations for opportunistic containers

2018-01-04 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312473#comment-16312473
 ] 

Jiandan Yang  edited comment on YARN-6673 at 1/5/18 4:49 AM:
-

[~miklos.szeg...@cloudera.com] How about setting Cpu share for Opportunistic 
container 
*CPU_DEFAULT_WEIGHT_OPPORTUNISTIC * containerVCores* 


was (Author: yangjiandan):
[~miklos.szeg...@cloudera.com] How about setting Cpu share for Opportunistic 
container 
* CPU_DEFAULT_WEIGHT_OPPORTUNISTIC * containerVCores* 

> Add cpu cgroup configurations for opportunistic containers
> --
>
> Key: YARN-6673
> URL: https://issues.apache.org/jira/browse/YARN-6673
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Haibo Chen
>Assignee: Miklos Szegedi
> Fix For: 3.0.0-beta1
>
> Attachments: YARN-6673.000.patch
>
>
> In addition to setting cpu.cfs_period_us on a per-container basis, we could 
> also set cpu.shares to 2 for opportunistic containers so they are run on a 
> best-effort basis



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6673) Add cpu cgroup configurations for opportunistic containers

2018-01-04 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16312473#comment-16312473
 ] 

Jiandan Yang  commented on YARN-6673:
-

[~miklos.szeg...@cloudera.com] How about setting Cpu share for Opportunistic 
container 
* CPU_DEFAULT_WEIGHT_OPPORTUNISTIC * containerVCores* 

> Add cpu cgroup configurations for opportunistic containers
> --
>
> Key: YARN-6673
> URL: https://issues.apache.org/jira/browse/YARN-6673
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Haibo Chen
>Assignee: Miklos Szegedi
> Fix For: 3.0.0-beta1
>
> Attachments: YARN-6673.000.patch
>
>
> In addition to setting cpu.cfs_period_us on a per-container basis, we could 
> also set cpu.shares to 2 for opportunistic containers so they are run on a 
> best-effort basis



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7693) ContainersMonitor support configurable

2018-01-05 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16313097#comment-16313097
 ] 

Jiandan Yang  commented on YARN-7693:
-

[~miklos.szeg...@cloudera.com]  Opportunistic Containers may impact Guaranteed 
Contains when they are under the same group
memory.soft_limit_in_bytes is best-effort and not guaranteed.  Just think the 
follow steps:
1. memory utilization of Guaranteed Containers in a NodeManager is very low, 
real memory usage is under allocation due to little traffic;
2. Scheduler some Opportunistic Containers on that NodeManager due to 
oversubscription;
3. Guaranteed Containers  memory utilization increases duo to a lot of traffic, 
and do not reach the hard limit of them
4.   *hadoop-yarn* exceeds hard limit 
5.  if set oom-killer, Guaranteed Container may be killed, that is not in line 
with the principle
6.  if not set oom-killer, Guaranteed Container may hang 

So Opportunistic Containers may impact Guaranteed Contains when They are under 
the same group.

If They are under different groups. Guaranteed and Opportunistic have their own 
hard limit, Opportunistic Containers never impact Guaranteed Containers.
Monitor resource utilization of Guaranteed Containers, if there is a gap 
between allocation and required, then picking a part of gap resource to 
Opportunistic Group; If the gap is less than a given value, then decrease the 
hard limit of Guaranteed Group. Kill containers when adjust hard limit fails 
for given times in order to make sure the resource of Guaranteed Containers.


> ContainersMonitor support configurable
> --
>
> Key: YARN-7693
> URL: https://issues.apache.org/jira/browse/YARN-7693
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Minor
> Attachments: YARN-7693.001.patch, YARN-7693.002.patch
>
>
> Currently ContainersMonitor has only one default implementation 
> ContainersMonitorImpl,
> After introducing Opportunistic Container, ContainersMonitor needs to monitor 
> system metrics and even dynamically adjust Opportunistic and Guaranteed 
> resources in the cgroup, so another ContainersMonitor may need to be 
> implemented. 
> The current ContainerManagerImpl ContainersMonitorImpl direct new 
> ContainerManagerImpl, so ContainersMonitor need to be configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7880) FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when running sls

2018-02-02 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7880:

Description: 
{code}
18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to 
RUNNING

java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
{code}

  was:
{code}
18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to 
RUNNING

java.lang.NullPointerException

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
{code}


> FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when running sls
> ---
>
> Key: YARN-7880
> URL: https://issues.apache.org/jira/browse/YARN-7880
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jiandan Yang 
>Priority: Major
>
> {code}
> 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
> container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED 
> to RUNNING
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7880) FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when running sls

2018-02-02 Thread Jiandan Yang (JIRA)
Jiandan Yang  created YARN-7880:
---

 Summary: FiCaSchedulerApp.commonCheckContainerAllocation throws 
NPE when running sls
 Key: YARN-7880
 URL: https://issues.apache.org/jira/browse/YARN-7880
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jiandan Yang 


18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to 
RUNNING

java.lang.NullPointerException

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7880) FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when running sls

2018-02-02 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7880:

Description: 
{code}
18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to 
RUNNING

java.lang.NullPointerException

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
{code}

  was:
18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to 
RUNNING

java.lang.NullPointerException

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)

        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)


> FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when running sls
> ---
>
> Key: YARN-7880
> URL: https://issues.apache.org/jira/browse/YARN-7880
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jiandan Yang 
>Priority: Major
>
> {code}
> 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
> container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED 
> to RUNNING
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls

2018-02-04 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7880:

Summary: CapacityScheduler$ResourceCommitterService throws NPE when running 
sls  (was: FiCaSchedulerApp.commonCheckContainerAllocation throws NPE when 
running sls)

> CapacityScheduler$ResourceCommitterService throws NPE when running sls
> --
>
> Key: YARN-7880
> URL: https://issues.apache.org/jira/browse/YARN-7880
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jiandan Yang 
>Priority: Major
>
> {code}
> 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
> container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED 
> to RUNNING
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls

2018-02-04 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7880:

Description: 
sls test case: node count = 9000, job count=10k,task num of job = 500, task run 
time = 100s

{code}
18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to 
RUNNING

java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
{code}

  was:
{code}
18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to 
RUNNING

java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
{code}


> CapacityScheduler$ResourceCommitterService throws NPE when running sls
> --
>
> Key: YARN-7880
> URL: https://issues.apache.org/jira/browse/YARN-7880
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jiandan Yang 
>Priority: Major
>
> sls test case: node count = 9000, job count=10k,task num of job = 500, task 
> run time = 100s
> {code}
> 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
> container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED 
> to RUNNING
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls

2018-02-04 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7880:

Affects Version/s: 3.0.0

> CapacityScheduler$ResourceCommitterService throws NPE when running sls
> --
>
> Key: YARN-7880
> URL: https://issues.apache.org/jira/browse/YARN-7880
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Jiandan Yang 
>Priority: Major
>
> sls test case: node count = 9000, job count=10k,task num of job = 500, task 
> run time = 100s
> {code}
> 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
> container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED 
> to RUNNING
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls

2018-02-04 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7880:

Component/s: yarn

> CapacityScheduler$ResourceCommitterService throws NPE when running sls
> --
>
> Key: YARN-7880
> URL: https://issues.apache.org/jira/browse/YARN-7880
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jiandan Yang 
>Priority: Major
>
> sls test case: node count = 9000, job count=10k,task num of job = 500, task 
> run time = 100s
> {code}
> 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
> container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED 
> to RUNNING
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls

2018-02-04 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7880:

Description: 

sls test case: node count = 9000, job count=10k,task num of job = 500, task run 
time = 100s
{code}
18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to 
RUNNING

java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
{code}
some CapacityScheduler$AsyncScheduleThread also throws NPE
{code}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequests(SchedulerApplicationAttempt.java:1341)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:302)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:389)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:470)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:252)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:816)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:856)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:735)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1343)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1337)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1434)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1199)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:474)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:501)
{code}

  was:
sls test case: node count = 9000, job count=10k,task num of job = 500, task run 
time = 100s

{code}
18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to 
RUNNING

java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
{code}


> CapacityScheduler$ResourceCommitterService throws NPE when running sls
> --
>
> Key: YARN-7880
> URL: 

[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls

2018-02-04 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7880:

Description: 
sls test case: node count = 9000, job count=10k,task num of job = 500, task run 
time = 100s, but it does not occur when node count = 500 and 2000.
{code}
18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to 
RUNNING

java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
{code}
some CapacityScheduler$AsyncScheduleThread also throws NPE
{code}
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequests(SchedulerApplicationAttempt.java:1341)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:302)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:389)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:470)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:252)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:816)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:856)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:735)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1343)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1337)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1434)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1199)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:474)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:501)
{code}

  was:

sls test case: node count = 9000, job count=10k,task num of job = 500, task run 
time = 100s
{code}
18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to 
RUNNING

java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
{code}
some CapacityScheduler$AsyncScheduleThread also throws NPE
{code}
java.lang.NullPointerException
at 

[jira] [Updated] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls

2018-02-04 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7880:

Description: 
sls test case: node count = 9000, job count=10k,task num of job = 500, task run 
time = 100s, but it does not occur when node count = 500 and 2000.
{code}
18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED to 
RUNNING

java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
{code}
some CapacityScheduler$AsyncScheduleThread also throws NPE
{code}
18/02/02 20:40:34 INFO resourcemanager.DefaultAMSProcessor: AM registration 
appattempt_1517575125794_4564_01
18/02/02 20:40:34 INFO resourcemanager.RMAuditLogger: USER=default  
OPERATION=Register App Master   TARGET=ApplicationMasterService RESULT=SUCCESS  
APPID=application_1517575125794_4564
APPATTEMPTID=appattempt_1517575125794_4564_01
Exception in thread "Thread-43" 18/02/02 20:40:34 INFO appmaster.AMSimulator: 
Register the application master for application application_1517575125794_4564
18/02/02 20:40:34 INFO resourcemanager.MockAMLauncher: Notify AM launcher 
launched:container_1517575125794_4564_01_01
18/02/02 20:40:34 INFO rmcontainer.RMContainerImpl: 
container_1517575125794_2703_01_01 Container Transitioned from ACQUIRED to 
RUNNING
18/02/02 20:40:34 INFO attempt.RMAppAttemptImpl: 
appattempt_1517575125794_4564_01 State change from ALLOCATED to LAUNCHED on 
event = LAUNCHED
18/02/02 20:40:34 INFO attempt.RMAppAttemptImpl: 
appattempt_1517575125794_4564_01 State change from LAUNCHED to RUNNING on 
event = REGISTERED
18/02/02 20:40:34 INFO rmapp.RMAppImpl: application_1517575125794_4564 State 
change from ACCEPTED to RUNNING on event = ATTEMPT_REGISTERED
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequests(SchedulerApplicationAttempt.java:1341)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:302)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:389)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:470)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:252)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:816)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:854)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:856)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:735)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1343)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1337)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1434)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1199)
at 

[jira] [Commented] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls

2018-02-04 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16351990#comment-16351990
 ] 

Jiandan Yang  commented on YARN-7880:
-

duplicated with YARN-7591

> CapacityScheduler$ResourceCommitterService throws NPE when running sls
> --
>
> Key: YARN-7880
> URL: https://issues.apache.org/jira/browse/YARN-7880
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jiandan Yang 
>Priority: Major
>
> sls test case: node count = 9000, job count=10k,task num of job = 500, task 
> run time = 100s, but it does not occur when node count = 500 and 2000.
> {code}
> 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
> container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED 
> to RUNNING
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
> {code}
> some CapacityScheduler$AsyncScheduleThread also throws NPE
> {code}
> 18/02/02 20:40:34 INFO resourcemanager.DefaultAMSProcessor: AM registration 
> appattempt_1517575125794_4564_01
> 18/02/02 20:40:34 INFO resourcemanager.RMAuditLogger: USER=default  
> OPERATION=Register App Master   TARGET=ApplicationMasterService 
> RESULT=SUCCESS  APPID=application_1517575125794_4564
> APPATTEMPTID=appattempt_1517575125794_4564_01
> Exception in thread "Thread-43" 18/02/02 20:40:34 INFO appmaster.AMSimulator: 
> Register the application master for application application_1517575125794_4564
> 18/02/02 20:40:34 INFO resourcemanager.MockAMLauncher: Notify AM launcher 
> launched:container_1517575125794_4564_01_01
> 18/02/02 20:40:34 INFO rmcontainer.RMContainerImpl: 
> container_1517575125794_2703_01_01 Container Transitioned from ACQUIRED 
> to RUNNING
> 18/02/02 20:40:34 INFO attempt.RMAppAttemptImpl: 
> appattempt_1517575125794_4564_01 State change from ALLOCATED to LAUNCHED 
> on event = LAUNCHED
> 18/02/02 20:40:34 INFO attempt.RMAppAttemptImpl: 
> appattempt_1517575125794_4564_01 State change from LAUNCHED to RUNNING on 
> event = REGISTERED
> 18/02/02 20:40:34 INFO rmapp.RMAppImpl: application_1517575125794_4564 State 
> change from ACCEPTED to RUNNING on event = ATTEMPT_REGISTERED
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequests(SchedulerApplicationAttempt.java:1341)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:302)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:389)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:470)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:252)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:816)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:854)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:856)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:735)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559)
> at 
> 

[jira] [Resolved] (YARN-7880) CapacityScheduler$ResourceCommitterService throws NPE when running sls

2018-02-04 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  resolved YARN-7880.
-
   Resolution: Duplicate
 Assignee: Jiandan Yang 
Fix Version/s: 3.0.0

> CapacityScheduler$ResourceCommitterService throws NPE when running sls
> --
>
> Key: YARN-7880
> URL: https://issues.apache.org/jira/browse/YARN-7880
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.0.0
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Fix For: 3.0.0
>
>
> sls test case: node count = 9000, job count=10k,task num of job = 500, task 
> run time = 100s, but it does not occur when node count = 500 and 2000.
> {code}
> 18/02/02 20:54:28 INFO rmcontainer.RMContainerImpl: 
> container_1517575125794_5707_01_86 Container Transitioned from ACQUIRED 
> to RUNNING
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.commonCheckContainerAllocation(FiCaSchedulerApp.java:324)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.accept(FiCaSchedulerApp.java:420)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.tryCommit(CapacityScheduler.java:2506)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ResourceCommitterService.run(CapacityScheduler.java:541)
> {code}
> some CapacityScheduler$AsyncScheduleThread also throws NPE
> {code}
> 18/02/02 20:40:34 INFO resourcemanager.DefaultAMSProcessor: AM registration 
> appattempt_1517575125794_4564_01
> 18/02/02 20:40:34 INFO resourcemanager.RMAuditLogger: USER=default  
> OPERATION=Register App Master   TARGET=ApplicationMasterService 
> RESULT=SUCCESS  APPID=application_1517575125794_4564
> APPATTEMPTID=appattempt_1517575125794_4564_01
> Exception in thread "Thread-43" 18/02/02 20:40:34 INFO appmaster.AMSimulator: 
> Register the application master for application application_1517575125794_4564
> 18/02/02 20:40:34 INFO resourcemanager.MockAMLauncher: Notify AM launcher 
> launched:container_1517575125794_4564_01_01
> 18/02/02 20:40:34 INFO rmcontainer.RMContainerImpl: 
> container_1517575125794_2703_01_01 Container Transitioned from ACQUIRED 
> to RUNNING
> 18/02/02 20:40:34 INFO attempt.RMAppAttemptImpl: 
> appattempt_1517575125794_4564_01 State change from ALLOCATED to LAUNCHED 
> on event = LAUNCHED
> 18/02/02 20:40:34 INFO attempt.RMAppAttemptImpl: 
> appattempt_1517575125794_4564_01 State change from LAUNCHED to RUNNING on 
> event = REGISTERED
> 18/02/02 20:40:34 INFO rmapp.RMAppImpl: application_1517575125794_4564 State 
> change from ACCEPTED to RUNNING on event = ATTEMPT_REGISTERED
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.getResourceRequests(SchedulerApplicationAttempt.java:1341)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:302)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:389)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:470)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:252)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:816)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:854)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:54)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:856)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:735)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:559)
> at 
> 

[jira] [Created] (YARN-7929) SLS supports setting container execution

2018-02-13 Thread Jiandan Yang (JIRA)
Jiandan Yang  created YARN-7929:
---

 Summary: SLS supports setting container execution
 Key: YARN-7929
 URL: https://issues.apache.org/jira/browse/YARN-7929
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: scheduler-load-simulator
Reporter: Jiandan Yang 
Assignee: Jiandan Yang 


SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file can 
not set execution type of container.
This jira will introduce execution type in SLS to help better simulation.
RUMEN has default execution type GUARANTEED
SYNTH set execution type by field map_execution_type and reduce_execution_type
SLS set execution type by field container.execution_type
For compatibility set GUARANTEED as default value when not setting above fields 
in trace file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7929) SLS supports setting container execution

2018-02-13 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7929:

Issue Type: Sub-task  (was: New Feature)
Parent: YARN-5065

> SLS supports setting container execution
> 
>
> Key: YARN-7929
> URL: https://issues.apache.org/jira/browse/YARN-7929
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Minor
>
> SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file 
> can not set execution type of container.
> This jira will introduce execution type in SLS to help better simulation.
> RUMEN has default execution type GUARANTEED
> SYNTH set execution type by field map_execution_type and reduce_execution_type
> SLS set execution type by field container.execution_type
> For compatibility set GUARANTEED as default value when not setting above 
> fields in trace file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7929) SLS supports setting container execution

2018-02-13 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7929:

Attachment: YARN-7929.001.patch

> SLS supports setting container execution
> 
>
> Key: YARN-7929
> URL: https://issues.apache.org/jira/browse/YARN-7929
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Minor
> Attachments: YARN-7929.001.patch
>
>
> SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file 
> can not set execution type of container.
> This jira will introduce execution type in SLS to help better simulation.
> RUMEN has default execution type GUARANTEED
> SYNTH set execution type by field map_execution_type and reduce_execution_type
> SLS set execution type by field container.execution_type
> For compatibility set GUARANTEED as default value when not setting above 
> fields in trace file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7929) SLS supports setting container execution

2018-02-23 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374021#comment-16374021
 ] 

Jiandan Yang  edited comment on YARN-7929 at 2/24/18 1:50 AM:
--

Hi [~youchen], thanks for your attention. I did encounter the issue of merging 
failed when I pull latest code in my local develop environment. I will upload a 
new patch based on latest code.

"water level" to the NMSimulator simulates actual resource utilization, the 
scheduling of OPPORTUNISTIC containers through the central RM need actual node 
utilization according to design doc in YARN-1011.


was (Author: yangjiandan):
Hi [~yochen], thanks for your attention. I did encounter the issue of merging 
failed when I pull latest code in my local develop environment. I will upload a 
new patch based on latest code.

"water level" to the NMSimulator simulates actual resource utilization, the 
scheduling of OPPORTUNISTIC containers through the central RM need actual node 
utilization according to design doc in YARN-1011.

> SLS supports setting container execution
> 
>
> Key: YARN-7929
> URL: https://issues.apache.org/jira/browse/YARN-7929
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-7929.001.patch, YARN-7929.002.patch
>
>
> SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file 
> can not set execution type of container.
>  This jira will introduce execution type in SLS to help better simulation. 
> This will help the perf testing with regarding to the Opportunistic 
> Containers.
>  RUMEN has default execution type GUARANTEED
>  SYNTH set execution type by field map_execution_type and 
> reduce_execution_type
>  SLS set execution type by field container.execution_type
>  For compatibility set GUARANTEED as default value when not setting above 
> fields in trace file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7929) SLS supports setting container execution

2018-02-22 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374021#comment-16374021
 ] 

Jiandan Yang  edited comment on YARN-7929 at 2/23/18 7:09 AM:
--

Hi [~yochen], thanks for your attention. I did encounter the issue of merging 
failed when I pull latest code in my local develop environment. I will upload a 
new patch based on latest code.

"water level" to the NMSimulator simulates actual resource utilization, the 
scheduling of OPPORTUNISTIC containers through the central RM need actual node 
utilization according to design doc in YARN-1011.


was (Author: yangjiandan):
Hi [~yochen], thanks for your attention. I did encounter the issue of merging 
failed when I pull latest code in my local develop environment. I will upload a 
new patch based latest code.

"water level" to the NMSimulator simulates actual resource utilization, the 
scheduling of OPPORTUNISTIC containers through the central RM need actual node 
utilization according to design doc in YARN-1011.

> SLS supports setting container execution
> 
>
> Key: YARN-7929
> URL: https://issues.apache.org/jira/browse/YARN-7929
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-7929.001.patch
>
>
> SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file 
> can not set execution type of container.
>  This jira will introduce execution type in SLS to help better simulation. 
> This will help the perf testing with regarding to the Opportunistic 
> Containers.
>  RUMEN has default execution type GUARANTEED
>  SYNTH set execution type by field map_execution_type and 
> reduce_execution_type
>  SLS set execution type by field container.execution_type
>  For compatibility set GUARANTEED as default value when not setting above 
> fields in trace file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7929) SLS supports setting container execution

2018-02-22 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374021#comment-16374021
 ] 

Jiandan Yang  commented on YARN-7929:
-

Hi [~yochen], thanks for your attention. I did encounter the issue of merging 
failed when I pull latest code in my local develop environment. I will upload a 
new patch based latest code.

"water level" to the NMSimulator simulates actual resource utilization, the 
scheduling of OPPORTUNISTIC containers through the central RM need actual node 
utilization according to design doc in YARN-1011.

> SLS supports setting container execution
> 
>
> Key: YARN-7929
> URL: https://issues.apache.org/jira/browse/YARN-7929
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-7929.001.patch
>
>
> SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file 
> can not set execution type of container.
>  This jira will introduce execution type in SLS to help better simulation. 
> This will help the perf testing with regarding to the Opportunistic 
> Containers.
>  RUMEN has default execution type GUARANTEED
>  SYNTH set execution type by field map_execution_type and 
> reduce_execution_type
>  SLS set execution type by field container.execution_type
>  For compatibility set GUARANTEED as default value when not setting above 
> fields in trace file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7929) SLS supports setting container execution

2018-02-23 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7929:

Attachment: YARN-7929.002.patch

> SLS supports setting container execution
> 
>
> Key: YARN-7929
> URL: https://issues.apache.org/jira/browse/YARN-7929
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-7929.001.patch, YARN-7929.002.patch
>
>
> SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file 
> can not set execution type of container.
>  This jira will introduce execution type in SLS to help better simulation. 
> This will help the perf testing with regarding to the Opportunistic 
> Containers.
>  RUMEN has default execution type GUARANTEED
>  SYNTH set execution type by field map_execution_type and 
> reduce_execution_type
>  SLS set execution type by field container.execution_type
>  For compatibility set GUARANTEED as default value when not setting above 
> fields in trace file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7970) Compatibility issue: throw RpcNoSuchMethodException when run mapreduce job

2018-02-26 Thread Jiandan Yang (JIRA)
Jiandan Yang  created YARN-7970:
---

 Summary: Compatibility issue: throw RpcNoSuchMethodException when 
run mapreduce job
 Key: YARN-7970
 URL: https://issues.apache.org/jira/browse/YARN-7970
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 3.0.0
Reporter: Jiandan Yang 


Running teragen failed in the version of hadoop-3.1, and hdfs server is 2.8.
The reason of failing is 2.8 HDFS does not have setErasureCodingPolicy.
The detailed exception trace is:
```
2018-02-26 11:22:53,178 INFO mapreduce.JobSubmitter: Cleaning up the staging 
area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1518615699369_0006
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchMethodException):
 Unknown method setErasureCodingPolicy called on 
org.apache.hadoop.hdfs.protocol.ClientProtocol protocol.
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:436)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491)
at org.apache.hadoop.ipc.Client.call(Client.java:1437)
at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy11.setErasureCodingPolicy(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setErasureCodingPolicy(ClientNamenodeProtocolTranslatorPB.java:1583)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy12.setErasureCodingPolicy(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.setErasureCodingPolicy(DFSClient.java:2678)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$63.doCall(DistributedFileSystem.java:2665)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$63.doCall(DistributedFileSystem.java:2662)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.setErasureCodingPolicy(DistributedFileSystem.java:2680)
at 
org.apache.hadoop.mapreduce.JobResourceUploader.disableErasureCodingForPath(JobResourceUploader.java:882)
at 
org.apache.hadoop.mapreduce.JobResourceUploader.uploadResourcesInternal(JobResourceUploader.java:174)
at 
org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:131)
at 
org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588)
at org.apache.hadoop.examples.terasort.TeraGen.run(TeraGen.java:304)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.examples.terasort.TeraGen.main(TeraGen.java:308)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 

[jira] [Updated] (YARN-7970) Compatibility issue: throw RpcNoSuchMethodException when run mapreduce job

2018-02-26 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7970:

Description: 
Running teragen failed in the version of hadoop-3.1, and hdfs server is 2.8.
The reason of failing is 2.8 HDFS does not have setErasureCodingPolicy.
The detailed exception trace is:

{code:java}
2018-02-26 11:22:53,178 INFO mapreduce.JobSubmitter: Cleaning up the staging 
area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1518615699369_0006
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchMethodException):
 Unknown method setErasureCodingPolicy called on 
org.apache.hadoop.hdfs.protocol.ClientProtocol protocol.
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:436)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491)
at org.apache.hadoop.ipc.Client.call(Client.java:1437)
at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy11.setErasureCodingPolicy(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setErasureCodingPolicy(ClientNamenodeProtocolTranslatorPB.java:1583)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy12.setErasureCodingPolicy(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.setErasureCodingPolicy(DFSClient.java:2678)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$63.doCall(DistributedFileSystem.java:2665)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$63.doCall(DistributedFileSystem.java:2662)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.setErasureCodingPolicy(DistributedFileSystem.java:2680)
at 
org.apache.hadoop.mapreduce.JobResourceUploader.disableErasureCodingForPath(JobResourceUploader.java:882)
at 
org.apache.hadoop.mapreduce.JobResourceUploader.uploadResourcesInternal(JobResourceUploader.java:174)
at 
org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:131)
at 
org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588)
at org.apache.hadoop.examples.terasort.TeraGen.run(TeraGen.java:304)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.examples.terasort.TeraGen.main(TeraGen.java:308)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 

[jira] [Updated] (YARN-7970) Compatibility issue: throw RpcNoSuchMethodException when run mapreduce job

2018-02-26 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7970:

Description: 
Running teragen failed in the version of hadoop-3.1, and hdfs server is 2.8.
The reason of failing is 2.8 HDFS does not have setErasureCodingPolicy.
The detailed exception trace is:

2018-02-26 11:22:53,178 INFO mapreduce.JobSubmitter: Cleaning up the staging 
area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1518615699369_0006
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchMethodException):
 Unknown method setErasureCodingPolicy called on 
org.apache.hadoop.hdfs.protocol.ClientProtocol protocol.
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:436)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491)
at org.apache.hadoop.ipc.Client.call(Client.java:1437)
at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy11.setErasureCodingPolicy(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setErasureCodingPolicy(ClientNamenodeProtocolTranslatorPB.java:1583)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy12.setErasureCodingPolicy(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.setErasureCodingPolicy(DFSClient.java:2678)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$63.doCall(DistributedFileSystem.java:2665)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$63.doCall(DistributedFileSystem.java:2662)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.setErasureCodingPolicy(DistributedFileSystem.java:2680)
at 
org.apache.hadoop.mapreduce.JobResourceUploader.disableErasureCodingForPath(JobResourceUploader.java:882)
at 
org.apache.hadoop.mapreduce.JobResourceUploader.uploadResourcesInternal(JobResourceUploader.java:174)
at 
org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:131)
at 
org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102)
at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1965)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588)
at org.apache.hadoop.examples.terasort.TeraGen.run(TeraGen.java:304)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.examples.terasort.TeraGen.main(TeraGen.java:308)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 

[jira] [Updated] (YARN-7693) ContainersMonitor support configurable

2018-01-01 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7693:

Attachment: YARN-7693.001.patch

> ContainersMonitor support configurable
> --
>
> Key: YARN-7693
> URL: https://issues.apache.org/jira/browse/YARN-7693
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Minor
> Attachments: YARN-7693.001.patch
>
>
> Currently ContainersMonitor has only one default implementation 
> ContainersMonitorImpl,
> After introducing Opportunistic Container, ContainersMonitor needs to monitor 
> system metrics and even dynamically adjust Opportunistic and Guaranteed 
> resources in the cgroup, so another ContainersMonitor may need to be 
> implemented. 
> The current ContainerManagerImpl ContainersMonitorImpl direct new 
> ContainerManagerImpl, so ContainersMonitor need to be configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7693) ContainersMonitor support configurable

2018-01-02 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7693:

Attachment: YARN-7693.002.patch

fix TestYarnConfigurationFields error

> ContainersMonitor support configurable
> --
>
> Key: YARN-7693
> URL: https://issues.apache.org/jira/browse/YARN-7693
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Minor
> Attachments: YARN-7693.001.patch, YARN-7693.002.patch
>
>
> Currently ContainersMonitor has only one default implementation 
> ContainersMonitorImpl,
> After introducing Opportunistic Container, ContainersMonitor needs to monitor 
> system metrics and even dynamically adjust Opportunistic and Guaranteed 
> resources in the cgroup, so another ContainersMonitor may need to be 
> implemented. 
> The current ContainerManagerImpl ContainersMonitorImpl direct new 
> ContainerManagerImpl, so ContainersMonitor need to be configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7693) ContainersMonitor support configurable

2018-01-02 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309129#comment-16309129
 ] 

Jiandan Yang  commented on YARN-7693:
-

[~miklos.szeg...@cloudera.com] Thanks for your attention. This jira does not 
conflict with YARN-7064. I file this jira because currently 
ContainersMonitorImpl has some problems:
1. online service may be crash due to high system resource utilization.
ContainersMonitorImpl only check pmem and vmem of every container,  and did not 
check the overall system utilization. This may be impact online service when 
offline task and online service run on the Yarn at the same time. For example, 
each container's memory did not exceed the limit, but the system's total memory 
utilization may be 100% because of oversubscription, and the decision of 
killing container by RM may not be timely enough, then it will affect the 
online service.
2. Directly kill Opportunistic container is too violent. Dynamically adjusting 
Opportunistic container resources may be a better choice.
So I proposal to:
1) Seperate containers into two different group Opportunistic_Group and 
Guaranteed_Group under *hadoop-yarn* 
2)  Monitor system resource utilization and dynamically adjust resource of 
Opportunistic_Group
3) Kill container only when adjust resource fail for given times

> ContainersMonitor support configurable
> --
>
> Key: YARN-7693
> URL: https://issues.apache.org/jira/browse/YARN-7693
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Minor
> Attachments: YARN-7693.001.patch, YARN-7693.002.patch
>
>
> Currently ContainersMonitor has only one default implementation 
> ContainersMonitorImpl,
> After introducing Opportunistic Container, ContainersMonitor needs to monitor 
> system metrics and even dynamically adjust Opportunistic and Guaranteed 
> resources in the cgroup, so another ContainersMonitor may need to be 
> implemented. 
> The current ContainerManagerImpl ContainersMonitorImpl direct new 
> ContainerManagerImpl, so ContainersMonitor need to be configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7693) ContainersMonitor support configurable

2018-01-01 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7693:

Description: 
Currently ContainersMonitor has only one default implementation 
ContainersMonitorImpl,
After introducing Opportunistic Container, ContainersMonitor needs to monitor 
system metrics and even dynamically adjust Opportunistic and Guaranteed 
resources in the cgroup, so another ContainersMonitor may need to be 
implemented. 
The current ContainerManagerImpl ContainersMonitorImpl direct new 
ContainerManagerImpl, so ContainersMonitor need to be configurable.

  was:
Currently ContainersMonitor has only one default implementation 
ContainersMonitorImpl,
After introducing Opportunistic Container, ContainersMonitor needs to monitor 
system metrics and even dynamically adjust Opportunistic and Guaranteed 
resources in the cgroup, so another ContainersMonitor may need to be 
implemented. The current ContainerManagerImpl ContainersMonitorImpl direct new 
ContainerManagerImpl, so ContainersMonitor need to be configurable.


> ContainersMonitor support configurable
> --
>
> Key: YARN-7693
> URL: https://issues.apache.org/jira/browse/YARN-7693
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Minor
>
> Currently ContainersMonitor has only one default implementation 
> ContainersMonitorImpl,
> After introducing Opportunistic Container, ContainersMonitor needs to monitor 
> system metrics and even dynamically adjust Opportunistic and Guaranteed 
> resources in the cgroup, so another ContainersMonitor may need to be 
> implemented. 
> The current ContainerManagerImpl ContainersMonitorImpl direct new 
> ContainerManagerImpl, so ContainersMonitor need to be configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-7693) ContainersMonitor support configurable

2018-01-01 Thread Jiandan Yang (JIRA)
Jiandan Yang  created YARN-7693:
---

 Summary: ContainersMonitor support configurable
 Key: YARN-7693
 URL: https://issues.apache.org/jira/browse/YARN-7693
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodemanager
Reporter: Jiandan Yang 
Assignee: Jiandan Yang 
Priority: Minor


Currently ContainersMonitor has only one default implementation 
ContainersMonitorImpl,
After introducing Opportunistic Container, ContainersMonitor needs to monitor 
system metrics and even dynamically adjust Opportunistic and Guaranteed 
resources in the cgroup, so another ContainersMonitor may need to be 
implemented. The current ContainerManagerImpl ContainersMonitorImpl direct new 
ContainerManagerImpl, so ContainersMonitor need to be configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7929) SLS supports setting container execution

2018-02-26 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7929:

Attachment: YARN-7929.004.patch

> SLS supports setting container execution
> 
>
> Key: YARN-7929
> URL: https://issues.apache.org/jira/browse/YARN-7929
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-7929.001.patch, YARN-7929.002.patch, 
> YARN-7929.003.patch, YARN-7929.004.patch
>
>
> SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file 
> can not set execution type of container.
>  This jira will introduce execution type in SLS to help better simulation. 
> This will help the perf testing with regarding to the Opportunistic 
> Containers.
>  RUMEN has default execution type GUARANTEED
>  SYNTH set execution type by field map_execution_type and 
> reduce_execution_type
>  SLS set execution type by field container.execution_type
>  For compatibility set GUARANTEED as default value when not setting above 
> fields in trace file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7929) SLS supports setting container execution

2018-02-26 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378110#comment-16378110
 ] 

Jiandan Yang  commented on YARN-7929:
-

fix checkstyle issues and upload YARN-7929.004.patch

> SLS supports setting container execution
> 
>
> Key: YARN-7929
> URL: https://issues.apache.org/jira/browse/YARN-7929
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-7929.001.patch, YARN-7929.002.patch, 
> YARN-7929.003.patch, YARN-7929.004.patch
>
>
> SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file 
> can not set execution type of container.
>  This jira will introduce execution type in SLS to help better simulation. 
> This will help the perf testing with regarding to the Opportunistic 
> Containers.
>  RUMEN has default execution type GUARANTEED
>  SYNTH set execution type by field map_execution_type and 
> reduce_execution_type
>  SLS set execution type by field container.execution_type
>  For compatibility set GUARANTEED as default value when not setting above 
> fields in trace file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7929) SLS supports setting container execution

2018-02-26 Thread Jiandan Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378179#comment-16378179
 ] 

Jiandan Yang  commented on YARN-7929:
-

fix checkstyle HiddenField and upload 005.patch

> SLS supports setting container execution
> 
>
> Key: YARN-7929
> URL: https://issues.apache.org/jira/browse/YARN-7929
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-7929.001.patch, YARN-7929.002.patch, 
> YARN-7929.003.patch, YARN-7929.004.patch, YARN-7929.005.patch
>
>
> SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file 
> can not set execution type of container.
>  This jira will introduce execution type in SLS to help better simulation. 
> This will help the perf testing with regarding to the Opportunistic 
> Containers.
>  RUMEN has default execution type GUARANTEED
>  SYNTH set execution type by field map_execution_type and 
> reduce_execution_type
>  SLS set execution type by field container.execution_type
>  For compatibility set GUARANTEED as default value when not setting above 
> fields in trace file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7929) SLS supports setting container execution

2018-02-26 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7929:

Attachment: YARN-7929.005.patch

> SLS supports setting container execution
> 
>
> Key: YARN-7929
> URL: https://issues.apache.org/jira/browse/YARN-7929
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-7929.001.patch, YARN-7929.002.patch, 
> YARN-7929.003.patch, YARN-7929.004.patch, YARN-7929.005.patch
>
>
> SLS currently support three tracetype, SYNTH, SLS and RUMEN, but trace file 
> can not set execution type of container.
>  This jira will introduce execution type in SLS to help better simulation. 
> This will help the perf testing with regarding to the Opportunistic 
> Containers.
>  RUMEN has default execution type GUARANTEED
>  SYNTH set execution type by field map_execution_type and 
> reduce_execution_type
>  SLS set execution type by field container.execution_type
>  For compatibility set GUARANTEED as default value when not setting above 
> fields in trace file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-15 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8664:

Description: 
ResourceManager logs about exception is:
{code:java}
2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
11.13.73.101:51083
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
        at 
org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
        at 
com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
        at 
com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
        at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
        at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
{code}
ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
when NM losting, and AllocateResponse#getProto will call 
ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
PB . Because ResourcePBImpl is not thread safe and 
multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
throw NullPointerException or UnsupportedOperationException.
I wrote a test code which can reproduce exception.
{code:java}
@Test
  public void testResource1() throws InterruptedException {
ResourcePBImpl resource = (ResourcePBImpl) Resource.newInstance(1, 1);
for (int i =0;i<10;i++ ) {
  Thread thread = new PBThread(resource);
  thread.setName("t"+i);
  thread.start();
}
Thread.sleep(1);
  }

  class PBThread extends Thread {
ResourcePBImpl resourcePB;

public PBThread(ResourcePBImpl resourcePB) {
  this.resourcePB = resourcePB;
}

@Override 
public void run() {
  while(true) {
this.resourcePB.getProto();
  }
}
  }
{code}

  was:
ResourceManager logs about exception is:
{code:java}
2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
11.13.73.101:51083
java.lang.NullPointerException
  

[jira] [Commented] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-15 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580881#comment-16580881
 ] 

Jiandan Yang  commented on YARN-8664:
-

[~cheersyang] Jenkins is probably not OK.
Would you please fix it?

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.001.pathch, 
> YARN-8664-branch-2.8.2.001.patch, YARN-8664-branch-2.8.2.002.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() throws 

[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-14 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8664:

Attachment: YARN-8664-branch-2.8.2.002.patch

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.2.001.patch, 
> YARN-8664-branch-2.8.2.002.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() throws InterruptedException {
> ResourcePBImpl resource = (ResourcePBImpl) 

[jira] [Commented] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-14 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580626#comment-16580626
 ] 

Jiandan Yang  commented on YARN-8664:
-

Jenkins report ERROR: Docker failed to build image, which is not related to 
patch.
upload patch again to trigger Jenkins.

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.2.001.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() throws 

[jira] [Commented] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-14 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580688#comment-16580688
 ] 

Jiandan Yang  commented on YARN-8664:
-

Thank [~cheersyang] for quick response.
There is no problem in trunk, because it replace ResourcePBImpl with 
LightWeightResource introduced by YARN-6909.
I will update a patch for branch-2.8

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.2.001.patch, 
> YARN-8664-branch-2.8.2.002.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code 

[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-14 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8664:

Attachment: YARN-8664-branch-2.8.001.pathch

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.001.pathch, 
> YARN-8664-branch-2.8.2.001.patch, YARN-8664-branch-2.8.2.002.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() throws InterruptedException {
> ResourcePBImpl resource = 

[jira] [Created] (YARN-8645) Yarn NM fail to start when remount cpu control group

2018-08-09 Thread Jiandan Yang (JIRA)
Jiandan Yang  created YARN-8645:
---

 Summary: Yarn NM fail to start when remount cpu control group
 Key: YARN-8645
 URL: https://issues.apache.org/jira/browse/YARN-8645
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Jiandan Yang 


NM failed to start when we update Yarn to latest version. NM logs are as 
follows:

{code:java}
2018-08-08 16:07:01,244 INFO [main] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
 Mounting controller cpu at /sys/fs/cgroup/cpu
2018-08-08 16:07:01,246 WARN [main] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
 Shell execution returned exit code: 32. Privileged Execution Operation Stderr:
Feature disabled: mount cgroup

Stdout:
Full command array for failed execution:
[/home/hadoop/hadoop_hbase/hadoop-current/bin/container-executor, 
--mount-cgroups, hadoop-yarn, cpu,cpuset,cpuacct=/sys/fs/cgroup/cpu]
2018-08-08 16:07:01,247 ERROR [main] 
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl:
 Failed to mount controller: cpu
2018-08-08 16:07:01,247 ERROR [main] 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to 
bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException:
 Failed to mount controller: cpu
 {code}

The cause of error is that 351cf87c92872d90f62c476f85ae4d02e485769c disable 
mounting cgroups by default in container-executor, which make 
container-executor return non-zero when executing mount-cgroups



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-14 Thread Jiandan Yang (JIRA)
Jiandan Yang  created YARN-8664:
---

 Summary: ApplicationMasterProtocolPBServiceImpl#allocate throw NPE 
when NM losting
 Key: YARN-8664
 URL: https://issues.apache.org/jira/browse/YARN-8664
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.8.2
 Environment: 



Reporter: Jiandan Yang 
Assignee: Jiandan Yang 


ResourceManager logs about exception is:
{code:java}
2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
11.13.73.101:51083
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
        at 
org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
        at 
org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
        at 
com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
        at 
com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
        at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
        at 
org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
        at 
org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
{code}
ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
when NM losting, and AllocateResponse#getProto will call 
ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
PB . Because ResourcePBImpl is not thread safe and 
multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
throw NullPointerException or UnsupportedOperationException.
I wrote a test code which can reproduce exception.
{code:java}
@Test
  public void testResource1() throws InterruptedException {
ResourcePBImpl resource = (ResourcePBImpl) Resource.newInstance(1, 1);
for(long i=0;i<100;i++) {
  resource.setResourceInformation("r" + i, 
ResourceInformation.newInstance("r" + i, i));
}
for (int i =0;i<10;i++ ) {
  Thread thread = new PBThread(resource);
  thread.setName("t"+i);
  thread.start();
}
Thread.sleep(1);
  }

  class PBThread extends Thread {
ResourcePBImpl resourcePB;

public PBThread(ResourcePBImpl resourcePB) {
  this.resourcePB = resourcePB;
}

@Override 
public void run() 

[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-14 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8664:

Attachment: YARN-8664-branch-2.8.2.001.patch

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.2.001.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() throws InterruptedException {
> ResourcePBImpl resource = (ResourcePBImpl) Resource.newInstance(1, 1);
> for(long i=0;i<100;i++) 

[jira] [Commented] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-14 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579884#comment-16579884
 ] 

Jiandan Yang  commented on YARN-8664:
-

replace  rmNode.getTotalCapability() with 
Resources.clone(rmNode.getTotalCapability())  to avoid access ResourcePBImpl by 
multiple threads.

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.2.001.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() 

[jira] [Updated] (YARN-8717) set memory.limit_in_bytes when NodeManager starting

2018-08-28 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8717:

Attachment: YARN-8717.001.patch

> set memory.limit_in_bytes when NodeManager starting
> ---
>
> Key: YARN-8717
> URL: https://issues.apache.org/jira/browse/YARN-8717
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8717.001.patch
>
>
> CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn  to 
> restrict total resource of cpu of NM when NM starting; 
> CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at 
> hirachy of hadoop-yarn to control memory resource of NM



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8717) set memory.limit_in_bytes when NodeManager starting

2018-08-28 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594858#comment-16594858
 ] 

Jiandan Yang  commented on YARN-8717:
-

Hi [~cheersyang]
Thanks for watching.
We found NM was killed by OOM-killer.
conditions are as follows:
```
yarn.nodemanager.resource.memory.enabled=false
yarn.nodemanager.resource.memory-mb = 100G
Physical Memory of NM machine is 120G
NM has two container, each requests 40G memory, but actual each request 50G+
```
So we thought setting limit on the hireachy of hadoop-yarn

> set memory.limit_in_bytes when NodeManager starting
> ---
>
> Key: YARN-8717
> URL: https://issues.apache.org/jira/browse/YARN-8717
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8717.001.patch
>
>
> CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn  to 
> restrict total resource of cpu of NM when NM starting; 
> CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at 
> hirachy of hadoop-yarn to control memory resource of NM



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-08-28 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8664:

Attachment: YARN-8664-branch-2.8.002.pathch

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.001.pathch, 
> YARN-8664-branch-2.8.002.pathch, YARN-8664-branch-2.8.01.patch, 
> YARN-8664-branch-2.8.2.001.patch, YARN-8664-branch-2.8.2.002.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() 

[jira] [Updated] (YARN-8717) set memory.limit_in_bytes when NodeManager starting

2018-08-27 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8717:

Description: CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of 
hadoop-yarn  to restrict total resource of cpu of NM when NM starting; 
CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at 
hirachy of hadoop-yarn to control memory resource of NM  (was: 
CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn  to 
restrict total resource of cpu of NM when NM starting; 
CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at 
hirachy of hadoop-yarn to control cpu resource of NM)

> set memory.limit_in_bytes when NodeManager starting
> ---
>
> Key: YARN-8717
> URL: https://issues.apache.org/jira/browse/YARN-8717
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
>
> CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn  to 
> restrict total resource of cpu of NM when NM starting; 
> CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at 
> hirachy of hadoop-yarn to control memory resource of NM



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8717) set memory.limit_in_bytes when NodeManager starting

2018-08-26 Thread Jiandan Yang (JIRA)
Jiandan Yang  created YARN-8717:
---

 Summary: set memory.limit_in_bytes when NodeManager starting
 Key: YARN-8717
 URL: https://issues.apache.org/jira/browse/YARN-8717
 Project: Hadoop YARN
  Issue Type: New Feature
 Environment: CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy 
of hadoop-yarn  to restrict total resource of cpu of NM when NM starting; 
CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at 
hirachy of hadoop-yarn to control cpu resource of NM
Reporter: Jiandan Yang 
Assignee: Jiandan Yang 






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8717) set memory.limit_in_bytes when NodeManager starting

2018-08-26 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8717:

Description: CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of 
hadoop-yarn  to restrict total resource of cpu of NM when NM starting; 
CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at 
hirachy of hadoop-yarn to control cpu resource of NM

> set memory.limit_in_bytes when NodeManager starting
> ---
>
> Key: YARN-8717
> URL: https://issues.apache.org/jira/browse/YARN-8717
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
>
> CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn  to 
> restrict total resource of cpu of NM when NM starting; 
> CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at 
> hirachy of hadoop-yarn to control cpu resource of NM



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8717) set memory.limit_in_bytes when NodeManager starting

2018-08-26 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8717:

Environment: (was: CGroupsCpuResourceHandlerImpl sets cpu quota at 
hirarchy of hadoop-yarn  to restrict total resource of cpu of NM when NM 
starting; CGroupsMemoryResourceHandlerImpl also should set 
memory.limit_in_bytes at hirachy of hadoop-yarn to control cpu resource of NM)

> set memory.limit_in_bytes when NodeManager starting
> ---
>
> Key: YARN-8717
> URL: https://issues.apache.org/jira/browse/YARN-8717
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8717) set memory.limit_in_bytes when NodeManager starting

2018-09-06 Thread Jiandan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16594858#comment-16594858
 ] 

Jiandan Yang  edited comment on YARN-8717 at 9/7/18 3:05 AM:
-

Hi [~cheersyang]
Thanks for watching.
We found NM was killed by OOM-killer.
conditions are as follows:
```
yarn.nodemanager.resource.memory.enforced=false
yarn.nodemanager.resource.memory-mb = 100G
Physical Memory of NM machine is 120G
NM has two container, each requests 40G memory, but actual each request 50G+
```
So we thought setting limit on the hireachy of hadoop-yarn


was (Author: yangjiandan):
Hi [~cheersyang]
Thanks for watching.
We found NM was killed by OOM-killer.
conditions are as follows:
```
yarn.nodemanager.resource.memory.enabled=false
yarn.nodemanager.resource.memory-mb = 100G
Physical Memory of NM machine is 120G
NM has two container, each requests 40G memory, but actual each request 50G+
```
So we thought setting limit on the hireachy of hadoop-yarn

> set memory.limit_in_bytes when NodeManager starting
> ---
>
> Key: YARN-8717
> URL: https://issues.apache.org/jira/browse/YARN-8717
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
>  Labels: cgroups
> Attachments: YARN-8717.001.patch
>
>
> CGroupsCpuResourceHandlerImpl sets cpu quota at hirarchy of hadoop-yarn  to 
> restrict total resource of cpu of NM when NM starting; 
> CGroupsMemoryResourceHandlerImpl also should set memory.limit_in_bytes at 
> hirachy of hadoop-yarn to control memory resource of NM



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-09-11 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8664:

Attachment: (was: YARN-8664-branch-2.8.001.pathch)

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.003.patch, 
> YARN-8664-branch-2.8.004.patch, YARN-8664-branch-2.8.01.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() throws InterruptedException {
> ResourcePBImpl resource = 

[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-09-11 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8664:

Attachment: YARN-8664-branch-2.8.004.patch

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.003.patch, 
> YARN-8664-branch-2.8.004.patch, YARN-8664-branch-2.8.01.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() throws InterruptedException {
> ResourcePBImpl resource = 

[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-09-11 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8664:

Attachment: (was: YARN-8664-branch-2.8.2.002.patch)

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.003.patch, 
> YARN-8664-branch-2.8.004.patch, YARN-8664-branch-2.8.01.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() throws InterruptedException {
> ResourcePBImpl resource = 

[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-09-11 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8664:

Attachment: (was: YARN-8664-branch-2.8.2.001.patch)

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.003.patch, 
> YARN-8664-branch-2.8.004.patch, YARN-8664-branch-2.8.01.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() throws InterruptedException {
> ResourcePBImpl resource = 

[jira] [Updated] (YARN-8664) ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting

2018-09-11 Thread Jiandan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-8664:

Attachment: (was: YARN-8664-branch-2.8.002.pathch)

> ApplicationMasterProtocolPBServiceImpl#allocate throw NPE when NM losting
> -
>
> Key: YARN-8664
> URL: https://issues.apache.org/jira/browse/YARN-8664
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.8.2
> Environment: 
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8664-branch-2.8.003.patch, 
> YARN-8664-branch-2.8.004.patch, YARN-8664-branch-2.8.01.patch
>
>
> ResourceManager logs about exception is:
> {code:java}
> 2018-08-09 00:52:30,746 WARN [IPC Server handler 5 on 8030] 
> org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8030, call Call#305638 
> Retry#0 org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.allocate from 
> 11.13.73.101:51083
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto.isInitialized(YarnProtos.java:6402)
>         at 
> org.apache.hadoop.yarn.proto.YarnProtos$ResourceProto$Builder.build(YarnProtos.java:6642)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.mergeLocalToProto(ResourcePBImpl.java:254)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.ResourcePBImpl.getProto(ResourcePBImpl.java:61)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.convertToProtoFormat(NodeReportPBImpl.java:313)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToBuilder(NodeReportPBImpl.java:264)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.mergeLocalToProto(NodeReportPBImpl.java:287)
>         at 
> org.apache.hadoop.yarn.api.records.impl.pb.NodeReportPBImpl.getProto(NodeReportPBImpl.java:224)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.convertToProtoFormat(AllocateResponsePBImpl.java:714)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.access$400(AllocateResponsePBImpl.java:69)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:680)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl$6$1.next(AllocateResponsePBImpl.java:669)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.checkForNullValues(AbstractMessageLite.java:336)
>         at 
> com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:323)
>         at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$AllocateResponseProto$Builder.addAllUpdatedNodes(YarnServiceProtos.java:12846)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToBuilder(AllocateResponsePBImpl.java:145)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.mergeLocalToProto(AllocateResponsePBImpl.java:176)
>         at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.AllocateResponsePBImpl.getProto(AllocateResponsePBImpl.java:97)
>         at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:61)
>         at 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:846)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:789)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2457)
> {code}
> ApplicationMasterService#allocate will call AllocateResponse#setUpdatedNodes 
> when NM losting, and AllocateResponse#getProto will call 
> ResourceBPImpl#getProto to transform NodeReportPBImpl#capacity into format of 
> PB . Because ResourcePBImpl is not thread safe and 
> multiple AM will call allocate at the same time, ResourcePBImpl#getProto may 
> throw NullPointerException or UnsupportedOperationException.
> I wrote a test code which can reproduce exception.
> {code:java}
> @Test
>   public void testResource1() throws InterruptedException {
> ResourcePBImpl resource = 

[jira] [Updated] (YARN-7693) ContainersMonitor support configurable

2018-01-24 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7693:

Priority: Minor  (was: Blocker)

> ContainersMonitor support configurable
> --
>
> Key: YARN-7693
> URL: https://issues.apache.org/jira/browse/YARN-7693
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Minor
> Attachments: YARN-7693.001.patch, YARN-7693.002.patch
>
>
> Currently ContainersMonitor has only one default implementation 
> ContainersMonitorImpl,
> After introducing Opportunistic Container, ContainersMonitor needs to monitor 
> system metrics and even dynamically adjust Opportunistic and Guaranteed 
> resources in the cgroup, so another ContainersMonitor may need to be 
> implemented. 
> The current ContainerManagerImpl ContainersMonitorImpl direct new 
> ContainerManagerImpl, so ContainersMonitor need to be configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7693) ContainersMonitor support configurable

2018-01-24 Thread Jiandan Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiandan Yang  updated YARN-7693:

Priority: Blocker  (was: Minor)

> ContainersMonitor support configurable
> --
>
> Key: YARN-7693
> URL: https://issues.apache.org/jira/browse/YARN-7693
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Blocker
> Attachments: YARN-7693.001.patch, YARN-7693.002.patch
>
>
> Currently ContainersMonitor has only one default implementation 
> ContainersMonitorImpl,
> After introducing Opportunistic Container, ContainersMonitor needs to monitor 
> system metrics and even dynamically adjust Opportunistic and Guaranteed 
> resources in the cgroup, so another ContainersMonitor may need to be 
> implemented. 
> The current ContainerManagerImpl ContainersMonitorImpl direct new 
> ContainerManagerImpl, so ContainersMonitor need to be configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   >