[jira] [Created] (YARN-7131) FSDownload.unpack should read determine the type of resource by reading the header bytes

2017-08-30 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-7131:


 Summary: FSDownload.unpack should read determine the type of 
resource by reading the header bytes
 Key: YARN-7131
 URL: https://issues.apache.org/jira/browse/YARN-7131
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Brook Zhou
Assignee: Brook Zhou


Currently, there are naive string checks to determine if a resource of a 
particular type (jar, zip, tar.gz) 

There can be cases where this does not work - e.g., the user decides to split 
up a large zip resource as {file1}.zip.001, {file1}.zip.002.

Instead, FSDownload.unpack should read the file header bytes to determine the 
file type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7098) LocalizerRunner should immediately send heartbeat response LocalizerStatus.DIE when the Container transitions from LOCALIZING to KILLING

2017-08-24 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-7098:


 Summary: LocalizerRunner should immediately send heartbeat 
response LocalizerStatus.DIE when the Container transitions from LOCALIZING to 
KILLING
 Key: YARN-7098
 URL: https://issues.apache.org/jira/browse/YARN-7098
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Brook Zhou
Assignee: Brook Zhou
Priority: Minor


Currently, the following can happen:

1. ContainerLocalizer heartbeats to ResourceLocalizationService.
2. LocalizerTracker.processHeartbeat verifies that there is a LocalizerRunner 
for the localizerId (containerId).
3. Container receives kill event, goes from LOCALIZING -> KILLING. The 
LocalizerRunner for the localizerId is removed from LocalizerTracker.
4. Since check (2) passed, LocalizerRunner sends heartbeat response with 
LocalizerStatus.LIVE and the next file to download.

What should happen here is that (4) sends a LocalizerStatus.DIE, since (3) 
happened before the heartbeat response in (4). This saves the container from 
potentially downloading an extra resource which will end up being deleted 
anyway.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6870) ResourceUtilization/ContainersMonitorImpl is calculating CPU utilization as a float, which is imprecise

2017-07-25 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-6870:


 Summary: ResourceUtilization/ContainersMonitorImpl is calculating 
CPU utilization as a float, which is imprecise
 Key: YARN-6870
 URL: https://issues.apache.org/jira/browse/YARN-6870
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api, nodemanager
Reporter: Brook Zhou
Assignee: Brook Zhou


We have seen issues on our clusters where the current way of computing CPU 
usage is having float-arithmetic inaccuracies (the bug is still there in trunk)

Simple program to illustrate:
{code:title=Bar.java|borderStyle=solid}
  public static void main(String[] args) throws Exception {
float result = 0.0f;
for (int i = 0; i < 7; i++) {
  if (i == 6) {
result += (float) 4 / (float)18;
  } else {
result += (float) 2 / (float)18;
  }
}
for (int i = 0; i < 7; i++) {
  if (i == 6) {
result -= (float) 4 / (float)18;
  } else {
result -= (float) 2 / (float)18;
  } 
}
System.out.println(result);
  }
{code}
// Printed
4.4703484E-8


2017-04-12 05:43:24,014 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
 Not enough cpu for [container_e3295_1491978508342_0467_01_30], Current CPU 
Allocation: [0.891], Requested CPU Allocation: [0.]

There are a few places with this issue:
1. ResourceUtilization.java - set/getCPU both use float. When 
ContainerScheduler calls 
ContainersMonitor.increase/decreaseResourceUtilization, this may lead to issues.

2. AllocationBasedResourceUtilizationTracker.java  - hasResourcesAvailable uses 
float as well for CPU computation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5472) WIN_MAX_PATH logic is off by one

2016-08-03 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-5472:


 Summary: WIN_MAX_PATH logic is off by one
 Key: YARN-5472
 URL: https://issues.apache.org/jira/browse/YARN-5472
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: Windows
Reporter: Brook Zhou
Assignee: Brook Zhou
Priority: Minor


The following check is incorrect:

if (Shell.WINDOWS && sb.getWrapperScriptPath().toString().length() > 
WIN_MAX_PATH)

should be >=, as the max path is defined as "D:\some 256-character path 
string" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-4840) Add option to upload files recursively from container directory

2016-03-18 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-4840:


 Summary: Add option to upload files recursively from container 
directory
 Key: YARN-4840
 URL: https://issues.apache.org/jira/browse/YARN-4840
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: log-aggregation
Affects Versions: 2.8.0
Reporter: Brook Zhou
Priority: Minor
 Fix For: 2.8.0


It may be useful to allow users to aggregate their logs recursively from 
container directories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (YARN-4818) AggregatedLogFormat.LogValue.write() incorrectly truncates files

2016-03-15 Thread Brook Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-4818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brook Zhou resolved YARN-4818.
--
Resolution: Invalid

> AggregatedLogFormat.LogValue.write() incorrectly truncates files
> 
>
> Key: YARN-4818
> URL: https://issues.apache.org/jira/browse/YARN-4818
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.8.0
>Reporter: Brook Zhou
>Assignee: Brook Zhou
>  Labels: log-aggregation
> Fix For: 2.8.0
>
> Attachments: YARN-4818-v0.patch
>
>
> AggregatedLogFormat.LogValue.write() currently has a bug where it only writes 
> in blocks of the buffer size (65535). This is because 
> FileInputStream.read(byte[] buf) returns -1 if there are less than buf.length 
> bytes remaining. In cases where the file size is not an exact multiple of 
> 65535 bytes, the remaining bytes are truncated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4818) AggregatedLogFormat.LogValue writes only in blocks of buffer size

2016-03-14 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-4818:


 Summary: AggregatedLogFormat.LogValue writes only in blocks of 
buffer size
 Key: YARN-4818
 URL: https://issues.apache.org/jira/browse/YARN-4818
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Brook Zhou
Assignee: Brook Zhou
 Fix For: 2.8.0


AggregatedLogFormat.LogValue.write() currently has a bug where it only writes 
in blocks of the buffer size (65535). This is because 
FileInputStream.read(byte[] buf) returns -1 if there are less than 65535 bytes 
remaining. In cases where the file is less than 65535 bytes, 0 bytes are 
written.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (YARN-4677) RMNodeResourceUpdateEvent update from scheduler can lead to race condition

2016-02-06 Thread Brook Zhou (JIRA)
Brook Zhou created YARN-4677:


 Summary: RMNodeResourceUpdateEvent update from scheduler can lead 
to race condition
 Key: YARN-4677
 URL: https://issues.apache.org/jira/browse/YARN-4677
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: graceful, resourcemanager, scheduler
Affects Versions: 2.7.1
Reporter: Brook Zhou


When a node is in decommissioning state, there is time window between 
completedContainer() and RMNodeResourceUpdateEvent get handled in 
scheduler.nodeUpdate (YARN-3223). 

So if a scheduling effort happens within this window, the new container could 
still get allocated on this node. Even worse case is if scheduling effort 
happen after RMNodeResourceUpdateEvent sent out but before it is propagated to 
SchedulerNode - then the total resource is lower than used resource and 
available resource is a negative value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)