[jira] [Comment Edited] (YARN-6852) [YARN-6223] Native code changes to support isolate GPU devices by using CGroups

2017-10-16 Thread Jonathan Hung (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206771#comment-16206771
 ] 

Jonathan Hung edited comment on YARN-6852 at 10/16/17 11:30 PM:


Hi [~leftnoteasy], in {{gpu-module.c#internal_handle_gpu_request}}, there's 
this code: {noformat}+  // Use cgroup helpers to blacklist devices
+  for (int i = 0; i < n_minor_devices_to_block; i++) {
+char param_value[128];
+memset(param_value, 0, sizeof(param_value));
+snprintf(param_value, sizeof(param_value), "c %d:%d rwm",
+ major_device_number, i);
+
+int rc = update_cgroups_parameters_func_p("devices", "deny",
+  container_id, param_value);
+
+if (0 != rc) {
+  fprintf(ERRORFILE, "CGroups: Failed to update cgroups\n");
+  return_code = -1;
+  goto cleanup;
+}
+  }{noformat}

Is {noformat}+snprintf(param_value, sizeof(param_value), "c %d:%d rwm",
+ major_device_number, i);{noformat} supposed to be {noformat}+
snprintf(param_value, sizeof(param_value), "c %d:%d rwm",
+ major_device_number, minor_devices[i]);{noformat}? Seems the 
minor number is not actually the number being written.

If so, I'll create a YARN-6223 subtask to address this.


was (Author: jhung):
Hi [~leftnoteasy], in {{gpu-module.c#internal_handle_gpu_request}}, there's 
this code: {noformat}+  // Use cgroup helpers to blacklist devices
+  for (int i = 0; i < n_minor_devices_to_block; i++) {
+char param_value[128];
+memset(param_value, 0, sizeof(param_value));
+snprintf(param_value, sizeof(param_value), "c %d:%d rwm",
+ major_device_number, i);
+
+int rc = update_cgroups_parameters_func_p("devices", "deny",
+  container_id, param_value);
+
+if (0 != rc) {
+  fprintf(ERRORFILE, "CGroups: Failed to update cgroups\n");
+  return_code = -1;
+  goto cleanup;
+}
+  }{noformat}

Is {noformat}+snprintf(param_value, sizeof(param_value), "c %d:%d rwm",
+ major_device_number, i);{noformat} supposed to be {noformat}+
snprintf(param_value, sizeof(param_value), "c %d:%d rwm",
+ major_device_number, minor_devices[i]);{noformat}? Seems the 
minor number is not actually the number being written.

> [YARN-6223] Native code changes to support isolate GPU devices by using 
> CGroups
> ---
>
> Key: YARN-6852
> URL: https://issues.apache.org/jira/browse/YARN-6852
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Fix For: 3.0.0-beta1
>
> Attachments: YARN-6852.001.patch, YARN-6852.002.patch, 
> YARN-6852.003.patch, YARN-6852.004.patch, YARN-6852.005.patch, 
> YARN-6852.006.patch, YARN-6852.007.patch, YARN-6852.008.patch, 
> YARN-6852.009.patch
>
>
> This JIRA plan to add support of:
> 1) Isolation in CGroups. (native side).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6852) [YARN-6223] Native code changes to support isolate GPU devices by using CGroups

2017-08-07 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117607#comment-16117607
 ] 

Wangda Tan edited comment on YARN-6852 at 8/8/17 12:19 AM:
---

Again, thanks [~miklos.szeg...@cloudera.com] for your reviews.

bq. could not find a doc about this, so I ask. Is there a reason to have a 
disable list of devices, instead of an enable list? 
Actually whitelist doesn't work for me, I tried to debug however I cannot find 
the root cause. Since blacklist and whitelist are just two different approaches 
for the same purpose. I think we could continue this approach, do you have any 
concern here? 

And regarding to file name style. I was just about to rename to underscore, 
however I found the container-executor itself is "\-", I suggest to convert 
existing files to "\-" connected, otherwise we have to rename a lot of files. I 
don't quite care about which style to go, I just want to avoid change too many 
file.

Attached ver.004 patch, (TODO): unit test related cleanups.


was (Author: leftnoteasy):
Again, thanks [~miklos.szeg...@cloudera.com] for your reviews.

bq. could not find a doc about this, so I ask. Is there a reason to have a 
disable list of devices, instead of an enable list? 
Actually whitelist doesn't work for me, I tried to debug however I cannot find 
the root cause. Since blacklist and whitelist are just two different approaches 
for the same purpose. I think we could continue this approach, do you have any 
concern here? 

And regarding to file name style. I was just about to rename to underscore, 
however I found the container-executor itself is "-", I suggest to convert 
existing files to "-" connected, otherwise we have to rename a lot of files. I 
don't quite care about which style to go, I just want to avoid change too many 
file.

Attached ver.004 patch, (TODO): unit test related cleanups.

> [YARN-6223] Native code changes to support isolate GPU devices by using 
> CGroups
> ---
>
> Key: YARN-6852
> URL: https://issues.apache.org/jira/browse/YARN-6852
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Wangda Tan
> Attachments: YARN-6852.001.patch, YARN-6852.002.patch, 
> YARN-6852.003.patch, YARN-6852.004.patch
>
>
> This JIRA plan to add support of:
> 1) Isolation in CGroups. (native side).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6852) [YARN-6223] Native code changes to support isolate GPU devices by using CGroups

2017-08-04 Thread Wangda Tan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115055#comment-16115055
 ] 

Wangda Tan edited comment on YARN-6852 at 8/4/17 10:46 PM:
---

Hi Miklos, really appreciate your thorough reviews, very helpful!

I address most of your comments.

Few items which I haven't addressed in the updated patch.
bq. Why do you have cgroup_cfg_section? You could eliminate it and get it all 
the time or just cache cgroups_root.
I still prefer to have it since this can help us get more configs without 
changing major code structure.

bq. int input_argv_idx = 0; the first argument is the process name.
Actually the argc and argv are modified in main.c before passed to modules, I 
removed process name already:
{code}
+return handle_gpu_request(_cgroups_parameters, "gpu", argc - 1,
+   [1]);
{code}
Please let me know if you have any suggestions to the approach. 

bq. opts->keys = malloc(sizeof(char*) * (argc + 1)); Why argc+1 and not argc-1? 
Updated to argc.

bq. required and has_values could be implemented as a bit array instead of a 
byte array. Another option ...
Since container-executor is not a memory-intensive application, I would prefer 
to spend time on changing it when it is necessary or there's any safety 
concerns. :) 

bq. This pattern is C+0x.
I think Varun mentioned this in YARN-6033, it is C99: 
https://stackoverflow.com/a/330867

bq. arr[idx] = n; There is no overflow check. This could also be exploitable.
This might not be an issue since we have already checked the input string once:
{code}
  for (int i = 0; i < strlen(input); i++) {
if (input[i] == ',') {
  n_numbers++;
}
  }
{code}

bq. container_1 is an invalid container id in the unit tests. They will fail. 
Did you mean we should not fail the check? "container_1" is actually an invalid 
id in YARN. 

bq. There is no indentation after namespace ContainerExecutor
I would prefer to not add extra indention for namespace. There're some 
discussions on SO: 
https://stackoverflow.com/questions/713698/c-namespaces-advice

bq. static std::vector cgroups_parameters_invoked; I think you 
should consider std::string here. No need to malloc later
bq. You do not clean up files in the unit tests, do you? Is there a reason?
(TODO) Will include unit test related changes and clean ups in the next patch.

Updated ver.003 patch. [~miklos.szeg...@cloudera.com], mind to check again?


was (Author: leftnoteasy):
Hi Miklos, really appreciate your thorough reviews, very helpful!

I address most of your comments.

Few items which I haven't addressed in the updated patch.
bq. Why do you have cgroup_cfg_section? You could eliminate it and get it all 
the time or just cache cgroups_root.
I still prefer to have it since this can help us get more configs without 
changing major code structure.

bq. int input_argv_idx = 0; the first argument is the process name.
Actually the argc and argv are modified in main.c before passed to modules, I 
removed process name already:
{code}
+return handle_gpu_request(_cgroups_parameters, "gpu", argc - 1,
+   [1]);
{code}
Please let me know if you have any suggestions to the approach. 

bq. opts->keys = malloc(sizeof(char*) * (argc + 1)); Why argc+1 and not argc-1? 
Updated to argc.

bq. required and has_values could be implemented as a bit array instead of a 
byte array. Another option ...
Since container-executor is not a memory-intensive application, I would prefer 
to spend time on changing it when it is necessary or there's any safety 
concerns. :) 

bq. This pattern is C+0x.
I think Varun mentioned this in YARN-6033, it is C99: 
https://stackoverflow.com/a/330867

bq. arr[idx] = n; There is no overflow check. This could also be exploitable.
This might not be an issue since we have already checked the input string once:
{code}
  for (int i = 0; i < strlen(input); i++) {
if (input[i] == ',') {
  n_numbers++;
}
  }
{code}

bq. container_1 is an invalid container id in the unit tests. They will fail. 
Did you mean we should not fail the check? "container_1" is actually an invalid 
id in YARN. 

bq. There is no indentation after namespace ContainerExecutor
I would prefer to not add extra indention for namespace. There're some 
discussions on SO: 
https://stackoverflow.com/questions/713698/c-namespaces-advice

bq. static std::vector cgroups_parameters_invoked; I think you 
should consider std::string here. No need to malloc later
bq. You do not clean up files in the unit tests, do you? Is there a reason?
(TODO) Will include unit test related changes and clean ups in the next patch.

Updated ver.003 patch.

> [YARN-6223] Native code changes to support isolate GPU devices by using 
> CGroups
> ---
>
> Key: YARN-6852
> URL: 

[jira] [Comment Edited] (YARN-6852) [YARN-6223] Native code changes to support isolate GPU devices by using CGroups

2017-08-02 Thread Miklos Szegedi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112193#comment-16112193
 ] 

Miklos Szegedi edited comment on YARN-6852 at 8/3/17 5:05 AM:
--

…
{{Return a parsed commandline options.}} There is a typo in this sentence.
{code}
struct section empty_executor_cfg = {.size=0, .kv_pairs=NULL};
{code}
This pattern is C++0x, should not be used in standard C. Note: I am not against 
converting the whole tool to C++…
Moreover {{section->name}} is not set to NULL above.
{code}
const struct section* cfg_section;
static int config_initialized = 0;
{code}
I see the same issues as in the groups case. (cfg_section==NULL can be used 
instead of config_initialized, cfg_section can be static, etc.)
n_minor_devices_to_block could be unsigned int, so that the negative check is 
not needed
strtol is a better alternative to atoi
{code}
char param_value[128];
snprintf(param_value, 128, "c %d:%d rwm", major_device_number, i);
{code}
This could be written as:
{code}
snprintf(param_value, sizeof(param_value), "c %d:%d rwm", 
major_device_number, i);
{code}
I do not see allowed_minor_numbers released anywhere.
{code}
  char container_id[128];
  memset(container_id, 0, 128);
{code}
It should be memset(container_id, 0, sizeof(container_id));
{{strcpy(container_id, optarg);}} This is dangerous without size. Use strncpy.
{{fflush(LOGFILE);}} This avoids caching and can be a performance bottleneck. I 
think it is better to avoid unless there is a good reason.
{code}
  const char *cgroups_param_path;
  const char* cgroups_param_value;
{code}
Misaligned space.
In module_enabled I would name rc something else. You marked 0 as rc success in 
other functions.
all_numbers: You touch the characters n^2 times. You should call strlen() once 
and cache the value.
all_numbers:
{code}
  if (strlen(input) == 0) {
return 0;
  }
{code}
This is not necessary.
{code}
  int* arr = (*numbers);
  arr = malloc(sizeof(int) * n_numbers);
{code}
Does this return anything? I think it should be:
{code}
  (*numbers) = malloc(sizeof(int) * n_numbers);
{code}
.
{code}
  char* input_cpy = malloc(strlen(input));
  strcpy(input_cpy, input);
{code}
There is no null pointer check.
{{arr[idx] = n;}} There is no overflow check. This could also be exploitable.
get_numbers_split_by_comma will return an array if a single 0 for an empty 
string. It should return ret_n_number=0 instead.
{code}
if (strlen(p) == 0) {
  return 0;
}
{code}
You could just check p[0]==0
{code}
if (mkdirs(TEST_ROOT, 0755) != 0) {
  exit(1);
}
{code}
This needs some logging to show what happened.
{{fprintf(LOGFILE, "\nTesting %s\n", __func__);}} GTest prints out the function 
name itself.
container_1 is an invalid container id in the unit tests. They will fail.
There is no indentation after {{namespace ContainerExecutor}}
{{static std::vector cgroups_parameters_invoked;}} I think you 
should consider std::string here. No need to malloc later
You do not clean up files in the unit tests, do you? Is there a reason?


was (Author: miklos.szeg...@cloudera.com):
…
{{Return a parsed commandline options.}} There is a typo in this sentence.
{code}
struct section empty_executor_cfg = {.size=0, .kv_pairs=NULL};
{code}
This pattern is C++0x, should not be used in standard C. Note: I am not against 
converting the whole tool to C++…
Moreover {{section->name}} is not set to NULL above.
{code}
const struct section* cfg_section;
static int config_initialized = 0;
{code}
I see the same issues as in the groups case. (cfg_section==NULL can be used 
instead of config_initialized, cfg_section can be static, etc.)
n_minor_devices_to_block could be unsigned int, so that the negative check is 
not needed
strtol is a better alternative to atoi
{code}
char param_value[128];
snprintf(param_value, 128, "c %d:%d rwm", major_device_number, i);
{code}
This could be written as:
{code}
snprintf(param_value, sizeof(param_value), "c %d:%d rwm", 
major_device_number, i);
{code}
I do not see allowed_minor_numbers released anywhere.
{code}
  char container_id[128];
  memset(container_id, 0, 128);
{code}
It should be memset(container_id, 0, sizeof(container_id));
{{strcpy(container_id, optarg);}} This is dangerous without size. Use strncpy.
{{fflush(LOGFILE);}} This avoids caching and can be a performance bottleneck. I 
think it is better to avoid unless there is a good reason.
{code}
  const char *cgroups_param_path;
  const char* cgroups_param_value;
{code}
Misaligned space.
In module_enabled I would name rc something else. You marked 0 as rc success in 
other functions.
all_numbers: You call strlen n^2 times. You should call it once and cache the 
value.
all_numbers:
{code}
  if (strlen(input) == 0) {
return 0;
  }
{code}
This is not necessary.
{code}
  int* arr = (*numbers);
  arr = malloc(sizeof(int) * n_numbers);
{code}
Does this return anything? I