[jira] [Comment Edited] (YARN-6852) [YARN-6223] Native code changes to support isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206771#comment-16206771 ] Jonathan Hung edited comment on YARN-6852 at 10/16/17 11:30 PM: Hi [~leftnoteasy], in {{gpu-module.c#internal_handle_gpu_request}}, there's this code: {noformat}+ // Use cgroup helpers to blacklist devices + for (int i = 0; i < n_minor_devices_to_block; i++) { +char param_value[128]; +memset(param_value, 0, sizeof(param_value)); +snprintf(param_value, sizeof(param_value), "c %d:%d rwm", + major_device_number, i); + +int rc = update_cgroups_parameters_func_p("devices", "deny", + container_id, param_value); + +if (0 != rc) { + fprintf(ERRORFILE, "CGroups: Failed to update cgroups\n"); + return_code = -1; + goto cleanup; +} + }{noformat} Is {noformat}+snprintf(param_value, sizeof(param_value), "c %d:%d rwm", + major_device_number, i);{noformat} supposed to be {noformat}+ snprintf(param_value, sizeof(param_value), "c %d:%d rwm", + major_device_number, minor_devices[i]);{noformat}? Seems the minor number is not actually the number being written. If so, I'll create a YARN-6223 subtask to address this. was (Author: jhung): Hi [~leftnoteasy], in {{gpu-module.c#internal_handle_gpu_request}}, there's this code: {noformat}+ // Use cgroup helpers to blacklist devices + for (int i = 0; i < n_minor_devices_to_block; i++) { +char param_value[128]; +memset(param_value, 0, sizeof(param_value)); +snprintf(param_value, sizeof(param_value), "c %d:%d rwm", + major_device_number, i); + +int rc = update_cgroups_parameters_func_p("devices", "deny", + container_id, param_value); + +if (0 != rc) { + fprintf(ERRORFILE, "CGroups: Failed to update cgroups\n"); + return_code = -1; + goto cleanup; +} + }{noformat} Is {noformat}+snprintf(param_value, sizeof(param_value), "c %d:%d rwm", + major_device_number, i);{noformat} supposed to be {noformat}+ snprintf(param_value, sizeof(param_value), "c %d:%d rwm", + major_device_number, minor_devices[i]);{noformat}? Seems the minor number is not actually the number being written. > [YARN-6223] Native code changes to support isolate GPU devices by using > CGroups > --- > > Key: YARN-6852 > URL: https://issues.apache.org/jira/browse/YARN-6852 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Fix For: 3.0.0-beta1 > > Attachments: YARN-6852.001.patch, YARN-6852.002.patch, > YARN-6852.003.patch, YARN-6852.004.patch, YARN-6852.005.patch, > YARN-6852.006.patch, YARN-6852.007.patch, YARN-6852.008.patch, > YARN-6852.009.patch > > > This JIRA plan to add support of: > 1) Isolation in CGroups. (native side). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6852) [YARN-6223] Native code changes to support isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16117607#comment-16117607 ] Wangda Tan edited comment on YARN-6852 at 8/8/17 12:19 AM: --- Again, thanks [~miklos.szeg...@cloudera.com] for your reviews. bq. could not find a doc about this, so I ask. Is there a reason to have a disable list of devices, instead of an enable list? Actually whitelist doesn't work for me, I tried to debug however I cannot find the root cause. Since blacklist and whitelist are just two different approaches for the same purpose. I think we could continue this approach, do you have any concern here? And regarding to file name style. I was just about to rename to underscore, however I found the container-executor itself is "\-", I suggest to convert existing files to "\-" connected, otherwise we have to rename a lot of files. I don't quite care about which style to go, I just want to avoid change too many file. Attached ver.004 patch, (TODO): unit test related cleanups. was (Author: leftnoteasy): Again, thanks [~miklos.szeg...@cloudera.com] for your reviews. bq. could not find a doc about this, so I ask. Is there a reason to have a disable list of devices, instead of an enable list? Actually whitelist doesn't work for me, I tried to debug however I cannot find the root cause. Since blacklist and whitelist are just two different approaches for the same purpose. I think we could continue this approach, do you have any concern here? And regarding to file name style. I was just about to rename to underscore, however I found the container-executor itself is "-", I suggest to convert existing files to "-" connected, otherwise we have to rename a lot of files. I don't quite care about which style to go, I just want to avoid change too many file. Attached ver.004 patch, (TODO): unit test related cleanups. > [YARN-6223] Native code changes to support isolate GPU devices by using > CGroups > --- > > Key: YARN-6852 > URL: https://issues.apache.org/jira/browse/YARN-6852 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Assignee: Wangda Tan > Attachments: YARN-6852.001.patch, YARN-6852.002.patch, > YARN-6852.003.patch, YARN-6852.004.patch > > > This JIRA plan to add support of: > 1) Isolation in CGroups. (native side). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6852) [YARN-6223] Native code changes to support isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115055#comment-16115055 ] Wangda Tan edited comment on YARN-6852 at 8/4/17 10:46 PM: --- Hi Miklos, really appreciate your thorough reviews, very helpful! I address most of your comments. Few items which I haven't addressed in the updated patch. bq. Why do you have cgroup_cfg_section? You could eliminate it and get it all the time or just cache cgroups_root. I still prefer to have it since this can help us get more configs without changing major code structure. bq. int input_argv_idx = 0; the first argument is the process name. Actually the argc and argv are modified in main.c before passed to modules, I removed process name already: {code} +return handle_gpu_request(_cgroups_parameters, "gpu", argc - 1, + [1]); {code} Please let me know if you have any suggestions to the approach. bq. opts->keys = malloc(sizeof(char*) * (argc + 1)); Why argc+1 and not argc-1? Updated to argc. bq. required and has_values could be implemented as a bit array instead of a byte array. Another option ... Since container-executor is not a memory-intensive application, I would prefer to spend time on changing it when it is necessary or there's any safety concerns. :) bq. This pattern is C+0x. I think Varun mentioned this in YARN-6033, it is C99: https://stackoverflow.com/a/330867 bq. arr[idx] = n; There is no overflow check. This could also be exploitable. This might not be an issue since we have already checked the input string once: {code} for (int i = 0; i < strlen(input); i++) { if (input[i] == ',') { n_numbers++; } } {code} bq. container_1 is an invalid container id in the unit tests. They will fail. Did you mean we should not fail the check? "container_1" is actually an invalid id in YARN. bq. There is no indentation after namespace ContainerExecutor I would prefer to not add extra indention for namespace. There're some discussions on SO: https://stackoverflow.com/questions/713698/c-namespaces-advice bq. static std::vector cgroups_parameters_invoked; I think you should consider std::string here. No need to malloc later bq. You do not clean up files in the unit tests, do you? Is there a reason? (TODO) Will include unit test related changes and clean ups in the next patch. Updated ver.003 patch. [~miklos.szeg...@cloudera.com], mind to check again? was (Author: leftnoteasy): Hi Miklos, really appreciate your thorough reviews, very helpful! I address most of your comments. Few items which I haven't addressed in the updated patch. bq. Why do you have cgroup_cfg_section? You could eliminate it and get it all the time or just cache cgroups_root. I still prefer to have it since this can help us get more configs without changing major code structure. bq. int input_argv_idx = 0; the first argument is the process name. Actually the argc and argv are modified in main.c before passed to modules, I removed process name already: {code} +return handle_gpu_request(_cgroups_parameters, "gpu", argc - 1, + [1]); {code} Please let me know if you have any suggestions to the approach. bq. opts->keys = malloc(sizeof(char*) * (argc + 1)); Why argc+1 and not argc-1? Updated to argc. bq. required and has_values could be implemented as a bit array instead of a byte array. Another option ... Since container-executor is not a memory-intensive application, I would prefer to spend time on changing it when it is necessary or there's any safety concerns. :) bq. This pattern is C+0x. I think Varun mentioned this in YARN-6033, it is C99: https://stackoverflow.com/a/330867 bq. arr[idx] = n; There is no overflow check. This could also be exploitable. This might not be an issue since we have already checked the input string once: {code} for (int i = 0; i < strlen(input); i++) { if (input[i] == ',') { n_numbers++; } } {code} bq. container_1 is an invalid container id in the unit tests. They will fail. Did you mean we should not fail the check? "container_1" is actually an invalid id in YARN. bq. There is no indentation after namespace ContainerExecutor I would prefer to not add extra indention for namespace. There're some discussions on SO: https://stackoverflow.com/questions/713698/c-namespaces-advice bq. static std::vector cgroups_parameters_invoked; I think you should consider std::string here. No need to malloc later bq. You do not clean up files in the unit tests, do you? Is there a reason? (TODO) Will include unit test related changes and clean ups in the next patch. Updated ver.003 patch. > [YARN-6223] Native code changes to support isolate GPU devices by using > CGroups > --- > > Key: YARN-6852 > URL:
[jira] [Comment Edited] (YARN-6852) [YARN-6223] Native code changes to support isolate GPU devices by using CGroups
[ https://issues.apache.org/jira/browse/YARN-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112193#comment-16112193 ] Miklos Szegedi edited comment on YARN-6852 at 8/3/17 5:05 AM: -- … {{Return a parsed commandline options.}} There is a typo in this sentence. {code} struct section empty_executor_cfg = {.size=0, .kv_pairs=NULL}; {code} This pattern is C++0x, should not be used in standard C. Note: I am not against converting the whole tool to C++… Moreover {{section->name}} is not set to NULL above. {code} const struct section* cfg_section; static int config_initialized = 0; {code} I see the same issues as in the groups case. (cfg_section==NULL can be used instead of config_initialized, cfg_section can be static, etc.) n_minor_devices_to_block could be unsigned int, so that the negative check is not needed strtol is a better alternative to atoi {code} char param_value[128]; snprintf(param_value, 128, "c %d:%d rwm", major_device_number, i); {code} This could be written as: {code} snprintf(param_value, sizeof(param_value), "c %d:%d rwm", major_device_number, i); {code} I do not see allowed_minor_numbers released anywhere. {code} char container_id[128]; memset(container_id, 0, 128); {code} It should be memset(container_id, 0, sizeof(container_id)); {{strcpy(container_id, optarg);}} This is dangerous without size. Use strncpy. {{fflush(LOGFILE);}} This avoids caching and can be a performance bottleneck. I think it is better to avoid unless there is a good reason. {code} const char *cgroups_param_path; const char* cgroups_param_value; {code} Misaligned space. In module_enabled I would name rc something else. You marked 0 as rc success in other functions. all_numbers: You touch the characters n^2 times. You should call strlen() once and cache the value. all_numbers: {code} if (strlen(input) == 0) { return 0; } {code} This is not necessary. {code} int* arr = (*numbers); arr = malloc(sizeof(int) * n_numbers); {code} Does this return anything? I think it should be: {code} (*numbers) = malloc(sizeof(int) * n_numbers); {code} . {code} char* input_cpy = malloc(strlen(input)); strcpy(input_cpy, input); {code} There is no null pointer check. {{arr[idx] = n;}} There is no overflow check. This could also be exploitable. get_numbers_split_by_comma will return an array if a single 0 for an empty string. It should return ret_n_number=0 instead. {code} if (strlen(p) == 0) { return 0; } {code} You could just check p[0]==0 {code} if (mkdirs(TEST_ROOT, 0755) != 0) { exit(1); } {code} This needs some logging to show what happened. {{fprintf(LOGFILE, "\nTesting %s\n", __func__);}} GTest prints out the function name itself. container_1 is an invalid container id in the unit tests. They will fail. There is no indentation after {{namespace ContainerExecutor}} {{static std::vector cgroups_parameters_invoked;}} I think you should consider std::string here. No need to malloc later You do not clean up files in the unit tests, do you? Is there a reason? was (Author: miklos.szeg...@cloudera.com): … {{Return a parsed commandline options.}} There is a typo in this sentence. {code} struct section empty_executor_cfg = {.size=0, .kv_pairs=NULL}; {code} This pattern is C++0x, should not be used in standard C. Note: I am not against converting the whole tool to C++… Moreover {{section->name}} is not set to NULL above. {code} const struct section* cfg_section; static int config_initialized = 0; {code} I see the same issues as in the groups case. (cfg_section==NULL can be used instead of config_initialized, cfg_section can be static, etc.) n_minor_devices_to_block could be unsigned int, so that the negative check is not needed strtol is a better alternative to atoi {code} char param_value[128]; snprintf(param_value, 128, "c %d:%d rwm", major_device_number, i); {code} This could be written as: {code} snprintf(param_value, sizeof(param_value), "c %d:%d rwm", major_device_number, i); {code} I do not see allowed_minor_numbers released anywhere. {code} char container_id[128]; memset(container_id, 0, 128); {code} It should be memset(container_id, 0, sizeof(container_id)); {{strcpy(container_id, optarg);}} This is dangerous without size. Use strncpy. {{fflush(LOGFILE);}} This avoids caching and can be a performance bottleneck. I think it is better to avoid unless there is a good reason. {code} const char *cgroups_param_path; const char* cgroups_param_value; {code} Misaligned space. In module_enabled I would name rc something else. You marked 0 as rc success in other functions. all_numbers: You call strlen n^2 times. You should call it once and cache the value. all_numbers: {code} if (strlen(input) == 0) { return 0; } {code} This is not necessary. {code} int* arr = (*numbers); arr = malloc(sizeof(int) * n_numbers); {code} Does this return anything? I