[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Comment: was deleted (was: Implemented a mesos module and resource estimator - source has been posted to https://github.com/ct-clmsn/mesos-cpusets Added performance counter enabled tools and mesos executor - source posted to https://github.com/ct-clmsn/mesos-papi) > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > Labels: cgroups, cpu, cpu-usage, gpu, isolation, isolator, > mentor, perfomance > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815211#comment-15815211 ] Chris commented on MESOS-5342: -- Implemented a mesos module and resource estimator - source has been posted to https://github.com/ct-clmsn/mesos-cpusets Added performance counter enabled tools and mesos executor - source posted to https://github.com/ct-clmsn/mesos-papi > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > Labels: cgroups, cpu, cpu-usage, gpu, isolation, isolator, > mentor, perfomance > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Labels: cgroups cpu cpu-usage gpu isolation isolator mentor perfomance (was: ) > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > Labels: cgroups, cpu, cpu-usage, gpu, isolation, isolator, > mentor, perfomance > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5358) Design Doc for CPU pinning/binding support (MESOS-5342)
[ https://issues.apache.org/jira/browse/MESOS-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5358: - Labels: cgroups cpu cpu-usage documentation gpu isolation isolator mentor newbie performance (was: documentation mentor newbie performance) > Design Doc for CPU pinning/binding support (MESOS-5342) > --- > > Key: MESOS-5358 > URL: https://issues.apache.org/jira/browse/MESOS-5358 > Project: Mesos > Issue Type: Documentation > Components: documentation >Affects Versions: 0.28.1 >Reporter: Chris > Labels: cgroups, cpu, cpu-usage, documentation, gpu, isolation, > isolator, mentor, newbie, performance > > Develop design document for MESOS-5342. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5358) Design Doc for CPU pinning/binding support (MESOS-5342)
[ https://issues.apache.org/jira/browse/MESOS-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278570#comment-15278570 ] Chris commented on MESOS-5358: -- Requesting a shepard! > Design Doc for CPU pinning/binding support (MESOS-5342) > --- > > Key: MESOS-5358 > URL: https://issues.apache.org/jira/browse/MESOS-5358 > Project: Mesos > Issue Type: Documentation > Components: documentation >Affects Versions: 0.28.1 >Reporter: Chris > Labels: documentation, mentor, newbie, performance > > Develop design document for MESOS-5342. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Comment: was deleted (was: Documentation for this ticket is MESOS-5358) > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5358) Design Doc for CPU pinning/binding support (MESOS-5342)
[ https://issues.apache.org/jira/browse/MESOS-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278416#comment-15278416 ] Chris commented on MESOS-5358: -- Implementation ticket is MESOS-5342 (https://issues.apache.org/jira/browse/MESOS-5342) > Design Doc for CPU pinning/binding support (MESOS-5342) > --- > > Key: MESOS-5358 > URL: https://issues.apache.org/jira/browse/MESOS-5358 > Project: Mesos > Issue Type: Documentation > Components: documentation >Affects Versions: 0.28.1 >Reporter: Chris > Labels: documentation, mentor, newbie, performance > > Develop design document for MESOS-5342. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Documentation for this ticket is MESOS-5358 > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278405#comment-15278405 ] Chris commented on MESOS-5342: -- Note, this is my first design document for Mesos, it's not perfect. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5358) DesignDoc for CPU pinning/binding support (MESOS-5342)
[ https://issues.apache.org/jira/browse/MESOS-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278400#comment-15278400 ] Chris commented on MESOS-5358: -- Design document posted here: https://docs.google.com/document/d/1G3L1Tdulg5iW7hZ2WXbG-bqROILu7zdBh2aWYu3An6A/edit?usp=sharing > DesignDoc for CPU pinning/binding support (MESOS-5342) > -- > > Key: MESOS-5358 > URL: https://issues.apache.org/jira/browse/MESOS-5358 > Project: Mesos > Issue Type: Documentation > Components: documentation >Affects Versions: 0.28.1 >Reporter: Chris > Labels: documentation, mentor, newbie, performance > > Develop design document for MESOS-5342. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5358) DesignDoc for CPU pinning/binding support (MESOS-5342)
[ https://issues.apache.org/jira/browse/MESOS-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5358: - Comment: was deleted (was: https://docs.google.com/document/d/1G3L1Tdulg5iW7hZ2WXbG-bqROILu7zdBh2aWYu3An6A/edit?usp=sharing) > DesignDoc for CPU pinning/binding support (MESOS-5342) > -- > > Key: MESOS-5358 > URL: https://issues.apache.org/jira/browse/MESOS-5358 > Project: Mesos > Issue Type: Documentation > Components: documentation >Affects Versions: 0.28.1 >Reporter: Chris > Labels: documentation, mentor, newbie, performance > > Develop design document for MESOS-5342. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5358) DesignDoc for CPU pinning/binding support (MESOS-5342)
[ https://issues.apache.org/jira/browse/MESOS-5358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15278399#comment-15278399 ] Chris commented on MESOS-5358: -- https://docs.google.com/document/d/1G3L1Tdulg5iW7hZ2WXbG-bqROILu7zdBh2aWYu3An6A/edit?usp=sharing > DesignDoc for CPU pinning/binding support (MESOS-5342) > -- > > Key: MESOS-5358 > URL: https://issues.apache.org/jira/browse/MESOS-5358 > Project: Mesos > Issue Type: Documentation > Components: documentation >Affects Versions: 0.28.1 >Reporter: Chris > Labels: documentation, mentor, newbie, performance > > Develop design document for MESOS-5342. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277463#comment-15277463 ] Chris commented on MESOS-5342: -- [~kaysoky] Sure thing - I've done some of this work prior to writing the code in a local README. Shouldn't be too much trouble transposing that information onto googledocs. Oh, should the source be posted on github under a separate branch for review? > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277207#comment-15277207 ] Chris commented on MESOS-5342: -- [~kaysoky] where are design documents supposed to be posted? I've gone through the patch submission documentation and will review the testing documentation and style guides. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276658#comment-15276658 ] Chris commented on MESOS-5342: -- Forgot to mention, a shepard is needed to support integration of this feature! > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276521#comment-15276521 ] Chris commented on MESOS-5342: -- For information about submodular functions (and why it was selected for this problem), strongly suggest reviewing at least this youtube lecture/video (ideally the entire series of videos) publicly available from MLSS Iceland 2014: https://youtu.be/6ThMzlHdKsI > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Comment: was deleted (was: For information about submodular functions (and why it was selected for this problem), strongly suggest reviewing at least this youtube lecture/video (ideally the entire series of videos) publicly available from MLSS Iceland 2014: https://youtu.be/6ThMzlHdKsI) > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276518#comment-15276518 ] Chris commented on MESOS-5342: -- For information about submodular functions (and why it was selected for this problem), strongly suggest reviewing at least this youtube lecture/video (ideally the entire series of videos) publicly available from MLSS Iceland 2014: https://youtu.be/6ThMzlHdKsI > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Comment: was deleted (was: For information about submodular functions (and why it was selected for this problem), strongly suggest reviewing this youtube video: https://youtu.be/6ThMzlHdKsI) > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276514#comment-15276514 ] Chris commented on MESOS-5342: -- For information about submodular functions (and why it was selected for this problem), strongly suggest reviewing this youtube video: https://youtu.be/6ThMzlHdKsI > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Comment: was deleted (was: Fixed a small bug in the greedy submodular subset selection algorithm. The "submodular cost" of selecting a core was being used in the knapsack budget test (cores currently have an at-most-budget-cost of "1.0"). The correct cost is now being used in the test.) > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276500#comment-15276500 ] Chris commented on MESOS-5342: -- Fixed a small bug in the greedy submodular subset selection algorithm. The "submodular cost" of selecting a core was being used in the knapsack budget test (cores currently have an at-most-budget-cost of "1.0"). The correct cost is now being used in the test. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Description: The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance, specifically in terms of cache locality. Applications requiring GPU resources can benefit from this feature by getting access to cores closest to the GPU hardware, which reduces cpu-gpu copy latency. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology, and the advent of Intel's Knights-series processors, will require making choices about where container's are going to run on the mesos-agent's processor(s) cores - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. was: The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance,specifically in terms of cache locality. Applications requiring GPU resources can benefit from this feature by getting access to cores closest to the GPU hardware, which reduces cpu-gpu copy latency. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology will require making choices about where container's are going to run on the mesos-agent's processor(s) - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology, and the advent of Intel's Knights-series processors, will require > making choices about where container's are going to run on the mesos-agent's > processor(s) cores - this feature is a step toward developing a robust > solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Description: The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance,specifically in terms of cache locality. Applications requiring GPU resources can benefit from this feature by getting access to cores closest to the GPU hardware, which reduces cpu-gpu copy latency. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology will require making choices about where container's are going to run on the mesos-agent's processor(s) - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. was: The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance, particularly in the case of applications requiring GPU resources. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology will require making choices about where container's are going to run on the mesos-agent's processor(s) - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance,specifically in terms of cache locality. > Applications requiring GPU resources can benefit from this feature by getting > access to cores closest to the GPU hardware, which reduces cpu-gpu copy > latency. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology will require making choices about where container's are going to > run on the mesos-agent's processor(s) - this feature is a step toward > developing a robust solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276195#comment-15276195 ] Chris commented on MESOS-5342: -- implementation has been posted to review board. successfully passed 'make check'. requires use of the hwloc library to perform machine hardware topology discovery and cpu binding. updates were made to configure.ac and Makefile.am. implementation is a new "device" under the cgroups isolator directory called "hwloc". Implementation detects topology, computes total number of cores required by the container (also checks if the container requires gpu). if the container requires gpu, the topology information is used to find the "closest" cores based on latency. if the container only requires cpu, a histogram of task assignment to cores is checked. if the histogram is "empty" (all cores have a value of 1.0) then a random core is selected and the latency matrix is used to find cores that are "closest" to the random core. the histogram is updated. If the histogram is "not empty" then a greedy submodular subset selection algorithm is used to select N cores using the latency matrix and a "per-core" cost value. the "per-core" cost value is a normalized version of the histogram divided by the number of processing units available on each core. greedy submodular subset selection algorithms use a "diminishing returns property" to find an optimal subset of items under a knapsack constraint. when the list of cores is returned, a bit vector representing a cpuset is bound to the container's pid_t. when the container is cleaned up, the histogram is updated by reducing the current task counts on each core assigned to the pid_t by -1.0. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, particularly in the case of applications > requiring GPU resources. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology will require making choices about where container's are going to > run on the mesos-agent's processor(s) - this feature is a step toward > developing a robust solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Comment: was deleted (was: implementation has been posted to review board. successfully passed 'make check'. requires use of the hwloc library to perform machine hardware topology discovery and cpu binding. updates were made to configure.ac and Makefile.am. implementation is a new "device" under the cgroups isolator directory called "hwloc". Implementation detects topology, computes total number of cores required by the container (also checks if the container requires gpu). if the container requires gpu, the topology information is used to find the "closest" cores based on latency. If the container only requires cpu, a histogram of task assignment to cores is checked. If the histogram is "empty" (all cores have a value of 1.0) then a random core is selected and the latency matrix is used to find cores that are "closest" to the random core. The histogram is updated. If the histogram is "not empty" then a greedy submodular subset selection algorithm is used to select N cores using the latency matrix and a "per-core" cost value. The "per-core" cost value is a normalized version of the histogram divided by the number of processing units available on each core. Greedy submodular subset selection algorithms use a "diminishing returns property" to find an optimal subset of items under a knapsack constraint. When the list of cores is returned, a bit vector representing a cpuset is bound to the container's pid_t. When the container is cleaned up, the histogram is updated by reducing the current task counts on each core assigned to the pid_t by -1.0.) > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, particularly in the case of applications > requiring GPU resources. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology will require making choices about where container's are going to > run on the mesos-agent's processor(s) - this feature is a step toward > developing a robust solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276194#comment-15276194 ] Chris commented on MESOS-5342: -- implementation has been posted to review board. successfully passed 'make check'. requires use of the hwloc library to perform machine hardware topology discovery and cpu binding. updates were made to configure.ac and Makefile.am. implementation is a new "device" under the cgroups isolator directory called "hwloc". Implementation detects topology, computes total number of cores required by the container (also checks if the container requires gpu). if the container requires gpu, the topology information is used to find the "closest" cores based on latency. If the container only requires cpu, a histogram of task assignment to cores is checked. If the histogram is "empty" (all cores have a value of 1.0) then a random core is selected and the latency matrix is used to find cores that are "closest" to the random core. The histogram is updated. If the histogram is "not empty" then a greedy submodular subset selection algorithm is used to select N cores using the latency matrix and a "per-core" cost value. The "per-core" cost value is a normalized version of the histogram divided by the number of processing units available on each core. Greedy submodular subset selection algorithms use a "diminishing returns property" to find an optimal subset of items under a knapsack constraint. When the list of cores is returned, a bit vector representing a cpuset is bound to the container's pid_t. When the container is cleaned up, the histogram is updated by reducing the current task counts on each core assigned to the pid_t by -1.0. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, particularly in the case of applications > requiring GPU resources. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology will require making choices about where container's are going to > run on the mesos-agent's processor(s) - this feature is a step toward > developing a robust solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-5342: - Description: The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance, particularly in the case of applications requiring GPU resources. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology will require making choices about where container's are going to run on the mesos-agent's processor(s) - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. was: The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance; particularly in the case of applications requiring GPU resources. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology will require making choices about where container's are going to run on the mesos-agent's processor(s) - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance, particularly in the case of applications > requiring GPU resources. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology will require making choices about where container's are going to > run on the mesos-agent's processor(s) - this feature is a step toward > developing a robust solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
[ https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15275892#comment-15275892 ] Chris commented on MESOS-5342: -- I've implemented code to support this particular feature and need to submit it for review. > CPU pinning/binding support for CgroupsCpushareIsolatorProcess > -- > > Key: MESOS-5342 > URL: https://issues.apache.org/jira/browse/MESOS-5342 > Project: Mesos > Issue Type: Improvement > Components: cgroups, containerization >Affects Versions: 0.28.1 >Reporter: Chris > > The cgroups isolator currently lacks support for binding (also called > pinning) containers to a set of cores. The GNU/Linux kernel is known to make > sub-optimal core assignments for processes and threads. Poor assignments > impact program performance; particularly in the case of applications > requiring GPU resources. > Most cluster management systems from the HPC community (SLURM) provide both > cgroup isolation and cpu binding. This feature would provide similar > capabilities. The current interest in supporting Intel's Cache Allocation > Technology will require making choices about where container's are going to > run on the mesos-agent's processor(s) - this feature is a step toward > developing a robust solution. > The improvement in this JIRA ticket will handle hardware topology detection, > track container-to-core utilization in a histogram, and use a mathematical > optimization technique to select cores for container assignment based on > latency and the container-to-core utilization histogram. > For GPU tasks, the improvement will prioritize selection of cores based on > latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess
Chris created MESOS-5342: Summary: CPU pinning/binding support for CgroupsCpushareIsolatorProcess Key: MESOS-5342 URL: https://issues.apache.org/jira/browse/MESOS-5342 Project: Mesos Issue Type: Improvement Components: cgroups, containerization Affects Versions: 0.28.1 Reporter: Chris The cgroups isolator currently lacks support for binding (also called pinning) containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal core assignments for processes and threads. Poor assignments impact program performance; particularly in the case of applications requiring GPU resources. Most cluster management systems from the HPC community (SLURM) provide both cgroup isolation and cpu binding. This feature would provide similar capabilities. The current interest in supporting Intel's Cache Allocation Technology will require making choices about where container's are going to run on the mesos-agent's processor(s) - this feature is a step toward developing a robust solution. The improvement in this JIRA ticket will handle hardware topology detection, track container-to-core utilization in a histogram, and use a mathematical optimization technique to select cores for container assignment based on latency and the container-to-core utilization histogram. For GPU tasks, the improvement will prioritize selection of cores based on latency between the GPU and cores in an effort to minimize copy latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4027) Improve task-node affinity
[ https://issues.apache.org/jira/browse/MESOS-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15064705#comment-15064705 ] Chris commented on MESOS-4027: -- Great considerations - really like the idea of public/private label fields - what would be a starting point for implementation? Adding security flags to labels? > Improve task-node affinity > -- > > Key: MESOS-4027 > URL: https://issues.apache.org/jira/browse/MESOS-4027 > Project: Mesos > Issue Type: Wish > Components: allocation, general >Reporter: Chris >Priority: Trivial > > Improve task-to-node affinity and anti-affinity (running hadoop or spark jobs > on a node currently running hdfs or to avoid running Ceph on HDFS nodes). > Provide a user-mutable Attribute in TaskInfo (the Attribute is modified by a > Framework Scheduler) that can describe what a Task is running. > The Attribute would propagate to a Task at execution. The Attribute is > passed to Framework Schedulers as part of an Offer's Attributes list. > A Framework Scheduler could then filter out or accept Offers from Nodes that > are currently labeled with a desired set or individual Attribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4027) Improve task-node affinity
[ https://issues.apache.org/jira/browse/MESOS-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15052902#comment-15052902 ] Chris commented on MESOS-4027: -- [~qianzhang] just noticed that - really should have spent more time reviewing the documentation in the protobuf file. I'd like to see the Task's labels field get propogated into an Offer so customized schedulers can have some view into what is running on a Mesos Node (ie: scheduler logic may filter out Offers associated with Nodes involved in Spark processing OR maybe schedulers want to filter out Offers that are not associated with HDFS). Figured out a way to do this without added more fields into the protobuf IDL for TaskInfo. Thanks for pointing this out!!! > Improve task-node affinity > -- > > Key: MESOS-4027 > URL: https://issues.apache.org/jira/browse/MESOS-4027 > Project: Mesos > Issue Type: Wish > Components: allocation, general >Reporter: Chris >Priority: Trivial > > Improve task-to-node affinity and anti-affinity (running hadoop or spark jobs > on a node currently running hdfs or to avoid running Ceph on HDFS nodes). > Provide a user-mutable Attribute in TaskInfo (the Attribute is modified by a > Framework Scheduler) that can describe what a Task is running. > The Attribute would propagate to a Task at execution. The Attribute is > passed to Framework Schedulers as part of an Offer's Attributes list. > A Framework Scheduler could then filter out or accept Offers from Nodes that > are currently labeled with a desired set or individual Attribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-4027) Improve task-node affinity
[ https://issues.apache.org/jira/browse/MESOS-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris updated MESOS-4027: - Summary: Improve task-node affinity (was: Improve task-node affinit) > Improve task-node affinity > -- > > Key: MESOS-4027 > URL: https://issues.apache.org/jira/browse/MESOS-4027 > Project: Mesos > Issue Type: Wish > Components: allocation, general >Reporter: Chris >Priority: Trivial > > Improve task-to-node affinity and anti-affinity (running hadoop or spark jobs > on a node currently running hdfs or to avoid running Ceph on HDFS nodes). > Provide a user-mutable Attribute in TaskInfo (the Attribute is modified by a > Framework Scheduler) that can describe what a Task is running. > The Attribute would propagate to a Task at execution. The Attribute is > passed to Framework Schedulers as part of an Offer's Attributes list. > A Framework Scheduler could then filter out or accept Offers from Nodes that > are currently labeled with a desired set or individual Attribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-4027) Improve task-node affinit
Chris created MESOS-4027: Summary: Improve task-node affinit Key: MESOS-4027 URL: https://issues.apache.org/jira/browse/MESOS-4027 Project: Mesos Issue Type: Wish Components: allocation, general Reporter: Chris Priority: Trivial Improve task-to-node affinity and anti-affinity (running hadoop or spark jobs on a node currently running hdfs or to avoid running Ceph on HDFS nodes). Provide a user-mutable Attribute in TaskInfo (the Attribute is modified by a Framework Scheduler) that can describe what a Task is running. The Attribute would propagate to a Task at execution. The Attribute is passed to Framework Schedulers as part of an Offer's Attributes list. A Framework Scheduler could then filter out or accept Offers from Nodes that are currently labeled with a desired set or individual Attribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4027) Improve task-node affinity
[ https://issues.apache.org/jira/browse/MESOS-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032577#comment-15032577 ] Chris edited comment on MESOS-4027 at 11/30/15 10:04 PM: - I've developed a patch in support of this ticket (it's been pushed to my github mesos fork). The patch requires testing (among other things!) before submitting for review. was (Author: ct.clmsn): I've developed a patch in support of this ticket (it's been pushed to my github mesos fork). The patch requires testing (among other things - ie: a Shepard) before submitting for review. > Improve task-node affinity > -- > > Key: MESOS-4027 > URL: https://issues.apache.org/jira/browse/MESOS-4027 > Project: Mesos > Issue Type: Wish > Components: allocation, general >Reporter: Chris >Priority: Trivial > > Improve task-to-node affinity and anti-affinity (running hadoop or spark jobs > on a node currently running hdfs or to avoid running Ceph on HDFS nodes). > Provide a user-mutable Attribute in TaskInfo (the Attribute is modified by a > Framework Scheduler) that can describe what a Task is running. > The Attribute would propagate to a Task at execution. The Attribute is > passed to Framework Schedulers as part of an Offer's Attributes list. > A Framework Scheduler could then filter out or accept Offers from Nodes that > are currently labeled with a desired set or individual Attribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4027) Improve task-node affinity
[ https://issues.apache.org/jira/browse/MESOS-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032577#comment-15032577 ] Chris commented on MESOS-4027: -- I've developed a patch for in support of this ticket. The patch requires testing (among other things - ie: a Shepard) before submitting for review. > Improve task-node affinity > -- > > Key: MESOS-4027 > URL: https://issues.apache.org/jira/browse/MESOS-4027 > Project: Mesos > Issue Type: Wish > Components: allocation, general >Reporter: Chris >Priority: Trivial > > Improve task-to-node affinity and anti-affinity (running hadoop or spark jobs > on a node currently running hdfs or to avoid running Ceph on HDFS nodes). > Provide a user-mutable Attribute in TaskInfo (the Attribute is modified by a > Framework Scheduler) that can describe what a Task is running. > The Attribute would propagate to a Task at execution. The Attribute is > passed to Framework Schedulers as part of an Offer's Attributes list. > A Framework Scheduler could then filter out or accept Offers from Nodes that > are currently labeled with a desired set or individual Attribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-4027) Improve task-node affinity
[ https://issues.apache.org/jira/browse/MESOS-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032577#comment-15032577 ] Chris edited comment on MESOS-4027 at 11/30/15 10:04 PM: - I've developed a patch in support of this ticket (it's been pushed to my github mesos fork). The patch requires testing (among other things - ie: a Shepard) before submitting for review. was (Author: ct.clmsn): I've developed a patch for in support of this ticket. The patch requires testing (among other things - ie: a Shepard) before submitting for review. > Improve task-node affinity > -- > > Key: MESOS-4027 > URL: https://issues.apache.org/jira/browse/MESOS-4027 > Project: Mesos > Issue Type: Wish > Components: allocation, general >Reporter: Chris >Priority: Trivial > > Improve task-to-node affinity and anti-affinity (running hadoop or spark jobs > on a node currently running hdfs or to avoid running Ceph on HDFS nodes). > Provide a user-mutable Attribute in TaskInfo (the Attribute is modified by a > Framework Scheduler) that can describe what a Task is running. > The Attribute would propagate to a Task at execution. The Attribute is > passed to Framework Schedulers as part of an Offer's Attributes list. > A Framework Scheduler could then filter out or accept Offers from Nodes that > are currently labeled with a desired set or individual Attribute. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-3059) Allow http endpoint to dynamically change the slave attributes
[ https://issues.apache.org/jira/browse/MESOS-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022795#comment-15022795 ] Chris edited comment on MESOS-3059 at 11/23/15 7:19 PM: Has there been any recent work on this capability? was (Author: ct.clmsn): Has there been any work on this capability? > Allow http endpoint to dynamically change the slave attributes > -- > > Key: MESOS-3059 > URL: https://issues.apache.org/jira/browse/MESOS-3059 > Project: Mesos > Issue Type: Wish >Reporter: Nitin > > This is well understood that - changing the attributes dynamically is not > safe without a restart because slave itself may not know which old framework > tasks are running on it that were dependent on previous attributes. > However, total restart makes lot of other history to delete. We need to > ensure a dynamic attribute changes with a soft restart. > It will be good to expose a rest endpoint either at slave or mesos-master > which directly changes the state in zookeeper. > USE-CASE > We use slave attributes/roles to direct the framework scheduling to use > specific slave as per it's requirements. Mesos scheduler only creates the > offer on the basis of some resources. > In our use case, we have some categorization of our spark frameworks or jobs > with framework(like marathon) based on multiple factors. We want job or > frameworks belonging to one category be running into their specific cluster > of resources. We want to dynamically manage the slaves into these logical > sub-clusters. > Since number of jobs that will be submitted or when it will be submitted is > very dynamic, it make sense to be able to dynamically assign roles or > attributes to slaves. It is not possible to gauge the requirements at time of > cluster provisioning. Static role or attribute assignment leads to > sub-optimal use of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-3059) Allow http endpoint to dynamically change the slave attributes
[ https://issues.apache.org/jira/browse/MESOS-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022795#comment-15022795 ] Chris commented on MESOS-3059: -- Has there been any work on this capability? > Allow http endpoint to dynamically change the slave attributes > -- > > Key: MESOS-3059 > URL: https://issues.apache.org/jira/browse/MESOS-3059 > Project: Mesos > Issue Type: Wish >Reporter: Nitin > > This is well understood that - changing the attributes dynamically is not > safe without a restart because slave itself may not know which old framework > tasks are running on it that were dependent on previous attributes. > However, total restart makes lot of other history to delete. We need to > ensure a dynamic attribute changes with a soft restart. > It will be good to expose a rest endpoint either at slave or mesos-master > which directly changes the state in zookeeper. > USE-CASE > We use slave attributes/roles to direct the framework scheduling to use > specific slave as per it's requirements. Mesos scheduler only creates the > offer on the basis of some resources. > In our use case, we have some categorization of our spark frameworks or jobs > with framework(like marathon) based on multiple factors. We want job or > frameworks belonging to one category be running into their specific cluster > of resources. We want to dynamically manage the slaves into these logical > sub-clusters. > Since number of jobs that will be submitted or when it will be submitted is > very dynamic, it make sense to be able to dynamically assign roles or > attributes to slaves. It is not possible to gauge the requirements at time of > cluster provisioning. Static role or attribute assignment leads to > sub-optimal use of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)