On 09/18/2009 03:43 AM, Mel Gorman wrote:
On Thu, Sep 17, 2009 at 04:59:15PM -0400, Jarod Wilson wrote:
The attached python script has been used successfully on Red Hat
Enterprise Linux 5, Fedora 11 and Fedora 12, and is likely to work for
other distros (though possibly with some minor tweaking required).
...
My preference if possible would be to integrate as much as possible into
hugeadm and have this script converted to using hugeadm where
appropriate.
So based on earlier feedback from David Gibson, I rewrote it slightly to
make more use of hugeadm and pagesize, but I've talked it over w/my
manager, and have the approval to go ahead and work on integrating as
much as possible of what this script does into hugeadm itself.
Great stuff.
Attaching a full diff that implements the bulk of the things that aren't
terribly hard to add to hugeadm itself. Semi-sanely broken out patches
available here:
http://people.redhat.com/jwilson/misc/hugeadm-enhancements/
I think creating groups might be beyond the scope of hugeadm. This is
possibly the most distro-specific part of the entire script so I'd be a
little more wary of integrating it.
Agreed. Creating users and groups definitely doesn't belong in hugeadm.
My thought is that anything not belonging in there can still reside in
an updated version of this script which does everything specific to huge
pages using hugeadm. It'd be much more of a wrapper to hugeadm and
{user,group}{add,mod} -- and possibly sysctl.
Haven't yet rewritten the script, but it should only have to wrap
hugeadm and the user/group add/mod bits now, I think...
Perhaps there is some scope for libhugetlbfs installing silently the first
time and have a forced reinstallion present some configuration options such
as creating a group and adding users as this script does?
I'm inclined to say no. At least in the RHEL world, people will
primarily be installing via packages, and anything interactive at
install/uninstall/reinstall/etc is pretty much no-go. Now, we *could*
theoretically have the package create something like a hugepage group at
install time, and even set hugetlb_shm_group, but not in a persistent
way (at least not w/o munging /etc/sysctl.conf directly from the package
install scriptlet, which would also probably be frowned upon). I'm
inclined to leave all of this to the user to configure after
installation -- though with the possible aid of said script, once
rewritten...
hmm.... can't decide on this one. Not sure whether hugeadm should know to to
make settings persist or if it should be recommended that hugeadm invocations
be put into an rc script.
Yeah, having hugeadm write to sysctl.conf doesn't sound like the best
idea to me either. What about having hugeadm simply inform the user what
sysctl settings they would need to add to have the settings persist?
That makes sense. It could be suggested by --explain which I just
noticed has no manual page entry. I should fix that.
I've added a bit to --set-recommended-shmmax and --set-shm-group to spit
out a warning "add foo to /etc/sysctl.conf to make these settings
persist" for now. Didn't add anything to --explain though.
So in in the limits.conf case, its a stand-alone file in
/etc/security/limits.d/, so maybe its okay to scribble on this file?
Certainly less contentious than munging sysctl.conf anyway.
I'd view them as being very similar. I think we should be able to
persist all settings or none at all. Maybe that's just me though.
Ideally, yeah, all or none... But its a bit murkier, if in one case
we're editing a system-wide file, vs. editing a file that could be part
of the libhugetlbfs distribution itself. For example,
/etc/security/limits.d/hugetlbfs.conf could be a file created by the
libhugetlbfs rpm on RHEL, in which case, we're definitely free and clear
to do with it as we please. But /etc/sysctl.conf isn't "ours". Bleah. We
need an /etc/sysctl.d/hugetlbfs.conf. :)
So hopefully, I've not butchered anything *too* badly...
--
Jarod Wilson
ja...@redhat.com
hugeadm.c | 201 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
man/hugeadm.8 | 40 +++++++++---
2 files changed, 222 insertions(+), 19 deletions(-)
diff --git a/hugeadm.c b/hugeadm.c
index a793267..fbaebfd 100644
--- a/hugeadm.c
+++ b/hugeadm.c
@@ -67,12 +67,15 @@ extern char *optarg;
#define PROCMOUNTS "/proc/mounts"
#define PROCHUGEPAGES_MOVABLE "/proc/sys/vm/hugepages_treat_as_movable"
#define PROCMINFREEKBYTES "/proc/sys/vm/min_free_kbytes"
+#define PROCHUGETLBGROUP "/proc/sys/vm/hugetlb_shm_group"
+#define PROCSHMMAX "/proc/sys/kernel/shmmax"
#define PROCZONEINFO "/proc/zoneinfo"
#define FS_NAME "hugetlbfs"
#define MIN_COL 20
#define MAX_SIZE_MNTENT (64 + PATH_MAX + 32 + 128 + 2 * sizeof(int))
#define FORMAT_LEN 20
+#define MEM_TOTAL "MemTotal:"
#define SWAP_FREE "SwapFree:"
#define SWAP_TOTAL "SwapTotal:"
@@ -86,13 +89,17 @@ void print_usage()
OPTION("--hard", "specified with --pool-pages-min to make");
CONT("multiple attempts at adjusting the pool size to the");
CONT("specified count on failure");
- OPTION("--pool-pages-min <size>:[+|-]<count>", "");
+ OPTION("--pool-pages-min
<size|DEFAULT>:[+|-]<pagecount|memsize<G|M|K>>", "");
CONT("Adjust pool 'size' lower bound");
- OPTION("--pool-pages-max <size>:[+|-]<count>", "");
+ OPTION("--pool-pages-max
<size|DEFAULT>:[+|-]<pagecount|memsize<G|M|K>>", "");
CONT("Adjust pool 'size' upper bound");
OPTION("--set-recommended-min_free_kbytes", "");
CONT("Sets min_free_kbytes to a recommended value to improve
availability of");
CONT("huge pages at runtime");
+ OPTION("--set-recommended-shmmax", "Sets shmmax to a recommended value
to");
+ CONT("maximise the size possible for shared memory pools");
+ OPTION("--set-shm-group", "Sets hugetlb_shm_group to a user-specified
group,");
+ CONT("which has permission to use hugetlb shared memory pools");
OPTION("--add-temp-swap[=count]", "Specified with --pool-pages-min to
create");
CONT("temporary swap space for the duration of the pool resize. Default
swap");
CONT("size is 5 huge pages. Optional arg sets size to 'count' huge
pages");
@@ -135,6 +142,8 @@ int opt_dry_run = 0;
int opt_hard = 0;
int opt_movable = -1;
int opt_set_recommended_minfreekbytes = 0;
+int opt_set_recommended_shmmax = 0;
+int opt_set_hugetlb_shm_group = 0;
int opt_temp_swap = 0;
int opt_ramdisk_swap = 0;
int opt_swap_persist = 0;
@@ -215,6 +224,8 @@ void verbose_expose(void)
#define LONG_POOL_MAX_ADJ (LONG_POOL|'M')
#define LONG_SET_RECOMMENDED_MINFREEKBYTES ('k' << 8)
+#define LONG_SET_RECOMMENDED_SHMMAX ('x' << 8)
+#define LONG_SET_HUGETLB_SHM_GROUP ('R' << 8)
#define LONG_MOVABLE ('z' << 8)
#define LONG_MOVABLE_ENABLE (LONG_MOVABLE|'e')
@@ -589,6 +600,19 @@ void create_mounts(char *user, char *group, char *base,
mode_t mode)
}
/**
+ * show_mem shouldn't change the behavior of any of its
+ * callers, it only prints a message to the user showing the
+ * total amount of memory in the system (in megabytes).
+ */
+void show_mem()
+{
+ long mem_total;
+
+ mem_total = read_meminfo(MEM_TOTAL);
+ printf("Total System Memory: %ld MB\n\n", mem_total / 1024);
+}
+
+/**
* check_swap shouldn't change the behavior of any of its
* callers, it only prints a message to the user if something
* is being done that might fail without swap available. i.e.
@@ -668,12 +692,108 @@ void check_minfreekbytes(void)
/* There should be at least one pageblock free per zone in the system */
if (recommended_min > min_free_kbytes) {
printf("\n");
- printf("The " PROCMINFREEKBYTES " of %ld is too small. To
maximiuse efficiency\n", min_free_kbytes);
+ printf("The " PROCMINFREEKBYTES " of %ld is too small. To
maximise efficiency\n", min_free_kbytes);
printf("of fragmentation avoidance, there should be at least
one huge page free per zone\n");
printf("in the system which minimally requires a
min_free_kbytes value of %ld\n", recommended_min);
}
}
+long get_recommended_shmmax(void)
+{
+ long mem_total;
+ long half_of_mem;
+ long mem_less_2g;
+ long recommended_shmmax;
+
+ /* in kB */
+ mem_total = read_meminfo(MEM_TOTAL);
+ half_of_mem = mem_total / 2;
+ mem_less_2g = mem_total - (2 * 1024 * 1024);
+
+ if (half_of_mem >= mem_less_2g)
+ recommended_shmmax = half_of_mem;
+ else
+ recommended_shmmax = mem_less_2g;
+
+ /* need it in bytes */
+ return recommended_shmmax * 1024;
+}
+
+void set_recommended_shmmax(void)
+{
+ int ret;
+ long recommended_shmmax = get_recommended_shmmax();
+
+ DEBUG("Setting shmmax to %ld\n", recommended_shmmax);
+ ret = file_write_ulong(PROCSHMMAX, (unsigned long)recommended_shmmax);
+
+ if (!ret) {
+ printf("To make shmmax settings persistent, add the following
line to /etc/sysctl.conf:\n");
+ printf(" kernel.shmmax = %ld\n", recommended_shmmax);
+ }
+}
+
+void check_shmmax(void)
+{
+ long current_shmmax = file_read_ulong(PROCSHMMAX, NULL);
+ long recommended_shmmax = get_recommended_shmmax();
+
+ /* 32 MB is typically the system default */
+ if (current_shmmax <= 32 * 1024 * 1024) {
+ printf("\n");
+ printf("A " PROCSHMMAX " value of %ld bytes may be too small.
To maximise\n", current_shmmax);
+ printf("shared memory usage, this should be set to the size of
the largest heap size you\n");
+ printf("want to be able to use. Alternatively, set it to a size
that ensures enough\n");
+ printf("system memory remains for other tasks (%ld bytes
recommended).\n", recommended_shmmax);
+ printf("This can be done automatically, using the
--set-recommended-shmmax option.\n");
+ }
+}
+
+void set_hugetlb_shm_group(gid_t gid, char *group)
+{
+ int ret;
+
+ DEBUG("Setting hugetlb_shm_group to %d (%s)\n", gid, group);
+ ret = file_write_ulong(PROCHUGETLBGROUP, (unsigned long)gid);
+
+ if (!ret) {
+ printf("To make hugetlb_shm_group settings persistent, add the
following line to /etc/sysctl.conf:\n");
+ printf(" vm.hugetlb_shm_group = %d\n", gid);
+ }
+}
+
+/* heisted from shadow-utils/libmisc/list.c::is_on_list() */
+static int user_in_group(char *const *list, const char *member)
+{
+ while (*list != NULL) {
+ if (strcmp(*list, member) == 0) {
+ return 1;
+ }
+ list++;
+ }
+
+ return 0;
+}
+
+void check_user(void)
+{
+ uid_t uid;
+ gid_t gid;
+ struct passwd *pwd;
+ struct group *grp;
+
+ gid = (gid_t)file_read_ulong(PROCHUGETLBGROUP, NULL);
+ grp = getgrgid(gid);
+
+ uid = getuid();
+ pwd = getpwuid(uid);
+
+ if (!user_in_group(grp->gr_mem, pwd->pw_name) && uid != 0) {
+ printf("\n");
+ WARNING("User %s (uid: %d) is not a member of the
hugetlb_shm_group %s (gid: %d)!\n", pwd->pw_name, uid, grp->gr_name, gid);
+ }
+}
+
void add_temp_swap(long page_size)
{
char path[PATH_MAX];
@@ -828,18 +948,37 @@ enum {
POOL_BOTH,
};
-static long value_adjust(char *adjust_str, long base)
+static long value_adjust(char *adjust_str, long base, long page_size)
{
long adjust;
char *iter;
/* Convert and validate the adjust. */
+ errno = 0;
adjust = strtol(adjust_str, &iter, 0);
- if (*iter) {
+ /* Catch strtoul errors and sizes that overflow the native word size */
+ if (errno || adjust_str == iter) {
+ if (errno == ERANGE)
+ errno = EOVERFLOW;
+ else
+ errno = EINVAL;
ERROR("%s: invalid adjustment\n", adjust_str);
exit(EXIT_FAILURE);
}
+ switch (*iter) {
+ case 'G':
+ case 'g':
+ adjust = size_to_smaller_unit(adjust);
+ case 'M':
+ case 'm':
+ adjust = size_to_smaller_unit(adjust);
+ case 'K':
+ case 'k':
+ adjust = size_to_smaller_unit(adjust);
+ adjust = adjust / page_size;
+ }
+
if (adjust_str[0] != '+' && adjust_str[0] != '-')
base = 0;
@@ -852,6 +991,8 @@ static long value_adjust(char *adjust_str, long base)
}
base += adjust;
+ INFO("Returning page count of %ld\n", base);
+
return base;
}
@@ -885,7 +1026,12 @@ void pool_adjust(char *cmd, unsigned int counter)
page_size_str, adjust_str, counter);
/* Convert and validate the page_size. */
- page_size = parse_page_size(page_size_str);
+ if (strcmp(page_size_str, "DEFAULT") == 0)
+ page_size = kernel_default_hugepage_size();
+ else
+ page_size = parse_page_size(page_size_str);
+
+ INFO("Working with page_size of %ld\n", page_size);
cnt = hpool_sizes(pools, MAX_POOLS);
if (cnt < 0) {
@@ -905,14 +1051,14 @@ void pool_adjust(char *cmd, unsigned int counter)
max = pools[pos].maximum;
if (counter == POOL_BOTH) {
- min = value_adjust(adjust_str, min);
+ min = value_adjust(adjust_str, min, page_size);
max = min;
} else if (counter == POOL_MIN) {
- min = value_adjust(adjust_str, min);
+ min = value_adjust(adjust_str, min, page_size);
if (min > max)
max = min;
} else {
- max = value_adjust(adjust_str, max);
+ max = value_adjust(adjust_str, max, page_size);
if (max < min)
min = max;
}
@@ -1003,13 +1149,16 @@ void page_sizes(int all)
void explain()
{
+ show_mem();
mounts_list_all();
printf("\nHuge page pools:\n");
pool_list();
printf("\nHuge page sizes with configured pools:\n");
page_sizes(0);
check_minfreekbytes();
+ check_shmmax();
check_swap();
+ check_user();
printf("\nNote: Permanent swap space should be preferred when dynamic "
"huge page pools are used.\n");
}
@@ -1027,6 +1176,9 @@ int main(int argc, char** argv)
int opt_global_mounts = 0, opt_pgsizes = 0, opt_pgsizes_all = 0;
int opt_explain = 0, minadj_count = 0, maxadj_count = 0;
int ret = 0, index = 0;
+ gid_t opt_gid = 0;
+ struct group *opt_grp = NULL;
+ int group_invalid = 0;
struct option long_opts[] = {
{"help", no_argument, NULL, 'h'},
{"verbose", required_argument, NULL, 'v' },
@@ -1036,6 +1188,8 @@ int main(int argc, char** argv)
{"pool-pages-min", required_argument, NULL, LONG_POOL_MIN_ADJ},
{"pool-pages-max", required_argument, NULL, LONG_POOL_MAX_ADJ},
{"set-recommended-min_free_kbytes", no_argument, NULL,
LONG_SET_RECOMMENDED_MINFREEKBYTES},
+ {"set-recommended-shmmax", no_argument, NULL,
LONG_SET_RECOMMENDED_SHMMAX},
+ {"set-shm-group", required_argument, NULL,
LONG_SET_HUGETLB_SHM_GROUP},
{"enable-zone-movable", no_argument, NULL, LONG_MOVABLE_ENABLE},
{"disable-zone-movable", no_argument, NULL,
LONG_MOVABLE_DISABLE},
{"hard", no_argument, NULL, LONG_HARD},
@@ -1153,6 +1307,29 @@ int main(int argc, char** argv)
opt_set_recommended_minfreekbytes = 1;
break;
+ case LONG_SET_RECOMMENDED_SHMMAX:
+ opt_set_recommended_shmmax = 1;
+ break;
+
+ case LONG_SET_HUGETLB_SHM_GROUP:
+ opt_grp = getgrnam(optarg);
+ if (!opt_grp) {
+ opt_gid = atoi(optarg);
+ if (opt_gid == 0 && strcmp(optarg, "0"))
+ group_invalid = 1;
+ opt_grp = getgrgid(opt_gid);
+ if (!opt_grp)
+ group_invalid = 1;
+ } else {
+ opt_gid = opt_grp->gr_gid;
+ }
+ if (group_invalid) {
+ ERROR("Invalid group specification (%s)\n",
optarg);
+ exit(EXIT_FAILURE);
+ }
+ opt_set_hugetlb_shm_group = 1;
+ break;
+
case LONG_MOVABLE_DISABLE:
opt_movable = 0;
break;
@@ -1208,6 +1385,12 @@ int main(int argc, char** argv)
if (opt_set_recommended_minfreekbytes)
set_recommended_minfreekbytes();
+ if (opt_set_recommended_shmmax)
+ set_recommended_shmmax();
+
+ if (opt_set_hugetlb_shm_group)
+ set_hugetlb_shm_group(opt_gid, opt_grp->gr_name);
+
while (--minadj_count >= 0) {
if (! kernel_has_overcommit())
pool_adjust(opt_min_adj[minadj_count], POOL_BOTH);
diff --git a/man/hugeadm.8 b/man/hugeadm.8
index 6342980..d3a2582 100644
--- a/man/hugeadm.8
+++ b/man/hugeadm.8
@@ -2,7 +2,7 @@
.\" First parameter, NAME, should be all caps
.\" Second parameter, SECTION, should be 1-8, maybe w/ subsection
.\" other parameters are allowed: see man(7), man(1)
-.TH HUGEADM 8 "October 10, 2008"
+.TH HUGEADM 8 "September 30, 2009"
.\" Please adjust this date whenever revising the manpage.
.\"
.\" Some roff macros, for reference:
@@ -87,6 +87,21 @@ avoiding mixing is to increase /proc/sys/vm/min_free_kbytes.
This parameter
sets min_free_kbytes to a recommended value to aid fragmentation avoidance.
.TP
+.B --set-recommended-shmmax
+
+The maximum shared memory segment size should be set to at least the size
+of the largest shared memory segment size you want available for applications
+using huge pages, via /proc/sys/kernel/shmmax. Optionally, it can be set to
+what should be a sufficiently large value automatically, using this switch.
+
+.TP
+.B --set-shm-group <gid|groupname>
+
+Users in the group specified in /proc/sys/vm/hugetlb_shm_group are granted
+full access to huge pages. The sysctl takes a numeric gid, but this hugeadm
+option can set it for you, using either a gid or group name.
+
+.TP
.B --page-sizes
This displays every page size supported by the system and has a pool
@@ -107,25 +122,30 @@ This displays all active mount points for hugetlbfs.
The following options configure the pool.
.TP
-.B --pool-pages-min=<size>:[+|-]<count>
+.B --pool-pages-min=<size|DEFAULT>:[+|-]<pagecount|memsize<G|M|K>>
This option sets or adjusts the Minimum number of hugepages in the pool for
pagesize \fBsize\fP. \fBsize\fP may be specified in bytes or in kilobytes,
-megabytes, or gigabytes by appending K, M, or G respectively. The pool is set
-to \fBcount\fP pages if + or - are not specified. If + or - are specified,
-then the size of the pool will adjust by that amount. Note that there is
-no guarantee that the system can allocate the hugepages requested for the
-Minimum pool. The size of the pools should be checked after executing this
-command to ensure they were successful.
+megabytes, or gigabytes by appending K, M, or G respectively, or as DEFAULT,
+which uses the system's default huge page size for \fBsize\fP. The pool size
+adjustment can be specified by \fBpagecount\fP pages or by \fBmemsize\fP, if
+postfixed with G, M, or K, for gigabytes, megabytes, or kilobytes,
+respectively. If the adjustment is specified via \fBmemsize\fP, then the
+\fBpagecount\fP will be calculated for you, based on page size \fBsize\fP.
+The pool is set to \fBpagecount\fP pages if + or - are not specified. If
++ or - are specified, then the size of the pool will adjust by that amount.
+Note that there is no guarantee that the system can allocate the hugepages
+requested for the Minimum pool. The size of the pools should be checked after
+executing this command to ensure they were successful.
.TP
-.B --pool-pages-max=<size>:[+|-]<count>
+.B --pool-pages-max=<size|DEFAULT>:[+|-]<pagecount|memsize<G|M|K>>
This option sets or adjusts the Maximum number of hugepages. Note that while
the Minimum number of pages are guaranteed to be available to applications,
there is not guarantee that the system can allocate the pages on demand when
the number of huge pages requested by applications is between the Minimum and
-Maximum pool sizes.
+Maximum pool sizes. See --pool-pages-min for usage syntax.
.TP
.B --enable-zone-movable
------------------------------------------------------------------------------
Come build with us! The BlackBerry® Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay
ahead of the curve. Join us from November 9-12, 2009. Register now!
http://p.sf.net/sfu/devconf
_______________________________________________
Libhugetlbfs-devel mailing list
Libhugetlbfs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel