Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size
Andres Freund writes: > On 2016-10-12 16:33:38 -0400, Tom Lane wrote: >> Also, if you look into /sys then you are going to see multiple >> possible values and it's not clear how to choose the right one. > That's a fair point. It'd probably be good to use the largest we can, > bounded by a percentage of max waste or such. But that's likely > something for another day. Yeah. Merlin pointed out that on versions of Linux newer than my RHEL6 box, mmap accepts additional flag bits that let you specify which hugepage size to use. So we would need to use those if we wanted to work with anything besides the default size. Now AFAICT from the documentation I've seen, configuring hugepages is all still pretty manual, ie the sysadmin has to set up so many huge pages of each size at or near boot. So I'm thinking that using a non-default size should be something that happens only if the user tells us to, ie we'd need to add a GUC saying "use size X". That's pretty ugly but if the admin is intending PG to use pages from a certain pool, how else would we ensure that the right thing happens? And it'd provide a way of overriding our default 2MB guess on non-Linux platforms. Anyway, anything involving a new GUC is certainly new-feature, HEAD-only material. I think though that reading the default hugepage size out of /proc/meminfo is a back-patchable bug fix. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size
On Wed, Oct 12, 2016 at 5:18 PM, Tom Lane wrote: > Merlin Moncure writes: >> ISTM all this silliness is pretty much unique to linux anyways. >> Instead of reading the filesystem, what about doing test map and test >> unmap? > > And if mmap succeeds and munmap fails, you'll recover how exactly? > > If this API were less badly designed, we'd not be having this problem > in the first place ... I was thinking to 'guess' in a ^2 loop in the case the obvious unmap didn't work, finally aborting if no guess worked. :-). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size
On 2016-10-12 16:33:38 -0400, Tom Lane wrote: > Andres Freund writes: > > On October 12, 2016 1:25:54 PM PDT, Tom Lane wrote: > >> A little bit of research suggests that on Linux the thing to do would > >> be to get the actual default hugepage size by reading /proc/meminfo and > >> looking for a line like "Hugepagesize: 2048 kB". > > > We had that, but Heikki ripped it out when merging... I think you're > > supposed to use /sys to get the available size. > > According to > https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt > looking into /proc/meminfo is the longer-standing API and thus is > likely to work on more kernel versions. MAP_HUGETLB, which we rely on for hugepage support, is newer than the introducing the /sys stuff. > Also, if you look into /sys then you are going to see multiple > possible values and it's not clear how to choose the right one. That's a fair point. It'd probably be good to use the largest we can, bounded by a percentage of max waste or such. But that's likely something for another day. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size
Merlin Moncure writes: > ISTM all this silliness is pretty much unique to linux anyways. > Instead of reading the filesystem, what about doing test map and test > unmap? And if mmap succeeds and munmap fails, you'll recover how exactly? If this API were less badly designed, we'd not be having this problem in the first place ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size
On Wed, Oct 12, 2016 at 5:10 PM, Tom Lane wrote: > Alvaro Herrera writes: >> Tom Lane wrote: >>> According to >>> https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt >>> looking into /proc/meminfo is the longer-standing API and thus is >>> likely to work on more kernel versions. Also, if you look into >>> /sys then you are going to see multiple possible values and it's >>> not clear how to choose the right one. > >> I'm not sure that this is the best rationale. In my system there are >> 2MB and 1GB huge page sizes; in systems with lots of memory (let's say 8 >> GB of shared memory is requested) it seems a clear winner to allocate 8 >> 1GB hugepages than 4096 2MB hugepages because the page table is so much >> smaller. The /proc interface only shows the 2MB page size, so if we go >> that route we'd not be getting the full benefit of the feature. > > And you'll tell mmap() which one to do how exactly? I haven't found > anything explaining how applications get to choose which page size applies > to their request. The kernel document says that /proc/meminfo reflects > the "default" size, and I'd assume that that's what we'll get from mmap. hm. for (recent) linux, I see: MAP_HUGE_2MB, MAP_HUGE_1GB (since Linux 3.8) Used in conjunction with MAP_HUGETLB to select alternative hugetlb page sizes (respectively, 2 MB and 1 GB) on systems that support multiple hugetlb page sizes. More generally, the desired huge page size can be configured by encoding the base-2 logarithm of the desired page size in the six bits at the offset MAP_HUGE_SHIFT. (A value of zero in this bit field provides the default huge page size; the default huge page size can be discovered vie the Hugepagesize field exposed by /proc/meminfo.) Thus, the above two constants are defined as: #define MAP_HUGE_2MB(21 << MAP_HUGE_SHIFT) #define MAP_HUGE_1GB(30 << MAP_HUGE_SHIFT) The range of huge page sizes that are supported by the system can be discovered by listing the subdirectories in /sys/kernel/mm/hugepages. via: http://man7.org/linux/man-pages/man2/mmap.2.html#NOTES ISTM all this silliness is pretty much unique to linux anyways. Instead of reading the filesystem, what about doing test map and test unmap? We could zero in on the page size for default I think with some probing of known possible values. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size
Alvaro Herrera writes: > Tom Lane wrote: >> According to >> https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt >> looking into /proc/meminfo is the longer-standing API and thus is >> likely to work on more kernel versions. Also, if you look into >> /sys then you are going to see multiple possible values and it's >> not clear how to choose the right one. > I'm not sure that this is the best rationale. In my system there are > 2MB and 1GB huge page sizes; in systems with lots of memory (let's say 8 > GB of shared memory is requested) it seems a clear winner to allocate 8 > 1GB hugepages than 4096 2MB hugepages because the page table is so much > smaller. The /proc interface only shows the 2MB page size, so if we go > that route we'd not be getting the full benefit of the feature. And you'll tell mmap() which one to do how exactly? I haven't found anything explaining how applications get to choose which page size applies to their request. The kernel document says that /proc/meminfo reflects the "default" size, and I'd assume that that's what we'll get from mmap. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size
Tom Lane wrote: > Andres Freund writes: > > On October 12, 2016 1:25:54 PM PDT, Tom Lane wrote: > >> A little bit of research suggests that on Linux the thing to do would > >> be to get the actual default hugepage size by reading /proc/meminfo and > >> looking for a line like "Hugepagesize: 2048 kB". > > > We had that, but Heikki ripped it out when merging... I think you're > > supposed to use /sys to get the available size. > > According to > https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt > looking into /proc/meminfo is the longer-standing API and thus is > likely to work on more kernel versions. Also, if you look into > /sys then you are going to see multiple possible values and it's > not clear how to choose the right one. I'm not sure that this is the best rationale. In my system there are 2MB and 1GB huge page sizes; in systems with lots of memory (let's say 8 GB of shared memory is requested) it seems a clear winner to allocate 8 1GB hugepages than 4096 2MB hugepages because the page table is so much smaller. The /proc interface only shows the 2MB page size, so if we go that route we'd not be getting the full benefit of the feature. We could just fall back to reading /proc if we cannot find the /sys stuff; that makes it continue to work on older kernels. Regarding choosing one among several size choices, I'd think the larger huge page size is always better if the request size is at least that large, otherwise fall back to the next smaller one. (This could stand some actual research). -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size
Andres Freund writes: > On October 12, 2016 1:25:54 PM PDT, Tom Lane wrote: >> A little bit of research suggests that on Linux the thing to do would >> be to get the actual default hugepage size by reading /proc/meminfo and >> looking for a line like "Hugepagesize: 2048 kB". > We had that, but Heikki ripped it out when merging... I think you're supposed > to use /sys to get the available size. According to https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt looking into /proc/meminfo is the longer-standing API and thus is likely to work on more kernel versions. Also, if you look into /sys then you are going to see multiple possible values and it's not clear how to choose the right one. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size
On October 12, 2016 1:25:54 PM PDT, Tom Lane wrote: >If any of you were following the thread at >https://www.postgresql.org/message-id/flat/CAOan6TnQeSGcu_627NXQ2Z%2BWyhUzBjhERBm5RN9D0QFWmk7PoQ%40mail.gmail.com >I spent quite a bit of time following a bogus theory, but the problem >turns out to be very simple: on Linux, munmap() is pickier than mmap() >about the length of a hugepage allocation. The comments in >sysv_shmem.c >mention that on older kernels mmap() with MAP_HUGETLB will fail if >given >a length request that's not a multiple of the hugepage size. Well, the >behavior they replaced that with is little better: mmap() succeeds, but >it gives you back a region that's been silently enlarged to the next >hugepage boundary, and then munmap() will fail if you specify the >region >size you asked for rather than the region size you were given. > >Since AFAICS there is no way to inquire what region size you were >given, >this API is astonishingly brain-dead IMO. But that seems to be what >we've got. Chris Richards reported it against a 3.16.7 kernel, and >I can replicate the behavior on RHEL6 (2.6.32) by asking for an >odd-size >huge page region. > >We've mostly masked this by rounding up to a 2MB boundary, which is >what >the hugepage size typically is. But that assumption is wrong on some >hardware, and it's not likely to get less wrong as time passes. > >A little bit of research suggests that on Linux the thing to do would >be >to get the actual default hugepage size by reading /proc/meminfo and >looking for a line like "Hugepagesize: 2048 kB". I don't know >of any more-portable API, so this does nothing for non-Linux kernels. >But we have not heard of similar misbehavior on other platforms, even >though IA64 and PPC64 can both have hugepages larger than 2MB, so it's >reasonable to hope that other implementations of munmap() don't have >the same gotcha. We had that, but Heikki ripped it out when merging... I think you're supposed to use /sys to get the available size. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers