Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size

2016-10-13 Thread Tom Lane
Andres Freund  writes:
> On 2016-10-12 16:33:38 -0400, Tom Lane wrote:
>> Also, if you look into /sys then you are going to see multiple
>> possible values and it's not clear how to choose the right one.

> That's a fair point. It'd probably be good to use the largest we can,
> bounded by a percentage of max waste or such.  But that's likely
> something for another day.

Yeah.  Merlin pointed out that on versions of Linux newer than my
RHEL6 box, mmap accepts additional flag bits that let you specify
which hugepage size to use.  So we would need to use those if we
wanted to work with anything besides the default size.

Now AFAICT from the documentation I've seen, configuring hugepages
is all still pretty manual, ie the sysadmin has to set up so many huge
pages of each size at or near boot.  So I'm thinking that using a
non-default size should be something that happens only if the user
tells us to, ie we'd need to add a GUC saying "use size X".  That's
pretty ugly but if the admin is intending PG to use pages from a
certain pool, how else would we ensure that the right thing happens?
And it'd provide a way of overriding our default 2MB guess on non-Linux
platforms.

Anyway, anything involving a new GUC is certainly new-feature, HEAD-only
material.  I think though that reading the default hugepage size out of
/proc/meminfo is a back-patchable bug fix.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size

2016-10-12 Thread Merlin Moncure
On Wed, Oct 12, 2016 at 5:18 PM, Tom Lane  wrote:
> Merlin Moncure  writes:
>> ISTM all this silliness is pretty much unique to linux anyways.
>> Instead of reading the filesystem, what about doing test map and test
>> unmap?
>
> And if mmap succeeds and munmap fails, you'll recover how exactly?
>
> If this API were less badly designed, we'd not be having this problem
> in the first place ...

I was thinking to 'guess' in a ^2 loop in the case the obvious unmap
didn't work, finally aborting if no guess worked.  :-).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size

2016-10-12 Thread Andres Freund
On 2016-10-12 16:33:38 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On October 12, 2016 1:25:54 PM PDT, Tom Lane  wrote:
> >> A little bit of research suggests that on Linux the thing to do would
> >> be to get the actual default hugepage size by reading /proc/meminfo and
> >> looking for a line like "Hugepagesize:   2048 kB".
> 
> > We had that, but Heikki ripped it out when merging... I think you're 
> > supposed to use /sys to get the available size.
> 
> According to
> https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
> looking into /proc/meminfo is the longer-standing API and thus is
> likely to work on more kernel versions.

MAP_HUGETLB, which we rely on for hugepage support, is newer than the
introducing the /sys stuff.


> Also, if you look into /sys then you are going to see multiple
> possible values and it's not clear how to choose the right one.

That's a fair point. It'd probably be good to use the largest we can,
bounded by a percentage of max waste or such.  But that's likely
something for another day.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size

2016-10-12 Thread Tom Lane
Merlin Moncure  writes:
> ISTM all this silliness is pretty much unique to linux anyways.
> Instead of reading the filesystem, what about doing test map and test
> unmap?

And if mmap succeeds and munmap fails, you'll recover how exactly?

If this API were less badly designed, we'd not be having this problem
in the first place ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size

2016-10-12 Thread Merlin Moncure
On Wed, Oct 12, 2016 at 5:10 PM, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> Tom Lane wrote:
>>> According to
>>> https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
>>> looking into /proc/meminfo is the longer-standing API and thus is
>>> likely to work on more kernel versions.  Also, if you look into
>>> /sys then you are going to see multiple possible values and it's
>>> not clear how to choose the right one.
>
>> I'm not sure that this is the best rationale.  In my system there are
>> 2MB and 1GB huge page sizes; in systems with lots of memory (let's say 8
>> GB of shared memory is requested) it seems a clear winner to allocate 8
>> 1GB hugepages than 4096 2MB hugepages because the page table is so much
>> smaller.  The /proc interface only shows the 2MB page size, so if we go
>> that route we'd not be getting the full benefit of the feature.
>
> And you'll tell mmap() which one to do how exactly?  I haven't found
> anything explaining how applications get to choose which page size applies
> to their request.  The kernel document says that /proc/meminfo reflects
> the "default" size, and I'd assume that that's what we'll get from mmap.

hm. for (recent) linux, I see:

   MAP_HUGE_2MB, MAP_HUGE_1GB (since Linux 3.8)
  Used in conjunction with MAP_HUGETLB to select alternative
  hugetlb page sizes (respectively, 2 MB and 1 GB) on systems
  that support multiple hugetlb page sizes.

  More generally, the desired huge page size can be configured
  by encoding the base-2 logarithm of the desired page size in
  the six bits at the offset MAP_HUGE_SHIFT.  (A value of zero
  in this bit field provides the default huge page size; the
  default huge page size can be discovered vie the Hugepagesize
  field exposed by /proc/meminfo.)  Thus, the above two
  constants are defined as:

  #define MAP_HUGE_2MB(21 << MAP_HUGE_SHIFT)
  #define MAP_HUGE_1GB(30 << MAP_HUGE_SHIFT)

  The range of huge page sizes that are supported by the system
  can be discovered by listing the subdirectories in
  /sys/kernel/mm/hugepages.


via: http://man7.org/linux/man-pages/man2/mmap.2.html#NOTES

ISTM all this silliness is pretty much unique to linux anyways.
Instead of reading the filesystem, what about doing test map and test
unmap?  We could zero in on the page size for default I think with
some probing of known possible values.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size

2016-10-12 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> According to
>> https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
>> looking into /proc/meminfo is the longer-standing API and thus is
>> likely to work on more kernel versions.  Also, if you look into
>> /sys then you are going to see multiple possible values and it's
>> not clear how to choose the right one.

> I'm not sure that this is the best rationale.  In my system there are
> 2MB and 1GB huge page sizes; in systems with lots of memory (let's say 8
> GB of shared memory is requested) it seems a clear winner to allocate 8
> 1GB hugepages than 4096 2MB hugepages because the page table is so much
> smaller.  The /proc interface only shows the 2MB page size, so if we go
> that route we'd not be getting the full benefit of the feature.

And you'll tell mmap() which one to do how exactly?  I haven't found
anything explaining how applications get to choose which page size applies
to their request.  The kernel document says that /proc/meminfo reflects
the "default" size, and I'd assume that that's what we'll get from mmap.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size

2016-10-12 Thread Alvaro Herrera
Tom Lane wrote:
> Andres Freund  writes:
> > On October 12, 2016 1:25:54 PM PDT, Tom Lane  wrote:
> >> A little bit of research suggests that on Linux the thing to do would
> >> be to get the actual default hugepage size by reading /proc/meminfo and
> >> looking for a line like "Hugepagesize:   2048 kB".
> 
> > We had that, but Heikki ripped it out when merging... I think you're 
> > supposed to use /sys to get the available size.
> 
> According to
> https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
> looking into /proc/meminfo is the longer-standing API and thus is
> likely to work on more kernel versions.  Also, if you look into
> /sys then you are going to see multiple possible values and it's
> not clear how to choose the right one.

I'm not sure that this is the best rationale.  In my system there are
2MB and 1GB huge page sizes; in systems with lots of memory (let's say 8
GB of shared memory is requested) it seems a clear winner to allocate 8
1GB hugepages than 4096 2MB hugepages because the page table is so much
smaller.  The /proc interface only shows the 2MB page size, so if we go
that route we'd not be getting the full benefit of the feature.

We could just fall back to reading /proc if we cannot find the /sys
stuff; that makes it continue to work on older kernels.

Regarding choosing one among several size choices, I'd think the larger
huge page size is always better if the request size is at least that
large, otherwise fall back to the next smaller one.  (This could stand
some actual research).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size

2016-10-12 Thread Tom Lane
Andres Freund  writes:
> On October 12, 2016 1:25:54 PM PDT, Tom Lane  wrote:
>> A little bit of research suggests that on Linux the thing to do would
>> be to get the actual default hugepage size by reading /proc/meminfo and
>> looking for a line like "Hugepagesize:   2048 kB".

> We had that, but Heikki ripped it out when merging... I think you're supposed 
> to use /sys to get the available size.

According to
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
looking into /proc/meminfo is the longer-standing API and thus is
likely to work on more kernel versions.  Also, if you look into
/sys then you are going to see multiple possible values and it's
not clear how to choose the right one.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] munmap() failure due to sloppy handling of hugepage size

2016-10-12 Thread Andres Freund


On October 12, 2016 1:25:54 PM PDT, Tom Lane  wrote:
>If any of you were following the thread at
>https://www.postgresql.org/message-id/flat/CAOan6TnQeSGcu_627NXQ2Z%2BWyhUzBjhERBm5RN9D0QFWmk7PoQ%40mail.gmail.com
>I spent quite a bit of time following a bogus theory, but the problem
>turns out to be very simple: on Linux, munmap() is pickier than mmap()
>about the length of a hugepage allocation.  The comments in
>sysv_shmem.c
>mention that on older kernels mmap() with MAP_HUGETLB will fail if
>given
>a length request that's not a multiple of the hugepage size.  Well, the
>behavior they replaced that with is little better: mmap() succeeds, but
>it gives you back a region that's been silently enlarged to the next
>hugepage boundary, and then munmap() will fail if you specify the
>region
>size you asked for rather than the region size you were given.
>
>Since AFAICS there is no way to inquire what region size you were
>given,
>this API is astonishingly brain-dead IMO.  But that seems to be what
>we've got.  Chris Richards reported it against a 3.16.7 kernel, and
>I can replicate the behavior on RHEL6 (2.6.32) by asking for an
>odd-size
>huge page region.
>
>We've mostly masked this by rounding up to a 2MB boundary, which is
>what
>the hugepage size typically is.  But that assumption is wrong on some
>hardware, and it's not likely to get less wrong as time passes.
>
>A little bit of research suggests that on Linux the thing to do would
>be
>to get the actual default hugepage size by reading /proc/meminfo and
>looking for a line like "Hugepagesize:   2048 kB".  I don't know
>of any more-portable API, so this does nothing for non-Linux kernels.
>But we have not heard of similar misbehavior on other platforms, even
>though IA64 and PPC64 can both have hugepages larger than 2MB, so it's
>reasonable to hope that other implementations of munmap() don't have
>the same gotcha.

We had that, but Heikki ripped it out when merging... I think you're supposed 
to use /sys to get the available size.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] munmap() failure due to sloppy handling of hugepage size

2016-10-12 Thread Tom Lane
If any of you were following the thread at
https://www.postgresql.org/message-id/flat/CAOan6TnQeSGcu_627NXQ2Z%2BWyhUzBjhERBm5RN9D0QFWmk7PoQ%40mail.gmail.com
I spent quite a bit of time following a bogus theory, but the problem
turns out to be very simple: on Linux, munmap() is pickier than mmap()
about the length of a hugepage allocation.  The comments in sysv_shmem.c
mention that on older kernels mmap() with MAP_HUGETLB will fail if given
a length request that's not a multiple of the hugepage size.  Well, the
behavior they replaced that with is little better: mmap() succeeds, but
it gives you back a region that's been silently enlarged to the next
hugepage boundary, and then munmap() will fail if you specify the region
size you asked for rather than the region size you were given.

Since AFAICS there is no way to inquire what region size you were given,
this API is astonishingly brain-dead IMO.  But that seems to be what
we've got.  Chris Richards reported it against a 3.16.7 kernel, and
I can replicate the behavior on RHEL6 (2.6.32) by asking for an odd-size
huge page region.

We've mostly masked this by rounding up to a 2MB boundary, which is what
the hugepage size typically is.  But that assumption is wrong on some
hardware, and it's not likely to get less wrong as time passes.

A little bit of research suggests that on Linux the thing to do would be
to get the actual default hugepage size by reading /proc/meminfo and
looking for a line like "Hugepagesize:   2048 kB".  I don't know
of any more-portable API, so this does nothing for non-Linux kernels.
But we have not heard of similar misbehavior on other platforms, even
though IA64 and PPC64 can both have hugepages larger than 2MB, so it's
reasonable to hope that other implementations of munmap() don't have
the same gotcha.

Barring objections I'll go make this happen.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers