Re: [HACKERS] things I learned from working on memory allocation

2014-07-15 Thread Robert Haas
On Mon, Jul 14, 2014 at 12:19 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2014-07-14 11:24:26 -0400, Robert Haas wrote:
 On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  The actual if (lock != NULL) bit costs significant amounts of cycles?
  I'd have assumed that branch prediction takes care of that. Or is it
  actually the icache not keeping up? Did you measure icache vs. dcache
  misses?
  Have you played with unlikely()/likely() type of macros?

 I have not.  I think it's really got more to do with how much stuff
 needs to be saved in the stack frame, but I might be wrong about that.

 I don't really see why that'd play such a big role. The register
 pressure on ppc/amd64 shouldn't be high enough that the (lock != NULL)
 alone requires to push anything on the stack. Sure, the call to
 LWLockAcquire() will, but if that's done in the separate branch...

I can't tell you for sure what the cause is; I just saw that the
difference was very significant.

 I found that it was possible to buy back most of the cost by
 replacing (*context-methods-free_p) (context, pointer); with a
 hard-coded AllocSetFree(context, pointer), so that gives you some idea
 what order of magnitude we're talking about here.

 That's measured with a microbenchmark or actual postgres workloads?
 Because in my measurements there wasn't consistent benefit in doing so
 even when benchmarking workloads where allocation is a bottleneck.

Microbenchmark.

  6. In general, I'm worried that it's going to be hard to keep the
  overhead of parallel sort from leaking into the non-parallel case.
  With the no-allocator approach, every place that uses
  GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
  DSM and non-DSM cases differently, which isn't great for either
  performance or maintainability.  And even with an allocator, the
  SortTuple array will need to use relative pointers in a DSM; that
  might burden the non-DSM case.
 
  Yes, I share that concern.
 
  I somewhat doubt that tuplesort.c really is the right layer to do the
  parallel part of parallel sort including the chunked storage. Why not
  pack the tuples outside it and teach tuplesort.c to not copy tuples for
  some _put* routine? Then tuplesort can entirely happen inside a single
  process and doesn't have to worry about indirect pointers and anything?
  Yes, it'll need slightly more memory, but that doesn't sound too bad.

 I'm not sure I understand what you're describing here:

 - the _put* routine has to do with how tuples get into the tuplesort;
 but parallel sort has to do with how we get them in order once they're
 in there
 - having tuplesort happen inside a single process seems like the exact
 opposite of parallel sort
 - why would we need more memory?

 But having expressed my confusion, I'd certainly welcome further
 comment on this point, because I think figuring out how to set up the
 abstraction is in some ways the hardest part of this problem - and it
 is certainly made harder by the complexity of the existing
 abstraction.

 I think we're imagining quite different approaches to the problem. In my
 mind the parallelism part is essentially a layer *above* tuplesort.c,
 but I think you're thinking about a much more integrated approach.

So, I don't see how that can work.  tuplesort.c is *the* entrypoint
for sorting throughout the backend.  If you didn't use those same
entrypoints, you'd have to update every caller, which seems tedious
and rather pointless.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] things I learned from working on memory allocation

2014-07-14 Thread Robert Haas
On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund and...@2ndquadrant.com wrote:
 The actual if (lock != NULL) bit costs significant amounts of cycles?
 I'd have assumed that branch prediction takes care of that. Or is it
 actually the icache not keeping up? Did you measure icache vs. dcache
 misses?
 Have you played with unlikely()/likely() type of macros?

I have not.  I think it's really got more to do with how much stuff
needs to be saved in the stack frame, but I might be wrong about that.

 Unfortunately, there's some incremental overhead
 in pfree, amounting to a single test of global variable, even when the
 new allocator is never used, which is measurable on a microbenchmark;
 I don't remember the exact numbers off-hand.

 Hm. Why's that needed? Because you're searching for your allocator's
 superblocks in pfree()? I guess you don't store information about the
 size/context for every allocatation like aset.c does?

Since the chunks don't have a StandardChunkHeader, pfree has got to
decide whether it's safe to look at the 16 bytes preceding the chunk
before doing so.  If the new allocator's not in use (single global
variable test) that's pretty cheap, but not so cheap as you might
hope.  I found that it was possible to buy back most of the cost by
replacing (*context-methods-free_p) (context, pointer); with a
hard-coded AllocSetFree(context, pointer), so that gives you some idea
what order of magnitude we're talking about here.

 When the new allocator
 *is* used, pfree gets a lot slower - mostly because one of the data
 structures used by the new allocator suffer occasional cache misses
 during free operations.  I think there might be an argument that some
 hit on pfree would be worth taking in exchange for better
 space-efficiency, because we mostly free contexts by resetting them
 anyway; but I haven't yet managed to make that hit small enough to
 feel especially good about it.

 The speed bump is hit when pfree()ing a allocation that's been allocated
 with the new allocator or also when there's any concurrent activity for
 that allocator?

The latter.

 3. The current MemoryContext abstraction layer is inadequate for
 almost everything one might want to do.  The idea of a context that
 allows only allocation, and ignores requests to free memory, has been
 discussed more than once.  Such an allocator could skip the
 power-of-two rounding that aset.c does, but it couldn't omit the chunk
 header, which means that in practice it would save no memory at all
 for cases like the one mentioned above (REINDEX
 pgbench_accounts_pkey).

 I personally think that the performance benefit of not doing the
 rounding, not accessing the freelist, and such is more interesting for
 such an allocator than the memory savings. I'm pretty sure such a
 'allocation only' allocator for e.g. the parser and parts of the
 executor would be a rather noticeable performance benefit.

I don't have any reason to believe that.  All palloc() does in the
common case is bump the boundary pointer and stick the chunk header in
there.  You're not going to be able to make that much faster.

 But even for the cases where the space savings are the important bit: To
 me it sounds feasible to declare that some allocators don't allow
 reallocations and freeing. Those then would be allowed to omit the chunk
 header.  To make that debuggable we would need to add a chunk header
 when assertions are enabled to error out when such an operation is
 performed - but that's a acceptable price imo.

Hmm, yeah, that could be done.  However, it does have the disadvantage
of requiring you to affirmatively avoid pfreeing anything that might
have come from such a context, rather than simply making pfree a noop.

 Btw, am I the only one that wondered if it  wouldn't be rather
 beneficial to make aset.c add the chunk header before rounding?

Well, it would help in some cases, hurt in others, and make no
difference at all in still others.  I mean, it just has to do with how
well your size classes fit your actual memory allocation pattern;
you'd be effectively changing the size classes from 32, 64, 128, 256
to 16, 48, 112, 240 and that could be a win or a loss depending on the
actual allocation pattern.   I'm sure it's pretty trivial to construct
a test case favoring either outcome.

 5. It's tempting to look at other ways of solving the parallel sort
 problem that don't need an allocator - perhaps by simply packing all
 the tuples into a DSM one after the next.  But this is not easy to do,
 or at least it's not easy to do efficiently.  Tuples enter tuplesort.c
 through one of the tuplesort_put*() functions, most of which end up
 calling one of the copytup_*() functions.  But you can't easily just
 change those functions to stick the tuple someplace else, because they
 don't necessarily get the address of the space to be used from palloc
 directly.  In particular, copytup_heap() calls
 ExecCopySlotMinimalTuple(), and copytup_cluster() calls
 heap_copytuple().  

Re: [HACKERS] things I learned from working on memory allocation

2014-07-14 Thread Andres Freund
On 2014-07-14 11:24:26 -0400, Robert Haas wrote:
 On Sun, Jul 13, 2014 at 2:23 PM, Andres Freund and...@2ndquadrant.com wrote:
  The actual if (lock != NULL) bit costs significant amounts of cycles?
  I'd have assumed that branch prediction takes care of that. Or is it
  actually the icache not keeping up? Did you measure icache vs. dcache
  misses?
  Have you played with unlikely()/likely() type of macros?
 
 I have not.  I think it's really got more to do with how much stuff
 needs to be saved in the stack frame, but I might be wrong about that.

I don't really see why that'd play such a big role. The register
pressure on ppc/amd64 shouldn't be high enough that the (lock != NULL)
alone requires to push anything on the stack. Sure, the call to
LWLockAcquire() will, but if that's done in the separate branch...

  Unfortunately, there's some incremental overhead
  in pfree, amounting to a single test of global variable, even when the
  new allocator is never used, which is measurable on a microbenchmark;
  I don't remember the exact numbers off-hand.
 
  Hm. Why's that needed? Because you're searching for your allocator's
  superblocks in pfree()? I guess you don't store information about the
  size/context for every allocatation like aset.c does?
 
 Since the chunks don't have a StandardChunkHeader, pfree has got to
 decide whether it's safe to look at the 16 bytes preceding the chunk
 before doing so.  If the new allocator's not in use (single global
 variable test) that's pretty cheap, but not so cheap as you might
 hope.

That's what I was afraid of :/

 I found that it was possible to buy back most of the cost by
 replacing (*context-methods-free_p) (context, pointer); with a
 hard-coded AllocSetFree(context, pointer), so that gives you some idea
 what order of magnitude we're talking about here.

That's measured with a microbenchmark or actual postgres workloads?
Because in my measurements there wasn't consistent benefit in doing so
even when benchmarking workloads where allocation is a bottleneck.

  3. The current MemoryContext abstraction layer is inadequate for
  almost everything one might want to do.  The idea of a context that
  allows only allocation, and ignores requests to free memory, has been
  discussed more than once.  Such an allocator could skip the
  power-of-two rounding that aset.c does, but it couldn't omit the chunk
  header, which means that in practice it would save no memory at all
  for cases like the one mentioned above (REINDEX
  pgbench_accounts_pkey).
 
  I personally think that the performance benefit of not doing the
  rounding, not accessing the freelist, and such is more interesting for
  such an allocator than the memory savings. I'm pretty sure such a
  'allocation only' allocator for e.g. the parser and parts of the
  executor would be a rather noticeable performance benefit.
 
 I don't have any reason to believe that.  All palloc() does in the
 common case is bump the boundary pointer and stick the chunk header in
 there.  You're not going to be able to make that much faster.

In my profiling the access to the freelist (to test whether there's free
chunks in there) actually is noticeable stall. Removing it makes things
measurably faster. Also adds memory overhead :) If you look at it, a
simple AllocSetAlloc() will currently access three cachelines inside
AllocSetContext - with some care that can be one.

  But even for the cases where the space savings are the important bit: To
  me it sounds feasible to declare that some allocators don't allow
  reallocations and freeing. Those then would be allowed to omit the chunk
  header.  To make that debuggable we would need to add a chunk header
  when assertions are enabled to error out when such an operation is
  performed - but that's a acceptable price imo.
 
 Hmm, yeah, that could be done.  However, it does have the disadvantage
 of requiring you to affirmatively avoid pfreeing anything that might
 have come from such a context, rather than simply making pfree a noop.

Yep. Sounds feasible for a number of cases imo. I have serious doubts
that adding a measurable cost to pfree()s for allocations from all
contests is going to be fun. There's some operators that do a lot of
those - especially inside btree opclasses which have to do so. Since
those are the ones you'll often be calling from parallel sort...

  Btw, am I the only one that wondered if it  wouldn't be rather
  beneficial to make aset.c add the chunk header before rounding?
 
 Well, it would help in some cases, hurt in others, and make no
 difference at all in still others.  I mean, it just has to do with how
 well your size classes fit your actual memory allocation pattern;
 you'd be effectively changing the size classes from 32, 64, 128, 256
 to 16, 48, 112, 240 and that could be a win or a loss depending on the
 actual allocation pattern.   I'm sure it's pretty trivial to construct
 a test case favoring either outcome.

I was actually thinking about it 

Re: [HACKERS] things I learned from working on memory allocation

2014-07-13 Thread Andres Freund
Hi Robert,

On 2014-07-07 15:57:00 -0400, Robert Haas wrote:
 1. I tried to write a single allocator which would be better than
 aset.c in two ways: first, by allowing allocation from dynamic shared
 memory; and second, by being more memory-efficient than aset.c.
 Heikki told me at PGCon that he thought that was a bad idea, and I now
 think he was right.

I agree with that as well. I am wondering whether it's possible to use
most of the same code and compile it once as a per process allocator and
once as the allocator for shared memory.

 Allocating from dynamic shared memory requires
 the use of relative rather than absolute pointers (because the DSM
 could be mapped at different addresses in different processes) and, if
 concurrent allocating and freeing is to be supported, locking.
 Although neither of these things requires very many extra
 instructions, the common path for aset.c is extremely short, and we
 haven't got those instructions to spare.  Code like if (lock != NULL)
 LWLockAcquire(lock, LW_EXCLUSIVE) is quite expensive even if the
 branch is never taken, apparently because being prepared to call a
 function at that point requires storing more stuff in our stack frame,
 and extra instructions to save and restore registers are a painfully
 expensive on allocation-heavy workloads.  On PPC64, to match aset.c's
 performance, a palloc/pfree cycle must run in ~120 instructions, ~80
 cycles, which doesn't leave much room for fluff.

The actual if (lock != NULL) bit costs significant amounts of cycles?
I'd have assumed that branch prediction takes care of that. Or is it
actually the icache not keeping up? Did you measure icache vs. dcache
misses?
Have you played with unlikely()/likely() type of macros?

I don't think that any such fixes changes will make enough of a
difference to negate the point that these simply are two different
allocators with different requirements, but it's still interesting.

 2. After ripping the relative-pointer and locking stuff out of my
 allocator, I came up with something which performs about like aset.c
 on allocation, but with significantly better memory efficiency on
 small allocations.  REINDEX pgbench_accounts_pkey saves about 22%,
 because the SortTuple array saves nothing but the individual tuples
 take only half as much space, by avoiding the StandardChunkHeader
 overhead.  This seems like a savings worth pursuing, if we can figure
 out how to get it.

Yes, that'd certainly be nice.

 Unfortunately, there's some incremental overhead
 in pfree, amounting to a single test of global variable, even when the
 new allocator is never used, which is measurable on a microbenchmark;
 I don't remember the exact numbers off-hand.

Hm. Why's that needed? Because you're searching for your allocator's
superblocks in pfree()? I guess you don't store information about the
size/context for every allocatation like aset.c does?

 When the new allocator
 *is* used, pfree gets a lot slower - mostly because one of the data
 structures used by the new allocator suffer occasional cache misses
 during free operations.  I think there might be an argument that some
 hit on pfree would be worth taking in exchange for better
 space-efficiency, because we mostly free contexts by resetting them
 anyway; but I haven't yet managed to make that hit small enough to
 feel especially good about it.

The speed bump is hit when pfree()ing a allocation that's been allocated
with the new allocator or also when there's any concurrent activity for
that allocator?

 3. The current MemoryContext abstraction layer is inadequate for
 almost everything one might want to do.  The idea of a context that
 allows only allocation, and ignores requests to free memory, has been
 discussed more than once.  Such an allocator could skip the
 power-of-two rounding that aset.c does, but it couldn't omit the chunk
 header, which means that in practice it would save no memory at all
 for cases like the one mentioned above (REINDEX
 pgbench_accounts_pkey).

I personally think that the performance benefit of not doing the
rounding, not accessing the freelist, and such is more interesting for
such an allocator than the memory savings. I'm pretty sure such a
'allocation only' allocator for e.g. the parser and parts of the
executor would be a rather noticeable performance benefit.

But even for the cases where the space savings are the important bit: To
me it sounds feasible to declare that some allocators don't allow
reallocations and freeing. Those then would be allowed to omit the chunk
header.  To make that debuggable we would need to add a chunk header
when assertions are enabled to error out when such an operation is
performed - but that's a acceptable price imo.

Btw, am I the only one that wondered if it  wouldn't be rather
beneficial to make aset.c add the chunk header before rounding?

 While there might be some benefit in other
 cases, without getting rid of or at least reducing the size of the
 chunk-header, I expect 

Re: [HACKERS] things I learned from working on memory allocation

2014-07-11 Thread Robert Haas
On Thu, Jul 10, 2014 at 1:05 AM, Amit Kapila amit.kapil...@gmail.com wrote:
 On Tue, Jul 8, 2014 at 1:27 AM, Robert Haas robertmh...@gmail.com wrote:
 6. In general, I'm worried that it's going to be hard to keep the
 overhead of parallel sort from leaking into the non-parallel case.
 With the no-allocator approach, every place that uses
 GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
 DSM and non-DSM cases differently, which isn't great for either
 performance or maintainability.  And even with an allocator, the
 SortTuple array will need to use relative pointers in a DSM; that
 might burden the non-DSM case.

 I think you are right in saying that there can be a performance impact
 for non-parallel case due to pfree() (or other similar calls) and/or due
 to usage of relative pointers, but can it have measurable impact during
 actual sort operation?

Benchmarking seems to indicate that it indeed can.

 If there is an noticeable impact, then do you think having separate
 file/infrastructure for parallel sort can help, basically non-parallel
 and parallel sort will have some common and some specific parts.
 The above layer will call the parallel/non-parallel functions based on
 operation need.

The tuplesort.c already handles so many cases already that adding yet
another axis of variability is intimidating.  But it may work out that
there's no better option.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] things I learned from working on memory allocation

2014-07-11 Thread Amit Kapila
On Fri, Jul 11, 2014 at 11:15 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jul 10, 2014 at 1:05 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  If there is an noticeable impact, then do you think having separate
  file/infrastructure for parallel sort can help, basically non-parallel
  and parallel sort will have some common and some specific parts.
  The above layer will call the parallel/non-parallel functions based on
  operation need.

 The tuplesort.c already handles so many cases already that adding yet
 another axis of variability is intimidating.  But it may work out that
 there's no better option.

For using new allocator, one idea is that it works seemlesly with current
memory allocation routines (palloc, pfree, repalloc, ..), another could be
that have separate routines (shm_palloc, shm_pfree, ..) for allocation
from new allocator.  I think having separate routines means all the call
sites for allocation/deallocation needs to be aware of operation type
(parallel or non-parallel), however I think this can avoid the overhead
for non-parallel cases.

I think it is bit difficult to say which approach (with allocator or
directly use dsm) will turn out to be better as both have its pros and
cons as listed by you, however I feel that if we directly using dsm, we
might need to change more core logic than with allocator.

I believe we will face the same question again if somebody wants to
parallelize the join, so one parameter to decide could be based on
which approach will lead to less change in core logic/design.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] things I learned from working on memory allocation

2014-07-09 Thread Peter Geoghegan
On Mon, Jul 7, 2014 at 7:29 PM, Peter Geoghegan p...@heroku.com wrote:
 I do think that's a problem with our sort implementation, but it's not
 clear to me whether it's *more* of a problem for parallel sort than it
 is for single-backend sort.

 If you'll forgive me for going on about my patch on this thread, I
 think the pgbench -c 4 and -c 1 cases that I tested suggest it is
 a particular problem for parallel sorts, as there is a much bigger
 both absolute and proportional difference in transaction throughput
 between those two with the patch applied. It seems reasonable to
 suppose the difference would be larger still if we were considering a
 single parallel sort, as opposed to multiple independent sorts (of the
 same data) that happen to occur in parallel.

I think that I may have been too optimistic when I said that there was
an apparent trend of memory bandwidth per core merely stagnating:

http://users.ece.cmu.edu/~omutlu/pub/mutlu_memory-scaling_imw13_invited-talk.pdf

As slide 8 indicates, memory capacity per core is expected to go down
30% every two years, while the trend for memory bandwidth per core is
even worse.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] things I learned from working on memory allocation

2014-07-09 Thread Amit Kapila
On Tue, Jul 8, 2014 at 1:27 AM, Robert Haas robertmh...@gmail.com wrote:

 6. In general, I'm worried that it's going to be hard to keep the
 overhead of parallel sort from leaking into the non-parallel case.
 With the no-allocator approach, every place that uses
 GetMemoryChunkSpace() or repalloc() or pfree() will have to handle the
 DSM and non-DSM cases differently, which isn't great for either
 performance or maintainability.  And even with an allocator, the
 SortTuple array will need to use relative pointers in a DSM; that
 might burden the non-DSM case.

I think you are right in saying that there can be a performance impact
for non-parallel case due to pfree() (or other similar calls) and/or due
to usage of relative pointers, but can it have measurable impact during
actual sort operation?  Changes for any of them doesn't seem to cause
much impact for overall sort operation, although I am not sure what kind
of impact usage of relative pointers can cause, do you any specific point
in mind due to which you think that it can cause noticeable performance
impact.

If there is an noticeable impact, then do you think having separate
file/infrastructure for parallel sort can help, basically non-parallel
and parallel sort will have some common and some specific parts.
The above layer will call the parallel/non-parallel functions based on
operation need.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] things I learned from working on memory allocation

2014-07-07 Thread Robert Haas
I wrote previously about my efforts to create a new memory allocator
(then called sb_alloc, now renamed to BlockAllocatorContext in my git
branch for consistency with existing identifiers) for PostgreSQL.  To
make a long story short, many of the basic concepts behind that patch
still seem sound to me, but significant problems remain, which account
for it not having been submitted to the current CommitFest nor
attached to this email.  Working in this area has led me to a few
conclusions - comments welcome.

1. I tried to write a single allocator which would be better than
aset.c in two ways: first, by allowing allocation from dynamic shared
memory; and second, by being more memory-efficient than aset.c.
Heikki told me at PGCon that he thought that was a bad idea, and I now
think he was right.  Allocating from dynamic shared memory requires
the use of relative rather than absolute pointers (because the DSM
could be mapped at different addresses in different processes) and, if
concurrent allocating and freeing is to be supported, locking.
Although neither of these things requires very many extra
instructions, the common path for aset.c is extremely short, and we
haven't got those instructions to spare.  Code like if (lock != NULL)
LWLockAcquire(lock, LW_EXCLUSIVE) is quite expensive even if the
branch is never taken, apparently because being prepared to call a
function at that point requires storing more stuff in our stack frame,
and extra instructions to save and restore registers are a painfully
expensive on allocation-heavy workloads.  On PPC64, to match aset.c's
performance, a palloc/pfree cycle must run in ~120 instructions, ~80
cycles, which doesn't leave much room for fluff.

2. After ripping the relative-pointer and locking stuff out of my
allocator, I came up with something which performs about like aset.c
on allocation, but with significantly better memory efficiency on
small allocations.  REINDEX pgbench_accounts_pkey saves about 22%,
because the SortTuple array saves nothing but the individual tuples
take only half as much space, by avoiding the StandardChunkHeader
overhead.  This seems like a savings worth pursuing, if we can figure
out how to get it.  Unfortunately, there's some incremental overhead
in pfree, amounting to a single test of global variable, even when the
new allocator is never used, which is measurable on a microbenchmark;
I don't remember the exact numbers off-hand.  When the new allocator
*is* used, pfree gets a lot slower - mostly because one of the data
structures used by the new allocator suffer occasional cache misses
during free operations.  I think there might be an argument that some
hit on pfree would be worth taking in exchange for better
space-efficiency, because we mostly free contexts by resetting them
anyway; but I haven't yet managed to make that hit small enough to
feel especially good about it.

3. The current MemoryContext abstraction layer is inadequate for
almost everything one might want to do.  The idea of a context that
allows only allocation, and ignores requests to free memory, has been
discussed more than once.  Such an allocator could skip the
power-of-two rounding that aset.c does, but it couldn't omit the chunk
header, which means that in practice it would save no memory at all
for cases like the one mentioned above (REINDEX
pgbench_accounts_pkey).  While there might be some benefit in other
cases, without getting rid of or at least reducing the size of the
chunk-header, I expect that the benefits will be too narrow to make
this approach worth pursuing.  The chunk-header requirement is also a
killer for DSM, again because segments can be mapped at different
addresses in different backends.  I'm inclined to think that if we
can't find a way to get rid of the chunk-header requirement, we ought
to consider ripping out the whole abstraction layer as a
performance-wasting exercise in futility.

4. The MemoryContext layer embeds assumptions not only about the
layout of memory, but the performance of various operations.  For
example, GetMemoryContextSpace() is assumed to be cheap, so there's no
interface like void *MemoryContextAlloc(Size request_size, Size
*chunk_size) or Size MemoryContextFree(Size chunk_size), but those
would be significantly more efficient for my new allocator.  Possibly
better still would be to have the context itself track utilization and
soft-fail when we run out of space; even for aset.c, it makes little
sense to refuse to accept another tuple when the AllocSet has a
suitable object on the freelist.  That's probably not a critical flaw
because, since tuples being sorted are likely all about the same size,
the waste may be little or nothing.  Other allocators might have
different space-management strategies, though, where this matters
more.

5. It's tempting to look at other ways of solving the parallel sort
problem that don't need an allocator - perhaps by simply packing all
the tuples into a DSM one after the next.  But this is not 

Re: [HACKERS] things I learned from working on memory allocation

2014-07-07 Thread Peter Geoghegan
On Mon, Jul 7, 2014 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote:
 5. It's tempting to look at other ways of solving the parallel sort
 problem that don't need an allocator - perhaps by simply packing all
 the tuples into a DSM one after the next.  But this is not easy to do,
 or at least it's not easy to do efficiently.  Tuples enter tuplesort.c
 through one of the tuplesort_put*() functions, most of which end up
 calling one of the copytup_*() functions.  But you can't easily just
 change those functions to stick the tuple someplace else, because they
 don't necessarily get the address of the space to be used from palloc
 directly.  In particular, copytup_heap() calls
 ExecCopySlotMinimalTuple(), and copytup_cluster() calls
 heap_copytuple().  heap_copytuple() is simple enough that you might be
 able to finagle a new API that would make it work, but I'm not sure
 what we could really do about ExecCopySlotMinimalTuple() except call
 it and then copy the result.  Perhaps that'd be OK for a first
 version.

Perhaps. If my experience with sort support for text is anything to go
on, the cost of copying over the tuples isn't really that high
relative to the cost of performing the sort, particularly when there
are many tuples to sort. OTOH, I'd be concerned about the cost of main
memory accesses for pass-by-reference types during parallel tuple
sorts in particular. The same concern applies to cases where the tuple
proper must be accessed, although presumably that occurs less
frequently.

The LLVM project's new libc++ library passes remark on a similar issue
in their rationale for yet another C++ standard library: For example,
it is generally accepted that building std::string using the short
string optimization instead of using Copy On Write (COW) is a
superior approach for multicore machines It seems likely that
they're mostly referencing the apparent trend of memory bandwidth per
core stagnation [1], [2]. To get the real benefit of a parallel sort,
we must do anything we can to avoid having cores/workers remain idle
during the sort while waiting on memory access. I believe that that's
the bigger problem.

[1] 
http://www.admin-magazine.com/HPC/Articles/Finding-Memory-Bottlenecks-with-Stream
[2] 
https://software.intel.com/en-us/articles/detecting-memory-bandwidth-saturation-in-threaded-applications

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] things I learned from working on memory allocation

2014-07-07 Thread Robert Haas
On Mon, Jul 7, 2014 at 5:37 PM, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Jul 7, 2014 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote:
 5. It's tempting to look at other ways of solving the parallel sort
 problem that don't need an allocator - perhaps by simply packing all
 the tuples into a DSM one after the next.  But this is not easy to do,
 or at least it's not easy to do efficiently.  Tuples enter tuplesort.c
 through one of the tuplesort_put*() functions, most of which end up
 calling one of the copytup_*() functions.  But you can't easily just
 change those functions to stick the tuple someplace else, because they
 don't necessarily get the address of the space to be used from palloc
 directly.  In particular, copytup_heap() calls
 ExecCopySlotMinimalTuple(), and copytup_cluster() calls
 heap_copytuple().  heap_copytuple() is simple enough that you might be
 able to finagle a new API that would make it work, but I'm not sure
 what we could really do about ExecCopySlotMinimalTuple() except call
 it and then copy the result.  Perhaps that'd be OK for a first
 version.

 Perhaps. If my experience with sort support for text is anything to go
 on, the cost of copying over the tuples isn't really that high
 relative to the cost of performing the sort, particularly when there
 are many tuples to sort.

The testing I did showed about a 5% overhead on REINDEX INDEX
pgbench_accounts_pkey from one extra tuple copy (cf.
9f03ca915196dfc871804a1f8aad26207f601fd6).  Of course that could vary
by circumstance for a variety of reasons.

 OTOH, I'd be concerned about the cost of main
 memory accesses for pass-by-reference types during parallel tuple
 sorts in particular. The same concern applies to cases where the tuple
 proper must be accessed, although presumably that occurs less
 frequently.

I do think that's a problem with our sort implementation, but it's not
clear to me whether it's *more* of a problem for parallel sort than it
is for single-backend sort.

 The LLVM project's new libc++ library passes remark on a similar issue
 in their rationale for yet another C++ standard library: For example,
 it is generally accepted that building std::string using the short
 string optimization instead of using Copy On Write (COW) is a
 superior approach for multicore machines It seems likely that
 they're mostly referencing the apparent trend of memory bandwidth per
 core stagnation [1], [2]. To get the real benefit of a parallel sort,
 we must do anything we can to avoid having cores/workers remain idle
 during the sort while waiting on memory access. I believe that that's
 the bigger problem.

Yes, I agree that's a problem.  There are algorithms for parallel
quicksort in the literature which seem to offer promising solutions,
but unfortunately these infrastructure issues have prevented me from
reaching them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] things I learned from working on memory allocation

2014-07-07 Thread Peter Geoghegan
On Mon, Jul 7, 2014 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote:
 The testing I did showed about a 5% overhead on REINDEX INDEX
 pgbench_accounts_pkey from one extra tuple copy (cf.
 9f03ca915196dfc871804a1f8aad26207f601fd6).  Of course that could vary
 by circumstance for a variety of reasons.

Be careful of the check for pre-sorted input within qsort() giving an
overly optimistic picture of things, thereby exaggerating the
proportion of time spent copying if taken as generally representative.

 OTOH, I'd be concerned about the cost of main
 memory accesses for pass-by-reference types during parallel tuple
 sorts in particular. The same concern applies to cases where the tuple
 proper must be accessed, although presumably that occurs less
 frequently.

 I do think that's a problem with our sort implementation, but it's not
 clear to me whether it's *more* of a problem for parallel sort than it
 is for single-backend sort.

If you'll forgive me for going on about my patch on this thread, I
think the pgbench -c 4 and -c 1 cases that I tested suggest it is
a particular problem for parallel sorts, as there is a much bigger
both absolute and proportional difference in transaction throughput
between those two with the patch applied. It seems reasonable to
suppose the difference would be larger still if we were considering a
single parallel sort, as opposed to multiple independent sorts (of the
same data) that happen to occur in parallel.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers