Re: Vacuum: allow usage of more than 1GB of work mem

2018-10-01 Thread Michael Paquier
On Mon, Jul 16, 2018 at 02:33:17PM -0400, Andrew Dunstan wrote:
> Ah, ok. Thanks. ignore the email I just sent about that.

So...  This thread has basically died of inactivity, while there have
been a couple of interesting things discussed, like the version from
Heikki here:
https://www.postgresql.org/message-id/cd8f7b62-17e1-4307-9f81-427922e5a...@iki.fi

I am marking the patches as returned with feedback for now.
--
Michael


signature.asc
Description: PGP signature


Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-17 Thread Robert Haas
On Fri, Apr 6, 2018 at 4:25 PM, Claudio Freire  wrote:
>> True all that. My point is that the multi-segmented array isn't all that
>> simple and proven, compared to an also straightforward B-tree. It's pretty
>> similar to a B-tree, actually, except that it has exactly two levels, and
>> the node (= segment) sizes grow exponentially. I'd rather go with a true
>> B-tree, than something homegrown that resembles a B-tree, but not quite.
>
> I disagree.

Yeah, me too.  I think a segmented array is a lot simpler than a
home-grown btree.  I wrote a home-grown btree that ended up becoming
src/backend/utils/mmgr/freepage.c and it took me a long time to get
rid of all the bugs.  Heikki is almost certainly better at coding up a
bug-free btree than I am, but a segmented array is a dead simple data
structure, or should be if done properly, and a btree is not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Andrew Dunstan




On 07/16/2018 11:35 AM, Claudio Freire wrote:

On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire  wrote:

On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
 wrote:



On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:

On 13/07/18 01:39, Andrew Dunstan wrote:

On 07/12/2018 06:34 PM, Alvaro Herrera wrote:

On 2018-Jul-12, Andrew Dunstan wrote:


I fully understand. I think this needs to go back to "Waiting on
Author".

Why?  Heikki's patch applies fine and passes the regression tests.

Well, I understood Claudio was going to do some more work (see
upthread).

Claudio raised a good point, that doing small pallocs leads to
fragmentation, and in particular, it might mean that we can't give
back the memory to the OS. The default glibc malloc() implementation
has a threshold of 4 or 32 MB or something like that - allocations
larger than the threshold are mmap()'d, and can always be returned to
the OS. I think a simple solution to that is to allocate larger
chunks, something like 32-64 MB at a time, and carve out the
allocations for the nodes from those chunks. That's pretty
straightforward, because we don't need to worry about freeing the
nodes in retail. Keep track of the current half-filled chunk, and
allocate a new one when it fills up.


Google seems to suggest the default threshold is much lower, like 128K.
Still, making larger allocations seems sensible. Are you going to work
on that?

Below a few MB the threshold is dynamic, and if a block bigger than
128K but smaller than the higher threshold (32-64MB IIRC) is freed,
the dynamic threshold is set to the size of the freed block.

See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]

So I'd suggest allocating blocks bigger than M_MMAP_MAX.

[1] http://man7.org/linux/man-pages/man3/mallopt.3.html

Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the
former is something else.



Ah, ok. Thanks. ignore the email I just sent about that.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Claudio Freire
On Mon, Jul 16, 2018 at 3:30 PM Andrew Dunstan
 wrote:
>
>
>
> On 07/16/2018 10:34 AM, Claudio Freire wrote:
> > On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
> >  wrote:
> >>
> >>
> >> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> >>> On 13/07/18 01:39, Andrew Dunstan wrote:
>  On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> > On 2018-Jul-12, Andrew Dunstan wrote:
> >
> >> I fully understand. I think this needs to go back to "Waiting on
> >> Author".
> > Why?  Heikki's patch applies fine and passes the regression tests.
>  Well, I understood Claudio was going to do some more work (see
>  upthread).
> >>> Claudio raised a good point, that doing small pallocs leads to
> >>> fragmentation, and in particular, it might mean that we can't give
> >>> back the memory to the OS. The default glibc malloc() implementation
> >>> has a threshold of 4 or 32 MB or something like that - allocations
> >>> larger than the threshold are mmap()'d, and can always be returned to
> >>> the OS. I think a simple solution to that is to allocate larger
> >>> chunks, something like 32-64 MB at a time, and carve out the
> >>> allocations for the nodes from those chunks. That's pretty
> >>> straightforward, because we don't need to worry about freeing the
> >>> nodes in retail. Keep track of the current half-filled chunk, and
> >>> allocate a new one when it fills up.
> >>
> >> Google seems to suggest the default threshold is much lower, like 128K.
> >> Still, making larger allocations seems sensible. Are you going to work
> >> on that?
> > Below a few MB the threshold is dynamic, and if a block bigger than
> > 128K but smaller than the higher threshold (32-64MB IIRC) is freed,
> > the dynamic threshold is set to the size of the freed block.
> >
> > See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]
> >
> > So I'd suggest allocating blocks bigger than M_MMAP_MAX.
> >
> > [1] http://man7.org/linux/man-pages/man3/mallopt.3.html
>
>
> That page says:
>
> M_MMAP_MAX
>This parameter specifies the maximum number of allocation
>requests that may be simultaneously serviced using mmap(2).
>This parameter exists because some systems have a limited
>number of internal tables for use by mmap(2), and using more
>than a few of them may degrade performance.
>
>The default value is 65,536, a value which has no special
>significance and which serves only as a safeguard. Setting
>this parameter to 0 disables the use of mmap(2) for servicing
>large allocation requests.
>
>
> I'm confused about the relevance.

It isn't relevant. See my next message, it should have read
DEFAULT_MMAP_THRESHOLD_MAX.



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Andrew Dunstan




On 07/16/2018 10:34 AM, Claudio Freire wrote:

On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
 wrote:



On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:

On 13/07/18 01:39, Andrew Dunstan wrote:

On 07/12/2018 06:34 PM, Alvaro Herrera wrote:

On 2018-Jul-12, Andrew Dunstan wrote:


I fully understand. I think this needs to go back to "Waiting on
Author".

Why?  Heikki's patch applies fine and passes the regression tests.

Well, I understood Claudio was going to do some more work (see
upthread).

Claudio raised a good point, that doing small pallocs leads to
fragmentation, and in particular, it might mean that we can't give
back the memory to the OS. The default glibc malloc() implementation
has a threshold of 4 or 32 MB or something like that - allocations
larger than the threshold are mmap()'d, and can always be returned to
the OS. I think a simple solution to that is to allocate larger
chunks, something like 32-64 MB at a time, and carve out the
allocations for the nodes from those chunks. That's pretty
straightforward, because we don't need to worry about freeing the
nodes in retail. Keep track of the current half-filled chunk, and
allocate a new one when it fills up.


Google seems to suggest the default threshold is much lower, like 128K.
Still, making larger allocations seems sensible. Are you going to work
on that?

Below a few MB the threshold is dynamic, and if a block bigger than
128K but smaller than the higher threshold (32-64MB IIRC) is freed,
the dynamic threshold is set to the size of the freed block.

See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]

So I'd suggest allocating blocks bigger than M_MMAP_MAX.

[1] http://man7.org/linux/man-pages/man3/mallopt.3.html



That page says:

   M_MMAP_MAX
  This parameter specifies the maximum number of allocation
  requests that may be simultaneously serviced using mmap(2).
  This parameter exists because some systems have a limited
  number of internal tables for use by mmap(2), and using more
  than a few of them may degrade performance.

  The default value is 65,536, a value which has no special
  significance and which serves only as a safeguard. Setting
  this parameter to 0 disables the use of mmap(2) for servicing
  large allocation requests.


I'm confused about the relevance.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Claudio Freire
On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire  wrote:
>
> On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
>  wrote:
> >
> >
> >
> > On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> > > On 13/07/18 01:39, Andrew Dunstan wrote:
> > >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> > >>> On 2018-Jul-12, Andrew Dunstan wrote:
> > >>>
> >  I fully understand. I think this needs to go back to "Waiting on
> >  Author".
> > >>> Why?  Heikki's patch applies fine and passes the regression tests.
> > >>
> > >> Well, I understood Claudio was going to do some more work (see
> > >> upthread).
> > >
> > > Claudio raised a good point, that doing small pallocs leads to
> > > fragmentation, and in particular, it might mean that we can't give
> > > back the memory to the OS. The default glibc malloc() implementation
> > > has a threshold of 4 or 32 MB or something like that - allocations
> > > larger than the threshold are mmap()'d, and can always be returned to
> > > the OS. I think a simple solution to that is to allocate larger
> > > chunks, something like 32-64 MB at a time, and carve out the
> > > allocations for the nodes from those chunks. That's pretty
> > > straightforward, because we don't need to worry about freeing the
> > > nodes in retail. Keep track of the current half-filled chunk, and
> > > allocate a new one when it fills up.
> >
> >
> > Google seems to suggest the default threshold is much lower, like 128K.
> > Still, making larger allocations seems sensible. Are you going to work
> > on that?
>
> Below a few MB the threshold is dynamic, and if a block bigger than
> 128K but smaller than the higher threshold (32-64MB IIRC) is freed,
> the dynamic threshold is set to the size of the freed block.
>
> See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]
>
> So I'd suggest allocating blocks bigger than M_MMAP_MAX.
>
> [1] http://man7.org/linux/man-pages/man3/mallopt.3.html

Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the
former is something else.



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Claudio Freire
On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
 wrote:
>
>
>
> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> > On 13/07/18 01:39, Andrew Dunstan wrote:
> >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> >>> On 2018-Jul-12, Andrew Dunstan wrote:
> >>>
>  I fully understand. I think this needs to go back to "Waiting on
>  Author".
> >>> Why?  Heikki's patch applies fine and passes the regression tests.
> >>
> >> Well, I understood Claudio was going to do some more work (see
> >> upthread).
> >
> > Claudio raised a good point, that doing small pallocs leads to
> > fragmentation, and in particular, it might mean that we can't give
> > back the memory to the OS. The default glibc malloc() implementation
> > has a threshold of 4 or 32 MB or something like that - allocations
> > larger than the threshold are mmap()'d, and can always be returned to
> > the OS. I think a simple solution to that is to allocate larger
> > chunks, something like 32-64 MB at a time, and carve out the
> > allocations for the nodes from those chunks. That's pretty
> > straightforward, because we don't need to worry about freeing the
> > nodes in retail. Keep track of the current half-filled chunk, and
> > allocate a new one when it fills up.
>
>
> Google seems to suggest the default threshold is much lower, like 128K.
> Still, making larger allocations seems sensible. Are you going to work
> on that?

Below a few MB the threshold is dynamic, and if a block bigger than
128K but smaller than the higher threshold (32-64MB IIRC) is freed,
the dynamic threshold is set to the size of the freed block.

See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]

So I'd suggest allocating blocks bigger than M_MMAP_MAX.

[1] http://man7.org/linux/man-pages/man3/mallopt.3.html



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-13 Thread Andrew Dunstan




On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:

On 13/07/18 01:39, Andrew Dunstan wrote:

On 07/12/2018 06:34 PM, Alvaro Herrera wrote:

On 2018-Jul-12, Andrew Dunstan wrote:

I fully understand. I think this needs to go back to "Waiting on 
Author".

Why?  Heikki's patch applies fine and passes the regression tests.


Well, I understood Claudio was going to do some more work (see
upthread).


Claudio raised a good point, that doing small pallocs leads to 
fragmentation, and in particular, it might mean that we can't give 
back the memory to the OS. The default glibc malloc() implementation 
has a threshold of 4 or 32 MB or something like that - allocations 
larger than the threshold are mmap()'d, and can always be returned to 
the OS. I think a simple solution to that is to allocate larger 
chunks, something like 32-64 MB at a time, and carve out the 
allocations for the nodes from those chunks. That's pretty 
straightforward, because we don't need to worry about freeing the 
nodes in retail. Keep track of the current half-filled chunk, and 
allocate a new one when it fills up.



Google seems to suggest the default threshold is much lower, like 128K. 
Still, making larger allocations seems sensible. Are you going to work 
on that?





He also wanted to refactor the iterator API, to return one ItemPointer 
at a time. I don't think that's necessary, the current iterator API is 
more convenient for the callers, but I don't feel strongly about that.


Anything else?


If we're going to go with Heikki's patch then do we need to
change the author, or add him as an author?


Let's list both of us. At least in the commit message, doesn't matter 
much what the commitfest app says.





I added you as an author in the CF App

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-13 Thread Heikki Linnakangas

On 13/07/18 01:39, Andrew Dunstan wrote:

On 07/12/2018 06:34 PM, Alvaro Herrera wrote:

On 2018-Jul-12, Andrew Dunstan wrote:


I fully understand. I think this needs to go back to "Waiting on Author".

Why?  Heikki's patch applies fine and passes the regression tests.


Well, I understood Claudio was going to do some more work (see
upthread).


Claudio raised a good point, that doing small pallocs leads to 
fragmentation, and in particular, it might mean that we can't give back 
the memory to the OS. The default glibc malloc() implementation has a 
threshold of 4 or 32 MB or something like that - allocations larger than 
the threshold are mmap()'d, and can always be returned to the OS. I 
think a simple solution to that is to allocate larger chunks, something 
like 32-64 MB at a time, and carve out the allocations for the nodes 
from those chunks. That's pretty straightforward, because we don't need 
to worry about freeing the nodes in retail. Keep track of the current 
half-filled chunk, and allocate a new one when it fills up.


He also wanted to refactor the iterator API, to return one ItemPointer 
at a time. I don't think that's necessary, the current iterator API is 
more convenient for the callers, but I don't feel strongly about that.


Anything else?


If we're going to go with Heikki's patch then do we need to
change the author, or add him as an author?


Let's list both of us. At least in the commit message, doesn't matter 
much what the commitfest app says.


- Heikki



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-12 Thread Andrew Dunstan




On 07/12/2018 06:34 PM, Alvaro Herrera wrote:

On 2018-Jul-12, Andrew Dunstan wrote:


I fully understand. I think this needs to go back to "Waiting on Author".

Why?  Heikki's patch applies fine and passes the regression tests.




Well, I understood Claudio was going to do some more work (see 
upthread). If we're going to go with Heikki's patch then do we need to 
change the author, or add him as an author?


cheers

andew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-12 Thread Alvaro Herrera
On 2018-Jul-12, Andrew Dunstan wrote:

> I fully understand. I think this needs to go back to "Waiting on Author".

Why?  Heikki's patch applies fine and passes the regression tests.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-12 Thread Andrew Dunstan




On 07/12/2018 12:38 PM, Claudio Freire wrote:

On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan
 wrote:



On 04/06/2018 08:00 PM, Claudio Freire wrote:

On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire  wrote:

On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas  wrote:

On 06/04/18 01:59, Claudio Freire wrote:

The iteration interface, however, seems quite specific for the use
case of vacuumlazy, so it's not really a good abstraction.

Can you elaborate? It does return the items one block at a time. Is that
what you mean by being specific for vacuumlazy? I guess that's a bit
special, but if you imagine some other users for this abstraction, it's
probably not that unusual. For example, if we started using it in bitmap
heap scans, a bitmap heap scan would also want to get the TIDs one block
number at a time.

But you're also tying the caller to the format of the buffer holding
those TIDs, for instance. Why would you, when you can have an
interface that just iterates TIDs and let the caller store them
if/however they want?

I do believe a pure iterator interface is a better interface.

Between the b-tree or not discussion and the refactoring to separate
the code, I don't think we'll get this in the next 24hs.

So I guess we'll have ample time to poner on both issues during the
next commit fest.




There doesn't seem to have been much pondering done since then, at least
publicly. Can we make some progress on this? It's been around for a long
time now.

Yeah, life has kept me busy and I haven't had much time to make
progress here, but I was planning on doing the refactoring as we were
discussing soon. Can't give a time frame for that, but "soonish".


I fully understand. I think this needs to go back to "Waiting on Author".

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-12 Thread Claudio Freire
On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan
 wrote:
>
>
>
> On 04/06/2018 08:00 PM, Claudio Freire wrote:
> > On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire  
> > wrote:
> >> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas  
> >> wrote:
> >>> On 06/04/18 01:59, Claudio Freire wrote:
>  The iteration interface, however, seems quite specific for the use
>  case of vacuumlazy, so it's not really a good abstraction.
> >>>
> >>> Can you elaborate? It does return the items one block at a time. Is that
> >>> what you mean by being specific for vacuumlazy? I guess that's a bit
> >>> special, but if you imagine some other users for this abstraction, it's
> >>> probably not that unusual. For example, if we started using it in bitmap
> >>> heap scans, a bitmap heap scan would also want to get the TIDs one block
> >>> number at a time.
> >> But you're also tying the caller to the format of the buffer holding
> >> those TIDs, for instance. Why would you, when you can have an
> >> interface that just iterates TIDs and let the caller store them
> >> if/however they want?
> >>
> >> I do believe a pure iterator interface is a better interface.
> > Between the b-tree or not discussion and the refactoring to separate
> > the code, I don't think we'll get this in the next 24hs.
> >
> > So I guess we'll have ample time to poner on both issues during the
> > next commit fest.
> >
>
>
>
> There doesn't seem to have been much pondering done since then, at least
> publicly. Can we make some progress on this? It's been around for a long
> time now.

Yeah, life has kept me busy and I haven't had much time to make
progress here, but I was planning on doing the refactoring as we were
discussing soon. Can't give a time frame for that, but "soonish".



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-12 Thread Andrew Dunstan




On 04/06/2018 08:00 PM, Claudio Freire wrote:

On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire  wrote:

On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas  wrote:

On 06/04/18 01:59, Claudio Freire wrote:

The iteration interface, however, seems quite specific for the use
case of vacuumlazy, so it's not really a good abstraction.


Can you elaborate? It does return the items one block at a time. Is that
what you mean by being specific for vacuumlazy? I guess that's a bit
special, but if you imagine some other users for this abstraction, it's
probably not that unusual. For example, if we started using it in bitmap
heap scans, a bitmap heap scan would also want to get the TIDs one block
number at a time.

But you're also tying the caller to the format of the buffer holding
those TIDs, for instance. Why would you, when you can have an
interface that just iterates TIDs and let the caller store them
if/however they want?

I do believe a pure iterator interface is a better interface.

Between the b-tree or not discussion and the refactoring to separate
the code, I don't think we'll get this in the next 24hs.

So I guess we'll have ample time to poner on both issues during the
next commit fest.





There doesn't seem to have been much pondering done since then, at least 
publicly. Can we make some progress on this? It's been around for a long 
time now.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-06 Thread Claudio Freire
On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire  wrote:
> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas  wrote:
>> On 06/04/18 01:59, Claudio Freire wrote:
>>>
>>> The iteration interface, however, seems quite specific for the use
>>> case of vacuumlazy, so it's not really a good abstraction.
>>
>>
>> Can you elaborate? It does return the items one block at a time. Is that
>> what you mean by being specific for vacuumlazy? I guess that's a bit
>> special, but if you imagine some other users for this abstraction, it's
>> probably not that unusual. For example, if we started using it in bitmap
>> heap scans, a bitmap heap scan would also want to get the TIDs one block
>> number at a time.
>
> But you're also tying the caller to the format of the buffer holding
> those TIDs, for instance. Why would you, when you can have an
> interface that just iterates TIDs and let the caller store them
> if/however they want?
>
> I do believe a pure iterator interface is a better interface.

Between the b-tree or not discussion and the refactoring to separate
the code, I don't think we'll get this in the next 24hs.

So I guess we'll have ample time to poner on both issues during the
next commit fest.



Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-06 Thread Alvaro Herrera
Claudio Freire wrote:

> On Fri, Apr 6, 2018 at 11:00 AM, Alvaro Herrera  
> wrote:

> > FWIW I liked the idea of having this abstraction possibly do other
> > things -- for instance to vacuum brin indexes you'd like to mark index
> > tuples as "containing tuples that were removed" and eventually
> > re-summarize the range.  With the current interface we cannot do that,
> > because vacuum expects brin vacuuming to ask for each heap tuple "is
> > this tid dead?" and of course we don't have a list of tids to ask for.
> > So if we can ask instead "how many dead tuples does this block contain?"
> > brin vacuuming will be much happier.
> 
> I don't think either patch gives you that.
> 
> The bulkdelete interface is part of the indexam and unlikely to change
> in this patch.

I'm sure you're correct.  I was just saying that with the abstract
interface it is easier to implement what I suggest as a follow-on patch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-06 Thread Alvaro Herrera
Heikki Linnakangas wrote:
> On 06/04/18 01:59, Claudio Freire wrote:
> > The iteration interface, however, seems quite specific for the use
> > case of vacuumlazy, so it's not really a good abstraction.
> 
> Can you elaborate? It does return the items one block at a time. Is that
> what you mean by being specific for vacuumlazy? I guess that's a bit
> special, but if you imagine some other users for this abstraction, it's
> probably not that unusual. For example, if we started using it in bitmap
> heap scans, a bitmap heap scan would also want to get the TIDs one block
> number at a time.

FWIW I liked the idea of having this abstraction possibly do other
things -- for instance to vacuum brin indexes you'd like to mark index
tuples as "containing tuples that were removed" and eventually
re-summarize the range.  With the current interface we cannot do that,
because vacuum expects brin vacuuming to ask for each heap tuple "is
this tid dead?" and of course we don't have a list of tids to ask for.
So if we can ask instead "how many dead tuples does this block contain?"
brin vacuuming will be much happier.

> Looking at the changes to the regression test in this, I don't quite
> understand what it's all about. What are the "wait_barriers" for? If I
> understand correctly, they're added so that the VACUUMs can remove the
> tuples that are deleted in the test. But why are they needed now? Was that
> an orthogonal change we should've done anyway?
> 
> Rather than add those wait_barriers, should we stop running the 'vacuum'
> test in parallel with the other tests? Or maybe it's a good thing to run it
> in parallel, to test some other things?

20180207235226.zygu4r3yv3yfcnmc@alvherre.pgsql

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-06 Thread Heikki Linnakangas

On 06/04/18 16:39, Heikki Linnakangas wrote:
Sure. I used the attached script to test this. 


Sorry, I attached the wrong script. Here is the correct one that I used. 
Here are also the results I got from running it


- Heikki


vacuumbenchone.sh
Description: application/shellscript


vacuumbenchone-results-on-heikkis-laptop.tar.gz
Description: application/gzip


Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-06 Thread Heikki Linnakangas

On 06/04/18 01:59, Claudio Freire wrote:

The iteration interface, however, seems quite specific for the use
case of vacuumlazy, so it's not really a good abstraction.


Can you elaborate? It does return the items one block at a time. Is that 
what you mean by being specific for vacuumlazy? I guess that's a bit 
special, but if you imagine some other users for this abstraction, it's 
probably not that unusual. For example, if we started using it in bitmap 
heap scans, a bitmap heap scan would also want to get the TIDs one block 
number at a time.



It also copies stuff a lot, so it's quite heavyweight. I'd suggest
trying to go for a lighter weight interface with less overhead that
is more general at the same time.


Note that there was similar copying, to construct an array of 
OffsetNumbers, happening in lazy_vacuum_page() before this patch. So the 
net amount of copying is the same.


I'm envisioning that this data structure will sooner or later be 
optimized further, so that when you have a lot of TIDs pointing to the 
same block, we would pack them more tightly, storing the block number 
just once, with an array of offset numbers. This interface that returns 
an array of offset numbers matches that future well, as the iterator 
could just return a pointer to the array of offset numbers, with no 
copying. (If we end up doing something even more dense, like a bitmap, 
then it doesn't help, but that's ok too.)



About the B-tree, however, I don't think a B-tree is a good idea.

Trees' main benefit is that they can be inserted to efficiently. When
all your data is loaded sequentially, in-order, in-memory and
immutable; the tree is pointless, more costly to build, and harder to
maintain - in terms of code complexity.

In this use case, the only benefit of B-trees would be that they're
optimized for disk access.


Those are not the reasons for which I'd prefer a B-tree. A B-tree has 
good cache locality, and when you don't need to worry about random 
insertions, page splits, deletions etc., it's also very simple to 
implement. This patch is not much longer than the segmented multi-array.



On the other side, using B-trees incurs memory overhead due to the
need for internal nodes, can fragment memory because internal nodes
aren't the same size as leaf nodes, is easier to get wrong and
introduce bugs... I don't see a gain.


The memory overhead incurred by the internal nodes is quite minimal, and 
can be adjusted by changing the node sizes. After some experimentation, 
I settled on 2048 items per leaf node, and 64 items per internal node. 
With those values, the overhead caused by the internal nodes is minimal, 
below 0.5%. That seems fine, but we could increase the node sizes to 
bring it further down, if we'd prefer that tradeoff.


I don't understand what memory fragmentation problems you're worried 
about. The tree grows one node at a time, as new TIDs are added, until 
it's all released at the end. I don't see how the size of internal vs 
leaf nodes matters.



If you propose its use, at least benchmark it to show some gain.


Sure. I used the attached script to test this. It's inspired by the test 
script you posted. It creates a pgbench database with scale factor 100, 
deletes 80% of the rows, and runs vacuum. To stress lazy_tid_reaped() 
more heavily, the test script creates a number of extra indexes. Half of 
them are on the primary key, just to get more repetitions without having 
to re-initialize in between, and the rest are like this:


create index random_1 on pgbench_accounts((hashint4(aid)))

to stress lazy_vacuum_tid_reaped() with a random access pattern, rather 
than the sequential one that you get with the primary key index.


I ran the test script on my laptop, with unpatched master, with your 
latest multi-array patch, and with the attached version of the b-tree 
patch. The results are quite noisy, unfortunately, so I wouldn't draw 
very strong conclusions from it, but it seems that the performance of 
all three versions is roughly the same. I looked in particular at the 
CPU time spent in the index vacuums, as reported by VACUUM VERBOSE.



Furthermore, among the 200-ish messages this thread has accumulated,
better ideas have been proposed, better because they do use less
memory and are faster (like using bitmaps when possible), but if we
can't push a simple refactoring first, there's no chance a bigger
rewrite will fare better. Remember, in this use case, using less
memory far outweights any other consideration. Less memory directly
translates to less iterations over the indexes, because more can be
crammed into m_w_m, which is a huge time saving. Far more than any
micro-optimization.

About 2 years ago, I chose to try to push this simple algorithm first,
then try to improve on it with better data structures. Nobody
complained at the time (I think, IIRC), and I don't think it fair to
go and revisit that now. It just delays getting a solution for this
issue for the persuit of "the 

Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-05 Thread Claudio Freire
On Thu, Apr 5, 2018 at 5:02 PM, Heikki Linnakangas  wrote:
> On 03/04/18 17:20, Claudio Freire wrote:
>>
>> Ok, rebased patches attached
>
>
> Thanks! I took a look at this.
>
> First, now that the data structure is more complicated, I think it's time to
> abstract it, and move it out of vacuumlazy.c. The Tid Map needs to support
> the following operations:
>
> * Add TIDs, in order (in 1st phase of vacuum)
> * Random lookup, by TID (when scanning indexes)
> * Iterate through all TIDs, in order (2nd pass over heap)
>
> Let's add a new source file to hold the code for the tid map data structure,
> with functions corresponding those operations.
>
> I took a stab at doing that, and I think it makes vacuumlazy.c nicer.

About the refactoring to split this into their own set of files and
abstract away the underlying structure, I can totally get behind that.

The iteration interface, however, seems quite specific for the use
case of vacuumlazy, so it's not really a good abstraction. It also
copies stuff a lot, so it's quite heavyweight. I'd suggest trying to
go for a lighter weight interface with less overhead that is more
general at the same time.

If it was C++, I'd say build an iterator class.

C would do it probably with macros, so you can have a macro to get to
the current element, another to advance to the next element, and
another to check whether you've reached the end.

I can do that if we agree on the points below:

> Secondly, I'm not a big fan of the chosen data structure. I think the only
> reason that the segmented "multi-array" was chosen is that each "segment"
> works is similar to the simple array that we used to have. After putting it
> behind the abstraction, it seems rather ad hoc. There are many standard
> textbook data structures that we could use instead, and would be easier to
> understand and reason about, I think.
>
> So, I came up with the attached patch. I used a B-tree as the data
> structure. Not sure it's the best one, I'm all ears for suggestions and
> bikeshedding on alternatives, but I'm pretty happy with that. I would expect
> it to be pretty close to the simple array with binary search in performance
> characteristics. It'd be pretty straightforward to optimize further, and
> e.g. use a bitmap of OffsetNumbers or some other more dense data structure
> in the B-tree leaf pages, but I resisted doing that as part of this patch.

About the B-tree, however, I don't think a B-tree is a good idea.

Trees' main benefit is that they can be inserted to efficiently. When
all your data is loaded sequentially, in-order, in-memory and
immutable; the tree is pointless, more costly to build, and harder to
maintain - in terms of code complexity.

In this use case, the only benefit of B-trees would be that they're
optimized for disk access. If we planned to store this on-disk,
perhaps I'd grant you that. But we don't plan to do that, and it's not
even clear doing it would be efficient enough for the intended use.

On the other side, using B-trees incurs memory overhead due to the
need for internal nodes, can fragment memory because internal nodes
aren't the same size as leaf nodes, is easier to get wrong and
introduce bugs... I don't see a gain. If you propose its use, at least
benchmark it to show some gain.

So I don't think B-tree is a good idea, the sorted array already is
good enough, and if not, it's at least close to the earlier
implementation and less likely to introduce bugs.

Furthermore, among the 200-ish messages this thread has accumulated,
better ideas have been proposed, better because they do use less
memory and are faster (like using bitmaps when possible), but if we
can't push a simple refactoring first, there's no chance a bigger
rewrite will fare better. Remember, in this use case, using less
memory far outweights any other consideration. Less memory directly
translates to less iterations over the indexes, because more can be
crammed into m_w_m, which is a huge time saving. Far more than any
micro-optimization.

About 2 years ago, I chose to try to push this simple algorithm first,
then try to improve on it with better data structures. Nobody
complained at the time (I think, IIRC), and I don't think it fair to
go and revisit that now. It just delays getting a solution for this
issue for the persuit of "the perfect implementaiton" that might never
arrive. Or even if it doesn, there's nothing stopping us from pushing
another patch in the future with that better implementation if we
wish. Lets get something simple and proven first.

> I haven't done any performance testing of this (and not much testing in
> general), but at least the abstraction seems better this way. Performance
> testing would be good, too. In particular, I'd like to know how this might
> affect the performance of lazy_tid_reaped(). That's a hot spot when
> vacuuming indexes, so we don't want to add any cycles there. Was there any
> ready-made test kits on that in this thread? I didn't see 

Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-05 Thread Heikki Linnakangas

On 03/04/18 17:20, Claudio Freire wrote:

Ok, rebased patches attached


Thanks! I took a look at this.

First, now that the data structure is more complicated, I think it's 
time to abstract it, and move it out of vacuumlazy.c. The Tid Map needs 
to support the following operations:


* Add TIDs, in order (in 1st phase of vacuum)
* Random lookup, by TID (when scanning indexes)
* Iterate through all TIDs, in order (2nd pass over heap)

Let's add a new source file to hold the code for the tid map data 
structure, with functions corresponding those operations.


I took a stab at doing that, and I think it makes vacuumlazy.c nicer.

Secondly, I'm not a big fan of the chosen data structure. I think the 
only reason that the segmented "multi-array" was chosen is that each 
"segment" works is similar to the simple array that we used to have. 
After putting it behind the abstraction, it seems rather ad hoc. There 
are many standard textbook data structures that we could use instead, 
and would be easier to understand and reason about, I think.


So, I came up with the attached patch. I used a B-tree as the data 
structure. Not sure it's the best one, I'm all ears for suggestions and 
bikeshedding on alternatives, but I'm pretty happy with that. I would 
expect it to be pretty close to the simple array with binary search in 
performance characteristics. It'd be pretty straightforward to optimize 
further, and e.g. use a bitmap of OffsetNumbers or some other more dense 
data structure in the B-tree leaf pages, but I resisted doing that as 
part of this patch.


I haven't done any performance testing of this (and not much testing in 
general), but at least the abstraction seems better this way. 
Performance testing would be good, too. In particular, I'd like to know 
how this might affect the performance of lazy_tid_reaped(). That's a hot 
spot when vacuuming indexes, so we don't want to add any cycles there. 
Was there any ready-made test kits on that in this thread? I didn't see 
any at a quick glance, but it's a long thread..


- Heikki
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index 4a6c99e090..634059b8ff 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -20,6 +20,6 @@ OBJS = amcmds.o aggregatecmds.o alter.o analyze.o async.o cluster.o comment.o \
 	policy.o portalcmds.o prepare.o proclang.o publicationcmds.o \
 	schemacmds.o seclabel.o sequence.o statscmds.o subscriptioncmds.o \
 	tablecmds.o tablespace.o trigger.o tsearchcmds.o typecmds.o user.o \
-	vacuum.o vacuumlazy.o variable.o view.o
+	vacuum.o vacuumlazy.o vacuumtidmap.o variable.o view.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index d2a006671a..0cc6b98831 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -4,23 +4,21 @@
  *	  Concurrent ("lazy") vacuuming.
  *
  *
- * The major space usage for LAZY VACUUM is storage for the array of dead tuple
- * TIDs.  We want to ensure we can vacuum even the very largest relations with
- * finite memory space usage.  To do that, we set upper bounds on the number of
- * tuples we will keep track of at once.
+ * The major space usage for LAZY VACUUM is storage of dead tuple TIDs.
+ * We want to ensure we can vacuum even the very largest relations with
+ * finite memory space usage.  To do that, we set upper bounds on the amount
+ * of memory used to track the TIDs at any one time.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
- * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
- * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * use a specialized data structure to hold the TIDs, which keeps track of
+ * the memory used.  If the memory limit is about to be exceeded, we suspend
+ * the heap scan phase and perform a pass of index cleanup and page
+ * compaction, then resume the heap scan with an empty tid map.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
- * of index scans performed.  So we don't use maintenance_work_mem memory for
- * the TID array, just enough to hold as many heap tuples as fit on one page.
+ * of index scans performed.
  *
  *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
@@ -94,13 +92,6 @@
 	((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ))
 
 /*
- * Guesstimation of number of dead tuples per page.  This is used to
- * provide an upper limit to memory allocated when vacuuming small
- * 

Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-03 Thread Claudio Freire
On Tue, Apr 3, 2018 at 11:09 AM, Claudio Freire  wrote:
> On Tue, Apr 3, 2018 at 11:06 AM, Claudio Freire  
> wrote:
>> I didn't receive your comment, I just saw it. Nevertheless, I rebased the 
>> patches a while ago just because I noticed they didn't apply anymore in 
>> cputube, and they still seem to apply.
>
> Sorry, that is false.
>
> They appear green in cputube, so I was confident they did apply, but I
> just double-checked on a recent pull and they don't. I'll rebase them
> shortly.


Ok, rebased patches attached
From 712aff9a856c2bae1d87555057e6546d029ecc47 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 402 ---
 src/test/regress/expected/vacuum.out |  72 ++-
 src/test/regress/sql/vacuum.sql  |  40 +++-
 3 files changed, 438 insertions(+), 76 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index d2a0066..2f82dc6 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity.
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated TID segments to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -101,6 +114,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_INIT_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -112,6 +133,27 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated). Keep it first to simplify
+		 * binary searches */
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointer dt_tids;		/* Array of dead tuples */
+}	DeadTuplesSegment;
+
+typedef struct DeadTuplesMultiArray
+{
+	int			num_entries;	/* current # of entries */
+	int			max_entries;	/* total # of slots that can be allocated in
+ * array */
+	int			num_segs;		/* number of dead tuple segments allocated */
+	int			last_seg;		/* last dead tuple segment with data (or 0) */
+	DeadTuplesSegment *dt_segments;		/* array of num_segs segments */
+}	DeadTuplesMultiArray;
+
 typedef struct LVRelStats
 {
 	/* hasindex = true means two-pass strategy; false means one-pass */
@@ -132,14 +174,20 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	/* List of TIDs of tuples we intend to delete */
 	/* NB: this list is 

Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-03 Thread Claudio Freire
On Tue, Apr 3, 2018 at 11:06 AM, Claudio Freire  wrote:
> I didn't receive your comment, I just saw it. Nevertheless, I rebased the 
> patches a while ago just because I noticed they didn't apply anymore in 
> cputube, and they still seem to apply.

Sorry, that is false.

They appear green in cputube, so I was confident they did apply, but I
just double-checked on a recent pull and they don't. I'll rebase them
shortly.



Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-03 Thread Claudio Freire
I didn't receive your comment, I just saw it. Nevertheless, I rebased the 
patches a while ago just because I noticed they didn't apply anymore in 
cputube, and they still seem to apply.

Patch number 2 was committed a long while ago, that's why it's missing. It was 
a simple patch, it landed rewritten as commit 
7e26e02eec90370dd222f35f00042f8188488ac4

Re: Vacuum: allow usage of more than 1GB of work mem

2018-03-27 Thread Aleksander Alekseev
Hello everyone,

I would like to let you know that unfortunately these patches don't apply 
anymore. Also personally I'm a bit confused by the last message that has 0001- 
and 0003- patches attached but not the 0002- one.

Re: Vacuum: allow usage of more than 1GB of work mem

2018-01-22 Thread Aleksander Alekseev
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

I can confirm that these patches don't break anything; the code is well
documented, has some tests and doesn't do anything obviously wrong. However
I would recommend someone who is more familiar with the VACUUM mechanism than I
do to recheck these patches.

The new status of this patch is: Ready for Committer