Re: Vacuum: allow usage of more than 1GB of work mem
On Mon, Jul 16, 2018 at 02:33:17PM -0400, Andrew Dunstan wrote: > Ah, ok. Thanks. ignore the email I just sent about that. So... This thread has basically died of inactivity, while there have been a couple of interesting things discussed, like the version from Heikki here: https://www.postgresql.org/message-id/cd8f7b62-17e1-4307-9f81-427922e5a...@iki.fi I am marking the patches as returned with feedback for now. -- Michael signature.asc Description: PGP signature
Re: Vacuum: allow usage of more than 1GB of work mem
On Fri, Apr 6, 2018 at 4:25 PM, Claudio Freire wrote: >> True all that. My point is that the multi-segmented array isn't all that >> simple and proven, compared to an also straightforward B-tree. It's pretty >> similar to a B-tree, actually, except that it has exactly two levels, and >> the node (= segment) sizes grow exponentially. I'd rather go with a true >> B-tree, than something homegrown that resembles a B-tree, but not quite. > > I disagree. Yeah, me too. I think a segmented array is a lot simpler than a home-grown btree. I wrote a home-grown btree that ended up becoming src/backend/utils/mmgr/freepage.c and it took me a long time to get rid of all the bugs. Heikki is almost certainly better at coding up a bug-free btree than I am, but a segmented array is a dead simple data structure, or should be if done properly, and a btree is not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Vacuum: allow usage of more than 1GB of work mem
On 07/16/2018 11:35 AM, Claudio Freire wrote: On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire wrote: On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan wrote: On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: On 13/07/18 01:39, Andrew Dunstan wrote: On 07/12/2018 06:34 PM, Alvaro Herrera wrote: On 2018-Jul-12, Andrew Dunstan wrote: I fully understand. I think this needs to go back to "Waiting on Author". Why? Heikki's patch applies fine and passes the regression tests. Well, I understood Claudio was going to do some more work (see upthread). Claudio raised a good point, that doing small pallocs leads to fragmentation, and in particular, it might mean that we can't give back the memory to the OS. The default glibc malloc() implementation has a threshold of 4 or 32 MB or something like that - allocations larger than the threshold are mmap()'d, and can always be returned to the OS. I think a simple solution to that is to allocate larger chunks, something like 32-64 MB at a time, and carve out the allocations for the nodes from those chunks. That's pretty straightforward, because we don't need to worry about freeing the nodes in retail. Keep track of the current half-filled chunk, and allocate a new one when it fills up. Google seems to suggest the default threshold is much lower, like 128K. Still, making larger allocations seems sensible. Are you going to work on that? Below a few MB the threshold is dynamic, and if a block bigger than 128K but smaller than the higher threshold (32-64MB IIRC) is freed, the dynamic threshold is set to the size of the freed block. See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] So I'd suggest allocating blocks bigger than M_MMAP_MAX. [1] http://man7.org/linux/man-pages/man3/mallopt.3.html Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the former is something else. Ah, ok. Thanks. ignore the email I just sent about that. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
On Mon, Jul 16, 2018 at 3:30 PM Andrew Dunstan wrote: > > > > On 07/16/2018 10:34 AM, Claudio Freire wrote: > > On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan > > wrote: > >> > >> > >> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: > >>> On 13/07/18 01:39, Andrew Dunstan wrote: > On 07/12/2018 06:34 PM, Alvaro Herrera wrote: > > On 2018-Jul-12, Andrew Dunstan wrote: > > > >> I fully understand. I think this needs to go back to "Waiting on > >> Author". > > Why? Heikki's patch applies fine and passes the regression tests. > Well, I understood Claudio was going to do some more work (see > upthread). > >>> Claudio raised a good point, that doing small pallocs leads to > >>> fragmentation, and in particular, it might mean that we can't give > >>> back the memory to the OS. The default glibc malloc() implementation > >>> has a threshold of 4 or 32 MB or something like that - allocations > >>> larger than the threshold are mmap()'d, and can always be returned to > >>> the OS. I think a simple solution to that is to allocate larger > >>> chunks, something like 32-64 MB at a time, and carve out the > >>> allocations for the nodes from those chunks. That's pretty > >>> straightforward, because we don't need to worry about freeing the > >>> nodes in retail. Keep track of the current half-filled chunk, and > >>> allocate a new one when it fills up. > >> > >> Google seems to suggest the default threshold is much lower, like 128K. > >> Still, making larger allocations seems sensible. Are you going to work > >> on that? > > Below a few MB the threshold is dynamic, and if a block bigger than > > 128K but smaller than the higher threshold (32-64MB IIRC) is freed, > > the dynamic threshold is set to the size of the freed block. > > > > See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] > > > > So I'd suggest allocating blocks bigger than M_MMAP_MAX. > > > > [1] http://man7.org/linux/man-pages/man3/mallopt.3.html > > > That page says: > > M_MMAP_MAX >This parameter specifies the maximum number of allocation >requests that may be simultaneously serviced using mmap(2). >This parameter exists because some systems have a limited >number of internal tables for use by mmap(2), and using more >than a few of them may degrade performance. > >The default value is 65,536, a value which has no special >significance and which serves only as a safeguard. Setting >this parameter to 0 disables the use of mmap(2) for servicing >large allocation requests. > > > I'm confused about the relevance. It isn't relevant. See my next message, it should have read DEFAULT_MMAP_THRESHOLD_MAX.
Re: Vacuum: allow usage of more than 1GB of work mem
On 07/16/2018 10:34 AM, Claudio Freire wrote: On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan wrote: On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: On 13/07/18 01:39, Andrew Dunstan wrote: On 07/12/2018 06:34 PM, Alvaro Herrera wrote: On 2018-Jul-12, Andrew Dunstan wrote: I fully understand. I think this needs to go back to "Waiting on Author". Why? Heikki's patch applies fine and passes the regression tests. Well, I understood Claudio was going to do some more work (see upthread). Claudio raised a good point, that doing small pallocs leads to fragmentation, and in particular, it might mean that we can't give back the memory to the OS. The default glibc malloc() implementation has a threshold of 4 or 32 MB or something like that - allocations larger than the threshold are mmap()'d, and can always be returned to the OS. I think a simple solution to that is to allocate larger chunks, something like 32-64 MB at a time, and carve out the allocations for the nodes from those chunks. That's pretty straightforward, because we don't need to worry about freeing the nodes in retail. Keep track of the current half-filled chunk, and allocate a new one when it fills up. Google seems to suggest the default threshold is much lower, like 128K. Still, making larger allocations seems sensible. Are you going to work on that? Below a few MB the threshold is dynamic, and if a block bigger than 128K but smaller than the higher threshold (32-64MB IIRC) is freed, the dynamic threshold is set to the size of the freed block. See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] So I'd suggest allocating blocks bigger than M_MMAP_MAX. [1] http://man7.org/linux/man-pages/man3/mallopt.3.html That page says: M_MMAP_MAX This parameter specifies the maximum number of allocation requests that may be simultaneously serviced using mmap(2). This parameter exists because some systems have a limited number of internal tables for use by mmap(2), and using more than a few of them may degrade performance. The default value is 65,536, a value which has no special significance and which serves only as a safeguard. Setting this parameter to 0 disables the use of mmap(2) for servicing large allocation requests. I'm confused about the relevance. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire wrote: > > On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan > wrote: > > > > > > > > On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: > > > On 13/07/18 01:39, Andrew Dunstan wrote: > > >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote: > > >>> On 2018-Jul-12, Andrew Dunstan wrote: > > >>> > > I fully understand. I think this needs to go back to "Waiting on > > Author". > > >>> Why? Heikki's patch applies fine and passes the regression tests. > > >> > > >> Well, I understood Claudio was going to do some more work (see > > >> upthread). > > > > > > Claudio raised a good point, that doing small pallocs leads to > > > fragmentation, and in particular, it might mean that we can't give > > > back the memory to the OS. The default glibc malloc() implementation > > > has a threshold of 4 or 32 MB or something like that - allocations > > > larger than the threshold are mmap()'d, and can always be returned to > > > the OS. I think a simple solution to that is to allocate larger > > > chunks, something like 32-64 MB at a time, and carve out the > > > allocations for the nodes from those chunks. That's pretty > > > straightforward, because we don't need to worry about freeing the > > > nodes in retail. Keep track of the current half-filled chunk, and > > > allocate a new one when it fills up. > > > > > > Google seems to suggest the default threshold is much lower, like 128K. > > Still, making larger allocations seems sensible. Are you going to work > > on that? > > Below a few MB the threshold is dynamic, and if a block bigger than > 128K but smaller than the higher threshold (32-64MB IIRC) is freed, > the dynamic threshold is set to the size of the freed block. > > See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] > > So I'd suggest allocating blocks bigger than M_MMAP_MAX. > > [1] http://man7.org/linux/man-pages/man3/mallopt.3.html Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the former is something else.
Re: Vacuum: allow usage of more than 1GB of work mem
On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan wrote: > > > > On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: > > On 13/07/18 01:39, Andrew Dunstan wrote: > >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote: > >>> On 2018-Jul-12, Andrew Dunstan wrote: > >>> > I fully understand. I think this needs to go back to "Waiting on > Author". > >>> Why? Heikki's patch applies fine and passes the regression tests. > >> > >> Well, I understood Claudio was going to do some more work (see > >> upthread). > > > > Claudio raised a good point, that doing small pallocs leads to > > fragmentation, and in particular, it might mean that we can't give > > back the memory to the OS. The default glibc malloc() implementation > > has a threshold of 4 or 32 MB or something like that - allocations > > larger than the threshold are mmap()'d, and can always be returned to > > the OS. I think a simple solution to that is to allocate larger > > chunks, something like 32-64 MB at a time, and carve out the > > allocations for the nodes from those chunks. That's pretty > > straightforward, because we don't need to worry about freeing the > > nodes in retail. Keep track of the current half-filled chunk, and > > allocate a new one when it fills up. > > > Google seems to suggest the default threshold is much lower, like 128K. > Still, making larger allocations seems sensible. Are you going to work > on that? Below a few MB the threshold is dynamic, and if a block bigger than 128K but smaller than the higher threshold (32-64MB IIRC) is freed, the dynamic threshold is set to the size of the freed block. See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1] So I'd suggest allocating blocks bigger than M_MMAP_MAX. [1] http://man7.org/linux/man-pages/man3/mallopt.3.html
Re: Vacuum: allow usage of more than 1GB of work mem
On 07/13/2018 09:44 AM, Heikki Linnakangas wrote: On 13/07/18 01:39, Andrew Dunstan wrote: On 07/12/2018 06:34 PM, Alvaro Herrera wrote: On 2018-Jul-12, Andrew Dunstan wrote: I fully understand. I think this needs to go back to "Waiting on Author". Why? Heikki's patch applies fine and passes the regression tests. Well, I understood Claudio was going to do some more work (see upthread). Claudio raised a good point, that doing small pallocs leads to fragmentation, and in particular, it might mean that we can't give back the memory to the OS. The default glibc malloc() implementation has a threshold of 4 or 32 MB or something like that - allocations larger than the threshold are mmap()'d, and can always be returned to the OS. I think a simple solution to that is to allocate larger chunks, something like 32-64 MB at a time, and carve out the allocations for the nodes from those chunks. That's pretty straightforward, because we don't need to worry about freeing the nodes in retail. Keep track of the current half-filled chunk, and allocate a new one when it fills up. Google seems to suggest the default threshold is much lower, like 128K. Still, making larger allocations seems sensible. Are you going to work on that? He also wanted to refactor the iterator API, to return one ItemPointer at a time. I don't think that's necessary, the current iterator API is more convenient for the callers, but I don't feel strongly about that. Anything else? If we're going to go with Heikki's patch then do we need to change the author, or add him as an author? Let's list both of us. At least in the commit message, doesn't matter much what the commitfest app says. I added you as an author in the CF App cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
On 13/07/18 01:39, Andrew Dunstan wrote: On 07/12/2018 06:34 PM, Alvaro Herrera wrote: On 2018-Jul-12, Andrew Dunstan wrote: I fully understand. I think this needs to go back to "Waiting on Author". Why? Heikki's patch applies fine and passes the regression tests. Well, I understood Claudio was going to do some more work (see upthread). Claudio raised a good point, that doing small pallocs leads to fragmentation, and in particular, it might mean that we can't give back the memory to the OS. The default glibc malloc() implementation has a threshold of 4 or 32 MB or something like that - allocations larger than the threshold are mmap()'d, and can always be returned to the OS. I think a simple solution to that is to allocate larger chunks, something like 32-64 MB at a time, and carve out the allocations for the nodes from those chunks. That's pretty straightforward, because we don't need to worry about freeing the nodes in retail. Keep track of the current half-filled chunk, and allocate a new one when it fills up. He also wanted to refactor the iterator API, to return one ItemPointer at a time. I don't think that's necessary, the current iterator API is more convenient for the callers, but I don't feel strongly about that. Anything else? If we're going to go with Heikki's patch then do we need to change the author, or add him as an author? Let's list both of us. At least in the commit message, doesn't matter much what the commitfest app says. - Heikki
Re: Vacuum: allow usage of more than 1GB of work mem
On 07/12/2018 06:34 PM, Alvaro Herrera wrote: On 2018-Jul-12, Andrew Dunstan wrote: I fully understand. I think this needs to go back to "Waiting on Author". Why? Heikki's patch applies fine and passes the regression tests. Well, I understood Claudio was going to do some more work (see upthread). If we're going to go with Heikki's patch then do we need to change the author, or add him as an author? cheers andew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
On 2018-Jul-12, Andrew Dunstan wrote: > I fully understand. I think this needs to go back to "Waiting on Author". Why? Heikki's patch applies fine and passes the regression tests. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
On 07/12/2018 12:38 PM, Claudio Freire wrote: On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan wrote: On 04/06/2018 08:00 PM, Claudio Freire wrote: On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire wrote: On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas wrote: On 06/04/18 01:59, Claudio Freire wrote: The iteration interface, however, seems quite specific for the use case of vacuumlazy, so it's not really a good abstraction. Can you elaborate? It does return the items one block at a time. Is that what you mean by being specific for vacuumlazy? I guess that's a bit special, but if you imagine some other users for this abstraction, it's probably not that unusual. For example, if we started using it in bitmap heap scans, a bitmap heap scan would also want to get the TIDs one block number at a time. But you're also tying the caller to the format of the buffer holding those TIDs, for instance. Why would you, when you can have an interface that just iterates TIDs and let the caller store them if/however they want? I do believe a pure iterator interface is a better interface. Between the b-tree or not discussion and the refactoring to separate the code, I don't think we'll get this in the next 24hs. So I guess we'll have ample time to poner on both issues during the next commit fest. There doesn't seem to have been much pondering done since then, at least publicly. Can we make some progress on this? It's been around for a long time now. Yeah, life has kept me busy and I haven't had much time to make progress here, but I was planning on doing the refactoring as we were discussing soon. Can't give a time frame for that, but "soonish". I fully understand. I think this needs to go back to "Waiting on Author". cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan wrote: > > > > On 04/06/2018 08:00 PM, Claudio Freire wrote: > > On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire > > wrote: > >> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas > >> wrote: > >>> On 06/04/18 01:59, Claudio Freire wrote: > The iteration interface, however, seems quite specific for the use > case of vacuumlazy, so it's not really a good abstraction. > >>> > >>> Can you elaborate? It does return the items one block at a time. Is that > >>> what you mean by being specific for vacuumlazy? I guess that's a bit > >>> special, but if you imagine some other users for this abstraction, it's > >>> probably not that unusual. For example, if we started using it in bitmap > >>> heap scans, a bitmap heap scan would also want to get the TIDs one block > >>> number at a time. > >> But you're also tying the caller to the format of the buffer holding > >> those TIDs, for instance. Why would you, when you can have an > >> interface that just iterates TIDs and let the caller store them > >> if/however they want? > >> > >> I do believe a pure iterator interface is a better interface. > > Between the b-tree or not discussion and the refactoring to separate > > the code, I don't think we'll get this in the next 24hs. > > > > So I guess we'll have ample time to poner on both issues during the > > next commit fest. > > > > > > There doesn't seem to have been much pondering done since then, at least > publicly. Can we make some progress on this? It's been around for a long > time now. Yeah, life has kept me busy and I haven't had much time to make progress here, but I was planning on doing the refactoring as we were discussing soon. Can't give a time frame for that, but "soonish".
Re: Vacuum: allow usage of more than 1GB of work mem
On 04/06/2018 08:00 PM, Claudio Freire wrote: On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire wrote: On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas wrote: On 06/04/18 01:59, Claudio Freire wrote: The iteration interface, however, seems quite specific for the use case of vacuumlazy, so it's not really a good abstraction. Can you elaborate? It does return the items one block at a time. Is that what you mean by being specific for vacuumlazy? I guess that's a bit special, but if you imagine some other users for this abstraction, it's probably not that unusual. For example, if we started using it in bitmap heap scans, a bitmap heap scan would also want to get the TIDs one block number at a time. But you're also tying the caller to the format of the buffer holding those TIDs, for instance. Why would you, when you can have an interface that just iterates TIDs and let the caller store them if/however they want? I do believe a pure iterator interface is a better interface. Between the b-tree or not discussion and the refactoring to separate the code, I don't think we'll get this in the next 24hs. So I guess we'll have ample time to poner on both issues during the next commit fest. There doesn't seem to have been much pondering done since then, at least publicly. Can we make some progress on this? It's been around for a long time now. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freirewrote: > On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas wrote: >> On 06/04/18 01:59, Claudio Freire wrote: >>> >>> The iteration interface, however, seems quite specific for the use >>> case of vacuumlazy, so it's not really a good abstraction. >> >> >> Can you elaborate? It does return the items one block at a time. Is that >> what you mean by being specific for vacuumlazy? I guess that's a bit >> special, but if you imagine some other users for this abstraction, it's >> probably not that unusual. For example, if we started using it in bitmap >> heap scans, a bitmap heap scan would also want to get the TIDs one block >> number at a time. > > But you're also tying the caller to the format of the buffer holding > those TIDs, for instance. Why would you, when you can have an > interface that just iterates TIDs and let the caller store them > if/however they want? > > I do believe a pure iterator interface is a better interface. Between the b-tree or not discussion and the refactoring to separate the code, I don't think we'll get this in the next 24hs. So I guess we'll have ample time to poner on both issues during the next commit fest.
Re: Vacuum: allow usage of more than 1GB of work mem
Claudio Freire wrote: > On Fri, Apr 6, 2018 at 11:00 AM, Alvaro Herrera> wrote: > > FWIW I liked the idea of having this abstraction possibly do other > > things -- for instance to vacuum brin indexes you'd like to mark index > > tuples as "containing tuples that were removed" and eventually > > re-summarize the range. With the current interface we cannot do that, > > because vacuum expects brin vacuuming to ask for each heap tuple "is > > this tid dead?" and of course we don't have a list of tids to ask for. > > So if we can ask instead "how many dead tuples does this block contain?" > > brin vacuuming will be much happier. > > I don't think either patch gives you that. > > The bulkdelete interface is part of the indexam and unlikely to change > in this patch. I'm sure you're correct. I was just saying that with the abstract interface it is easier to implement what I suggest as a follow-on patch. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
Heikki Linnakangas wrote: > On 06/04/18 01:59, Claudio Freire wrote: > > The iteration interface, however, seems quite specific for the use > > case of vacuumlazy, so it's not really a good abstraction. > > Can you elaborate? It does return the items one block at a time. Is that > what you mean by being specific for vacuumlazy? I guess that's a bit > special, but if you imagine some other users for this abstraction, it's > probably not that unusual. For example, if we started using it in bitmap > heap scans, a bitmap heap scan would also want to get the TIDs one block > number at a time. FWIW I liked the idea of having this abstraction possibly do other things -- for instance to vacuum brin indexes you'd like to mark index tuples as "containing tuples that were removed" and eventually re-summarize the range. With the current interface we cannot do that, because vacuum expects brin vacuuming to ask for each heap tuple "is this tid dead?" and of course we don't have a list of tids to ask for. So if we can ask instead "how many dead tuples does this block contain?" brin vacuuming will be much happier. > Looking at the changes to the regression test in this, I don't quite > understand what it's all about. What are the "wait_barriers" for? If I > understand correctly, they're added so that the VACUUMs can remove the > tuples that are deleted in the test. But why are they needed now? Was that > an orthogonal change we should've done anyway? > > Rather than add those wait_barriers, should we stop running the 'vacuum' > test in parallel with the other tests? Or maybe it's a good thing to run it > in parallel, to test some other things? 20180207235226.zygu4r3yv3yfcnmc@alvherre.pgsql -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Vacuum: allow usage of more than 1GB of work mem
On 06/04/18 16:39, Heikki Linnakangas wrote: Sure. I used the attached script to test this. Sorry, I attached the wrong script. Here is the correct one that I used. Here are also the results I got from running it - Heikki vacuumbenchone.sh Description: application/shellscript vacuumbenchone-results-on-heikkis-laptop.tar.gz Description: application/gzip
Re: Vacuum: allow usage of more than 1GB of work mem
On 06/04/18 01:59, Claudio Freire wrote: The iteration interface, however, seems quite specific for the use case of vacuumlazy, so it's not really a good abstraction. Can you elaborate? It does return the items one block at a time. Is that what you mean by being specific for vacuumlazy? I guess that's a bit special, but if you imagine some other users for this abstraction, it's probably not that unusual. For example, if we started using it in bitmap heap scans, a bitmap heap scan would also want to get the TIDs one block number at a time. It also copies stuff a lot, so it's quite heavyweight. I'd suggest trying to go for a lighter weight interface with less overhead that is more general at the same time. Note that there was similar copying, to construct an array of OffsetNumbers, happening in lazy_vacuum_page() before this patch. So the net amount of copying is the same. I'm envisioning that this data structure will sooner or later be optimized further, so that when you have a lot of TIDs pointing to the same block, we would pack them more tightly, storing the block number just once, with an array of offset numbers. This interface that returns an array of offset numbers matches that future well, as the iterator could just return a pointer to the array of offset numbers, with no copying. (If we end up doing something even more dense, like a bitmap, then it doesn't help, but that's ok too.) About the B-tree, however, I don't think a B-tree is a good idea. Trees' main benefit is that they can be inserted to efficiently. When all your data is loaded sequentially, in-order, in-memory and immutable; the tree is pointless, more costly to build, and harder to maintain - in terms of code complexity. In this use case, the only benefit of B-trees would be that they're optimized for disk access. Those are not the reasons for which I'd prefer a B-tree. A B-tree has good cache locality, and when you don't need to worry about random insertions, page splits, deletions etc., it's also very simple to implement. This patch is not much longer than the segmented multi-array. On the other side, using B-trees incurs memory overhead due to the need for internal nodes, can fragment memory because internal nodes aren't the same size as leaf nodes, is easier to get wrong and introduce bugs... I don't see a gain. The memory overhead incurred by the internal nodes is quite minimal, and can be adjusted by changing the node sizes. After some experimentation, I settled on 2048 items per leaf node, and 64 items per internal node. With those values, the overhead caused by the internal nodes is minimal, below 0.5%. That seems fine, but we could increase the node sizes to bring it further down, if we'd prefer that tradeoff. I don't understand what memory fragmentation problems you're worried about. The tree grows one node at a time, as new TIDs are added, until it's all released at the end. I don't see how the size of internal vs leaf nodes matters. If you propose its use, at least benchmark it to show some gain. Sure. I used the attached script to test this. It's inspired by the test script you posted. It creates a pgbench database with scale factor 100, deletes 80% of the rows, and runs vacuum. To stress lazy_tid_reaped() more heavily, the test script creates a number of extra indexes. Half of them are on the primary key, just to get more repetitions without having to re-initialize in between, and the rest are like this: create index random_1 on pgbench_accounts((hashint4(aid))) to stress lazy_vacuum_tid_reaped() with a random access pattern, rather than the sequential one that you get with the primary key index. I ran the test script on my laptop, with unpatched master, with your latest multi-array patch, and with the attached version of the b-tree patch. The results are quite noisy, unfortunately, so I wouldn't draw very strong conclusions from it, but it seems that the performance of all three versions is roughly the same. I looked in particular at the CPU time spent in the index vacuums, as reported by VACUUM VERBOSE. Furthermore, among the 200-ish messages this thread has accumulated, better ideas have been proposed, better because they do use less memory and are faster (like using bitmaps when possible), but if we can't push a simple refactoring first, there's no chance a bigger rewrite will fare better. Remember, in this use case, using less memory far outweights any other consideration. Less memory directly translates to less iterations over the indexes, because more can be crammed into m_w_m, which is a huge time saving. Far more than any micro-optimization. About 2 years ago, I chose to try to push this simple algorithm first, then try to improve on it with better data structures. Nobody complained at the time (I think, IIRC), and I don't think it fair to go and revisit that now. It just delays getting a solution for this issue for the persuit of "the
Re: Vacuum: allow usage of more than 1GB of work mem
On Thu, Apr 5, 2018 at 5:02 PM, Heikki Linnakangaswrote: > On 03/04/18 17:20, Claudio Freire wrote: >> >> Ok, rebased patches attached > > > Thanks! I took a look at this. > > First, now that the data structure is more complicated, I think it's time to > abstract it, and move it out of vacuumlazy.c. The Tid Map needs to support > the following operations: > > * Add TIDs, in order (in 1st phase of vacuum) > * Random lookup, by TID (when scanning indexes) > * Iterate through all TIDs, in order (2nd pass over heap) > > Let's add a new source file to hold the code for the tid map data structure, > with functions corresponding those operations. > > I took a stab at doing that, and I think it makes vacuumlazy.c nicer. About the refactoring to split this into their own set of files and abstract away the underlying structure, I can totally get behind that. The iteration interface, however, seems quite specific for the use case of vacuumlazy, so it's not really a good abstraction. It also copies stuff a lot, so it's quite heavyweight. I'd suggest trying to go for a lighter weight interface with less overhead that is more general at the same time. If it was C++, I'd say build an iterator class. C would do it probably with macros, so you can have a macro to get to the current element, another to advance to the next element, and another to check whether you've reached the end. I can do that if we agree on the points below: > Secondly, I'm not a big fan of the chosen data structure. I think the only > reason that the segmented "multi-array" was chosen is that each "segment" > works is similar to the simple array that we used to have. After putting it > behind the abstraction, it seems rather ad hoc. There are many standard > textbook data structures that we could use instead, and would be easier to > understand and reason about, I think. > > So, I came up with the attached patch. I used a B-tree as the data > structure. Not sure it's the best one, I'm all ears for suggestions and > bikeshedding on alternatives, but I'm pretty happy with that. I would expect > it to be pretty close to the simple array with binary search in performance > characteristics. It'd be pretty straightforward to optimize further, and > e.g. use a bitmap of OffsetNumbers or some other more dense data structure > in the B-tree leaf pages, but I resisted doing that as part of this patch. About the B-tree, however, I don't think a B-tree is a good idea. Trees' main benefit is that they can be inserted to efficiently. When all your data is loaded sequentially, in-order, in-memory and immutable; the tree is pointless, more costly to build, and harder to maintain - in terms of code complexity. In this use case, the only benefit of B-trees would be that they're optimized for disk access. If we planned to store this on-disk, perhaps I'd grant you that. But we don't plan to do that, and it's not even clear doing it would be efficient enough for the intended use. On the other side, using B-trees incurs memory overhead due to the need for internal nodes, can fragment memory because internal nodes aren't the same size as leaf nodes, is easier to get wrong and introduce bugs... I don't see a gain. If you propose its use, at least benchmark it to show some gain. So I don't think B-tree is a good idea, the sorted array already is good enough, and if not, it's at least close to the earlier implementation and less likely to introduce bugs. Furthermore, among the 200-ish messages this thread has accumulated, better ideas have been proposed, better because they do use less memory and are faster (like using bitmaps when possible), but if we can't push a simple refactoring first, there's no chance a bigger rewrite will fare better. Remember, in this use case, using less memory far outweights any other consideration. Less memory directly translates to less iterations over the indexes, because more can be crammed into m_w_m, which is a huge time saving. Far more than any micro-optimization. About 2 years ago, I chose to try to push this simple algorithm first, then try to improve on it with better data structures. Nobody complained at the time (I think, IIRC), and I don't think it fair to go and revisit that now. It just delays getting a solution for this issue for the persuit of "the perfect implementaiton" that might never arrive. Or even if it doesn, there's nothing stopping us from pushing another patch in the future with that better implementation if we wish. Lets get something simple and proven first. > I haven't done any performance testing of this (and not much testing in > general), but at least the abstraction seems better this way. Performance > testing would be good, too. In particular, I'd like to know how this might > affect the performance of lazy_tid_reaped(). That's a hot spot when > vacuuming indexes, so we don't want to add any cycles there. Was there any > ready-made test kits on that in this thread? I didn't see
Re: Vacuum: allow usage of more than 1GB of work mem
On 03/04/18 17:20, Claudio Freire wrote: Ok, rebased patches attached Thanks! I took a look at this. First, now that the data structure is more complicated, I think it's time to abstract it, and move it out of vacuumlazy.c. The Tid Map needs to support the following operations: * Add TIDs, in order (in 1st phase of vacuum) * Random lookup, by TID (when scanning indexes) * Iterate through all TIDs, in order (2nd pass over heap) Let's add a new source file to hold the code for the tid map data structure, with functions corresponding those operations. I took a stab at doing that, and I think it makes vacuumlazy.c nicer. Secondly, I'm not a big fan of the chosen data structure. I think the only reason that the segmented "multi-array" was chosen is that each "segment" works is similar to the simple array that we used to have. After putting it behind the abstraction, it seems rather ad hoc. There are many standard textbook data structures that we could use instead, and would be easier to understand and reason about, I think. So, I came up with the attached patch. I used a B-tree as the data structure. Not sure it's the best one, I'm all ears for suggestions and bikeshedding on alternatives, but I'm pretty happy with that. I would expect it to be pretty close to the simple array with binary search in performance characteristics. It'd be pretty straightforward to optimize further, and e.g. use a bitmap of OffsetNumbers or some other more dense data structure in the B-tree leaf pages, but I resisted doing that as part of this patch. I haven't done any performance testing of this (and not much testing in general), but at least the abstraction seems better this way. Performance testing would be good, too. In particular, I'd like to know how this might affect the performance of lazy_tid_reaped(). That's a hot spot when vacuuming indexes, so we don't want to add any cycles there. Was there any ready-made test kits on that in this thread? I didn't see any at a quick glance, but it's a long thread.. - Heikki diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile index 4a6c99e090..634059b8ff 100644 --- a/src/backend/commands/Makefile +++ b/src/backend/commands/Makefile @@ -20,6 +20,6 @@ OBJS = amcmds.o aggregatecmds.o alter.o analyze.o async.o cluster.o comment.o \ policy.o portalcmds.o prepare.o proclang.o publicationcmds.o \ schemacmds.o seclabel.o sequence.o statscmds.o subscriptioncmds.o \ tablecmds.o tablespace.o trigger.o tsearchcmds.o typecmds.o user.o \ - vacuum.o vacuumlazy.o variable.o view.o + vacuum.o vacuumlazy.o vacuumtidmap.o variable.o view.o include $(top_srcdir)/src/backend/common.mk diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index d2a006671a..0cc6b98831 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -4,23 +4,21 @@ * Concurrent ("lazy") vacuuming. * * - * The major space usage for LAZY VACUUM is storage for the array of dead tuple - * TIDs. We want to ensure we can vacuum even the very largest relations with - * finite memory space usage. To do that, we set upper bounds on the number of - * tuples we will keep track of at once. + * The major space usage for LAZY VACUUM is storage of dead tuple TIDs. + * We want to ensure we can vacuum even the very largest relations with + * finite memory space usage. To do that, we set upper bounds on the amount + * of memory used to track the TIDs at any one time. * * We are willing to use at most maintenance_work_mem (or perhaps * autovacuum_work_mem) memory space to keep track of dead tuples. We - * initially allocate an array of TIDs of that size, with an upper limit that - * depends on table size (this limit ensures we don't allocate a huge area - * uselessly for vacuuming small tables). If the array threatens to overflow, - * we suspend the heap scan phase and perform a pass of index cleanup and page - * compaction, then resume the heap scan with an empty TID array. + * use a specialized data structure to hold the TIDs, which keeps track of + * the memory used. If the memory limit is about to be exceeded, we suspend + * the heap scan phase and perform a pass of index cleanup and page + * compaction, then resume the heap scan with an empty tid map. * * If we're processing a table with no indexes, we can just vacuum each page * as we go; there's no need to save up multiple tuples to minimize the number - * of index scans performed. So we don't use maintenance_work_mem memory for - * the TID array, just enough to hold as many heap tuples as fit on one page. + * of index scans performed. * * * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group @@ -94,13 +92,6 @@ ((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ)) /* - * Guesstimation of number of dead tuples per page. This is used to - * provide an upper limit to memory allocated when vacuuming small - *
Re: Vacuum: allow usage of more than 1GB of work mem
On Tue, Apr 3, 2018 at 11:09 AM, Claudio Freirewrote: > On Tue, Apr 3, 2018 at 11:06 AM, Claudio Freire > wrote: >> I didn't receive your comment, I just saw it. Nevertheless, I rebased the >> patches a while ago just because I noticed they didn't apply anymore in >> cputube, and they still seem to apply. > > Sorry, that is false. > > They appear green in cputube, so I was confident they did apply, but I > just double-checked on a recent pull and they don't. I'll rebase them > shortly. Ok, rebased patches attached From 712aff9a856c2bae1d87555057e6546d029ecc47 Mon Sep 17 00:00:00 2001 From: Claudio Freire Date: Mon, 12 Sep 2016 23:36:42 -0300 Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem Turn the dead_tuples array into a structure composed of several exponentially bigger arrays, to enable usage of more than 1GB of work mem during vacuum and thus reduce the number of full index scans necessary to remove all dead tids when the memory is available. Improve test ability to spot vacuum errors --- src/backend/commands/vacuumlazy.c| 402 --- src/test/regress/expected/vacuum.out | 72 ++- src/test/regress/sql/vacuum.sql | 40 +++- 3 files changed, 438 insertions(+), 76 deletions(-) diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index d2a0066..2f82dc6 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -11,11 +11,24 @@ * * We are willing to use at most maintenance_work_mem (or perhaps * autovacuum_work_mem) memory space to keep track of dead tuples. We - * initially allocate an array of TIDs of that size, with an upper limit that + * initially allocate an array of TIDs of 128MB, or an upper limit that * depends on table size (this limit ensures we don't allocate a huge area - * uselessly for vacuuming small tables). If the array threatens to overflow, + * uselessly for vacuuming small tables). Additional arrays of increasingly + * large sizes are allocated as they become necessary. + * + * The TID array is thus represented as a list of multiple segments of + * varying size, beginning with the initial size of up to 128MB, and growing + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem + * is used up. + * + * Lookup in that structure happens with a binary search of segments, and then + * with a binary search within each segment. Since segment's size grows + * exponentially, this retains O(log N) lookup complexity. + * + * If the multiarray's total size threatens to grow beyond maintenance_work_mem, * we suspend the heap scan phase and perform a pass of index cleanup and page - * compaction, then resume the heap scan with an empty TID array. + * compaction, then resume the heap scan with an array of logically empty but + * already preallocated TID segments to be refilled with more dead tuple TIDs. * * If we're processing a table with no indexes, we can just vacuum each page * as we go; there's no need to save up multiple tuples to minimize the number @@ -101,6 +114,14 @@ #define LAZY_ALLOC_TUPLES MaxHeapTuplesPerPage /* + * Minimum (starting) size of the dead_tuples array segments. Will allocate + * space for 128MB worth of tid pointers in the first segment, further segments + * will grow in size exponentially. Don't make it too small or the segment list + * will grow bigger than the sweetspot for search efficiency on big vacuums. + */ +#define LAZY_INIT_TUPLES Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData)) + +/* * Before we consider skipping a page that's marked as clean in * visibility map, we must've seen at least this many clean pages. */ @@ -112,6 +133,27 @@ */ #define PREFETCH_SIZE ((BlockNumber) 32) +typedef struct DeadTuplesSegment +{ + ItemPointerData last_dead_tuple; /* Copy of the last dead tuple (unset + * until the segment is fully + * populated). Keep it first to simplify + * binary searches */ + int num_dead_tuples; /* # of entries in the segment */ + int max_dead_tuples; /* # of entries allocated in the segment */ + ItemPointer dt_tids; /* Array of dead tuples */ +} DeadTuplesSegment; + +typedef struct DeadTuplesMultiArray +{ + int num_entries; /* current # of entries */ + int max_entries; /* total # of slots that can be allocated in + * array */ + int num_segs; /* number of dead tuple segments allocated */ + int last_seg; /* last dead tuple segment with data (or 0) */ + DeadTuplesSegment *dt_segments; /* array of num_segs segments */ +} DeadTuplesMultiArray; + typedef struct LVRelStats { /* hasindex = true means two-pass strategy; false means one-pass */ @@ -132,14 +174,20 @@ typedef struct LVRelStats BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */ /* List of TIDs of tuples we intend to delete */ /* NB: this list is
Re: Vacuum: allow usage of more than 1GB of work mem
On Tue, Apr 3, 2018 at 11:06 AM, Claudio Freirewrote: > I didn't receive your comment, I just saw it. Nevertheless, I rebased the > patches a while ago just because I noticed they didn't apply anymore in > cputube, and they still seem to apply. Sorry, that is false. They appear green in cputube, so I was confident they did apply, but I just double-checked on a recent pull and they don't. I'll rebase them shortly.
Re: Vacuum: allow usage of more than 1GB of work mem
I didn't receive your comment, I just saw it. Nevertheless, I rebased the patches a while ago just because I noticed they didn't apply anymore in cputube, and they still seem to apply. Patch number 2 was committed a long while ago, that's why it's missing. It was a simple patch, it landed rewritten as commit 7e26e02eec90370dd222f35f00042f8188488ac4
Re: Vacuum: allow usage of more than 1GB of work mem
Hello everyone, I would like to let you know that unfortunately these patches don't apply anymore. Also personally I'm a bit confused by the last message that has 0001- and 0003- patches attached but not the 0002- one.
Re: Vacuum: allow usage of more than 1GB of work mem
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, passed I can confirm that these patches don't break anything; the code is well documented, has some tests and doesn't do anything obviously wrong. However I would recommend someone who is more familiar with the VACUUM mechanism than I do to recheck these patches. The new status of this patch is: Ready for Committer