Re: [HACKERS] Reducing tuple overhead
On Sun, Jun 7, 2015 at 3:02 PM, Simon Riggs si...@2ndquadrant.com wrote: On 23 April 2015 at 17:24, Andres Freund and...@anarazel.de wrote: It's hard to see how to save space there without reference to a specific use case. I see different solutions depending upon whether we assume a low number of transactions or a high number of transactions. I have tried to check theoretically, how much difference such a change could give us. Assuming BLK_SZ - 8192 bytes; Page header - 24 bytes; each line pointer - 4 bytes; average tuple - 150 bytes, roughly 53 tuples could be accommodated in one page. Now each of this tuple contains 12 bytes transaction information (xmin, xmax, cid/combocid). Now considering that in average workload 4 transactions operate on a page at the same time (I think for a workload like pgbench tpc-b, it shouldn't be more otherwise it should have been visible in perf reports), 4 additional tuples [1] could be accommodated on a page which is approximately 7% savings in space (which in-turns means that much less I/O). This gain could vary based on tuple size, no. of transactions that can operate on page, page size.. Some additional benefits that I could see from such a change: 1. I think we don't need to traverse the whole page while freezing, so there should be some savings in freeze operation as well. 2. Now I think with this, we might be able to reduce WAL also if we can avoid some transaction related info in the cases where it is currently stored (update/delete). 3. Another gain could come if we want to add transaction information in index segment as well, because if such information can be stored at page level, then there won't be much impact in adding it there which will help us in avoiding multiple-passes of Vaccum (heap and index could be vacuumed separately which will definitely help in IO and bloat reduction). [1] Calc for 4 additional tuples: (saving by removing trans info from tuple - new space consumed by trans) /new tuple size (53 * 12 - 12 * 4) / (150 - 12) = 4.26 With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reducing tuple overhead
On 23 April 2015 at 17:24, Andres Freund and...@anarazel.de wrote: Split into a new thread, the other one is already growing fast enough. This discussion started at http://archives.postgresql.org/message-id/55391469.5010506%40iki.fi On April 23, 2015 6:48:57 PM GMT+03:00, Heikki Linnakangas hlinn...@iki.fi wrote: Stop right there. You need to reserve enough space on the page to store an xmax for *every* tuple on the page. Because if you don't, what are you going to do when every tuple on the page is deleted by a different transaction. Even if you store the xmax somewhere else than the page header, you need to reserve the same amount of space for them, so it doesn't help at all. Depends on how you do it and what you optimize for (disk space, runtime, code complexity..). You can e.g. use apply a somewhat similar trick to xmin/xmax as done to cmin/cmax; only that the data structure needs to be persistent. In fact, we already have combocid like structure for xids that's persistent - multixacts. We could just have one xid saved that's either xmin or xmax (indicated by bits) or a multixact. When a tuple is updated/deleted whose xmin is still required we could replace the former xmin with a multixact, otherwise just change the flag that it's now a xmax without a xmin. To check visibility and if the xid is a multixact we'd just have to look for the relevant member for the actual xmin and xmax. To avoid exessive overhead when a tuple is repeatedly updated within one session we could store some of the data in the combocid entry that we anyway need in that case. Whether that's feasible complexity wise is debatable, but it's certainly possible. I do wonder what, in realistic cases, is actually the bigger contributor to the overhead. The tuple header or the padding we liberally add in many cases... It's hard to see how to save space there without reference to a specific use case. I see different solutions depending upon whether we assume a low number of transactions or a high number of transactions. A case we could optimize for is insert-mostly tables. But in that case if you get rid of the xmax then you still have to freeze the tuples later. I would have thought a better optimization would be to use the xmax for the xid epoch by default, so that such rows would never need freezing. Then at least we are using the xmax for something useful in a larger insert-mostly database rather than just leaving it at zero. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Reducing tuple overhead
On Thu, Apr 30, 2015 at 6:54 AM, Robert Haas robertmh...@gmail.com wrote: The other, related problem is that the ordering operator might start to return different results than it did at index creation time. For example, consider a btree index built on a text column. Now consider 'yum update'. glibc gets updated, collation ordering of various strings change, and now you've got tuples that are in the wrong place in the index, because when the index was built, we thought A B, but now we think B A. You would think the glibc maintainers might avoid such changes in minor releases, or that the Red Hat guys would avoid packaging and shipping those changes in minor releases, but you'd be wrong. I would not think that. Unicode Technical Standard #10 states: Collation order is not fixed. Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that require changes; and finally, new characters added to the Unicode Standard will interleave with the previously-defined ones. This means that collations must be carefully versioned. Also, in the paper Modern B-Tree Techniques, by Goetz Graefe, page 238, it states: In many operating systems, appropriate functions are provided to compute a normalized key from a localized string value, date value, or time value. This functionality is used, for example, to list files in a directory as appropriate for the local language. Adding normalization for numeric data types is relatively straightforward, as is concatenation of multiple normalized values. Database code must not rely on such operating system code, however. The problem with relying on operating systems support for database indexes is the update frequency. An operating system might update its normalization code due to an error or extension in the code or in the definition of a local sort order; it is unacceptable, however, if such an update silently renders existing large database indexes incorrect. Unfortunately, it is simply not the case that we can rely on OS collations being immutable. We have *no* contract with any C standard library concerning collation stability whatsoever. I'm surprised that we don't see hear more about this kind of corruption. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On Thu, Apr 23, 2015 at 9:54 PM, Andres Freund and...@anarazel.de wrote: Split into a new thread, the other one is already growing fast enough. This discussion started at http://archives.postgresql.org/message-id/55391469.5010506%40iki.fi On April 23, 2015 6:48:57 PM GMT+03:00, Heikki Linnakangas hlinn...@iki.fi wrote: Stop right there. You need to reserve enough space on the page to store an xmax for *every* tuple on the page. Because if you don't, what are you going to do when every tuple on the page is deleted by a different transaction. Even if you store the xmax somewhere else than the page header, you need to reserve the same amount of space for them, so it doesn't help at all. Depends on how you do it and what you optimize for (disk space, runtime, code complexity..). You can e.g. use apply a somewhat similar trick to xmin/xmax as done to cmin/cmax; only that the data structure needs to be persistent. Today while reading how other databases (that stores similar information at page level) tackle this problem, I came across a link [1] which indicates that this is controlled by some clauses (options) at table level. The idea seems to be that have some preallocated space (minimum and maximum value for which can be specified by user, ofcourse there will be some default values for the same) for this information in page and if more space than that is required for a concurrent write operation, then the operation needs to wait till the space for the same is available. I am not sure if this is the best way as it depends on how to re-use the preallocated space for transaction information at page level, but still I think it is worth considering. [1] - https://techiedba.wordpress.com/2011/09/03/what-is-initrans-and-maxtrans/ With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reducing tuple overhead
On 4/30/15 7:37 AM, Robert Haas wrote: On Thu, Apr 30, 2015 at 8:05 AM, Simon Riggs si...@2ndquadrant.com wrote: A much better idea is to work out how to avoid index bloat at cause. If we are running an UPDATE and we cannot get a cleanup lock, we give up and do a non-HOT update, causing the index to bloat. It seems better to wait for a short period to see if we can get the cleanup lock. The short period is currently 0, so lets start there and vary the duration of wait upwards proportionally as the index gets more bloated. That only happens if there already wasn't enough space on the page so we need to Defrag, yes? If there is enough space we can HOT update without the cleanup lock. What would be useful to know is how often we abort a HOT update because of lack of free space; that would indicate to a DBA that a lower fill factor may be in oredr. What would be useful to -hackers would be stats on how often an update would have been HOT if only the page had been pruned. What I'd be worried about there is that it would be very hard to tune the wait time, and that the operating system scheduling granularity (10ms?) would be way too long. [1] indicates between 0.75 and 6ms by default on Linux. I think FBSD still uses a 1000Hz scheduler (1ms), but it's not as clear. What might be more promising is ways to avoid holding a pin for a long time (like the outer side of a nested loop), or being more aggressive about attempting the lock (IE: lower the threshold to trigger cleaning). There's also a (in hindsight) questionable bit of logic in heap_page_prune_opt(); once we get the cleanup lock we check the page free space a second time. If we managed to actually get the lock, we should probably just clean it anyway. But I'm in vigorous agreement with you on one point: the solution to index bloat (and probably heap bloat, too) is not to clean it up faster but to create less of it in the first place. Making more updates HOT is one way to do that. +1. 1: http://stackoverflow.com/questions/16401294/how-to-know-linux-scheduler-time-slice -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On Thu, Apr 30, 2015 at 5:35 PM, Simon Riggs si...@2ndquadrant.com wrote: On 25 April 2015 at 01:12, Amit Kapila amit.kapil...@gmail.com wrote: On Sat, Apr 25, 2015 at 1:58 AM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/23/15 10:40 PM, Amit Kapila wrote: I agree with you and what I think one of the major reasons of bloat is that Index segment doesn't have visibility information due to which clearing of Index needs to be tied along with heap. Now if we can move transaction information at page level, then we can even think of having it in Index segment as well and then Index can delete/prune it's tuples on it's own which can reduce the bloat in index significantly and there is a benefit to Vacuum as well. I don't see how putting visibility at the page level helps indexes at all. We could already put XMIN in indexes if we wanted, but it won't help, because... We can do that by putting transaction info at tuple level in index as well but that will be huge increase in size of index unless we devise a way to have variable index tuple header rather than a fixed. Now this has some downsides as well like Delete needs to traverse Index segment as well to Delete mark the tuples, but I think the upsides of reducing bloat can certainly outweigh the downsides. ... which isn't possible. You can not go from a heap tuple to an index tuple. We will have the access to index value during delete, so why do you think that we need linkage between heap and index tuple to perform Delete operation? I think we need to think more to design Delete .. by CTID, but that should be doable. I see some assumptions here that need to be challenged. We can keep xmin and/or xmax on index entries. The above discussion assumes that the information needs to be updated synchronously. We already store visibility information on index entries using the lazily updated killtuple mechanism, so I don't see much problem in setting the xmin in a similar lazy manner. That way when we use the index if xmax is set we use it, if it is not we check the heap. (And then you get to freeze indexes as well ;-( ) Anyway, I have no objection to making index AM pass visibility information to indexes that wish to know the information, as long as it is provided lazily. Providing such an information lazily can help to an extent, but I think it won't help much in bloat reduction. For example, when an insert tries to insert a row in index page and found that there is no space, it can't kill or overwrite any tuple (that is actually dead unless updated lazily by that time) which is I think one of the main reasons for index bloat. The second assumption is that if we had visibility information in the index that it would make a difference to bloat. Since as I mention, we already do have visibility information, I don't see that adding xmax or xmin would make any difference at all to bloat. So -1 to adding it **for that reason**. The visibility information is only updated when such an index item is accessed (lazy updation) and by that time already the new space for index insertion would be used whereas if the information is provided synchronously the dead space could be reclaimed much earlier (for insertions on page which has dead tuples) and will reduce the chances of bloat. A much better idea is to work out how to avoid index bloat at cause. If we are running an UPDATE and we cannot get a cleanup lock, we give up and do a non-HOT update, causing the index to bloat. It seems better to wait for a short period to see if we can get the cleanup lock. The short period is currently 0, so lets start there and vary the duration of wait upwards proportionally as the index gets more bloated. I think this is a separate and another good way of avoiding the bloat, but independent of this having something like what we discussed above will even reduce the chances of bloat for a non-HOT update in a scenario described by you. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reducing tuple overhead
On 25 April 2015 at 01:12, Amit Kapila amit.kapil...@gmail.com wrote: On Sat, Apr 25, 2015 at 1:58 AM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/23/15 10:40 PM, Amit Kapila wrote: I agree with you and what I think one of the major reasons of bloat is that Index segment doesn't have visibility information due to which clearing of Index needs to be tied along with heap. Now if we can move transaction information at page level, then we can even think of having it in Index segment as well and then Index can delete/prune it's tuples on it's own which can reduce the bloat in index significantly and there is a benefit to Vacuum as well. I don't see how putting visibility at the page level helps indexes at all. We could already put XMIN in indexes if we wanted, but it won't help, because... We can do that by putting transaction info at tuple level in index as well but that will be huge increase in size of index unless we devise a way to have variable index tuple header rather than a fixed. Now this has some downsides as well like Delete needs to traverse Index segment as well to Delete mark the tuples, but I think the upsides of reducing bloat can certainly outweigh the downsides. ... which isn't possible. You can not go from a heap tuple to an index tuple. We will have the access to index value during delete, so why do you think that we need linkage between heap and index tuple to perform Delete operation? I think we need to think more to design Delete .. by CTID, but that should be doable. I see some assumptions here that need to be challenged. We can keep xmin and/or xmax on index entries. The above discussion assumes that the information needs to be updated synchronously. We already store visibility information on index entries using the lazily updated killtuple mechanism, so I don't see much problem in setting the xmin in a similar lazy manner. That way when we use the index if xmax is set we use it, if it is not we check the heap. (And then you get to freeze indexes as well ;-( ) Anyway, I have no objection to making index AM pass visibility information to indexes that wish to know the information, as long as it is provided lazily. The second assumption is that if we had visibility information in the index that it would make a difference to bloat. Since as I mention, we already do have visibility information, I don't see that adding xmax or xmin would make any difference at all to bloat. So -1 to adding it **for that reason**. A much better idea is to work out how to avoid index bloat at cause. If we are running an UPDATE and we cannot get a cleanup lock, we give up and do a non-HOT update, causing the index to bloat. It seems better to wait for a short period to see if we can get the cleanup lock. The short period is currently 0, so lets start there and vary the duration of wait upwards proportionally as the index gets more bloated. -- Simon Riggshttp://www.2ndQuadrant.com/ http://www.2ndquadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services
Re: [HACKERS] Reducing tuple overhead
On Thu, Apr 30, 2015 at 8:05 AM, Simon Riggs si...@2ndquadrant.com wrote: A much better idea is to work out how to avoid index bloat at cause. If we are running an UPDATE and we cannot get a cleanup lock, we give up and do a non-HOT update, causing the index to bloat. It seems better to wait for a short period to see if we can get the cleanup lock. The short period is currently 0, so lets start there and vary the duration of wait upwards proportionally as the index gets more bloated. What I'd be worried about there is that it would be very hard to tune the wait time, and that the operating system scheduling granularity (10ms?) would be way too long. But I'm in vigorous agreement with you on one point: the solution to index bloat (and probably heap bloat, too) is not to clean it up faster but to create less of it in the first place. Making more updates HOT is one way to do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On Thu, Apr 30, 2015 at 12:31 AM, Amit Kapila amit.kapil...@gmail.com wrote: I think I am missing something here, but when this second evaluation is needed. Basically what I understand from index insertion is that it evaluates the value to be inserted in index before calling nbtree module and then nbtree just inserts the value/tuple passed to it. Sure, but what happens if it doesn't evaluate to the same value? Consider a tuple where a = 1 and a function f(a). You insert the tuple, evaluate f(a), and get 17. So you insert an index tuple into the btree with a value of 17, pointing at the tuple. Now you delete the tuple, evaluate f(a) again, and this time you get 42. You search the btree for an index tuple with a value of 42, and you don't find one. But the index tuple is still there. With the current approach, that doesn't happen, because we effectively search the entire index for tuples pointing at the heap tuple we're trying to get rid of. The only problem with that is that it's crushingly expensive when the index is large and the number of tuples we're cleaning out is comparatively small. But that's a big problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On Thu, Apr 30, 2015 at 5:03 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Apr 30, 2015 at 12:31 AM, Amit Kapila amit.kapil...@gmail.com wrote: I think I am missing something here, but when this second evaluation is needed. Basically what I understand from index insertion is that it evaluates the value to be inserted in index before calling nbtree module and then nbtree just inserts the value/tuple passed to it. Sure, but what happens if it doesn't evaluate to the same value? Consider a tuple where a = 1 and a function f(a). You insert the tuple, evaluate f(a), and get 17. So you insert an index tuple into the btree with a value of 17, pointing at the tuple. Now you delete the tuple, evaluate f(a) again, and this time you get 42. As the index expression contain table columns and all the functions or operators used in expression must be IMMUTABLE, won't that guarantee to avoid such a situation? If not, then I think for such indexes we might need to search for tupleoffset in the entire index and mark it as Delete or may be for such indexes follow the current mechanism. I think it will still give us lot of benefit in more common cases. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reducing tuple overhead
On Thu, Apr 30, 2015 at 9:46 AM, Amit Kapila amit.kapil...@gmail.com wrote: As the index expression contain table columns and all the functions or operators used in expression must be IMMUTABLE, won't that guarantee to avoid such a situation? The concern is that they might be labeled as immutable but not actually behave that way. The other, related problem is that the ordering operator might start to return different results than it did at index creation time. For example, consider a btree index built on a text column. Now consider 'yum update'. glibc gets updated, collation ordering of various strings change, and now you've got tuples that are in the wrong place in the index, because when the index was built, we thought A B, but now we think B A. You would think the glibc maintainers might avoid such changes in minor releases, or that the Red Hat guys would avoid packaging and shipping those changes in minor releases, but you'd be wrong. We've seen cases where the master and the standby were both running RHEL X.Y (same X.Y on both servers) but they didn't agree on the collation definitions, so queries that worked on the master failed on the slave. If not, then I think for such indexes we might need to search for tupleoffset in the entire index and mark it as Delete or may be for such indexes follow the current mechanism. I think that *is* the current mechanism. I think it will still give us lot of benefit in more common cases. It's hard to figure out exactly what can work here. Aside from correctness issues, the biggest problem with refinding the index tuples one-by-one and killing them is that it may suck for bulk operations. When you delete one tuple from the table, refinding the index tuples and killing them immediately is probably the smartest thing you can do. If you postpone it, somebody may be forced to split the page because it's full, whereas if you do it right away, the next guy who wants to put a tuple on that page may be able to make it fit. That's a big win. But if you want to delete 10 million tuples, doing an index scan for each one may induce a tremendous amount of random I/O on the index pages, possibly visiting and dirtying the same pages more than once. Scanning the whole index for tuples to kill avoids that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On Thu, Apr 30, 2015 at 7:24 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Apr 30, 2015 at 9:46 AM, Amit Kapila amit.kapil...@gmail.com wrote: As the index expression contain table columns and all the functions or operators used in expression must be IMMUTABLE, won't that guarantee to avoid such a situation? The concern is that they might be labeled as immutable but not actually behave that way. The other, related problem is that the ordering operator might start to return different results than it did at index creation time. For example, consider a btree index built on a text column. Now consider 'yum update'. glibc gets updated, collation ordering of various strings change, and now you've got tuples that are in the wrong place in the index, because when the index was built, we thought A B, but now we think B A. You would think the glibc maintainers might avoid such changes in minor releases, or that the Red Hat guys would avoid packaging and shipping those changes in minor releases, but you'd be wrong. We've seen cases where the master and the standby were both running RHEL X.Y (same X.Y on both servers) but they didn't agree on the collation definitions, so queries that worked on the master failed on the slave. This is quite similar to IMMUTABLE function, but for an operator. I think guaranteeing the immutable nature for a function is the job of application/user. Oracle uses something similar (DETERMINISTIC functions) for function based indexes and asks user to ensure the DETERMINISTIC property of function. I think having synchronous delete (delete from index at the same time as from heap) is one of the main differentiating factor for not having bloat in indexes in some of the other databases. If we don't want to change the current property for functions or operators for indexes, then we can have a new type of index which can have visibility information and users are advised to use such an index where they can ensure that functions or operators for that index are IMMUTABLE. Here, there is one argument that users might or might not be able to ensure the same, but I think if it can be ensured by users of other successful databases, then the same should be possible for PostgreSQL users as well, after all this can bring a lot of value on table (avoiding index bloat) for users. I think it will still give us lot of benefit in more common cases. It's hard to figure out exactly what can work here. Aside from correctness issues, the biggest problem with refinding the index tuples one-by-one and killing them is that it may suck for bulk operations. When you delete one tuple from the table, refinding the index tuples and killing them immediately is probably the smartest thing you can do. If you postpone it, somebody may be forced to split the page because it's full, whereas if you do it right away, the next guy who wants to put a tuple on that page may be able to make it fit. That's a big win. Exactly, I have that big win in my mind. But if you want to delete 10 million tuples, doing an index scan for each one may induce a tremendous amount of random I/O on the index pages, possibly visiting and dirtying the same pages more than once. Scanning the whole index for tuples to kill avoids that. Even doing it only from heap might not be cheap. I think for doing BULK delete (as you described), users of other databases have some smart ways like: a. If possible drop the indexes b. instead of deleting the records you don't want, SELECT OUT the records you do -- drop the old table -- rename new table. c. Archive instead of delete Instead of deleting, just set a flag to mark a row as archived, invalid, deleted, etc. This will avoid the immediate overhead of index maintenance. Ignore the rows temporarily and let some other job delete them later. The large downside to this is that it affects any query on the table. ... possibly some other ways. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reducing tuple overhead
Robert Haas wrote: On Mon, Apr 27, 2015 at 5:01 PM, Jim Nasby jim.na...@bluetreble.com wrote: The problem with just having the value is that if *anything* changes between how you evaluated the value when you created the index tuple and when you evaluate it a second time you'll corrupt your index. This is actually an incredibly easy problem to have; witness how we allowed indexing timestamptz::date until very recently. That was clearly broken, but because we never attempted to re-run the index expression to do vacuuming at least we never corrupted the index itself. True. But I guess what I don't understand is: how big a deal is this, really? The uncorrupted index can still return wrong answers to queries. The fact that you won't end up with index entries pointing to completely unrelated tuples is nice, but if index scans are missing tuples that they should see, aren't you still pretty hosed? As I recall, there's a desire not to run expressions when vacuuming, because to run them means getting a snapshot, in case any functions look at database state; and you can't get a snapshot while vacuuming because that means the vacuum gets visible to concurrent processes; in particular they keep all processes' xmin from moving forward. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On Mon, Apr 27, 2015 at 5:01 PM, Jim Nasby jim.na...@bluetreble.com wrote: The problem with just having the value is that if *anything* changes between how you evaluated the value when you created the index tuple and when you evaluate it a second time you'll corrupt your index. This is actually an incredibly easy problem to have; witness how we allowed indexing timestamptz::date until very recently. That was clearly broken, but because we never attempted to re-run the index expression to do vacuuming at least we never corrupted the index itself. True. But I guess what I don't understand is: how big a deal is this, really? The uncorrupted index can still return wrong answers to queries. The fact that you won't end up with index entries pointing to completely unrelated tuples is nice, but if index scans are missing tuples that they should see, aren't you still pretty hosed? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On 4/29/15 12:18 PM, Robert Haas wrote: On Mon, Apr 27, 2015 at 5:01 PM, Jim Nasby jim.na...@bluetreble.com wrote: The problem with just having the value is that if *anything* changes between how you evaluated the value when you created the index tuple and when you evaluate it a second time you'll corrupt your index. This is actually an incredibly easy problem to have; witness how we allowed indexing timestamptz::date until very recently. That was clearly broken, but because we never attempted to re-run the index expression to do vacuuming at least we never corrupted the index itself. True. But I guess what I don't understand is: how big a deal is this, really? The uncorrupted index can still return wrong answers to queries. The fact that you won't end up with index entries pointing to completely unrelated tuples is nice, but if index scans are missing tuples that they should see, aren't you still pretty hosed? Maybe, maybe not. You could argue it's better to miss some rows than have completely unrelated ones. My recollection is there's other scenarios where this causes problems, but that's from several years ago and I wasn't able to find anything on a quick search of archives. I've wondered the same in the past and Tom had reasons it was bad, but perhaps they're overstated. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On Tue, Apr 28, 2015 at 2:31 AM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/25/15 12:12 AM, Amit Kapila wrote: ... which isn't possible. You can not go from a heap tuple to an index tuple. We will have the access to index value during delete, so why do you think that we need linkage between heap and index tuple to perform Delete operation? I think we need to think more to design Delete .. by CTID, but that should be doable. The problem with just having the value is that if *anything* changes between how you evaluated the value when you created the index tuple and when you evaluate it a second time you'll corrupt your index. I think I am missing something here, but when this second evaluation is needed. Basically what I understand from index insertion is that it evaluates the value to be inserted in index before calling nbtree module and then nbtree just inserts the value/tuple passed to it. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reducing tuple overhead
On 4/25/15 12:12 AM, Amit Kapila wrote: ... which isn't possible. You can not go from a heap tuple to an index tuple. We will have the access to index value during delete, so why do you think that we need linkage between heap and index tuple to perform Delete operation? I think we need to think more to design Delete .. by CTID, but that should be doable. The problem with just having the value is that if *anything* changes between how you evaluated the value when you created the index tuple and when you evaluate it a second time you'll corrupt your index. This is actually an incredibly easy problem to have; witness how we allowed indexing timestamptz::date until very recently. That was clearly broken, but because we never attempted to re-run the index expression to do vacuuming at least we never corrupted the index itself. This has been discussed in the past. I have tried to search in archive, but not getting what is the exact problem. Unfortunately I can't find prior discussion now either... :/ -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On Sat, Apr 25, 2015 at 1:58 AM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/23/15 10:40 PM, Amit Kapila wrote: I agree with you and what I think one of the major reasons of bloat is that Index segment doesn't have visibility information due to which clearing of Index needs to be tied along with heap. Now if we can move transaction information at page level, then we can even think of having it in Index segment as well and then Index can delete/prune it's tuples on it's own which can reduce the bloat in index significantly and there is a benefit to Vacuum as well. I don't see how putting visibility at the page level helps indexes at all. We could already put XMIN in indexes if we wanted, but it won't help, because... We can do that by putting transaction info at tuple level in index as well but that will be huge increase in size of index unless we devise a way to have variable index tuple header rather than a fixed. Now this has some downsides as well like Delete needs to traverse Index segment as well to Delete mark the tuples, but I think the upsides of reducing bloat can certainly outweigh the downsides. ... which isn't possible. You can not go from a heap tuple to an index tuple. We will have the access to index value during delete, so why do you think that we need linkage between heap and index tuple to perform Delete operation? I think we need to think more to design Delete .. by CTID, but that should be doable. This has been discussed in the past. I have tried to search in archive, but not getting what is the exact problem. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reducing tuple overhead
On 4/23/15 10:40 PM, Amit Kapila wrote: I agree with you and what I think one of the major reasons of bloat is that Index segment doesn't have visibility information due to which clearing of Index needs to be tied along with heap. Now if we can move transaction information at page level, then we can even think of having it in Index segment as well and then Index can delete/prune it's tuples on it's own which can reduce the bloat in index significantly and there is a benefit to Vacuum as well. I don't see how putting visibility at the page level helps indexes at all. We could already put XMIN in indexes if we wanted, but it won't help, because... Now this has some downsides as well like Delete needs to traverse Index segment as well to Delete mark the tuples, but I think the upsides of reducing bloat can certainly outweigh the downsides. ... which isn't possible. You can not go from a heap tuple to an index tuple. This has been discussed in the past. If we could do that then vacuum would become REALLY cheap compared to today. BTW, before actually tackling anything we should try and get more data from Robert/Jan about where the extra 80% came from. We don't know if it's indexes or what. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On 04/23/2015 09:42 AM, Jim Nasby wrote: On 4/23/15 11:24 AM, Andres Freund wrote: I do wonder what, in realistic cases, is actually the bigger contributor to the overhead. The tuple header or the padding we liberally add in many cases... Assuming you're talking about padding between fields... Several years ago Enova paid Command Prompt to start work on logical column ordering, and part of the motivation for that was to allow re-ordering physical tuples into the most efficient on-disk format possible. I think I did some tests re-arranging some tables into the theoretically most efficient order and measuring heap size. I think there was some modest size improvement, maybe 10-15%? This was several years ago so it's all foggy. Maybe Josh can find some of this in CMD's ticketing system? Yeah I dug around. I don't see anything about size improvement but here are our notes: Alvaro said: I ended up not producing notes as regularly as I had initially hoped. To try and make up for it, here's an update covering everything I've done since I started working on this issue. This patch turned out to be completely different than what we had initially thought. We had thought it was going to be a matter of finding out places that used attnum and replace it with either attnum, attlognum or attphysnum, depending on what order was necessary on any given spot. This wasn't an easy thing to do because there are several hundreds of those. So it was supposed to be amazingly time-consuming and rather boring work. This has nothing to do with reality: anywhere from parser down to optimizer and executor, the way things work is that a list of attributes is built, processed, and referenced. Some places assume that the list is in a certain order that's always the same order for those three cases. So the way to develop this feature is to change those places so that instead of receiving the list in one of these orders, they instead receive it in a different order. So what I had to do early on, was find a way to retrieve the sort order from catalogs, preserve it when TupleDescriptors are built, and ensure the attribute list is extracted from TupleDesc in the correct order. But it turned out that this is not enough, because down in the parser guts, a target list is constructed; and later, a TupleDescriptor is built from the target list. So it's necessary to preserve the sorting info from the original tuple descriptor into the target list (which means adding order info to Var and TargetEntry nodes), so that the new TupleDesc can also have it. Today I'm finding that even more than that is necessary. It turns out that the RangeTableEntries (i.e. the entries in the FROM clause of a query) have an item dubbed eref which is a list of column names; due to my changes in the earlier parser stages, this list is sorted in logical column order; but the code to resolve things such as columns used in JOIN/ON clauses walks the list (which is in logical order) and then uses the number of times it had to walk the elements in the list to construct a Var (column reference) in attnum order -- so it finds a different column, and it all fails. So what I'm doing now is modify the RangeTableEntry node to keep a mapping list of logical to identity numbers. Then I'll have to search for places using the rte-eref-colnames and make sure that they correctly use attlognum as index into it. And then later: First of all I should note that I discussed the approach mentioned above to pgsql-hackers and got a very interesting comment from Tom Lane that adding sorting info to Var and TargetEntry nodes was not a very good idea because it'd break stored rules whenever a table column changed. So I went back and studied that code and noticed that it was really the change in RangeTableEntry that's doing the good magic; those other changes are fortunately not necessary. (Though there were a necessary vehicle for me to understand how the other stuff works.) I've been continuing to study the backend code looking for uses of attribute lists that assume a single ordering. As I get more into it, more complex cases appear. The number of cases is fortunately bounded, though. Most of the uses of straight attribute lists are in places that do not require modification, or require little work or thought to update correctly. However, some other places are not like that. I have fixed SQL functions two times now, and I just found out that the second fix (which I believed to be mostly correct) was to be the final one, but I found out just now that it's not, and the proper fix is going to require something a bit more low-level (namely, a projection step that reorders columns correctly after the fact). Fortunately, I believe that this extra projection step is going to fix a lot of other cases too, which I originally had no idea how to attack. Moreover, understanding that bit means I also figured out what Tom Lane meant
Re: [HACKERS] Reducing tuple overhead
On 23/04/15 18:24, Andres Freund wrote: Whether that's feasible complexity wise is debatable, but it's certainly possible. I do wonder what, in realistic cases, is actually the bigger contributor to the overhead. The tuple header or the padding we liberally add in many cases... The logical ordering patch + auto optimizations of column layout on table creation/rewrite might help partially there. But what seems to be clear is that we need more in depth analysis of what really contributes most to the tuple size in various use-cases and then we can debate what we can do about it. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
Thanks for posting this. Joshua D. Drake wrote: First of all I should note that I discussed the approach mentioned above to pgsql-hackers and got a very interesting comment from Tom Lane that adding sorting info to Var and TargetEntry nodes was not a very good idea because it'd break stored rules whenever a table column changed. So I went back and studied that code and noticed that it was really the change in RangeTableEntry that's doing the good magic; those other changes are fortunately not necessary. (Though there were a necessary vehicle for me to understand how the other stuff works.) So in the logical columns thread I was saying that this change of eref didn't work either; it seems that most of the time (if not always) the list should continue to be in attnum order, and the logical-ordering-info data should be obtained from the tuple descriptor and whatever needs to be sorted should be sorted afterwards, i.e. in setrefs.c, using data previously stored in RelOptInfo. I tried doing that and ran into some other mess elsewhere. I've been continuing to study the backend code looking for uses of attribute lists that assume a single ordering. As I get more into it, more complex cases appear. The number of cases is fortunately bounded, though. he hopes, and eventually gives up However, some other places are not like that. I have fixed SQL functions two times now, and I just found out that the second fix (which I believed to be mostly correct) was to be the final one, but I found out just now that it's not, and the proper fix is going to require something a bit more low-level (namely, a projection step that reorders columns correctly after the fact). Fortunately, I believe that this extra projection step is going to fix a lot of other cases too, which I originally had no idea how to attack. Moreover, understanding that bit means I also figured out what Tom Lane meant on the second half of his response to my original pgsql-hackers comment. So I think we're good on that front. I had forgotten my intention to rework things using projection. In any case, I have posted a lot of patches now which others could study, if there's interest. I would sure like to see this split, and there are multiple benefits such as reduction of padding space. I'm sure there's a nonempty set of guys brighter than me that could figure it out in not that much time. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing tuple overhead
On Fri, Apr 24, 2015 at 1:03 AM, Jim Nasby jim.na...@bluetreble.com wrote: On 4/23/15 11:45 AM, Petr Jelinek wrote: On 23/04/15 18:24, Andres Freund wrote: Whether that's feasible complexity wise is debatable, but it's certainly possible. I do wonder what, in realistic cases, is actually the bigger contributor to the overhead. The tuple header or the padding we liberally add in many cases... The logical ordering patch + auto optimizations of column layout on table creation/rewrite might help partially there. But what seems to be clear is that we need more in depth analysis of what really contributes most to the tuple size in various use-cases and then we can debate what we can do about it. Also, what Robert posted was that while we started at something like 15%-30% larger, we ended the test at 80% larger. That makes me think that the bigger win is not in reducing tuple size but tackling bloat. I agree with you and what I think one of the major reasons of bloat is that Index segment doesn't have visibility information due to which clearing of Index needs to be tied along with heap. Now if we can move transaction information at page level, then we can even think of having it in Index segment as well and then Index can delete/prune it's tuples on it's own which can reduce the bloat in index significantly and there is a benefit to Vacuum as well. Now this has some downsides as well like Delete needs to traverse Index segment as well to Delete mark the tuples, but I think the upsides of reducing bloat can certainly outweigh the downsides. In short, reducing the tuple size by moving transaction information at page level can not only reduce the tuple size but can also help in reducing bloat. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reducing tuple overhead
On 4/23/15 11:45 AM, Petr Jelinek wrote: On 23/04/15 18:24, Andres Freund wrote: Whether that's feasible complexity wise is debatable, but it's certainly possible. I do wonder what, in realistic cases, is actually the bigger contributor to the overhead. The tuple header or the padding we liberally add in many cases... The logical ordering patch + auto optimizations of column layout on table creation/rewrite might help partially there. But what seems to be clear is that we need more in depth analysis of what really contributes most to the tuple size in various use-cases and then we can debate what we can do about it. Also, what Robert posted was that while we started at something like 15%-30% larger, we ended the test at 80% larger. That makes me think that the bigger win is not in reducing tuple size but tackling bloat. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers