Re: [PATCHES] HOT patch - version 15
On Mon, 2007-09-10 at 09:03 +0100, Heikki Linnakangas wrote: As I understand it, there are two HOT features: Single-chain pruning, which trims HOT chains but doesn't reuse the space Defragementation, which prunes the entire page and reuses space and handles deleted rows, etc. I'm repeating myself, but I would split that second operation into two parts while we think about this: pruning the entire page, and defragmenting (PageRepairFragmentation). Tom wondered why they're separated in the patch. As the patch stands, there is no reason, but I feel that separating them and doing them at different times might be an important piece in the puzzle. Yes, it does seem important to keep that separation, since it is possible to do these things at different times, even if, for now, we choose not to. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] HOT patch - version 15
Bruce Momjian wrote: Tom Lane wrote: The we-already-pinned-the-page problem is a bit nasty but may not be insurmountable. If you find a way, that would be great. As I understand it, there are two HOT features: Single-chain pruning, which trims HOT chains but doesn't reuse the space Defragementation, which prunes the entire page and reuses space and handles deleted rows, etc. I'm repeating myself, but I would split that second operation into two parts while we think about this: pruning the entire page, and defragmenting (PageRepairFragmentation). Tom wondered why they're separated in the patch. As the patch stands, there is no reason, but I feel that separating them and doing them at different times might be an important piece in the puzzle. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] HOT patch - version 15
On 9/10/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: I'm repeating myself, but I would split that second operation into two parts while we think about this: pruning the entire page, and defragmenting (PageRepairFragmentation). Tom wondered why they're separated in the patch. As the patch stands, there is no reason, but I feel that separating them and doing them at different times might be an important piece in the puzzle. I think we all agree that defragmentation should be done less aggressively and ideally when we are going to insert a tuple in the page and running low on free space in the page. If we could really do it in heap_update(), after we figure out that there is not enough free space available in the page, that will be great. We already have exclusive lock on the page and hopefully we also have the vacuum-lock. (This is how earlier version of the patch used to work - before we took out all the complexity associated with tracking LP_DELETEd tuples, tracking fragmented tuples and replaced them with simple PageRepairFragmentation Since in the earlier version, we never used to call PageRepairFragmentation, we could easily do pruning in heap_update()) If we can't find a way to defragment inside heap_update(), then we have three choices: 1. Pass a hint to heap_fetch that a UPDATE may follow. If we are running low on free space, we try to prune and defragment at that point. 2. Move some of the work to bgwriter. 3. Use some heuristic to decide whether to defragment or not. I am not sure how easy it is to know apriori that we are fetching a tuple for UPDATE. Assuming we can, this seems like a good idea. Eventually we may want to move some work to bgwriter or some other background process. But my take would be to save that for 8.4 As a starting point, we are doing 3 in the patch. We always try to keep just one tuple worth of space free in the page. So if an UPDATE takes up remaining free space in the page, the page will be pruned/defraged in the subsequent lookup (index or seq). In read-mostly scenario, only the first lookup would trigger prune/defrag, but next many lookups would be quick. In update-mostly scenario, most of the heap_fetches would anyways end up doing update, so doing pruning/defragmentation in heap_fetch shouldn't be too bad. For a balanced scenario, we might be moving cost of pruning/defragmentation to the SELECT path, but UPDATEs would be quick. The other issue is when to prune the HOT chain. I tend to agree that the chances of having long HOT chains are less and the cost of following the chain won't be too significant. At the same we probably don't want to live with very long chains forever. An example would be: a tuple gets HOT updated several times in a transaction. If the page is never updated again, the long HOT chain would never be pruned. I like Simon's idea of pruning the chain if it goes beyond a limit. Instead of adding any significant complexity to track length of each HOT chain, we can just mark the page with a flag if heap_hot_fetch() detects a long chain. The next lookup (index or seq) of the page will try to prune the page and clear the flag if successful. I also agree that we should avoid taking exclusive lock before checking free space in heap_page_prune_defrag(). That might be unsafe, but we don't care if you occasionally skip the maintenance work (or do it a little early) Thanks, Pavan Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
Bruce Momjian wrote: (Can someone time the access time for following a chain that fills an entire page (the worst case) vs. having a single tuple on the page?) Here is some results, on my laptop. The test case is a table with a single integer column, and a single row in the table. The row is updated 250 times to make a 251 tuples long chain. Page can hold 255 tuples, so this is pretty much the worst case. After that, the row is fetched 100 times, in a PL/pgSQL function. I ran the tests with enable_seqscan on and off, see seqscan and idxscan rows below. Numbers are the times spent doing the fetches, in seconds. I repeated each test a few times to make sure that the results are repeatable, they seem to be repeatable to ~0.1s precision. The test script used is attached. Autovacuum was disabled, and shared_buffers=320MB, otherwise all settings were left to defaults. HEADHOT HOT-opt HOT-pruned seqscan 19.921.120.111.5 idxscan 27.831.430.413.7 Explanations of the columns: HEAD: CVS HEAD HOT-pruned: CVS HEAD + HOT patch v15 HOT:CVS HEAD + HOT patch v15, but with heap_page_prune_defrag short circuited to do nothing HOT-opt:CVS HEAD + HOT patch v15, but with static XidInMVCCSnapshot like in CVS HEAD I didn't expect a difference in seqscan performance between HEAD and HOT. I oprofiled it, and figured out that it's because HOT patch removed the static-qualifier XidInMVCCSnapshot, because it's needed in plancat.c. I changed it back to static, dummying out the call in plancat.c, and the results are now closer to each other (HOT-opt column). Comparing the idxscan columns, it looks like following the chain *is* more expensive than having to go through killed index pointers. Pruning clearly does help. Given that this test is pretty much the worst case scenario, I'm ok with not pruning for the purpose of keeping chains short. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com set enable_seqscan=on; DROP TABLE hottest; CREATE TABLE hottest (id integer); INSERT INTO hottest VALUES (1); CREATE INDEX i_hottest ON hottest(id); -- Fill the table CREATE OR REPLACE FUNCTION longchain (n integer) RETURNS void LANGUAGE PLPGSQL AS $$ DECLARE i integer; BEGIN FOR i IN 1..n LOOP UPDATE hottest SET id=id WHERE id = 1; END LOOP; END; $$; SELECT longchain(250); CREATE OR REPLACE FUNCTION fetchchain (n integer) RETURNS void LANGUAGE PLPGSQL AS $$ DECLARE i integer; foo integer; BEGIN FOR i IN 1..n LOOP SELECT id INTO foo FROM hottest WHERE id = 1; END LOOP; END; $$; \timing SELECT fetchchain(100); \timing ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] HOT patch - version 15
Heikki Linnakangas wrote: Explanations of the columns: HEAD: CVS HEAD HOT-pruned: CVS HEAD + HOT patch v15 HOT: CVS HEAD + HOT patch v15, but with heap_page_prune_defrag short circuited to do nothing HOT-opt: CVS HEAD + HOT patch v15, but with static XidInMVCCSnapshot like in CVS HEAD One correction: HOT-opt was also short circuited like HOT. IOW, HOT and HOT-opt are the same, except for the static qualifier in XidInMVCCSnapshot -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] HOT patch - version 15
On 9/6/07, Tom Lane [EMAIL PROTECTED] wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: When I suggested that we get rid of the LP_DELETE flag for heap tuples, the tuple-level fragmentation and all that, and just take the vacuum lock and call PageRepairFragmentation, I was thinking that we'd do it in heap_update and only when we run out of space on the page. But as Greg said, it doesn't work because you're already holding a reference to at least one tuple on the page, the one you're updating, by the time you get to heap_update. That's why I put the pruning code to heap_fetch instead. Yes, though the amortized cost is the same, it does push the pruning work to the foreground query path. The amortized cost is only the same if every heap_fetch is associated with a heap update. I feel pretty urgently unhappy about this choice. Have you tested the impact of the patch on read-mostly workloads? For read-mostly workloads, only the first SELECT after an UPDATE would trigger pruning/defragmentation. heap_page_prune_defrag() would be a no-op for subsequent SELECTs (PageIsPrunable() would be false until the next UPDATE) I think Heikki's recent test confirms this. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/8/07, Bruce Momjian [EMAIL PROTECTED] wrote: Would someone tell us exactly when pruning and defragmenting happens in the current version of the patch? If we don't nail this issue down soon PostgreSQL 8.3 is going to sail without this patch. I guess you already have the answers by now, but let me repeat here: Until patch version 14, defragmentation and pruning were separate operations, though we used to invoke both at the same time. The only difference was that pruning can work with a simple exclusive lock whereas defragmentation needs vacuum-strength lock. Tom argued that its not worth the complexity, so I changed that and clubbed both the operations together. What it means: pruning also now needs vacuum-strength lock and is always followed by defragmentation. So as the patch stands (version 15), we call heap_page_prune_defrag() inside heap_release_fetch() and heapgetpage(), apart from the vacuum loop. It checks for PageIsPrunable() before proceeding. PageIsPrunable is set when a tuple is updated/deleted. So for read-mostly workloads heap_page_prune_defrag will mostly be no-op. If PageIsPrunable is set, cleanup lock is available (exclusive lock is tried conditionally, so we don't wait if there is contention) and we are running low on free space (right now if free space is less than 1.2 times the average tuple size or less than BLCKSZ/8), the entire page is pruned and fragmentation is repaired. We also prune-defrag is vacuum lazy and vacuum full. But I assume we are not worried about these maintenance activities. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/10/07, Simon Riggs [EMAIL PROTECTED] wrote: On Mon, 2007-09-10 at 12:17 +0100, Heikki Linnakangas wrote: Bruce Momjian wrote: (Can someone time the access time for following a chain that fills an entire page (the worst case) vs. having a single tuple on the page?) Here is some results, on my laptop. HEADHOT HOT-opt HOT-pruned seqscan 19.921.120.111.5 idxscan 27.831.430.413.7 Comparing the idxscan columns, it looks like following the chain *is* more expensive than having to go through killed index pointers. Pruning clearly does help. Given that this test is pretty much the worst case scenario, I'm ok with not pruning for the purpose of keeping chains short. I wasn't expecting that result and had accepted the counter argument. Yes, I go with Simon. I am also surprised that HOT-pruned did so well in this setup. I always thought that HOT would do well in update-intensive scenarios, but from the results it seems that HOT is also doing well for read-mostly queries. In this particular example, the first SELECT after the 250 UPDATEs would have pruned all the dead tuples and reduced HOT chain to a single tuple. Hence the total time for subsequent SELECTs improved tremendously. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] [HACKERS] Include Lists for Text Search
On Mon, 2007-09-10 at 10:21 -0400, Tom Lane wrote: Oleg Bartunov [EMAIL PROTECTED] writes: On Mon, 10 Sep 2007, Simon Riggs wrote: Can we include that functionality now? This could be realized very easyly using dict_strict, which returns only known words, and mapping contains only this dictionary. So, feel free to write it and submit. ... for 8.4. I've coded a small patch to allow CaseSensitive synonyms. CREATE TEXT SEARCH DICTIONARY my_diction ( TEMPLATE = biglist, DictFile = words, CaseSensitive = true ); -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com Index: src/backend/tsearch/dict_synonym.c === RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/dict_synonym.c,v retrieving revision 1.4 diff -c -r1.4 dict_synonym.c *** src/backend/tsearch/dict_synonym.c 25 Aug 2007 02:29:45 - 1.4 --- src/backend/tsearch/dict_synonym.c 10 Sep 2007 15:14:21 - *** *** 29,34 --- 29,35 typedef struct { int len; /* length of syn array */ + bool case_sensitive; Syn *syn; } DictSyn; *** *** 83,88 --- 84,90 *end = NULL; int cur = 0; char *line = NULL; + bool case_sensitive = false; foreach(l, dictoptions) { *** *** 95,100 --- 97,107 (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg(unrecognized synonym parameter: \%s\, defel-defname))); + + if (pg_strcasecmp(CaseSensitive, defel-defname) == 0 + pg_strcasecmp(True, defGetString(defel)) == 0) + case_sensitive = true; + } if (!filename) *** *** 168,173 --- 175,182 d-len = cur; qsort(d-syn, d-len, sizeof(Syn), compareSyn); + d-case_sensitive = case_sensitive; + PG_RETURN_POINTER(d); } *** *** 180,195 Syn key, *found; TSLexeme *res; /* note: d-len test protects against Solaris bsearch-of-no-items bug */ if (len = 0 || d-len = 0) PG_RETURN_POINTER(NULL); ! key.in = lowerstr_with_len(in, len); key.out = NULL; found = (Syn *) bsearch(key, d-syn, d-len, sizeof(Syn), compareSyn); ! pfree(key.in); if (!found) PG_RETURN_POINTER(NULL); --- 189,214 Syn key, *found; TSLexeme *res; + bool need_pfree = false; /* note: d-len test protects against Solaris bsearch-of-no-items bug */ if (len = 0 || d-len = 0) PG_RETURN_POINTER(NULL); ! if (d-case_sensitive) ! key.in = in; ! else ! { ! key.in = lowerstr_with_len(in, len); ! need_pfree = true; ! } ! key.out = NULL; found = (Syn *) bsearch(key, d-syn, d-len, sizeof(Syn), compareSyn); ! ! if (need_pfree) ! pfree(key.in); if (!found) PG_RETURN_POINTER(NULL); ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PATCHES] Yet more tsearch refactoring
* Defined new struct WordEntryPosVector that holds a uint16 length and a variable size array of WordEntries. This replaces the previous convention of a variable size uint16 array, with the first element implying the length. WordEntryPosVector has the same layout in memory, but is more readable in source code. The POSDATAPTR and POSDATALEN macros are still used, though it would now be more readable to access the fields in WordEntryPosVector directly. * Removed needfree field from DocRepresentation. It was always set to false. * Miscellaneous other commenting and refactoring -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/utils/adt/tsginidx.c === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/adt/tsginidx.c,v retrieving revision 1.3 diff -c -r1.3 tsginidx.c *** src/backend/utils/adt/tsginidx.c 7 Sep 2007 16:03:40 - 1.3 --- src/backend/utils/adt/tsginidx.c 10 Sep 2007 12:04:14 - *** *** 25,37 int32 *nentries = (int32 *) PG_GETARG_POINTER(1); Datum *entries = NULL; ! *nentries = 0; if (vector-size 0) { int i; WordEntry *we = ARRPTR(vector); - *nentries = (uint32) vector-size; entries = (Datum *) palloc(sizeof(Datum) * vector-size); for (i = 0; i vector-size; i++) --- 25,36 int32 *nentries = (int32 *) PG_GETARG_POINTER(1); Datum *entries = NULL; ! *nentries = vector-size; if (vector-size 0) { int i; WordEntry *we = ARRPTR(vector); entries = (Datum *) palloc(sizeof(Datum) * vector-size); for (i = 0; i vector-size; i++) *** *** 134,144 if (query-size 0) { ! int4 i, j = 0; QueryItem *item; GinChkVal gcv; gcv.frst = item = GETQUERY(query); gcv.mapped_check = (bool *) palloc(sizeof(bool) * query-size); --- 133,151 if (query-size 0) { ! int i, j = 0; QueryItem *item; GinChkVal gcv; + /* + * check-parameter array has one entry for each value (operand) in the + * query. We expand that array into mapped_check, so that there's one + * entry in mapped_check for every node in the query, including + * operators, to allow quick lookups in checkcondition_gin. Only the + * entries corresponding operands are actually used. + */ + gcv.frst = item = GETQUERY(query); gcv.mapped_check = (bool *) palloc(sizeof(bool) * query-size); Index: src/backend/utils/adt/tsgistidx.c === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/adt/tsgistidx.c,v retrieving revision 1.3 diff -c -r1.3 tsgistidx.c *** src/backend/utils/adt/tsgistidx.c 7 Sep 2007 15:09:56 - 1.3 --- src/backend/utils/adt/tsgistidx.c 10 Sep 2007 15:56:09 - *** *** 133,152 } static int ! compareint(const void *a, const void *b) { ! if (*((int4 *) a) == *((int4 *) b)) return 0; ! return (*((int4 *) a) *((int4 *) b)) ? 1 : -1; } static int uniqueint(int4 *a, int4 l) { int4 *ptr, *res; ! if (l == 1) return l; ptr = res = a; --- 133,159 } static int ! compareint(const void *va, const void *vb) { ! int4 a = *((int4 *) va); ! int4 b = *((int4 *) vb); ! ! if (a == b) return 0; ! return (a b) ? 1 : -1; } + /* + * Removes duplicates from an array of int4. 'l' is + * size of the input array. Returns the new size of the array. + */ static int uniqueint(int4 *a, int4 l) { int4 *ptr, *res; ! if (l = 1) return l; ptr = res = a; *** *** 570,581 } SPLITCOST; static int ! comparecost(const void *a, const void *b) { ! if (((SPLITCOST *) a)-cost == ((SPLITCOST *) b)-cost) return 0; else ! return (((SPLITCOST *) a)-cost ((SPLITCOST *) b)-cost) ? 1 : -1; } --- 577,591 } SPLITCOST; static int ! comparecost(const void *va, const void *vb) { ! SPLITCOST *a = (SPLITCOST *) va; ! SPLITCOST *b = (SPLITCOST *) vb; ! ! if (a-cost == b-cost) return 0; else ! return (a-cost b-cost) ? 1 : -1; } Index: src/backend/utils/adt/tsrank.c === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/utils/adt/tsrank.c,v retrieving revision 1.4 diff -c -r1.4 tsrank.c *** src/backend/utils/adt/tsrank.c 7 Sep 2007 16:03:40 - 1.4 --- src/backend/utils/adt/tsrank.c 10 Sep 2007 15:56:51 - *** *** 53,74 { WordEntry *ptr = ARRPTR(t), *end = (WordEntry *) STRPTR(t); ! int len = 0, ! clen; while (ptr end) { ! if ((clen = POSDATALEN(t, ptr)) == 0) len += 1; else len += clen; ptr++; } return len; } ! static int4 WordECompareQueryItem(char *eval, char *qval, WordEntry *ptr, QueryOperand
Re: [PATCHES] HOT patch - version 15
Bruce Momjian wrote: My guess is that once you prune a tuple you no longer have to check its visibility, and that is where the win is coming from. Yes, I think that's right. Oh, one more thing occured to me. Without HOT, we not only mark index tuples pointing to dead tuples as killed, we remove them altogether if the index page gets full. If you modify the test case so that after doing the updates, you insert a bunch of tuples with a different key to fill the index page, you should see CVS HEAD winning HOT without pruning hands down. If we check a tuple in a chain and the tuple is dead is it possible the pruning operation is cheaper than having to check the tuple again for visibility the next time we see it? If so, we can just always prune when we see a dead tuple in the chain, which I believe was the original design. Pruning becomes an operation similar to marking an index entry as dead. (I assuming pruing does not generate WAL traffic.) Pruning does generate a WAL record at the moment. Maybe you could do some kind of a quick pruning without a WAL record. Like just modify the ctid of the oldest dead tuple in the chain, or the redirecting line pointer if there is one, to point to the latest live tuple, without removing the dead tuples or the line pointers. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] HOT patch - version 15
On Mon, 2007-09-10 at 18:23 +0100, Heikki Linnakangas wrote: If we check a tuple in a chain and the tuple is dead is it possible the pruning operation is cheaper than having to check the tuple again for visibility the next time we see it? If so, we can just always prune when we see a dead tuple in the chain, which I believe was the original design. Pruning becomes an operation similar to marking an index entry as dead. (I assuming pruing does not generate WAL traffic.) Pruning does generate a WAL record at the moment. Maybe you could do some kind of a quick pruning without a WAL record. Like just modify the ctid of the oldest dead tuple in the chain, or the redirecting line pointer if there is one, to point to the latest live tuple, without removing the dead tuples or the line pointers. Sounds like a great idea. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] HOT patch - version 15
On 9/10/07, Florian Pflug [EMAIL PROTECTED] wrote: Maybe that risk could be lowered if instead of a flag, we stored the minimal global xmin needed to prune at least one tuple. I like the idea. The question is whether the chances of a Prunable page being looked up again and again in presence of a long running transaction are high enough to justify adding 4 bytes to page header. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/10/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Oh, one more thing occured to me. Without HOT, we not only mark index tuples pointing to dead tuples as killed, we remove them altogether if the index page gets full. If you modify the test case so that after doing the updates, you insert a bunch of tuples with a different key to fill the index page, you should see CVS HEAD winning HOT without pruning hands down. Um. I am not sure I follow that. With HOT even if the HOT chain is long, there is still just one index entry for the entire chain. So I don't see why CVS HEAD would do better than HOT in the above case. Net-net there will be equal number of index keys after the inserts. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On 9/11/07, Bruce Momjian [EMAIL PROTECTED] wrote: Heikki Linnakangas wrote: Pruning does generate a WAL record at the moment. Maybe you could do some kind of a quick pruning without a WAL record. Like just modify the ctid of the oldest dead tuple in the chain, or the redirecting line pointer if there is one, to point to the latest live tuple, without removing the dead tuples or the line pointers. I am wondering what you even mean by removing the dead tuples when pruning. I thought only defragmentation removed tuples. Pruning removes intermediate dead tuples by marking their line pointers ~LP_USED and redirecting the root line pointer to the first live/recently_dead tuple in the chain. OTOH defragmentation moves tuples around to repair fragmentation. IOW defragmentation is nothing but calling PageRepairFragmentation on the page. For example, if we have a HOT chain of 5 tuples: 1 -- 2 -- 3 -- 4 -- 5 of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE At the end of pruning: - line pointer of 1 is redirected to 4 - line pointers of 2 and 3 are marked ~LP_USED - the offset of 4 and 5 is unchanged At the end of defragmentation: - 4 and 5 are moved to the end of the page to create a larger contiguous free space in the page. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
Heikki Linnakangas wrote: Bruce Momjian wrote: My guess is that once you prune a tuple you no longer have to check its visibility, and that is where the win is coming from. Yes, I think that's right. Oh, one more thing occured to me. Without HOT, we not only mark index tuples pointing to dead tuples as killed, we remove them altogether if the index page gets full. If you modify the test case so that after doing the updates, you insert a bunch of tuples with a different key to fill the index page, you should see CVS HEAD winning HOT without pruning hands down. If we check a tuple in a chain and the tuple is dead is it possible the pruning operation is cheaper than having to check the tuple again for visibility the next time we see it? If so, we can just always prune when we see a dead tuple in the chain, which I believe was the original design. Pruning becomes an operation similar to marking an index entry as dead. (I assuming pruing does not generate WAL traffic.) Pruning does generate a WAL record at the moment. Maybe you could do some kind of a quick pruning without a WAL record. Like just modify the ctid of the oldest dead tuple in the chain, or the redirecting line pointer if there is one, to point to the latest live tuple, without removing the dead tuples or the line pointers. I am wondering what you even mean by removing the dead tuples when pruning. I thought only defragmentation removed tuples. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] HOT patch - version 15
Pavan Deolasee wrote: Pruning does generate a WAL record at the moment. Maybe you could do some kind of a quick pruning without a WAL record. Like just modify the ctid of the oldest dead tuple in the chain, or the redirecting line pointer if there is one, to point to the latest live tuple, without removing the dead tuples or the line pointers. I am wondering what you even mean by removing the dead tuples when pruning. I thought only defragmentation removed tuples. Pruning removes intermediate dead tuples by marking their line pointers ~LP_USED and redirecting the root line pointer to the first live/recently_dead tuple in the chain. OTOH defragmentation moves tuples around to repair fragmentation. IOW defragmentation is nothing but calling PageRepairFragmentation on the page. For example, if we have a HOT chain of 5 tuples: 1 -- 2 -- 3 -- 4 -- 5 of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE At the end of pruning: - line pointer of 1 is redirected to 4 - line pointers of 2 and 3 are marked ~LP_USED - the offset of 4 and 5 is unchanged At the end of defragmentation: - 4 and 5 are moved to the end of the page to create a larger contiguous free space in the page. Right. My point is that pruning only modifies item pointers. It does not change the actual heap tuples. In the quote above, how is Heikki's quick pruning different from the pruning you are describing? -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] HOT patch - version 15
Bruce Momjian wrote: Pavan Deolasee wrote: Pruning does generate a WAL record at the moment. Maybe you could do some kind of a quick pruning without a WAL record. Like just modify the ctid of the oldest dead tuple in the chain, or the redirecting line pointer if there is one, to point to the latest live tuple, without removing the dead tuples or the line pointers. I am wondering what you even mean by removing the dead tuples when pruning. I thought only defragmentation removed tuples. Pruning removes intermediate dead tuples by marking their line pointers ~LP_USED and redirecting the root line pointer to the first live/recently_dead tuple in the chain. OTOH defragmentation moves tuples around to repair fragmentation. IOW defragmentation is nothing but calling PageRepairFragmentation on the page. For example, if we have a HOT chain of 5 tuples: 1 -- 2 -- 3 -- 4 -- 5 of which, 1, 2 and 3 are DEAD, 4 is RECENTLY_DEAD and 5 is LIVE At the end of pruning: - line pointer of 1 is redirected to 4 - line pointers of 2 and 3 are marked ~LP_USED - the offset of 4 and 5 is unchanged At the end of defragmentation: - 4 and 5 are moved to the end of the page to create a larger contiguous free space in the page. Right. My point is that pruning only modifies item pointers. It does not change the actual heap tuples. In the quote above, how is Heikki's quick pruning different from the pruning you are describing? In quick pruning, you don't mark the line pointer as not used. Otherwise someone might reuse the line pointer for another tuple, which is a problem if you then crash. WAL replay would see an insert to a line pointer that's already in use (the quick pruning wouldn't be WAL logged), which would raise an error. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] HOT patch - version 15
On 9/11/07, Bruce Momjian [EMAIL PROTECTED] wrote: Right. My point is that pruning only modifies item pointers. It does not change the actual heap tuples. In the quote above, how is Heikki's quick pruning different from the pruning you are describing? I think the only difference is that the quick pruning does not mark intermediate tuples ~LP_USED and hence we may avoid WAL logging. Btw, I am not too enthusiastic about quick pruning because it leaves behind dead heap-only tuples which are not reachable from any root heap tuples. Not that we can't handle them, but I am worried about making such changes right now unless we are sure about the benefits. We can always tune and tweak in 8.4 Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
On Mon, 2007-09-10 at 20:46 +0100, Heikki Linnakangas wrote: Bruce Momjian wrote: Pavan Deolasee wrote: At the end of pruning: - line pointer of 1 is redirected to 4 - line pointers of 2 and 3 are marked ~LP_USED - the offset of 4 and 5 is unchanged In the quote above, how is Heikki's quick pruning different from the pruning you are describing? In quick pruning, you don't mark the line pointer as not used. Otherwise someone might reuse the line pointer for another tuple, which is a problem if you then crash. WAL replay would see an insert to a line pointer that's already in use (the quick pruning wouldn't be WAL logged), which would raise an error. That seems like the best way forward to me. It's just a hint that says where the first live tuple is in the chain. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] HOT patch - version 15
Gregory Stark wrote: Bruce Momjian [EMAIL PROTECTED] writes: Looking at the patch I see: + /* +* Mark the page as clear of prunable tuples. If we find a tuple which +* may become prunable, we shall set the hint again. +*/ + PageClearPrunable(page); I like the idea of the page hint bit, but my question is if there is a long-running transaction, isn't each SELECT going to try to defragment a page over and over again because there is still something prunable on the page? Well it'll try to prune the chains over and over again. If it doesn't find anything it won't defragment, but yes. I think we could tackle that by storing on the page the oldest xmax which would allow us to prune a tuple. Then you could compare RecentGlobalXmin against that and not bother looking at any other chains if it hasn't been passed yet. Yea, that might work if we have space on the page. It would be hard to do that with single-chain pruning though, once you the limiting tuple you would then wouldn't know what the next limiting xmax is until the next time you do a full-page prune. Still that gets us down to at most two full-page prunes per update instead of a potentially unbounded number of prunes. This seems like a further optimization to think about after we have a place to trigger the pruning where it'll do the most good. I was thinking if we could only try defragementing when we are doing an INSERT or UPDATE, then the defragment would either free enough space so we could stay on the page or the new row would go on another page and we would probably not return to that page again anytime soon. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[PATCHES] External builds need tsearch include files
I tried to compile a custom tsearch dictionary as an external module, but it couldn't find the tsearch include files in the server installation. tsearch subdirectory needs to be added to src/include/Makefile, otherwise they don't get installed: *** src/include/Makefile9 Dec 2005 21:19:35 - 1.21 --- src/include/Makefile10 Sep 2007 19:53:54 - *** *** 18,24 # Subdirectories containing headers for server-side dev SUBDIRS = access bootstrap catalog commands executor lib libpq mb \ ! nodes optimizer parser regex rewrite storage tcop utils \ port port/win32 port/win32/arpa port/win32/netinet port/win32/sys # Install all headers --- 18,24 # Subdirectories containing headers for server-side dev SUBDIRS = access bootstrap catalog commands executor lib libpq mb \ ! nodes optimizer parser regex rewrite storage tcop tsearch utils \ port port/win32 port/win32/arpa port/win32/netinet port/win32/sys # Install all headers -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] HOT patch - version 15
On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote: I think the only difference is that the quick pruning does not mark intermediate tuples ~LP_USED and hence we may avoid WAL logging. Sounds great. Btw, I am not too enthusiastic about quick pruning because it leaves behind dead heap-only tuples which are not reachable from any root heap tuples. Not that we can't handle them, but I am worried about making such changes right now unless we are sure about the benefits. We can always tune and tweak in 8.4 ...but we don't want to reach them. They are waiting to be removed. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] HOT patch - version 15
Pavan Deolasee [EMAIL PROTECTED] writes: On 9/11/07, Bruce Momjian [EMAIL PROTECTED] wrote: Right. My point is that pruning only modifies item pointers. It does not change the actual heap tuples. In the quote above, how is Heikki's quick pruning different from the pruning you are describing? I think the only difference is that the quick pruning does not mark intermediate tuples ~LP_USED and hence we may avoid WAL logging. Btw, I am not too enthusiastic about quick pruning because it leaves behind dead heap-only tuples which are not reachable from any root heap tuples. Not that we can't handle them, but I am worried about making such changes right now unless we are sure about the benefits. We can always tune and tweak in 8.4 You could mark such tuples with LP_DELETE. That would also let other transactions quickly tot up how much space would be available if they were to run PageRepairFragmentation. But if you don't WAL log setting LP_DELETE then you would still have to deal with unreachable HOT tuples who lost their LP_DELETE flag. I suppose that as long as PageRepairFragmentation first looks for any dead HOT tuples without depending on LP_DELETE that would be enough. I wonder if you could do the quick prune without even taking the exclusive page lock? You're overwriting 16 bits and you know nobody else will be modifying any of the line pointers in question to anything else. (because your pin prevents the vacuum lock from coming in and trying to mark it unused). -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Yet more tsearch refactoring
Heikki Linnakangas wrote: * Defined new struct WordEntryPosVector that holds a uint16 length and a variable size array of WordEntries. This replaces the previous convention of a variable size uint16 array, with the first element implying the length. WordEntryPosVector has the same layout in memory, but is more readable in source code. The POSDATAPTR and POSDATALEN macros are still used, though it would now be more readable to access the fields in WordEntryPosVector directly. Did you check it on 64-bit boxes with strict alignment? I remember that was a headache for me. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Yet more tsearch refactoring
Teodor Sigaev wrote: Heikki Linnakangas wrote: * Defined new struct WordEntryPosVector that holds a uint16 length and a variable size array of WordEntries. This replaces the previous convention of a variable size uint16 array, with the first element implying the length. WordEntryPosVector has the same layout in memory, but is more readable in source code. The POSDATAPTR and POSDATALEN macros are still used, though it would now be more readable to access the fields in WordEntryPosVector directly. Did you check it on 64-bit boxes with strict alignment? I remember that was a headache for me. No, I didn't. But I don't see how it would make a difference, the resulting memory layout is the same AFAICS. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] HOT patch - version 15
Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote: I think the only difference is that the quick pruning does not mark intermediate tuples ~LP_USED and hence we may avoid WAL logging. Sounds great. What it sounds is utterly unsafe. You can get away with not WAL-logging individual bit flips (that is, hint-bit-setting) because either state of the page is valid. If I read this proposal correctly it is to change t_ctid without WAL-logging, which means that a partial page write (torn page syndrome) could leave the page undetectably corrupted --- t_ctid is 6 bytes and could easily cross a hardware sector boundary. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] HOT patch - version 15
Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Tue, 2007-09-11 at 01:22 +0530, Pavan Deolasee wrote: I think the only difference is that the quick pruning does not mark intermediate tuples ~LP_USED and hence we may avoid WAL logging. Sounds great. What it sounds is utterly unsafe. You can get away with not WAL-logging individual bit flips (that is, hint-bit-setting) because either state of the page is valid. If I read this proposal correctly it is to change t_ctid without WAL-logging, which means that a partial page write (torn page syndrome) could leave the page undetectably corrupted --- t_ctid is 6 bytes and could easily cross a hardware sector boundary. OK, we certainly want to avoid WAL writes for pruning a single chain if possible. One idea would be to just mark the item pointer as dead but not repoint the first item pointer to point to the first live tuple. Heikki, would that give us speedups similar to what you saw in your earlier tests? It would if the majority of the cost of scanning the page looking for the first live tuple was in doing the visibility tests on every tuple. Basically, when we are walking the chain via an index and we see the tuple is dead we mark it as dead but don't repoint. We would repoint and reclaim on a defragementation. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] HOT patch - version 15
On 9/11/07, Gregory Stark [EMAIL PROTECTED] wrote: You could mark such tuples with LP_DELETE. That would also let other transactions quickly tot up how much space would be available if they were to run PageRepairFragmentation. IMHO we are making full circles here. We have already tried LP_DELETE techniques and moved away to simplify things. We also tried reusing dead space without running PageRepairFragmentation. Each of these techniques worked just fine with slightly different performance characteristics. What we now have is a simplified algorithm which is much easier to follow and is safer, yet giving us a very good performance boost. I am not sure if this is the right time to throw new ideas because we would never be sure as what we are doing would be the most optimal solution. Would it help if we go with some solution right now, get rest of the review process done and then use the feedback during beta testing to tune things ? We may have far more data points at that time to choose one technique over other. And we would also know what areas to focus on. I am also worried that by focusing too much on this issue we may overlook some other correctness issue in the patch. From whatever we have discussed so far, IMHO we should do the following things and let rest of the review process proceed - Defragment a page only when the free space left in the page is not enough to accommodate even a single tuple (use average tuple length for this decision). This would mean we might be defragmenting even though there is no immediate UPDATE to the page. But we can treat this as fillfactor which allows us to provision for the next UPDATE coming to the page. Since we are defragmenting when the page is almost full hopefully we would reclaim good amount of space in the page and won't call defragmentation for next few UPDATEs. We already have mechanism to track average tuple request size in relcache. May be we can have some relcache invalidation to keep the information in sync (send invalidation when the average request size changes by say 5%) - Avoid pruning chains in every index or seq lookup. But if the chain becomes longer than X tuples, mark the page to be pruned in the next lookup. We can choose to separate prune and defragmentation and only do pruning in this case. But I would prefer to keep them together for now. - Track the minimum xmin in the page header to avoid repeated (wasted) attempts to prune a Prunable page in the presence of long running transactions. We can save rest of the techniques for beta testing period or 8.4. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [PATCHES] HOT patch - version 15
Pavan Deolasee wrote: IMHO we are making full circles here. We have already tried LP_DELETE techniques and moved away to simplify things. We also tried reusing dead space without running PageRepairFragmentation. Each of these techniques worked just fine with slightly different performance characteristics. What we now have is a simplified algorithm which is much easier to follow and is safer, yet giving us a very good performance boost. I am not sure if this is the right time to throw new ideas because we would never be sure as what we are doing would be the most optimal solution. Would it help if we go with some solution right now, get rest of the review process done and then use the feedback during beta testing to tune things ? We may have far more data points at that time to choose one technique over other. And we would also know what areas to focus on. I totally understand. I was just throwing out ideas to make sure I fully understood it and to explore various options. I do agree we should just wait for the review, knowing we have these trade-offs. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match