[HACKERS] GSoC proposal. Index-only scans for GIST
Hello! Here is the text of my proposal which I've applied to GSoC. (and link http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/lubennikovaav/5629499534213120) Any suggestions and comments are welcome. *Project name* Support for index-only scans for GIST index *Brief review* Currently GiST index don't have index-only scan functionality. Index-only scans are a major performance feature added to Postgres 9.2. They allow certain types of queries to be satisfied just by retrieving data from indexes, and not from tables. This feature for B-trees (implemented since PostgreSQL-9.2) allows doing certain types of queries significantly faster. This is achieved by reduction in the amount of I/O necessary to satisfy queries. I think it will be useful to implement this feature for user defined data types that use GiST index. *Benefits to the PostgreSQL Community* Faster GiST index search would be actual for many PostgreSQL applications (for example some GIS systems). *Quantifiable results* Acceleration of GiST index search. *Project details* 1. The GiST is a balanced tree structure like a B-tree, containing key, pointer pairs. GiST key is a member of a user-defined class, and represents some property that is true of all data items reachable from the pointer associated with the key. The GiST provides a possibility to create custom data types with indexed access methods and extensible set of queries. There are seven methods that an index operator class for GiST must provide, and an eighth that is optional. -consistent -union -compress -decompress -penalty -picksplit -equal -distance (optional) I'm going to create new optional method fetch() in addition to existing. Thus user can create a method of retrieving data from the index without loss. It will be used when performing search queries to speed data retrieving. 2. gistget algorithm (several parts omitted): Check the key gistindex_keytest() - does this index tuple satisfy the scan key(s)? Scan all items on the GiST index page and insert them into the queue (or directly to output areas) plain scan Heap tuple TIDs are returned into so-pageData[] ordered scan Heap tuple TIDs are pushed into individual search queue items If the fetch() is specified by the developer, then using it, algorithm can retrieve the data directly to output areas at this stage, without reference to the heap. 3. Create function gistcanreturn to check whether fetch() is specified by user. Amcanreturn - Function to check whether index supports index-only scans, or zero if none There is the question of support index-only scans for multicolumn indexes. Probably it would require extend the interface to add separate columns checking. To solve this question I need the community's help. 4. Add details for index only scans into gistcostestimate function *Links* 1) Hellerstein J. M., Naughton J. F., Pfeffer A. Generalized search trees for database systems. - September, 1995. 2) http://www.sai.msu.su/~megera/postgres/gist/ 3) PostgreSQL 9.3.3 Documentation: chapters 11. Indexes, 55. GiST Indexes. -- Best regards, Lubennikova Anastasia
Re: [HACKERS] GSoC proposal. Index-only scans for GIST
2014-03-18 18:47 GMT+04:00 Robert Haas robertmh...@gmail.com If the fetch() is specified by the developer, then using it, algorithm can retrieve the data directly to output areas at this stage, without reference to the heap. This seems to be the crux of your proposal, but it seems vague: what exactly do you mean by retrieve the data directly to output areas? What data are you going to retrieve and where are you going to put it? I meant Datum that storages in Gist-tree nodes. Now gistgettuple() returns xs_ctup.t_self (item pointer). I'm going to add index-only scan functionality: gistsettuple() will return pointer and Datum itself as xs_itup . So queue will contain both the pointer and the Datum. If visibilitymap_test returns true then Datum from xs_itup would be added into queue. Otherwise page must be scanned. Another question to consider is: which operator classes do you anticipate that this will work for and which ones do you anticipate that it will not work for? Any operator class that lossifies that input data before storing it in the index is presumably doomed, but which ones do that, and which do not? about amcanreturn: I'm going to create function gistcanreturn() = If fetch() is defined for all indexed columns? And last point of my project is to implement fetch() for existing opclasses based on GIST. -- Best regards, Lubennikova Anastasia
[HACKERS] Index-only scans for GIST
Hi, hackers! There are first results of my work on GSoC project Index-only scans for GIST. 1. Version of my project code is in forked repository https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments - This version is only for one-column indexes - fetch() method is realized only for box opclass (because it's trivial) 2. I test Index-only scans with SQL script box_test.sql and it works really faster. (results in box_test_out) I'll be glad to get your feedback about this feature. -- Best regards, Lubennikova Anastasia indexonlygist_2.0.patch Description: Binary data box_test.sql Description: Binary data box_test_out.out Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Changes in amcanreturn() interface to support multicolumn indexes
Changes in amcanreturn() interface to support multicolumn indexes. Hi, hackers I work on GSoC project Support_for_Index-only_scans_for_GIST_GSoC_2014 https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014 There is a question of support multicolumn index only scans for GIST. I need help with this problem. bool amcanreturn() - function to check whether index supports index-only scans. Now it's defined for - B-tree. It always returns true, so there's no questions. - SP-GIST. It doesn't support multicolumn indexes, so there's no problems in spgcanreturn too. - GIST. In first version it works only for onecolumn indexes. gistcanreturn() checks whether fetch() method is defined for column's data type. There is a question of support multicolumn index only scans for GIST. gistcanreturn() can return true if fetch is implemented for all indexed columns and false otherwise. But that's not very good case for multicolumn indexes. I think, it requires extend the interface to add separate columns checking. But I can't understand what kind of changes is required and how would it affect on previous amcanreturn interface. -- Regards, Lubennikova Anastasia
[HACKERS] Index-only scans for multicolumn GIST
Hi, hackers! There are new results of my work on GSoC project Index-only scans for GIST. Previous post is here: http://postgresql.1045698.n5.nabble.com/Index-only-scans-for-GIST-td5804892.html Repository is https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments. It includes indexonlyscan for multicolumn GiST. It works correctly - results are the same with another scan plans. Fetch() method is realized for box and circle opclasses Improvement for circle opclass is not such distinct as for box opclass, because of recheck. I remember that all elog and other bad comments must be fixed before this patch can be committed. Any comments are welcome -- Best regards, Lubennikova Anastasia indexonlygist_2.1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index-only scans for GIST
Hi, hackers! I work on a GSoC project Index-only scans for GIST https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014 Repository is https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments. It includes index-only scans for multicolumn GIST and new regression test. Fetch() method is realized for box and point opclasses. Documentation is not updated yet, but I'm going to do it till the end of GSoC. I've got one question about query with OR condition. It is the last query in regression test gist_indexonly. It doesn't fail but it doensn't use index-only scans. Could someone explain to me how it works? It seems to depend on build_paths_for_OR http://doxygen.postgresql.org/indxpath_8c.html#ae660d2e886355e53ed3b9ec693e4afd2 function. But I couldn't understand how. -- Best regards, Lubennikova Anastasia indexonlyscan_gist.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for GIST
Thank you for comment Patch is already added in Performance topic. 2014-08-01 20:25 GMT+04:00 Fabrízio de Royes Mello fabriziome...@gmail.com : On Fri, Aug 1, 2014 at 4:58 AM, Anastasia Lubennikova lubennikov...@gmail.com wrote: Hi, hackers! I work on a GSoC project Index-only scans for GIST https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014 Repository is https://github.com/lubennikovaav/postgres/tree/indexonlygist2 Patch is in attachments. It includes index-only scans for multicolumn GIST and new regression test. Fetch() method is realized for box and point opclasses. Documentation is not updated yet, but I'm going to do it till the end of GSoC. I've got one question about query with OR condition. It is the last query in regression test gist_indexonly. It doesn't fail but it doensn't use index-only scans. Could someone explain to me how it works? It seems to depend on build_paths_for_OR function. But I couldn't understand how. Very nice... please add your patch to the next commit fest [1]. Regards, [1] https://commitfest.postgresql.org/action/commitfest_view?id=23 -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog sobre TI: http://fabriziomello.blogspot.com Perfil Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello -- Best regards, Lubennikova Anastasia
Re: [HACKERS] Index-only scans for GIST
2014-08-07 0:30 GMT+04:00 Heikki Linnakangas hlinnakan...@vmware.com: * I'm getting two regression failures with this (opr_sanity and join). opr_sanity failure is corrected. But there is remain question with join. I check the latest version of my github repo and there's no fail in join regression test All 145 tests passed. To tell the truth, I don't understand which changes could led to this failure. Could you show me regression diffs? I want to understand what's wrong with the patch. * The regression test queries that use LIMIT are not guaranteed to always return the same rows, hence they're not very good regression test cases. I'd suggest using more restricting WHERE clauses, so that each query only returns a handful of rows. Thank you for comment, I rewrote wrong queries. But could you explain why LIMIT queries may return different results? Is it happens because of different query optimization? * I think it's leaking memory, in GIST scan context. I tested this with a variant of the regression tests: insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i, 0.05*i)), point(0.05*i, 0.05*i) FROM generate_series(0, 1000) as i; CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p); set enable_seqscan=off; set enable_bitmapscan=off; explain analyze select p from gist_tbl where p @ box(point(0,0), point(999,999)) and length(p::text) 10; while the final query runs, 'top' shows constantly increasing memory usage. I don't think it's memory leak. After some increasing, memory using remain the same. It works similar without using indexonlyscan. -- Best regards, Lubennikova Anastasia
Re: [HACKERS] Index-only scans for GIST
Updated patch * Compiler, merge and regression fails checked * Regression tests was impoved * GiST and amcanreturn docs updated -- Best regards, Lubennikova Anastasia indexonlyscan_gist2.patch Description: Binary data indexonlyscan_gist_docs.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index-only scans for GiST.
Finally there is a new version of patch (in attachments). It provides multicolumn index-only scan for GiST indexes. - Memory leak is fixed. - little code cleanup - example of performance test in attachmens - function OIDs have debugging values (*) just to avoid merge conflicts while testing patch Wiki page of the project is https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014 Waiting for feedback. -- Best regards, Lubennikova Anastasia indexonlyscan_gist_2.0.patch Description: Binary data test_performance.sql Description: Binary data indexonlyscan_gist_docs.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index-only scans for GiST.
Thanks for answer. Now it seems to be applied correctly. 2015-02-12 3:12 GMT+04:00 Thom Brown t...@linux.com: On 11 February 2015 at 22:50, Anastasia Lubennikova lubennikov...@gmail.com wrote: Finally there is a new version of patch (in attachments). It provides multicolumn index-only scan for GiST indexes. - Memory leak is fixed. - little code cleanup - example of performance test in attachmens - function OIDs have debugging values (*) just to avoid merge conflicts while testing patch Wiki page of the project is https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014 Waiting for feedback. Hi Anastasia. Thanks for the updated patch. I've just tried applying it to head and it doesn't appear to apply cleanly. $ patch -p1 ~/Downloads/indexonlyscan_gist_2.0.patch (Stripping trailing CRs from patch; use --binary to disable.) patching file src/backend/access/gist/gist.c Hunk #1 succeeded at 1404 (offset 9 lines). Hunk #2 succeeded at 1434 (offset 9 lines). (Stripping trailing CRs from patch; use --binary to disable.) patching file src/backend/access/gist/gistget.c Hunk #1 succeeded at 227 (offset 1 line). Hunk #2 succeeded at 243 (offset 1 line). Hunk #3 succeeded at 293 (offset -4 lines). Hunk #4 succeeded at 330 (offset -4 lines). Hunk #5 succeeded at 365 (offset -5 lines). Hunk #6 succeeded at 444 (offset -27 lines). Hunk #7 succeeded at 474 (offset -27 lines). Hunk #8 FAILED at 518. Hunk #9 succeeded at 507 (offset -28 lines). Hunk #10 succeeded at 549 with fuzz 1 (offset -28 lines). Hunk #11 FAILED at 601. 2 out of 11 hunks FAILED -- saving rejects to file src/backend/access/gist/gistget.c.rej ... -- Thom -- Best regards, Lubennikova Anastasia diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c index db2a452..53750da 100644 --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -1404,6 +1404,14 @@ initGISTstate(Relation index) else giststate-distanceFn[i].fn_oid = InvalidOid; + /* opclasses are not required to provide a Fetch method */ + if (OidIsValid(index_getprocid(index, i + 1, GIST_FETCH_PROC))) + fmgr_info_copy((giststate-fetchFn[i]), + index_getprocinfo(index, i + 1, GIST_FETCH_PROC), + scanCxt); + else + giststate-fetchFn[i].fn_oid = InvalidOid; + /* * If the index column has a specified collation, we should honor that * while doing comparisons. However, we may have a collatable storage @@ -1426,6 +1434,22 @@ initGISTstate(Relation index) return giststate; } +/* + * Gistcanreturn is supposed to be true if ANY FetchFn method is defined. + * If FetchFn exists it would be used in index-only scan + * Thus the responsibility rests with the opclass developer. + */ + +Datum +gistcanreturn(PG_FUNCTION_ARGS) { + Relation index = (Relation) PG_GETARG_POINTER(0); + int i = PG_GETARG_INT32(1); + if (OidIsValid(index_getprocid(index, i+1, GIST_FETCH_PROC))) + PG_RETURN_BOOL(true); + else + PG_RETURN_BOOL(false); +} + void freeGISTstate(GISTSTATE *giststate) { diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c index 717cb85..0925e56 100644 --- a/src/backend/access/gist/gistget.c +++ b/src/backend/access/gist/gistget.c @@ -227,8 +227,10 @@ gistindex_keytest(IndexScanDesc scan, * If tbm/ntids aren't NULL, we are doing an amgetbitmap scan, and heap * tuples should be reported directly into the bitmap. If they are NULL, * we're doing a plain or ordered indexscan. For a plain indexscan, heap - * tuple TIDs are returned into so-pageData[]. For an ordered indexscan, + * tuple TIDs are returned into so-pageData. For an ordered indexscan, * heap tuple TIDs are pushed into individual search queue items. + * If index-only scan is possible, heap tuples themselves are returned + * into so-pageData or into search queue. * * If we detect that the index page has split since we saw its downlink * in the parent, we push its new right sibling onto the queue so the @@ -241,6 +243,10 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, GISTScanOpaque so = (GISTScanOpaque) scan-opaque; Buffer buffer; Page page; + GISTSTATE *giststate = so-giststate; + Relation r = scan-indexRelation; + boolisnull[INDEX_MAX_KEYS]; + GISTSearchHeapItem *tmpListItem; GISTPageOpaque opaque; OffsetNumber maxoff; OffsetNumber i; @@ -287,8 +293,6 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, MemoryContextSwitchTo(oldcxt); } - so-nPageData = so-curPageData = 0; - /* * check all tuples on page */ @@ -326,11 +330,20 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, else if (scan-numberOfOrderBys == 0 GistPageIsLeaf(page)) { /* - * Non-ordered scan, so report heap tuples in so-pageData[] + * Non-ordered scan, so report tuples in so-pageData + */ + + /* form tmpListItem
Re: [HACKERS] GSoC 2015 - mentors, students and admins.
I'm interested to participate as student again. -- Best regards, Lubennikova Anastasia
[HACKERS] GSoC 2015 proposal. Bitmap Index-only Count
Hi, hackers! Here is the text of my proposal which I've applied to GSoC. (and link http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/lubennikovaav/5657382461898752 ) Any suggestions and comments are welcome. *Project name* Bitmap Index-only Count *Brief review* There is a problem of slow counting in PostgreSQL [1]. The reason why this is slow is related to the *MVCC* implementation in PostgreSQL. Index-only scans (implemented since PostgreSQL-9.2) providing some performance improvements where the *visibility map* of the table allows it. That’s good. But it works only for access methods which provide amgettuple method. Unfortunately GIN supports only BitmapIndexScan and has no implementation of index_getnext() interface [2]. Idea of the proposal is to implement Bitmap Index-Only Count method that allows to count elements, without reference to the heap. *Benefits to the PostgreSQL Community* Faster count(*) would be actual for GIN. Probably it would accelerate count(*) for other access methods too, but I do not sure if it would be significant difference. *Quantifiable results* Acceleration of count(*) queries with WHERE clause which use GIN. *Project details* Let’s look at count(*) query: EXPLAIN SELECT count(*) from tbl_mytbl where val @ '{b:2}'; Now the query plan looks like this: Aggregate - Bitmap Heap Scan on tbl_mytbl Recheck Cond: (val @ '{b: 2}'::jsonb) - Bitmap Index Scan on idx_mytbl Index Cond: (val @ '{b: 2}'::jsonb) Idea of the proposal is to implement Bitmap Index-Only Count method that allows to count elements, without reference to the heap. Conditions: - all tuples are visible (the same problem as for Index-only count(*)); - result TIDBitmap is not lossy. nchunks = 0; int nchunks http://doxygen.postgresql.org/structTIDBitmap.html#a85d5883056bad6863cb47a15446581c7; /* number of lossy entries in pagetable */ - pages in TIDBitmap don’t require recheck * recheck is used only on exact pages --- it indicates that although * only the stated tuples need be checked, the full index qual condition * must be checked for each (ie, these are candidate matches). If all conditions are true, it’s possible to call aggregate COUNT function for each tuple from TIDBitmap returned by Bitmap Index Scan (or BitmapAnd/BitmapOr nodes). And there’s no necessity to call Bitmap Heap Scan. As a GSoC student I will create new Node “Bitmap Index-Only Scan”, which would catch tuples from Bitmap Index Scan node and pass them to Aggregate node. Thus, new query plan will be as follow: Aggregate - Bitmap Index-only Count - Bitmap Index Scan on idx_mytbl Index Cond: (val @ '{b: 2}'::jsonb) *Project Schedule* until May 25 Read documentation and source code, clarify details of implementation. until June 30 Implement Bitmap Index-only Count node. 1 July – 31 July Add Bitmap Index-only Count node to Planner. 1 August -15 August Final refactoring, testing and submitting a patch. *Links* 1. https://wiki.postgresql.org/wiki/Slow_Counting 2. http://postgresql.nabble.com/Todo-item-Support-amgettuple-in-GIN-td5780810.html -- Best regards, Lubennikova Anastasia
Re: [HACKERS] GSoC 2015 proposal. Bitmap Index-only Count
2015-03-24 18:01 GMT+04:00 Tom Lane t...@sss.pgh.pa.us: Anastasia Lubennikova lubennikov...@gmail.com writes: There is a problem of slow counting in PostgreSQL [1]. The reason why this is slow is related to the *MVCC* implementation in PostgreSQL. Index-only scans (implemented since PostgreSQL-9.2) providing some performance improvements where the *visibility map* of the table allows it. That’s good. But it works only for access methods which provide amgettuple method. Unfortunately GIN supports only BitmapIndexScan and has no implementation of index_getnext() interface [2]. Right ... As a GSoC student I will create new Node “Bitmap Index-Only Scan”, which would catch tuples from Bitmap Index Scan node and pass them to Aggregate node. Thus, new query plan will be as follow: I'm pretty hesitant about adding a whole new plan node type (which will require quite a lot of infrastructure) for such a narrow use-case. I think the odds are good that if you proceed down this path, you will end up with something that never gets committed to Postgres. Thanks a lot for reply. It was just approximate idea. I thought is wasn't very good. I wonder whether it'd be possible to teach GIN to support index_getnext instead. Initially it would probably work only for cases where the index didn't have to return any columns ... but if we did it, maybe the door would be open to cases where GIN could reconstruct actual values. Another idea is to write index_getnext() for GIN which would return some fake tuple. Since there is no difference for COUNT aggregate what the tuple contains. COUNT just wants to know whether we have tuple that satisfy the qual. Is this idea better? Is it possible for planner to use index_getnext() for GIN only with COUNT aggregate? -- Best regards, Lubennikova Anastasia
Re: [HACKERS] Index-only scans for GiST.
I add MemoryContext listCxt to avoid memory leak. listCxt is created once in gistrescan (only for index-only scan plan ) and reseted when scan of the leaf page is finished. I do not sure if the problem was completely solved, so I wait for feedback. * What's the reason for turning GISTScanOpaqueData.pageData from an array to a List? This array is field of structure GISTScanOpaqueData. Memory for that structure allocated in function gistbeginscan(). The array is static so it's declared only one time in structure: GISTSearchHeapItem pageData [BLCKSZ/sizeof(IndexTupleData)] But how could we know size of array if we don't know what data would be returned? I mean type and amount. I asked Alexander about that and he offered me to use List instead of Array. indexonlyscan_gist_2.2.patch Description: Binary data indexonlyscan_gist_docs.patch Description: Binary data test_performance.sql Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Wrong Assert in PageIndexMultiDelete?
Hi, hackers! I am trying to create new index access method. And I found strange Assert in PageIndexMultiDelete http://doxygen.postgresql.org/bufpage_8c_source.html#l00791 function. Assert http://doxygen.postgresql.org/c_8h.html#a706ac5b1a53bd04067f81924b92cb9f6(nitems MaxIndexTuplesPerPage http://doxygen.postgresql.org/itup_8h.html#adb7c94e95ce112eb47d71f7797604ddb ); Is '' sign is correct? I thougt it should be '='. Is it a bug or just my misunderstanding? -- Best regards, Lubennikova Anastasia
Re: [HACKERS] [PATCH] Microvacuum for gist.
On Mon, Aug 3, 2015 at 12:27 PM, Anastasia Lubennikova a.lubennik...@postgrespro.ru mailto:a.lubennik...@postgrespro.ru wrote: 1) Test and results are in attachments. Everything seems to work as expected. 2) I dropped these notices. It was done only for debug purposes. Updated patch is attached. 3) fixed Good! Another couple of notes from me: 1) I think gistvacuumpage() and gistkillitems() need function-level comments. 2) ItemIdIsDead() can be used in index scan like it's done in _bt_checkkeys(). -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com http://www.postgrespro.com/ The Russian Postgres Company I've added some comments. ItemIdIsDead() is used now (just skip dead tuples as not matching the quals). And there is one else check of LSN in gistkillitems to make sure that page was not changed between reads. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company *** a/src/backend/access/gist/gist.c --- b/src/backend/access/gist/gist.c *** *** 36,42 static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack, bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); ! #define ROTATEDIST(d) do { \ SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \ --- 36,42 bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); ! static void gistvacuumpage(Relation rel, Page page, Buffer buffer); #define ROTATEDIST(d) do { \ SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \ *** *** 209,214 gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate, --- 209,225 * because the tuple vector passed to gistSplit won't include this tuple. */ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + + /* +* If leaf page is full, try at first to delete dead tuples. And then +* check again. +*/ + if ((is_split) GistPageIsLeaf(page) GistPageHasGarbage(page)) + { + gistvacuumpage(rel, page, buffer); + is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + } + if (is_split) { /* no space for insertion */ *** *** 1440,1442 freeGISTstate(GISTSTATE *giststate) --- 1451,1522 /* It's sufficient to delete the scanCxt */ MemoryContextDelete(giststate-scanCxt); } + + /* + * gistvacuumpage() -- try to remove LP_DEAD items from the given page. + */ + static void + gistvacuumpage(Relation rel, Page page, Buffer buffer) + { + OffsetNumber deletable[MaxOffsetNumber]; + int ndeletable = 0; + OffsetNumber offnum, + minoff, + maxoff; + + /* +* Scan over all items to see which ones need to be deleted according to +* LP_DEAD flags. +*/ + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = FirstOffsetNumber; +offnum = maxoff; +offnum = OffsetNumberNext(offnum)) + { + ItemId itemId = PageGetItemId(page, offnum); + + if (ItemIdIsDead(itemId)) + deletable[ndeletable++] = offnum; + } + + if (ndeletable 0) + { + START_CRIT_SECTION(); + + PageIndexMultiDelete(page, deletable, ndeletable); + + /* +* Mark the page as not containing any LP_DEAD items. This is not +* certainly true (there might be some that have recently been marked, +* but weren't included in our target-item list), but it will almost +* always be true and it doesn't seem worth an additional page scan to +* check it. Remember that F_HAS_GARBAGE is only a hint anyway. +*/ + GistClearPageHasGarbage(page); + + MarkBufferDirty(buffer); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + + recptr = gistXLogUpdate(rel-rd_node, buffer, + deletable, ndeletable, + NULL, 0, InvalidBuffer); + + PageSetLSN(page, recptr); + } + else
[HACKERS] [PATCH] Microvacuum for gist.
Hi, I have written microvacuum support for gist access method. Briefly microvacuum includes two steps: 1. When search tells us that the tuple is invisible to all transactions it is marked LP_DEAD and page is marked as has dead tuples, 2. Then, when insert touches full page which has dead tuples it calls microvacuum instead of splitting page. You can find a kind of review here [1]. [1] http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120 Patch is in attachements. Please review it. -- Best regards, Lubennikova Anastasia microvacuum_for_gist.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Microvacuum for gist.
30.07.2015 16:33, Alexander Korotkov пишет: Hi! On Thu, Jul 30, 2015 at 2:51 PM, Anastasia Lubennikova lubennikov...@gmail.com mailto:lubennikov...@gmail.com wrote: I have written microvacuum support for gist access method. Briefly microvacuum includes two steps: 1. When search tells us that the tuple is invisible to all transactions it is marked LP_DEAD and page is marked as has dead tuples, 2. Then, when insert touches full page which has dead tuples it calls microvacuum instead of splitting page. You can find a kind of review here [1]. [1] http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120 Patch is in attachements. Please review it. Nice! Some notes about this patch. 1) Could you give same test case demonstrating that microvacuum really work with patch? Finally, we should get index less growing with microvacuum. 2) Generating notices for every dead tuple would be too noisy. I suggest to replace notice with one of debug levels. + elog(NOTICE, gistkillitems. Mark Item Dead offnum %hd, blkno %d, offnum, BufferGetBlockNumber(buffer)); 3) Please, recheck coding style. For instance, this line needs more spaces and open brace should be on the next line. + if ((scan-kill_prior_tuple)(so-curPageData 0)(so-curPageData == so-nPageData)) { -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com http://www.postgrespro.com/ The Russian Postgres Company 1) Test and results are in attachments. Everything seems to work as expected. 2) I dropped these notices. It was done only for debug purposes. Updated patch is attached. 3) fixed // -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company *** a/src/backend/access/gist/gist.c --- b/src/backend/access/gist/gist.c *** *** 36,42 static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack, bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); ! #define ROTATEDIST(d) do { \ SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \ --- 36,42 bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); ! static void gistvacuumpage(Relation rel, Page page, Buffer buffer); #define ROTATEDIST(d) do { \ SplitedPageLayout *tmp=(SplitedPageLayout*)palloc(sizeof(SplitedPageLayout)); \ *** *** 209,214 gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate, --- 209,225 * because the tuple vector passed to gistSplit won't include this tuple. */ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + + /* +* If leaf page is full, try at first to delete dead tuples. And then +* check again. +*/ + if ((is_split) GistPageIsLeaf(page) GistPageHasGarbage(page)) + { + gistvacuumpage(rel, page, buffer); + is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + } + if (is_split) { /* no space for insertion */ *** *** 1440,1442 freeGISTstate(GISTSTATE *giststate) --- 1451,1519 /* It's sufficient to delete the scanCxt */ MemoryContextDelete(giststate-scanCxt); } + + static void + gistvacuumpage(Relation rel, Page page, Buffer buffer) + { + OffsetNumber deletable[MaxOffsetNumber]; + int ndeletable = 0; + OffsetNumber offnum, + minoff, + maxoff; + + /* +* Scan over all items to see which ones need to be deleted according to +* LP_DEAD flags. +*/ + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = FirstOffsetNumber; +offnum = maxoff; +offnum = OffsetNumberNext(offnum)) + { + ItemId itemId = PageGetItemId(page, offnum); + + if (ItemIdIsDead(itemId)) + deletable[ndeletable++] = offnum; + } + + if (ndeletable 0) + { + START_CRIT_SECTION(); + + PageIndexMultiDelete(page, deletable, ndeletable); + + /* +* Mark the page as not containing any LP_DEAD items. This is not +* certainly true (there might be some that have recently been marked, +* but weren't included in our target-item list), but it will almost +* always be true and it doesn't seem worth an additional
Re: [HACKERS] How to compare different datums within from a tuple?
Can someone tell me, how I can compare two datum fields, when I do not know the data type in advance inside an executor function? In a nutshell, there is no way to compare Datums. Datum is an abstact data type. It's the backend internal representation of a single value of any SQL data type. The code using Datum has to know which type it is, since the Datum itself doesn't contain that information. For example, x less than y where x and y are of various types that form intervals. I have found the method ExecTuplesMatch, but it is only for equality comparison, I think. Another one is ApplySortComparator... maybe that's the correct way to go? Some code to make things clearer... Datum x = heap_getattr(out-tts_tuple, node-xpos, out-tts_tupleDescriptor, isNull1); Datum y = slot_getattr(curr, node-ypos, isNull2); if (compareDatumWithCorrectMethod(x,y) 0) { /* do something */ } Thx, Peter -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers Maybe you can use DatumGetXXX function to extract value. For example, DatumGetInt32. http://doxygen.postgresql.org/postgres_8h.html#aacbc8a3ac6d52d85feaf0b7ac1b1160c -- Best regards, Lubennikova Anastasia
[HACKERS] Microvacuum for gist. Question about GISTPageOpaqueData flag
Hi, I'm working on microvacuum for gist access method. Briefly microvacuum includes two steps: 1. When search tells us that the tuple is invisible to all transactions it is marked LP_DEAD and page is marked as has dead tuples, 2. Then, when insert touches full page which has dead tuples it calls microvacuum instead of splitting page. You can find a kind of review here [1]. While writing patch, I found strange GISTPageOpaqueData flag - F_TUPLES_DELETED http://doxygen.postgresql.org/gist_8h.html#a23812efd70313b9b10ae61376e2594f6 . Its description looks like it is the same for BTP_HAS_GARBAGE http://doxygen.postgresql.org/nbtree_8h.html#a3b7c77849276ff8617edc1f84441c230 #define F_TUPLES_DELETED (1 2) /* some tuples on the page are dead */ #define BTP_HAS_GARBAGE (1 6) /* page has LP_DEAD tuples */ But it's definitely not the same things. I found only two mentions of this flag. Function GistMarkTuplesDeleted http://doxygen.postgresql.org/gist_8h.html#a96fc3c6bb5aecfc8d2818b7010d68aac sets the flag after dead tuples deletion. Do anyone need it at all? I found no place where this flag is checked. Is it correct using of the flag? I need an advice, what would be better: - to add new flag like F_HAS_GARBAGE, - or to delete all mentions of F_TUPLES_DELETED and use it in gist microvacuum. [1] http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/ivanitskiy_ilya/5629499534213120 -- Best regards, Lubennikova Anastasia
Re: [HACKERS] Microvacuum for gist. Question about GISTPageOpaqueData flag
2015-07-27 20:05 GMT+04:00 Heikki Linnakangas hlinn...@iki.fi: On 07/27/2015 06:46 PM, Teodor Sigaev wrote: I need an advice, what would be better: - to add new flag like F_HAS_GARBAGE, - or to delete all mentions of F_TUPLES_DELETED and use it in gist microvacuum. According to commit message: commit 2effb72e682a7dbdc9a8a60a80c22ec1fa9d8079 Author: Heikki Linnakangas heikki.linnakan...@iki.fi Date: Fri Nov 7 15:03:46 2014 +0200 .. The code that generated a record to clear the F_TUPLES_DELETED flag hasn't existed since we got rid of old-style VACUUM FULL. I kept the code that sets the flag, although it's not used for anything anymore, because it might still be interesting information for debugging purposes that some tuples have been deleted from a page. .. If Heikki doesn't change his opinion then introduce new flag. Although I don't think that we need to keep F_TUPLES_DELETED. It's certainly not needed for anything at the moment, although conceivably we might reintroduce code that needs it in the future. There are plenty of flag bits available, so let's use a new flag. If there was a shortage, I wouldn't blink reusing F_TUPLES_DELETED. - Heikki Thanks for the quick reply -- Best regards, Lubennikova Anastasia
[HACKERS]WIP: Covering + unique indexes.
Hi hackers, I'm working on a patch that allows to combine covering and unique functionality for btree indexes. _Previous discussion was here:_ 1) Proposal thread <http://www.postgresql.org/message-id/55f2ccd0.7040...@postgrespro.ru> 2) Message with proposal clarification <http://www.postgresql.org/message-id/55f84df4.5030...@postgrespro.ru> In a nutshell, the feature allows to create index with "key" columns and "included" columns. "key" columns can be used as scan keys. Unique constraint relates only to "key" columns. "included" columns may be used as scan keys if they have suitable opclass. Both "key" and "included" columns can be returned from index by IndexOnlyScan. Btree is the default index and it's used everywhere. So it requires properly testing. Volunteers are welcome) _Use case:_ - We have a table (c1, c2, c3, c4); - We need to have an unique index on (c1, c2). - We would like to have a covering index on all columns to avoid reading of heap pages. Old way: CREATE UNIQUE INDEX olduniqueidx ON oldt USING btree (c1, c2); CREATE INDEX oldcoveringidx ON oldt USING btree (c1, c2, c3, c4); What's wrong? Two indexes contain repeated data. Overhead to data manipulation operations and database size. New way: CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4); The patch is attached. In 'test.sql' you can find a test with detailed comments on each step, and comparison of old and new indexes. New feature has following syntax: CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4); Keyword INCLUDING defines the "included" columns of index. These columns aren't concern to unique constraint. Also, them are not stored in index inner pages. It allows to decrease index size. _Results:_ 1) Additional covering index is not required anymore. 2) New index can use IndexOnlyScan on queries, where old index can't. For example, explain analyze select c1, c2 from newt where c1<1 and c3<20; *more examples in 'test.sql' _Future work:_ To do opclasses for "included" columns optional. CREATE TABLE tbl (c1 int, c4 box); CREATE UNIQUE INDEX idx ON tbl USING btree (c1) INCLUDING (c4); If we don't need c4 as an index scankey, we don't need any btree opclass on it. But we still want to have it in covering index for queries like SELECT c4 FROM tbl WHERE c1=1000; SELECT * FROM tbl WHERE c1=1000; -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company test.sql Description: application/sql diff --git a/src/backend/access/common/indextuple.c b/src/backend/access/common/indextuple.c index dc588d7..83f24c3 100644 --- a/src/backend/access/common/indextuple.c +++ b/src/backend/access/common/indextuple.c @@ -19,6 +19,7 @@ #include "access/heapam.h" #include "access/itup.h" #include "access/tuptoaster.h" +#include "utils/rel.h" /* @@ -441,3 +442,30 @@ CopyIndexTuple(IndexTuple source) memcpy(result, source, size); return result; } + +/* + * Reform index tuple. Truncate nonkey (INCLUDED) attributes. + */ +IndexTuple +index_reform_tuple(Relation idxrel, IndexTuple olditup, int natts, int nkeyatts) +{ + TupleDesc itupdesc = RelationGetDescr(idxrel); + Datum values[INDEX_MAX_KEYS]; + boolisnull[INDEX_MAX_KEYS]; + IndexTuple newitup; + + Assert(natts <= INDEX_MAX_KEYS); + Assert(nkeyatts > 0); + Assert(nkeyatts <= natts); + + index_deform_tuple(olditup, itupdesc, values, isnull); + + /* form new tuple that will contain only key attributes */ + itupdesc->natts = nkeyatts; + newitup = index_form_tuple(itupdesc, values, isnull); + newitup->t_tid = olditup->t_tid; + + itupdesc->natts = natts; + + return newitup; +} diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index 77c2fdf..d14df12 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -108,18 +108,23 @@ _bt_doinsert(Relation rel, IndexTuple itup, IndexUniqueCheck checkUnique, Relation heapRel) { bool is_unique = false; - int natts = rel->rd_rel->relnatts; + int nkeyatts = rel->rd_rel->relnatts; ScanKey itup_scankey; BTStack stack; Buffer buf; OffsetNumber offset; + Assert (rel->rd_index != NULL); + Assert(rel->rd_index->indnatts != 0); + Assert(rel->rd_index->indnkeyatts != 0); + nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); + /* we need an insertion scan key to do our search, so build one */ itup_scankey = _bt_mkscankey(rel, itup); top: /* find the first page containing this key */ - stack = _bt_search(rel, natts, itup_scankey, false, , BT_WRITE); + stack = _bt_search(rel, nkeyatts, itup_scankey, false, , BT_WRITE); offset = InvalidOffset
Re: [HACKERS]WIP: Covering + unique indexes.
08.10.2015 19:31, Thom Brown пишет: On 8 October 2015 at 16:18, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: Hi hackers, I'm working on a patch that allows to combine covering and unique functionality for btree indexes. Previous discussion was here: 1) Proposal thread 2) Message with proposal clarification In a nutshell, the feature allows to create index with "key" columns and "included" columns. "key" columns can be used as scan keys. Unique constraint relates only to "key" columns. "included" columns may be used as scan keys if they have suitable opclass. Both "key" and "included" columns can be returned from index by IndexOnlyScan. Btree is the default index and it's used everywhere. So it requires properly testing. Volunteers are welcome) Use case: - We have a table (c1, c2, c3, c4); - We need to have an unique index on (c1, c2). - We would like to have a covering index on all columns to avoid reading of heap pages. Old way: CREATE UNIQUE INDEX olduniqueidx ON oldt USING btree (c1, c2); CREATE INDEX oldcoveringidx ON oldt USING btree (c1, c2, c3, c4); What's wrong? Two indexes contain repeated data. Overhead to data manipulation operations and database size. New way: CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4); The patch is attached. In 'test.sql' you can find a test with detailed comments on each step, and comparison of old and new indexes. New feature has following syntax: CREATE UNIQUE INDEX newidx ON newt USING btree (c1, c2) INCLUDING (c3, c4); Keyword INCLUDING defines the "included" columns of index. These columns aren't concern to unique constraint. Also, them are not stored in index inner pages. It allows to decrease index size. Results: 1) Additional covering index is not required anymore. 2) New index can use IndexOnlyScan on queries, where old index can't. For example, explain analyze select c1, c2 from newt where c1<1 and c3<20; *more examples in 'test.sql' Future work: To do opclasses for "included" columns optional. CREATE TABLE tbl (c1 int, c4 box); CREATE UNIQUE INDEX idx ON tbl USING btree (c1) INCLUDING (c4); If we don't need c4 as an index scankey, we don't need any btree opclass on it. But we still want to have it in covering index for queries like SELECT c4 FROM tbl WHERE c1=1000; SELECT * FROM tbl WHERE c1=1000; The definition output needs a space after "INCLUDING": # SELECT pg_get_indexdef('people_first_name_last_name_email_idx'::regclass::oid); pg_get_indexdef -- CREATE UNIQUE INDEX people_first_name_last_name_email_idx ON people USING btree (first_name, last_name) INCLUDING(email) (1 row) There is also no collation output: # CREATE UNIQUE INDEX test_idx ON people (first_name COLLATE "en_GB", last_name) INCLUDING (email COLLATE "pl_PL"); CREATE INDEX # SELECT pg_get_indexdef('test_idx'::regclass::oid); pg_get_indexdef - CREATE UNIQUE INDEX test_idx ON people USING btree (first_name COLLATE "en_GB", last_name) INCLUDING(email) (1 row) As for functioning, it works as described: # EXPLAIN SELECT email FROM people WHERE (first_name,last_name) = ('Paul','Freeman'); QUERY PLAN -- Index Only Scan using people_first_name_last_name_email_idx on people (cost=0.28..1.40 rows=1 width=21) Index Cond: ((first_name = 'Paul'::text) AND (last_name = 'Freeman'::text)) (2 rows) Typo: "included columns must not intersects with key columns" should be: "included columns must not intersect with key columns" Thank you for testing. Mentioned issues are fixed. One thing I've noticed you can do with your patch, which you haven't mentioned, is have a non-unique covering index: # CREATE INDEX covering_idx ON people (first_name) INCLUDING (last_name); CREATE INDEX # EXPLAIN SELECT first_name, last_name FROM people WHERE first_name = 'Paul'; QUERY PLAN - Index Only Scan using covering_idx on people (cost=0.28..1.44 rows=4 width=13) Index Cond: (first_name = 'Paul'::text) (2 rows) But this appears to behave as if it were a regular multi-column index, in that it will use the index for ordering rather than sort after fetching from the index. So is this really stored the same as a multi-column index? The index sizes aren't identical, so some
[HACKERS] [PROPOSAL] Effective storage of duplicates in B-tree index.
Hi, hackers! I'm going to begin work on effective storage of duplicate keys in B-tree index. The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN. In a nutshell, effective storing of duplicates in GIN is organised as follows. Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree. You can find wonderful detailed descriptions in gin readme <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README> and articles <http://www.cybertec.at/gin-just-an-index-type/>. It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1) <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>. Now new B-tree index tuple must be inserted for each table row that we index. It can possibly cause page split. Because of MVCC even unique index could contain duplicates. Storing duplicates in posting list/tree helps to avoid superfluous splits. So it seems to be very useful improvement. Of course it requires a lot of changes in B-tree implementation, so I need approval from community. 1. Compatibility. It's important to save compatibility with older index versions. I'm going to change BTREE_VERSION to 3. And use new (posting) features for v3, saving old implementation for v2. Any objections? 2. There are several tricks to handle non-unique keys in B-tree. More info in btree readme <https://github.com/postgres/postgres/blob/master/src/backend/access/nbtree/README> (chapter - Differences to the Lehman & Yao algorithm). In the new version they'll become useless. Am I right? 3. Microvacuum. Killed items are marked LP_DEAD and could be deleted from separate page at time of insertion. Now it's fine, because each item corresponds with separate TID. But posting list implementation requires another way. I've got two ideas: First is to mark LP_DEAD only those tuples where all TIDs are not visible. Second is to add LP_DEAD flag to each TID in posting list(tree). This way requires a bit more space, but allows to do microvacuum of posting list/tree. Which one is better? -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Adding since-version tags to the docs?
31.08.2015 14:06, Shulgin, Oleksandr пишет: Hello, I often find it pity that our docs are missing any information on since when a certain GUC setting, SQL-level command or function was introduced. Clicking through the "this page in other versions" links at the top of a webpage does help, but you still need to do some guessing (binary search?) with the clicks. :-) It would be nice if we could make a script that would parse the sgml files and for every symbol it finds it would add a tag like "Since version 9.x". Such a script could start by checking out REL9_0_STABLE and looking through all symbols it can find, tagging them "Since 9.0". Then it could commit the result, check out the next version branch and apply said commit (some manual effort to merge it might be required), and repeat the process, assuming all newly found symbols must be introduced in this new version. That is for the lists, tabular representation might require adding a new column, I'm not sure what would be the best format. After this process is done once, we can have a requirement that every newly introduced symbol/command be tagged manually by the patch author. Do you think such approach will work? Is there interest in having this done? -- Alex I think that you're looking for Feature matrix <http://www.postgresql.org/about/featurematrix/>. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Should \o mean "everything?"
31.08.2015 23:25, David Fetter пишет: On Mon, Aug 31, 2015 at 07:18:02PM +, Kevin Grittner wrote: David Fetter <da...@fetter.org> wrote: In a failed attempt to send the output of \pset to a pipe, I noticed that for reasons I find difficult to explain, not every output gets redirected with \o. At first blush, I'd consider this inconsistency as a bug. What have I missed? The documentation says: | Arranges to save future query results to the file filename or | pipe future results to the shell command command. If no argument | is specified, the query output is reset to the standard output. | | "Query results" includes all tables, command responses, and | notices obtained from the database server, as well as output of | various backslash commands that query the database (such as \d), | but not error messages. Are you seeing anything inconsistent with the documentation? If so, what? Perhaps an example would help clarify... postgres=# \o | perl -pE 's/^/PREFIXED!/' postgres=# \dt postgres=# PREFIXED!No relations found. postgres=# \set AUTOCOMMIT = 'on' ON_ERROR_STOP = '' PROMPT1 = '%/%R%# ' PROMPT2 = '%/%R%# ' PROMPT3 = '>> ' VERBOSITY = 'default' VERSION = 'PostgreSQL 9.4.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit' DBNAME = 'postgres' USER = 'postgres' HOST = '/var/run/postgresql' PORT = '5432' ENCODING = 'UTF8' PSQL_EDITOR = '"/usr/local/bin/vim"' Cheers, David. Thanks for your test case. It made the question clear. That's definitely not a bug. Just not use case of "\o" command. Maybe documentation is a bit vague here. | "Query results" includes <...> output of various backslash commands that query the database (such as \d), | but not error messages. As I understand, this phrase means ONLY those commands which starts with '\d' (such as \dt, \di, \des etc.) -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Effective storage of duplicates in B-tree index.
01.09.2015 21:23, Peter Geoghegan: On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: Now new B-tree index tuple must be inserted for each table row that we index. It can possibly cause page split. Because of MVCC even unique index could contain duplicates. Storing duplicates in posting list/tree helps to avoid superfluous splits. I'm glad someone is thinking about this, because it is certainly needed. I thought about working on it myself, but there is always something else to do. I should be able to assist with review, though. Thank you) So it seems to be very useful improvement. Of course it requires a lot of changes in B-tree implementation, so I need approval from community. 1. Compatibility. It's important to save compatibility with older index versions. I'm going to change BTREE_VERSION to 3. And use new (posting) features for v3, saving old implementation for v2. Any objections? It might be better to just have a flag bit for pages that are compressed -- there are IIRC 8 free bits in the B-Tree page special area flags variable. But no real opinion on this from me, yet. You have plenty of bitspace to work with to mark B-Tree pages, in any case. Hmm.. If we are talking about storing duplicates in posting lists (and trees) as in GIN, I don't see a way how to apply it to separate pages, while not applying to others. Look some notes below . 2. There are several tricks to handle non-unique keys in B-tree. More info in btree readme (chapter - Differences to the Lehman & Yao algorithm). In the new version they'll become useless. Am I right? I think that the L algorithm makes assumptions for the sake of simplicity, rather than because they really believed that there were real problems. For example, they say that deletion can occur offline or something along those lines, even though that's clearly impractical. They say that because they didn't want to write a paper about deletion within B-Trees, I suppose. See also, my opinion of how they claim to not need read locks [1]. Also, note that despite the fact that the GIN README mentions "Lehman & Yao style right links", it doesn't actually do the L trick of avoiding lock coupling -- the whole point of L -- so that remark is misleading. This must be why B-Tree has much better concurrency than GIN in practice. Yes, thanks for extensive explanation. I mean such tricks as moving right in _bt_findinsertloc(), for example. /*-- * If we will need to split the page to put the item on this page, * check whether we can put the tuple somewhere to the right, * instead. Keep scanning right until we *(a) find a page with enough free space, *(b) reach the last page where the tuple can legally go, or *(c) get tired of searching. * (c) is not flippant; it is important because if there are many * pages' worth of equal keys, it's better to split one of the early * pages than to scan all the way to the end of the run of equal keys * on every insert. We implement "get tired" as a random choice, * since stopping after scanning a fixed number of pages wouldn't work * well (we'd never reach the right-hand side of previously split * pages). Currently the probability of moving right is set at 0.99, * which may seem too high to change the behavior much, but it does an * excellent job of preventing O(N^2) behavior with many equal keys. *-- */ If there is no multiple tuples with the same key, we shouldn't care about it at all. It would be possible to skip these steps in "effective B-tree implementation". That's why I want to change btree_version. So I'm really talking about a slightly different thing -- prefix compression, rather than handling duplicates. Whether or not you should do prefix compression instead of deduplication is certainly not clear to me, but it should be considered. Also, I always imagined that prefix compression would use the highkey as the thing that is offset for each "real" IndexTuple, because it's there anyway, and that's simple. However, I suppose that that means that duplicate handling can't really work in a way that makes duplicates have a fixed cost, which may be a particularly important property to you. You're right, that is two different techniques. 1. Effective storing of duplicates, which I propose, works with equal keys. And allow us to delete repeats. Index tuples are stored like this: IndexTupleData + Attrs (key) | IndexTupleData + Attrs (key) | IndexTupleData + Attrs (key) If all Attrs are equal, it seems reasonable not to repeat them. So we can store it in following structure: MetaData + Attrs (key) | IndexTupleData | IndexTupleData | IndexTupleData It is a posting list. It doesn't require significant changes in index page layout, because we can use ordinary IndexTupleData for meta informa
Re: [HACKERS] PATCH: index-only scans with partial indexes
25.08.2015 20:19, Jeff Janes пишет: On Fri, Jul 10, 2015 at 11:29 AM, Tomas Vondra <tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>> wrote: Hi, currently partial indexes end up not using index only scans in most cases, because check_index_only() is overly conservative, as explained in this comment: * XXX this is overly conservative for partial indexes, since we will * consider attributes involved in the index predicate as required even * though the predicate won't need to be checked at runtime. (The same * is true for attributes used only in index quals, if we are certain * that the index is not lossy.) However, it would be quite expensive * to determine that accurately at this point, so for now we take the * easy way out. In other words, unless you include columns from the index predicate to the index, the planner will decide index only scans are not possible. Which is a bit unfortunate, because those columns are not needed at runtime, and will only increase the index size (and the main benefit of partial indexes is size reduction). The attached patch fixes this by only considering clauses that are not implied by the index predicate. The effect is simple: create table t as select i as a, i as b from generate_series(1,1000) s(i); create index tidx_partial on t(b) where a > 1000 and a < 2000; vacuum freeze t; analyze t; explain analyze select count(b) from t where a > 1000 and a < 2000; However, "explain analyze select sum(b) from t where a > 1000 and a < 1999;" still doesn't use the index only scan. Isn't that also implied by the predicate? In this example it doesn't use IndexOnlyScan correctly. If I understand partial indexes right, if index predicate and search clause are not equal, index scan must recheck values when it's fetching them. 'tidx_partial' in example above has no information about 'a' attribute, beside the index->indpred, so it is impossible to recheck qual without referencing to table. In example: create index tidx_partial on t(a) where a > 1000 and a < 2000; explain analyze select sum(a) from t where a > 1000 and a < 1999; it can use IndexOnlyScan. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] [PATCH] Microvacuum for gist.
04.09.2015 15:11, Teodor Sigaev: Some notices 1 gistvacuumpage(): OffsetNumber deletable[MaxOffsetNumber]; Seems, MaxOffsetNumber is too much, MaxIndexTuplesPerPage is enough Fixed. 2 Loop in gistkillitems() for searching heap pointer. It seems to me that it could be a performance problem. To fix it, it's needed to remember index tuple's offset number somewhere near GISTScanOpaqueData->killedItems. And gistkillitems() will loop over offsets and compare heap pointer from killedItems and index tuple, if they doesn't match then just skip this index tuple. Thanks for suggestion. I've rewritten this function. Now killedItems[] contains only OffsetNumbers of tuples which we are going to delete. PageLSN check is enough to ensure that nothing has changed on the page. Heap pointer recheck is unnecessary. (It's really important for btree, where tuple could be inserted in the middle of page. But we can't have such situation for GiST index page). It works 50% faster than before. 3 Connected with previous, could you show some performance tests? Perfomance test is attached. Test is following - portion of tuples is deleted and after that selected several times. Without microvacuum. All 'select' queries are executed at about same time Time: 360,468 ms Time: 243,197 ms Time: 238,310 ms Time: 238,018 ms With microvacuum. First 'select' invokes gistkillitems(). It's executed a bit slower than before. But following queries are executed significantly faster than without microvacuum. Time: 368,780 ms Time: 69,769 ms Time: 9,545 ms Time: 12,427 ms Please, review the patch again. I could have missed something. P.S. Do I need to write any documentation update? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company *** a/src/backend/access/gist/gist.c --- b/src/backend/access/gist/gist.c *** *** 36,41 static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack, --- 36,42 bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); + static void gistvacuumpage(Relation rel, Page page, Buffer buffer); #define ROTATEDIST(d) do { \ *** *** 209,214 gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate, --- 210,226 * because the tuple vector passed to gistSplit won't include this tuple. */ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + + /* + * If leaf page is full, try at first to delete dead tuples. And then + * check again. + */ + if (is_split && GistPageIsLeaf(page) && GistPageHasGarbage(page)) + { + gistvacuumpage(rel, page, buffer); + is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + } + if (is_split) { /* no space for insertion */ *** *** 1440,1442 freeGISTstate(GISTSTATE *giststate) --- 1452,1523 /* It's sufficient to delete the scanCxt */ MemoryContextDelete(giststate->scanCxt); } + + /* + * gistvacuumpage() -- try to remove LP_DEAD items from the given page. + */ + static void + gistvacuumpage(Relation rel, Page page, Buffer buffer) + { + OffsetNumber deletable[MaxOffsetNumber]; + int ndeletable = 0; + OffsetNumber offnum, + minoff, + maxoff; + + /* + * Scan over all items to see which ones need to be deleted according to + * LP_DEAD flags. + */ + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = FirstOffsetNumber; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemId = PageGetItemId(page, offnum); + + if (ItemIdIsDead(itemId)) + deletable[ndeletable++] = offnum; + } + + if (ndeletable > 0) + { + START_CRIT_SECTION(); + + PageIndexMultiDelete(page, deletable, ndeletable); + + /* + * Mark the page as not containing any LP_DEAD items. This is not + * certainly true (there might be some that have recently been marked, + * but weren't included in our target-item list), but it will almost + * always be true and it doesn't seem worth an additional page scan to + * check it. Remember that F_HAS_GARBAGE is only a hint anyway. + */ + GistClearPageHasGarbage(page); + + MarkBufferDirty(buffer); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + + recptr = gistXLogUpdate(rel->rd_node, buffer, + deletable, ndeletable, + NULL, 0, InvalidBuffer); + + PageSetLSN(page, recptr); + } + else + PageSetLSN(page, gistGetFakeLSN(rel)); + + END_CRIT_SECTION(); + } + + /* + * Note: if we didn't find any LP_DEAD items, then the page's + * F_HAS_GARBAGE hint bit is falsely set. We do not bother expending a + * separate write to clear it, however. We will clear it when we split + * the page. + */ + } *** a/src/backend/access/gist/gistget.
Re: [HACKERS] [PATCH] Microvacuum for gist.
Fixed patch is attached. 08.09.2015 13:47, Teodor Sigaev: BTW, I slightly modify your test to provide more stable results. Thank you, I tried it, Results are nearly the same. Without microvacuum Time: 312,955 ms Time: 264,597 ms Time: 243,286 ms Time: 243,679 ms With microvacuum: Time: 354,097 ms Time: 82,206 ms Time: 11,714 ms Time: 11,277 ms -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c index 0e49959..bb48782 100644 --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -36,6 +36,7 @@ static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack, bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); +static void gistvacuumpage(Relation rel, Page page, Buffer buffer); #define ROTATEDIST(d) do { \ @@ -209,6 +210,17 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate, * because the tuple vector passed to gistSplit won't include this tuple. */ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + + /* + * If leaf page is full, try at first to delete dead tuples. And then + * check again. + */ + if (is_split && GistPageIsLeaf(page) && GistPageHasGarbage(page)) + { + gistvacuumpage(rel, page, buffer); + is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + } + if (is_split) { /* no space for insertion */ @@ -1440,3 +1452,70 @@ freeGISTstate(GISTSTATE *giststate) /* It's sufficient to delete the scanCxt */ MemoryContextDelete(giststate->scanCxt); } + +/* + * gistvacuumpage() -- try to remove LP_DEAD items from the given page. + */ +static void +gistvacuumpage(Relation rel, Page page, Buffer buffer) +{ + OffsetNumber deletable[MaxOffsetNumber]; + int ndeletable = 0; + OffsetNumber offnum, maxoff; + + /* + * Scan over all items to see which ones need to be deleted according to + * LP_DEAD flags. + */ + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = FirstOffsetNumber; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemId = PageGetItemId(page, offnum); + + if (ItemIdIsDead(itemId)) + deletable[ndeletable++] = offnum; + } + + if (ndeletable > 0) + { + START_CRIT_SECTION(); + + PageIndexMultiDelete(page, deletable, ndeletable); + + /* + * Mark the page as not containing any LP_DEAD items. This is not + * certainly true (there might be some that have recently been marked, + * but weren't included in our target-item list), but it will almost + * always be true and it doesn't seem worth an additional page scan to + * check it. Remember that F_HAS_GARBAGE is only a hint anyway. + */ + GistClearPageHasGarbage(page); + + MarkBufferDirty(buffer); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + + recptr = gistXLogUpdate(rel->rd_node, buffer, + deletable, ndeletable, + NULL, 0, InvalidBuffer); + + PageSetLSN(page, recptr); + } + else + PageSetLSN(page, gistGetFakeLSN(rel)); + + END_CRIT_SECTION(); + } + + /* + * Note: if we didn't find any LP_DEAD items, then the page's + * F_HAS_GARBAGE hint bit is falsely set. We do not bother expending a + * separate write to clear it, however. We will clear it when we split + * the page. + */ +} diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c index 20f695c..8a8d121 100644 --- a/src/backend/access/gist/gistget.c +++ b/src/backend/access/gist/gistget.c @@ -24,6 +24,74 @@ #include "utils/memutils.h" #include "utils/rel.h" +/* + * gistkillitems() -- set LP_DEAD state for items an indexscan caller has + * told us were killed. + * + * We re-read page here, so it's important to check page LSN. If the page + * has been modified since the last read (as determined by LSN), we cannot + * flag any entries because it is possible that the old entry was vacuumed + * away and the TID was re-used by a completely different heap tuple. + */ +static void +gistkillitems(IndexScanDesc scan) +{ + GISTScanOpaque so = (GISTScanOpaque) scan->opaque; + Buffer buffer; + Page page; + OffsetNumber offnum; + ItemId iid; + int i; + bool killedsomething = false; + + Assert(so->curBlkno != InvalidBlockNumber); + Assert(so->killedItems != NULL); + + buffer = ReadBuffer(scan->indexRelation, so->curBlkno); + if (!BufferIsValid(buffer)) + return; + + LockBuffer(buffer, GIST_SHARE); + gistcheckpage(scan->indexRelation, buffer); + page = BufferGetPage(buffer); + + /* + * If page LSN differs it means that the page was modified since the last read. + * killedItems could be not valid so LP_DEAD hints applying is not safe. + */ + if(PageGetLSN(page) != so->curPageLSN) + { + UnlockReleaseBuffer(buffer); + so-&g
Re: [HACKERS] [PATCH] Microvacuum for gist.
Hi, I don't know too much about gist, but did a quick read through. Mostly spotting some stylistic issues. Please fix those making it easier for the next reviewer. Thank you for review! All mentioned issues are fixed. -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c index 0e49959..f658831 100644 --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -36,6 +36,7 @@ static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack, bool unlockbuf, bool unlockleftchild); static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack, GISTSTATE *giststate, List *splitinfo, bool releasebuf); +static void gistvacuumpage(Relation rel, Page page, Buffer buffer); #define ROTATEDIST(d) do { \ @@ -209,6 +210,17 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate, * because the tuple vector passed to gistSplit won't include this tuple. */ is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + + /* + * If leaf page is full, try at first to delete dead tuples. And then + * check again. + */ + if (is_split && GistPageIsLeaf(page) && GistPageHasGarbage(page)) + { + gistvacuumpage(rel, page, buffer); + is_split = gistnospace(page, itup, ntup, oldoffnum, freespace); + } + if (is_split) { /* no space for insertion */ @@ -1440,3 +1452,72 @@ freeGISTstate(GISTSTATE *giststate) /* It's sufficient to delete the scanCxt */ MemoryContextDelete(giststate->scanCxt); } + +/* + * gistvacuumpage() -- try to remove LP_DEAD items from the given page. + */ +static void +gistvacuumpage(Relation rel, Page page, Buffer buffer) +{ + OffsetNumber deletable[MaxOffsetNumber]; + int ndeletable = 0; + OffsetNumber offnum, +minoff, +maxoff; + + /* + * Scan over all items to see which ones need to be deleted according to + * LP_DEAD flags. + */ + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = FirstOffsetNumber; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemId = PageGetItemId(page, offnum); + + if (ItemIdIsDead(itemId)) + deletable[ndeletable++] = offnum; + } + + if (ndeletable > 0) + { + START_CRIT_SECTION(); + + PageIndexMultiDelete(page, deletable, ndeletable); + + /* + * Mark the page as not containing any LP_DEAD items. This is not + * certainly true (there might be some that have recently been marked, + * but weren't included in our target-item list), but it will almost + * always be true and it doesn't seem worth an additional page scan to + * check it. Remember that F_HAS_GARBAGE is only a hint anyway. + */ + GistClearPageHasGarbage(page); + + MarkBufferDirty(buffer); + + /* XLOG stuff */ + if (RelationNeedsWAL(rel)) + { + XLogRecPtr recptr; + + recptr = gistXLogUpdate(rel->rd_node, buffer, + deletable, ndeletable, + NULL, 0, InvalidBuffer); + + PageSetLSN(page, recptr); + } + else + PageSetLSN(page, gistGetFakeLSN(rel)); + + END_CRIT_SECTION(); + } + + /* + * Note: if we didn't find any LP_DEAD items, then the page's + * F_HAS_GARBAGE hint bit is falsely set. We do not bother expending a + * separate write to clear it, however. We will clear it when we split + * the page. + */ +} diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c index 20f695c..e655a05 100644 --- a/src/backend/access/gist/gistget.c +++ b/src/backend/access/gist/gistget.c @@ -24,6 +24,97 @@ #include "utils/memutils.h" #include "utils/rel.h" +/* + * gistkillitems() -- set LP_DEAD state for items an indexscan caller has + * told us were killed. + * + * We match items by heap TID before mark them. If an item has moved off + * the current page due to a split, we'll fail to find it and do nothing + * (this is not an error case --- we assume the item will eventually get + * marked in a future indexscan). + * + * We re-read page here, so it's important to check page LSN. If the page + * has been modified since the last read (as determined by LSN), we cannot + * flag any antries because it is possible that the old entry was vacuumed + * away and the TID was re-used by a completely different heap tuple. + */ +static void +gistkillitems(IndexScanDesc scan) +{ + GISTScanOpaque so = (GISTScanOpaque) scan->opaque; + Buffer buffer; + Page page; + OffsetNumber minoff; + OffsetNumber maxoff; + int i; + bool killedsomething = false; + + Assert(so->curBlkno != InvalidBlockNumber); + + buffer = ReadBuffer(scan->indexRelation, so->curBlkno); + if (!BufferIsValid(buffer)) + return; + + LockBuffer(buffer, GIST_SHARE); + gistcheckpage(scan->indexRelation, buffer); + page = BufferGetPage(buffer); + + /* + * If page LSN differs it means that the page was modified since the last read. + * killedItemes could be not valid so LP_DEA
Re: [HACKERS] [PATCH] Microvacuum for gist.
16.09.2015 07:30, Jeff Janes: The commit of this patch seems to have created a bug in which updated tuples can disappear from the index, while remaining in the table. It looks like the bug depends on going through a crash-recovery cycle, but I am not sure of that yet. I've looked through the commit diff and don't see anything obviously wrong. I notice index tuples are marked dead with only a buffer content share lock, and the page is defragmented with only a buffer exclusive lock (as opposed to a super-exclusive buffer clean up lock). But as far as I can tell, both of those should be safe on an index. Also, if that was the bug, it should happen without crash-recovery. The test is pretty simple. I create a 10,000 row table with a unique-by-construction id column with a btree_gist index on it and a counter column, and fire single-row updates of the counter for random ids in high concurrency (8 processes running flat out). I force the server to crash frequently with simulated torn-page writes in which md.c writes a partial page and then PANICs. Eventually (1 to 3 hours) the updates start indicating they updated 0 rows. At that point, a forced table scan will find the row, but the index doesn't. Any hints on how to proceed with debugging this? If I can't get it to reproduce the problem in the absence of crash-recovery cycles with an overnight run, then I think my next step will be to run it over hot-standby and see if WAL replay in the absence of crashes might be broken as well. I've found the bug. It's because of mixed calls of PageIndexMultiDelete() in gistvacuumpage() and PageIndexTupleDelete() in gistRedoPageUpdateRecord(). These functions are conflicting. I've fixed my patch by change MultiDelete to TupleDelete in gistvacuumpage(). Patch is attached. But It seems to me that it would be better to rewrite all mentions of TupleDelete to MultiDelete in gist code. I'm working on it. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company() diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c index 4edc5a7..1d02e1b 100644 --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -1478,14 +1478,20 @@ gistvacuumpage(Relation rel, Page page, Buffer buffer) ItemId itemId = PageGetItemId(page, offnum); if (ItemIdIsDead(itemId)) - deletable[ndeletable++] = offnum; + { + deletable[ndeletable] = offnum - ndeletable; + ndeletable++; + } } if (ndeletable > 0) { + int i; + START_CRIT_SECTION(); - PageIndexMultiDelete(page, deletable, ndeletable); + for (i = 0; i < ndeletable; i++) + PageIndexTupleDelete(page, deletable[i]); /* * Mark the page as not containing any LP_DEAD items. This is not -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS][PROPOSAL] Covering + unique indexes.
Proposal Clarification. I see that discussion become too complicated. So, I'd like to clarify what we are talking about. We are discussing 2 different improvements of index. The one is "partially unique index" and the other "index with included columns". Let's look at example. - We have a table tbl(f1, f2, f3, f4). - We want to have an unique index on (f1,f2). - We want to have an index on (f1, f2, f3) which allow us to use index for complex "where" clauses. - We would like to have a covering index on all columns to avoid reading of heap pages. What are we doing now: CREATE UNIQUE INDEX on tbl(f1,f2); CREATE INDEX on tbl(f1, f2, f3, f4); What's wrong? - Two indexes with repeated data. Overhead to data manipulation operations and database size. - We don't need f4 as index key. But we have to... - Problem related to previous. It's possible that f4 has no opclass for our index. So there's no way to include it to index. While we don't need any opclass at all. Suggestions. CREATE INDEX idx ON tbl (f1, f2, f3) [UNIQUE ON (f1, f2)] [INCLUDE (f4)]; Let's review it stepby step. 1. "partially unique index" CREATE INDEX idx ON tbl (f1, f2, f3) UNIQUE ON (f1, f2); It means that we want to have columns (f1, f2, f3) as index keys in our index. But we want enforce uniqueness only on first two. It allows us insert (1,1,1), (1,2,1) and restricts insert (1,1,1), (1,1,2). It doesn't affect "select" queries. Following query can use index-only scan. SELECT f1,f2, f3 where f1 ... and f2 ... and f3 ; We haven't to maintain two indexes now. Just one! _bt_iseual() <http://doxygen.postgresql.org/nbtinsert_8c.html#abcfb7a3dcd40a5d1759652239f3a0115> works with (f1, f2) _bt_compare() <http://doxygen.postgresql.org/nbtsearch_8c.html#af689dabb25e99f551b68aa9b7a1e6ea6> works with (f1,f2,f3) 2. "index with included columns" It goes well with both unique and non-unique indexes. CREATE INDEX idx ON tbl (f1, f2, f3) INCLUDE (f4); What we get here: - f4 is not search key. - f4 could not have opclass for our index - f4 is only in the leaf pages and it's not bloating internal nodes of b-tree. - f4 can still be retrieved without going to the heap, so including it in the leaf nodes makes it possible to do index-only scans more often Following query can use index-only scan: SELECT f1,f2, f3, f4 where f1 ... and f2 ... and f3 ; And this one wouldn't use index-only scan because recheck on f4 is required. SELECT f1,f2, f3, f4 where f1 ... and f2 ... and f3 and f4; Catalog changes: Now: pg_index <http://doxygen.postgresql.org/pg__index_8h.html#a5065be0408f03576083a30c97b43a09a> int16 indnatts; /* number of columns in index */ bool indisunique; /* is this a unique index? */ New: pg_index int16 ind_n_unique_atts; /* number of unique columns in index. counted from begin of index. 0 means that index is not unique */ int16 ind_n_key_atts; /* number of index key columns in index. counted from begin of index.*/ int16 ind_n_total_atts; /* number of columns in index.*/ In our case: ind_n_unique_atts = 2; // f1, f2 ind_n_key_atts = 3; // f1, f2, f3 ind_n_total_atts = 4; // f1, f2, f3, f4 P.S. I use many ideas from discussion without quotations just because I'd like to keep this message readable. Thanks to everyone. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS][PROPOSAL] Covering + unique indexes.
15.09.2015 12:11, Vik Fearing: On 09/15/2015 10:57 AM, David Rowley wrote: I agree, that form CREATE UNIQUE INDEX i ON t (f1, f2, f3) INCLUDE (f4) is clear. f4 will be used in row compare and actually planner will be able to use it as unique index (f1, f2, f3) with additional f4 or as as unique index (f1, f2, f3, f4), because uniqueness on (f1, f2, f3) gives warranty of uniqueness on (f1, f2, f3, f4) I'd vote for this too. However, INCLUDE does not seem to be a reserved word at the moment. What about CREATE UNIQUE INDEX i ON t (f1, f2, f3) WITH (f4); ? WITH seems ambiguity to me. It refers to CTE, so I expect to see after that a kind of query expression. But maybe that's just matter of habit. BTW, that's the first syntax change I'm working with. Is there any convention in PostgreSQL about new keywords and so on? Where can I find it? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS][PROPOSAL] Covering + unique indexes.
Hi, hackers! Use case: Index-only scans is a wonderful feature that allows to speed up select queries of indexed columns. Therefore some users want to create multicolumn indexes on columns which are queried often. But if there's unique constraint on some column, they have to maintain another unique index. Even if the column is already in covering index. This adds overhead to data manipulation operations and database size. I've started work on a patch that allows to combine covering and unique functionality. The main idea is to allow user create multicolumn indexes with a definite number of unique columns. For example (don't mind SQL syntax here, please): CREATE INDEX index ON table (c1, c2, c3) UNIQUE ON (c1, c2); Created index has three columns, but only two of them have unique constraint. This idea has obvious restriction. We can set unique only for first index columns. There is no clear way to maintain following index. CREATE INDEX index ON table (c1, c2, c3) UNIQUE ON (c1, c3); So I suggest following syntax: CREATE [UNIQUE {ON FIRST {COLUMN | n_unique_column COLUMNS}} INDEX ON table_name (column_name1, column_name2 ...); Examples: CREATE UNIQUE INDEX ON table_name (c1, c2, c3); // (c1,c2, c3) must be UNIQUE. That's how it works now. CREATE UNIQUE ON FIRST COLUMN INDEX ON table_name (c1, c2, c3); // (c1) must be UNIQUE CREATE UNIQUE ON FIRST 2 COLUMNS INDEX ON table_name (c1, c2, c3); // (c1,c2) must be UNIQUE CREATE UNIQUE ON FIRST 3 COLUMNS INDEX ON table_name (c1, c2, c3); // (c1,c2, c3) must be UNIQUE Next issue is pg_index changes. Now there's only a boolean flag * bool indisunique; /* is this a unique index? */ But new algorithm requires to store a single number * unit16n_unique_columns; /* number of first columns of index which has unique constrains. */ I think, that numbers of all attributes themselves are not needed. Is it right? I'd like to see your suggestions about syntax changes. And of course any other comments are welcome. -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS][PROPOSAL] Covering + unique indexes.
15.09.2015 12:18, David Rowley: On 12 September 2015 at 00:45, Anastasia Lubennikova <a.lubennik...@postgrespro.ru <mailto:a.lubennik...@postgrespro.ru>> wrote: I've started work on a patch that allows to combine covering and unique functionality. Great to hear someone is working on this! Thank you! It looks like very interesting project to me) Next issue is pg_index changes. Now there's only a boolean flag * bool indisunique; /* is this a unique index? */ But new algorithm requires to store a single number * unit16n_unique_columns; /* number of first columns of index which has unique constrains. */ I think, that numbers of all attributes themselves are not needed. Is it right? I think the total number of attributes is already in indnatts. I imagine you're planning to keep the indexed columns at the start of the indkey[] array, and just use n_unique_columns to mark how many of the indkey[] attributes to check as indexed columns. I'd imagine the change would be fairly simple from a planner point of view as you'd just need to check columns 1 to n_unique_columns instead of 1 to indnatts. Although I have a tendency to under estimate these things :( I've asked that for the same reason. I'm not too deep in access method and btree code, so I don't want to miss any hidden dependencies. As I see it, changes are required in _bt_isequal() <http://doxygen.postgresql.org/nbtinsert_8c.html#abcfb7a3dcd40a5d1759652239f3a0115>. Instead of for (i = 1; i <= keysz; i++) {} // where /keysz/ is actually /natts//= rel->rd_rel->relnatts; /Look at _bt_check_uinque <http://doxygen.postgresql.org/nbtinsert_8c.html#a96eb8c53ffdf53f139b037194a9721a3> and pg_class <http://doxygen.postgresql.org/pg__class_8h.html#ac8bf924d36feee5f3ca4c36aa01c75ec> for more info. cycle should be for (i = 1; i <= nuinique; i++) {} // where /nunique /is value of /rel->rd_index->//indnuinque //indnuinque /is a new field, which I suggest to add to pg_index <http://doxygen.postgresql.org/pg__index_8h.html#a5065be0408f03576083a30c97b43a09a> instead of /indisunique/ (or may be in addition to it). But I'm doubt about the problem which Jim has mentioned. >Are we certain that no index type could ever support an index on (f1, f2, f3) UNIQUE(f1, f3)? Even if it >doesn't make sense for btree, perhaps some other index could handle it. If it's really so, we certainly have to keep in pg_index <http://doxygen.postgresql.org/pg__index_8h.html#a5065be0408f03576083a30c97b43a09a> not just number of unique columns (/indnunique/), but their numbers themselves (an array /indnuniqueatts/ on the model of /indnatts/). I imagine you don't want to name the new column n_unique_columns, since it does not fit too well with non-unique indexes. Hm, I think that it would be quite clear to set it to zero for non-unique indexes. /(nunique == 0)/ is equal to /(indisunique==false)/. But maybe I've missed some reason why we should to save /indisunique/ flag. Perhaps just indindexedatts, or something slightly along those lines. But perhaps it would be a good idea to also rename "ncolumns" in code, to ensure that any non-updated code does not even compile. Perhaps including "tot" or "total" in there might help indicate it's new meaning. I hope, that all these changes are not needed. Just continue to use /ncolumns/ for all indexed columns. And new variable /nuinique/ (or something like that) is for number of unique columns in the index. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS]WIP: Covering + unique indexes.
03.12.2015 04:03, Robert Haas пишет: On Tue, Dec 1, 2015 at 7:53 AM, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: If we don't need c4 as an index scankey, we don't need any btree opclass on it. But we still want to have it in covering index for queries like SELECT c4 FROM tbl WHERE c1=1000; // uses the IndexOnlyScan SELECT * FROM tbl WHERE c1=1000; // uses the IndexOnlyScan The patch "optional_opclass" completely ignores opclasses of included attributes. OK, I don't get it. Why have an opclass here at all, even optionally? We haven't opclass on c4 and there's no need to have it. But now (without a patch) it's impossible to create covering index, which contains columns with no opclass for btree. test=# create index on tbl using btree (c1, c4); ERROR: data type box has no default operator class for access method "btree" ComputeIndexAttrs() processes the list of index attributes and trying to get an opclass for each of them via GetIndexOpClass(). The patch drops this check for included attributes. So it makes possible to store any datatype in btree and use IndexOnlyScan advantages. I hope that this helps to clarify. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS]WIP: Covering + unique indexes.
Finally, completed patch "covering_unique_3.0.patch" is here. It includes the functionality discussed above in the thread, regression tests and docs update. I think it's quite ready for review. _Future work:_ Besides that, I'd like to get feedback about attached patch "optional_opclass_3.0.patch". It should be applied on the "covering_unique_3.0.patch". Actually, this patch is the first step to do opclasses for "included" columns optional and implement real covering indexing. Example: CREATE TABLE tbl (c1 int, c4 box); CREATE UNIQUE INDEX idx ON tbl USING btree (c1) INCLUDING (c4); If we don't need c4 as an index scankey, we don't need any btree opclass on it. But we still want to have it in covering index for queries like SELECT c4 FROM tbl WHERE c1=1000; // uses the IndexOnlyScan SELECT * FROM tbl WHERE c1=1000; // uses the IndexOnlyScan The patch "optional_opclass" completely ignores opclasses of included attributes. To see the difference, look at the explain analyze output: explain analyze select * from tbl where c1=2 and c4 && box '(0,0,1,1)'; QUERY PLAN --- Index Only Scan using idx on tbl (cost=0.13..4.15 rows=1 width=36) (actual time=0.010..0.013 rows=1 loops=1) Index Cond: (c1 = 2) Filter: (c4 && '(1,1),(0,0)'::box) "Index Cond" shows the index ScanKey conditions and "Filter" is for conditions which are used after index scan. Anyway it is faster than SeqScan that we had before, because IndexOnlyScan avoids extra heap fetches. As I already said, this patch is just WIP, so included opclass is not "optional" but actually "ignored". And following example works worse than without the patch. Please, don't care about it. CREATE TABLE tbl2 (c1 int, c2 int); CREATE UNIQUE INDEX idx2 ON tbl2 USING btree (c1) INCLUDING (c2); explain analyze select * from tbl2 where c1<20 and c2<5; QUERY PLAN --- Index Only Scan using idx2 on tbl2 (cost=0.28..4.68 rows=10 width=8) (actual time=0.055..0.066 rows=9 loops=1) Index Cond: (c1 < 20) Filter: (c2 < 5) The question is more about suitable syntax. We have two different optimizations here: 1. INCLUDED columns 2. Optional opclasses It's logical to provide optional opclasses only for included columns. Is it ok, to handle it using the same syntax and resolve all opclass conflicts while create index? CREATE TABLE tbl2 (c1 int, c2 int, c4 box); CREATE UNIQUE INDEX idx2 ON tbl2 USING btree (c1) INCLUDING (c2, c4); CREATE UNIQUE INDEX idx3 ON tbl2 USING btree (c1) INCLUDING (c4, c2); Of course, order of attributes is important. Attrs which have oplass and want to use it in ScanKey must be situated before the others. idx2 will use c2 in IndexCond, while idx3 will not. But I think that it's the job for DBA. If you see any related changes in planner, please mention them. I haven't explored that part of code yet and could have missed something. -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index b4ea227..4973e1b 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -644,6 +644,13 @@ Does an index of this type manage fine-grained predicate locks? + + amcanincluding + bool + + Does the access method support included columns? + + amkeytype oid @@ -3704,6 +3711,14 @@ pg_class.relnatts) + + indnkeyatts + int2 + + The number of key columns in the index. "Key columns" are ordinary + index columns in contrast with "included" columns. + + indisunique bool diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index 1c09bae..0287c62 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -765,9 +765,11 @@ amrestrpos (IndexScanDesc scan); PostgreSQL enforces SQL uniqueness constraints using unique indexes, which are indexes that disallow - multiple entries with identical keys. An access method that supports this + multiple entries with identical keys. An access method that supports this feature sets pg_am.amcanunique true. - (At present, only b-tree supports it.) + Columns included with clause INCLUDING aren't used to enforce uniqueness. + An access method that supports this feature sets pg_am.amcanincluding true. + (At present, only b-tree supports them.) diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/i
Re: [HACKERS]WIP: Covering + unique indexes.
13.01.2016 04:47, David Rowley : On 13 January 2016 at 06:47, Jeff Janes <jeff.ja...@gmail.com <mailto:jeff.ja...@gmail.com>> wrote: Why is omit_opclass a separate patch? If the included columns now never participate in the index ordering, shouldn't it be an inherent property of the main patch that you can "cover" things without btree opclasses? I don't personally think the covering_unique_4.0.patch is that close to being too big to review, I think things would make more sense of the omit_opclass_4.0.patch was included together with this. I agree that these patches should be merged. It'll be fixed it the next updates. I kept them separate only for historical reasons, it was more convenient for me to debug them. Furthermore, I wanted to show some performance degradation caused by "omit_opclass" and give a way to reproduce it performing test with and whithot the patch. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS]WIP: Covering + unique indexes.
13.01.2016 04:27, David Rowley: I've also done some testing: create table ab (a int, b int); insert into ab select a,b from generate_Series(1,10) a(a), generate_series(1,1) b(b); set enable_bitmapscan=off; set enable_indexscan=off; select * from ab where a = 1 and b=1; a | b ---+--- 1 | 1 (1 row) set enable_indexscan = on; select * from ab where a = 1 and b=1; a | b ---+--- (0 rows) This is broken. I've not looked into why yet, but from looking at the EXPLAIN output I was a bit surprised to see b=1 as an index condition. I'd have expected a Filter maybe, but I've not looked at the EXPLAIN code to see how those are determined yet. Hmm... Do you use both patches? And could you provide index definition, I can't reproduce the problem assuming that index is created by the statement CREATE INDEX idx ON ab (a) INCLUDING (b); -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS]WIP: Covering + unique indexes.
04.01.2016 11:49, David Rowley: On 2 December 2015 at 01:53, Anastasia Lubennikova <a.lubennik...@postgrespro.ru <mailto:a.lubennik...@postgrespro.ru>> wrote: Finally, completed patch "covering_unique_3.0.patch" is here. It includes the functionality discussed above in the thread, regression tests and docs update. I think it's quite ready for review. Hi Anastasia, I've maybe mentioned before that I think this is a great feature and I think it will be very useful to have, so I've signed up to review the patch, and below is the results of my first pass from reading the code. Apologies if some of the things seem like nitpicks, I've basically just listed everything I've noticed during, no matter how small. First of all, I would like to thank you for writing such a detailed review. All mentioned style problems, comments and typos are fixed in the patch v4.0. + An access method that supports this feature sets pg_am.amcanincluding true. I don't think this belongs under the "Index Uniqueness Checks" title. I think the "Columns included with clause INCLUDING aren't used to enforce uniqueness." that you've added before it is a good idea, but perhaps the details of amcanincluding are best explained elsewhere. agree +This clause specifies additional columns to be appended to the set of index columns. +Included columns don't support any constraints (UNIQUE, PRMARY KEY, EXCLUSION CONSTRAINT). +These columns can improve the performance of some queries through using advantages of index-only scan +(Or so called covering indexes. Covering index is the index that +covers all columns required in the query and prevents a table access). +Besides that, included attributes are not stored in index inner pages. +It allows to decrease index size and furthermore it provides a way to extend included +columns to store atttributes without suitable opclass (not implemented yet). +This clause could be applied to both unique and nonunique indexes. +It's possible to have non-unique covering index, which behaves as a regular +multi-column index with a bit smaller index-size. +Currently, only the B-tree access method supports this feature. "PRMARY KEY" should be "PRIMARY KEY". I ended up rewriting this paragraph as follows. "An optional INCLUDING clause allows a list of columns to be specified which will be included in the index, in the non-key portion of the index. Columns which are part of this clause cannot also exist in the indexed columns portion of the index, and vice versa. The INCLUDING columns exist solely to allow more queries to benefit from index only scans by including certain columns in the index, the value of which would otherwise have to be obtained by reading the table's heap. Having these columns in the INCLUDING clause in some cases allows PostgreSQL to skip the heap read completely. This also allows UNIQUE indexes to be defined on one set of columns, which can include another set of column in the INCLUDING clause, on which the uniqueness is not enforced upon. This can also be useful for non-unique indexes as any columns which are not required for the searching or ordering of records can defined in the INCLUDING clause, which can often reduce the size of the index." Maybe not perfect, but maybe it's an improvement? Yes, this explanation is much better. I've just added couple of notes. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 97ef618..d17a06c 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -644,6 +644,13 @@ Does an index of this type manage fine-grained predicate locks? + + amcaninclude + bool + + Does the access method support included columns? + + amkeytype oid @@ -3714,6 +3721,14 @@ pg_class.relnatts) + + indnkeyatts + int2 + + The number of key columns in the index. "Key columns" are ordinary + index columns in contrast with "included" columns. + + indisunique bool diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index 1c09bae..a102391 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -765,9 +765,10 @@ amrestrpos (IndexScanDesc scan); PostgreSQL enforces SQL uniqueness constraints using unique indexes, which are indexes that disallow - multiple entries with identical keys. An access method that supports this + multiple entries with identical keys. An access method that supports this feature sets pg_am.amcanunique true. - (At present, only b-tree supports it.) + Columns included wit
Re: [HACKERS]WIP: Covering + unique indexes.
for the data type. create index on tbl (a) including (b); CREATE INDEX This functionality is provided by the attached patch "omit_opclass_4.0", which must be applied over covering_unique_4.0.patch. I see what you were confused about, I'd had the same question at the very beginning of the discussion of this patch. Now it seems a bit more clear to me. INCLUDING columns are not used for the searching or ordering of records, so there is no need to check whether they have an opclass. INCLUDING columns perform as expected and it agrees with other database experience. And this patch is completed. But it isn't perfect definitely... I found test case to explain that. See below. That's why we need optional_opclass functionality, which will use opclass where possible and omit it in other cases. This idea have been already described in a message Re: [PROPOSAL] Covering + unique indexes <http://www.postgresql.org/message-id/55f84df4.5030...@postgrespro.ru>as "partially unique index". I suggest to separate optional_opclass task to ease syntax discussion and following review. And I'll implement it in the next patch a bit later. Test case: 1) patch covering_unique_4.0 + test_covering_unique_4.0 If included columns' opclasses are used, new query plan is the same with the old one. and have nearly the same execution time: QUERY PLAN Index Only Scan using oldcoveringidx on oldt (cost=0.43..301.72 rows=1 width=8) (actual time=0.021..0.676 rows=6 loops=1) Index Cond: ((c1 < 1) AND (c3 < 20)) Heap Fetches: 0 Planning time: 0.101 ms Execution time: 0.697 ms (5 rows) QUERY PLAN Index Only Scan using newidx on newt (cost=0.43..276.51 rows=1 width=8) (actual time=0.020..0.665 rows=6 loops=1) Index Cond: ((c1 < 1) AND (c3 < 20)) Heap Fetches: 0 Planning time: 0.082 ms Execution time: 0.687 ms (5 rows) 2) patch covering_unique_4.0 + patch omit_opclass_4.0 + test_covering_unique_4.0 Otherwise, new query can not use included column in Index Cond and uses filter instead. It slows down the query significantly. QUERY PLAN Index Only Scan using oldcoveringidx on oldt (cost=0.43..230.39 rows=1 width=8) (actual time=0.021..0.722 rows=6 loops=1) Index Cond: ((c1 < 1) AND (c3 < 20)) Heap Fetches: 0 Planning time: 0.091 ms Execution time: 0.744 ms (5 rows) QUERY PLAN Index Only Scan using newidx on newt (cost=0.43..374.68 rows=1 width=8) (actual time=0.018..2.595 rows=6 loops=1) Index Cond: (c1 < 1) Filter: (c3 < 20) Rows Removed by Filter: 9993 Heap Fetches: 0 Planning time: 0.078 ms Execution time: 2.612 ms -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 97ef618..d17a06c 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -644,6 +644,13 @@ Does an index of this type manage fine-grained predicate locks? + + amcaninclude + bool + + Does the access method support included columns? + + amkeytype oid @@ -3714,6 +3721,14 @@ pg_class.relnatts) + + indnkeyatts + int2 + + The number of key columns in the index. "Key columns" are ordinary + index columns in contrast with "included" columns. + + indisunique bool diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index 1c09bae..a102391 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -765,9 +765,10 @@ amrestrpos (IndexScanDesc scan); PostgreSQL enforces SQL uniqueness constraints using unique indexes, which are indexes that disallow - multiple entries with identical keys. An access method that supports this + multiple entries with identical keys. An access method that supports this feature sets pg_am.amcanunique true. - (At present, only b-tree supports it.) + Columns included with clause INCLUDING aren't used to enforce uniqueness. + (At present, only b-tree supports them.) diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml index 23bbec6..44dddb6 100644
[HACKERS] Some refactoring of index structures .
Hi, hackers. Long story short, I'd like to do some refactoring of the code related to indexes. I work on patch which provides INCLUDING columns functional [1]. The patch was reviewed again and again, I fixed many issues, but we still don't sure enough that all occurrences of /rd_index->indnatts/ are replaced with /rd_index->indnkeyatts /accurately. I have no idea how to do it without thorough reading of code, so I am doing it now. I already replaced all these /rd_index->indnatts ///with macro /IndexRelationGetNumberOfAttributes/ and /rd_index->indnkeyatts /with macro /IndexRelationGetNumberOfKeyAttributes/. But still there are a lot of places with vague naming. For example, /indnatts/, /natts/, /ncolumns/ and so on for the very same /rd_index->indnatts. /I changed them with /indnatts./ Or /indexStruct, index, idxForm, index_form/ for /index->rd_index./ They are definitely should be unified. I'd prefer indexForm. Any objections? One more point is indkey in FormData_pg_index <http://doxygen.postgresql.org/pg__index_8h.html#a5065be0408f03576083a30c97b43a09a>. The same for indkeys <http://doxygen.postgresql.org/struct__indxInfo.html#a66d6eab0efe4bcb9fcbb472d403fa981>, ii_KeyAttrNumbers <http://doxygen.postgresql.org/structIndexInfo.html#a7723798af613a780d505231bad72c0a3> , indexkeys <http://doxygen.postgresql.org/structIndexOptInfo.html#ab95b0da6ff1e527506239dab1a480b82>. With my patch index can have key and non-key (included) columns. Therefore indkey is not just the key now. It contains both key and included columns. And its length is not indnkeyatts as one might think, but indnatts. I suppose it's worth to rename it somehow. But I can't find the proper name. Do you have any ideas? If you see any related changes, please, let me know. 1. http://www.postgresql.org/message-id/flat/56168952.4010...@postgrespro.ru#56168952.4010...@postgrespro.ru -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
28.01.2016 20:03, Thom Brown: On 28 January 2016 at 16:12, Anastasia Lubennikova <a.lubennik...@postgrespro.ru <mailto:a.lubennik...@postgrespro.ru>> wrote: 28.01.2016 18:12, Thom Brown: On 28 January 2016 at 14:06, Anastasia Lubennikova <a.lubennik...@postgrespro.ru <mailto:a.lubennik...@postgrespro.ru>> wrote: 31.08.2015 10:41, Anastasia Lubennikova: Hi, hackers! I'm going to begin work on effective storage of duplicate keys in B-tree index. The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN. In a nutshell, effective storing of duplicates in GIN is organised as follows. Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree. You can find wonderful detailed descriptions in gin readme <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README> and articles <http://www.cybertec.at/gin-just-an-index-type/>. It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1) <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>. Now new B-tree index tuple must be inserted for each table row that we index. It can possibly cause page split. Because of MVCC even unique index could contain duplicates. Storing duplicates in posting list/tree helps to avoid superfluous splits. I'd like to share the progress of my work. So here is a WIP patch. It provides effective duplicate handling using posting lists the same way as GIN does it. Layout of the tuples on the page is changed in the following way: before: TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key with patch: TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid) It seems that backward compatibility works well without any changes. But I haven't tested it properly yet. Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file. i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=1000 means that all keys are different. The other columns contain the index size (MB). i B-tree Old B-tree New GIN 1 214,234375 87,7109375 10,2109375 10 214,234375 87,7109375 10,71875 100 214,234375 87,4375 15,640625 1000214,234375 86,2578125 31,296875 1 214,234375 78,421875 104,3046875 10 214,234375 65,359375 49,078125 100 214,234375 90,140625 106,8203125 1000214,234375 214,234375 534,0625 You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct. Other cases looks really nice to me. Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree. I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing. I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome. Here is a couple of useful queries to inspect the data inside the index pages: create extension pageinspect; select * from bt_metap('idx'); select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt; select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt; And at last, the list of items I'm going to complete in the near future: 1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off' 2. Bring back microvacuum functionality for compressed indexes. 3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want. 4
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
28.01.2016 18:12, Thom Brown: On 28 January 2016 at 14:06, Anastasia Lubennikova <a.lubennik...@postgrespro.ru <mailto:a.lubennik...@postgrespro.ru>> wrote: 31.08.2015 10:41, Anastasia Lubennikova: Hi, hackers! I'm going to begin work on effective storage of duplicate keys in B-tree index. The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN. In a nutshell, effective storing of duplicates in GIN is organised as follows. Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree. You can find wonderful detailed descriptions in gin readme <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README> and articles <http://www.cybertec.at/gin-just-an-index-type/>. It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1) <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>. Now new B-tree index tuple must be inserted for each table row that we index. It can possibly cause page split. Because of MVCC even unique index could contain duplicates. Storing duplicates in posting list/tree helps to avoid superfluous splits. I'd like to share the progress of my work. So here is a WIP patch. It provides effective duplicate handling using posting lists the same way as GIN does it. Layout of the tuples on the page is changed in the following way: before: TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key with patch: TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid) It seems that backward compatibility works well without any changes. But I haven't tested it properly yet. Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file. i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=1000 means that all keys are different. The other columns contain the index size (MB). i B-tree Old B-tree New GIN 1 214,234375 87,7109375 10,2109375 10 214,234375 87,7109375 10,71875 100 214,234375 87,4375 15,640625 1000214,234375 86,2578125 31,296875 1 214,234375 78,421875 104,3046875 10 214,234375 65,359375 49,078125 100 214,234375 90,140625 106,8203125 1000214,234375 214,234375 534,0625 You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct. Other cases looks really nice to me. Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree. I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing. I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome. Here is a couple of useful queries to inspect the data inside the index pages: create extension pageinspect; select * from bt_metap('idx'); select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt; select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt; And at last, the list of items I'm going to complete in the near future: 1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off' 2. Bring back microvacuum functionality for compressed indexes. 3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want. 4. Clean the code and comments, add related documentation. This doesn't apply cleanly against current git head. Have you caught up past commit 65c5fcd35? Thank you for the notice. New patch is attached. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 9673fe0..0c8e4fb 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
29.01.2016 19:01, Thom Brown: On 29 January 2016 at 15:47, Aleksander Alekseev <a.aleks...@postgrespro.ru> wrote: I tested this patch on x64 and ARM servers for a few hours today. The only problem I could find is that INSERT works considerably slower after applying a patch. Beside that everything looks fine - no crashes, tests pass, memory doesn't seem to leak, etc. Thank you for testing. I rechecked that, and insertions are really very very very slow. It seems like a bug. Okay, now for some badness. I've restored a database containing 2 tables, one 318MB, another 24kB. The 318MB table contains 5 million rows with a sequential id column. I get a problem if I try to delete many rows from it: # delete from contacts where id % 3 != 0 ; WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory I didn't manage to reproduce this. Thom, could you describe exact steps to reproduce this issue please? Sure, I used my pg_rep_test tool to create a primary (pg_rep_test -r0), which creates an instance with a custom config, which is as follows: shared_buffers = 8MB max_connections = 7 wal_level = 'hot_standby' cluster_name = 'primary' max_wal_senders = 3 wal_keep_segments = 6 Then create a pgbench data set (I didn't originally use pgbench, but you can get the same results with it): createdb -p 5530 pgbench pgbench -p 5530 -i -s 100 pgbench And delete some stuff: thom@swift:~/Development/test$ psql -p 5530 pgbench Timing is on. psql (9.6devel) Type "help" for help. ➤ psql://thom@[local]:5530/pgbench # DELETE FROM pgbench_accounts WHERE aid % 3 != 0; WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory ... WARNING: out of shared memory WARNING: out of shared memory DELETE 667 Time: 22218.804 ms There were 358 lines of that warning message. I don't get these messages without the patch. Thom Thank you for this report. I tried to reproduce it, but I couldn't. Debug will be much easier now. I hope I'll fix these issueswithin the next few days. BTW, I found a dummy mistake, the previous patch contains some unrelated changes. I fixed it in the new version (attached). -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index e3c55eb..3908cc1 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -24,6 +24,7 @@ #include "storage/predicate.h" #include "utils/tqual.h" +#include "catalog/catalog.h" typedef struct { @@ -60,7 +61,8 @@ static void _bt_findinsertloc(Relation rel, ScanKey scankey, IndexTuple newtup, BTStack stack, - Relation heapRel); + Relation heapRel, + bool *updposing); static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf, BTStack stack, IndexTuple itup, @@ -113,6 +115,7 @@ _bt_doinsert(Relation rel, IndexTuple itup, BTStack stack; Buffer buf; OffsetNumber offset; + bool updposting = false; /* we need an insertion scan key to do our search, so build one */ itup_scankey = _bt_mkscankey(rel, itup); @@ -162,8 +165,9 @@ top: { TransactionId xwait; uint32 speculativeToken; + bool fakeupdposting = false; /* Never update posting in unique index */ - offset = _bt_binsrch(rel, buf, natts, itup_scankey, false); + offset = _bt_binsrch(rel, buf, natts, itup_scankey, false, ); xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey, checkUnique, _unique, ); @@ -200,8 +204,54 @@ top: CheckForSerializableConflictIn(rel, NULL, buf); /* do the insertion */ _bt_findinsertloc(rel, , , natts, itup_scankey, itup, - stack, heapRel); - _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false); + stack, heapRel, ); + + if (IsSystemRelation(rel)) + updposting = false; + + /* + * New tuple has the same key with tuple at the page. + * Unite them into one posting. + */ + if (updposting) + { + Page page; + IndexTuple olditup, newitup; + ItemPointerData *ipd; + int nipd; + + page = BufferGetPage(buf); + olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset)); + + if (BtreeTupleIsPosting(olditup)) +nipd = BtreeGetNPosting(olditup); + else +nipd = 1; + + ipd = palloc0(sizeof(ItemPointerData)*(nipd + 1)); + /* copy item pointers from old tuple into ipd */ + if (BtreeTupleIsPosting(olditup)) +memcpy(ipd, BtreeGetPosting(olditup), sizeof(ItemPointerData)*nipd); + else +memcpy(ipd, olditup, sizeof(ItemPointerData)); + + /* add item pointer of the new tuple into ipd */ + memcpy(ipd+nipd, itup, sizeof(ItemPointerData)); + + /* + * Form posting tuple,
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
29.01.2016 20:43, Thom Brown: On 29 January 2016 at 16:50, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: 29.01.2016 19:01, Thom Brown: On 29 January 2016 at 15:47, Aleksander Alekseev <a.aleks...@postgrespro.ru> wrote: I tested this patch on x64 and ARM servers for a few hours today. The only problem I could find is that INSERT works considerably slower after applying a patch. Beside that everything looks fine - no crashes, tests pass, memory doesn't seem to leak, etc. Thank you for testing. I rechecked that, and insertions are really very very very slow. It seems like a bug. Okay, now for some badness. I've restored a database containing 2 tables, one 318MB, another 24kB. The 318MB table contains 5 million rows with a sequential id column. I get a problem if I try to delete many rows from it: # delete from contacts where id % 3 != 0 ; WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory I didn't manage to reproduce this. Thom, could you describe exact steps to reproduce this issue please? Sure, I used my pg_rep_test tool to create a primary (pg_rep_test -r0), which creates an instance with a custom config, which is as follows: shared_buffers = 8MB max_connections = 7 wal_level = 'hot_standby' cluster_name = 'primary' max_wal_senders = 3 wal_keep_segments = 6 Then create a pgbench data set (I didn't originally use pgbench, but you can get the same results with it): createdb -p 5530 pgbench pgbench -p 5530 -i -s 100 pgbench And delete some stuff: thom@swift:~/Development/test$ psql -p 5530 pgbench Timing is on. psql (9.6devel) Type "help" for help. ➤ psql://thom@[local]:5530/pgbench # DELETE FROM pgbench_accounts WHERE aid % 3 != 0; WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory ... WARNING: out of shared memory WARNING: out of shared memory DELETE 667 Time: 22218.804 ms There were 358 lines of that warning message. I don't get these messages without the patch. Thom Thank you for this report. I tried to reproduce it, but I couldn't. Debug will be much easier now. I hope I'll fix these issueswithin the next few days. BTW, I found a dummy mistake, the previous patch contains some unrelated changes. I fixed it in the new version (attached). Thanks. Well I've tested this latest patch, and the warnings are no longer generated. However, the index sizes show that the patch doesn't seem to be doing its job, so I'm wondering if you removed too much from it. Huh, this patch seems to be enchanted) It works fine for me. Did you perform "make distclean"? Anyway, I'll send a new version soon. I just write here to say that I do not disappear and I do remember about the issue. I even almost fixed the insert speed problem. But I'm very very busy this week. I'll send an updated patch next week as soon as possible. Thank you for attention to this work. -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
31.08.2015 10:41, Anastasia Lubennikova: Hi, hackers! I'm going to begin work on effective storage of duplicate keys in B-tree index. The main idea is to implement posting lists and posting trees for B-tree index pages as it's already done for GIN. In a nutshell, effective storing of duplicates in GIN is organised as follows. Index stores single index tuple for each unique key. That index tuple points to posting list which contains pointers to heap tuples (TIDs). If too many rows having the same key, multiple pages are allocated for the TIDs and these constitute so called posting tree. You can find wonderful detailed descriptions in gin readme <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README> and articles <http://www.cybertec.at/gin-just-an-index-type/>. It also makes possible to apply compression algorithm to posting list/tree and significantly decrease index size. Read more in presentation (part 1) <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>. Now new B-tree index tuple must be inserted for each table row that we index. It can possibly cause page split. Because of MVCC even unique index could contain duplicates. Storing duplicates in posting list/tree helps to avoid superfluous splits. I'd like to share the progress of my work. So here is a WIP patch. It provides effective duplicate handling using posting lists the same way as GIN does it. Layout of the tuples on the page is changed in the following way: before: TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key with patch: TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid) It seems that backward compatibility works well without any changes. But I haven't tested it properly yet. Here are some test results. They are obtained by test functions test_btbuild and test_ginbuild, which you can find in attached sql file. i - number of distinct values in the index. So i=1 means that all rows have the same key, and i=1000 means that all keys are different. The other columns contain the index size (MB). i B-tree Old B-tree New GIN 1 214,234375 87,7109375 10,2109375 10 214,234375 87,7109375 10,71875 100 214,234375 87,4375 15,640625 1000214,234375 86,2578125 31,296875 1 214,234375 78,421875 104,3046875 10 214,234375 65,359375 49,078125 100 214,234375 90,140625 106,8203125 1000214,234375 214,234375 534,0625 You can note that the last row contains the same index sizes for B-tree, which is quite logical - there is no compression if all the keys are distinct. Other cases looks really nice to me. Next thing to say is that I haven't implemented posting list compression yet. So there is still potential to decrease size of compressed btree. I'm almost sure, there are still some tiny bugs and missed functions, but on the whole, the patch is ready for testing. I'd like to get a feedback about the patch testing on some real datasets. Any bug reports and suggestions are welcome. Here is a couple of useful queries to inspect the data inside the index pages: create extension pageinspect; select * from bt_metap('idx'); select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', n) as bt; select n, bt.* from generate_series(1,1) as n, lateral bt_page_items('idx', n) as bt; And at last, the list of items I'm going to complete in the near future: 1. Add storage_parameter 'enable_compression' for btree access method which specifies whether the index handles duplicates. default is 'off' 2. Bring back microvacuum functionality for compressed indexes. 3. Improve insertion speed. Insertions became significantly slower with compressed btree, which is obviously not what we do want. 4. Clean the code and comments, add related documentation. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index 77c2fdf..3b61e8f 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -24,6 +24,7 @@ #include "storage/predicate.h" #include "utils/tqual.h" +#include "catalog/catalog.h" typedef struct { @@ -60,7 +61,8 @@ static void _bt_findinsertloc(Relation rel, ScanKey scankey, IndexTuple newtup, BTStack stack, - Relation heapRel); + Relation heapRel, + bool *updposing); static void _bt_insertonpg(Relation rel, Buffer buf, Buffer cbuf, BTStack stack, IndexTuple itup, @@ -113,6 +115,7 @@ _bt_doinsert(Relation rel, IndexTuple itup, BTStack stack; Buffer buf; OffsetNumber offset; + bool updposting = false; /
Re: [HACKERS]WIP: Covering + unique indexes.
25.02.2016 21:39, Jeff Janes: As promised, here's the new version of the patch "including_columns_4.0". I fixed all issues except some points mentioned below. Thanks for the update patch. I get a compiler warning: genam.c: In function 'BuildIndexValueDescription': genam.c:259: warning: unused variable 'tupdesc' Thank you for the notice, I'll fix it in the next update. Also, I can't create a primary key INCLUDING columns directly: jjanes=# create table foobar (a int, b int, c int); jjanes=# alter table foobar add constraint foobar_pkey primary key (a,b) including (c); ERROR: syntax error at or near "including" But I can get there using a circuitous route: jjanes=# create unique index on foobar (a,b) including (c); jjanes=# alter table foobar add constraint foobar_pkey primary key using index foobar_a_b_c_idx; The description of the table's index knows to include the including column: jjanes=# \d foobar Table "public.foobar" Column | Type | Modifiers +-+--- a | integer | not null b | integer | not null c | integer | Indexes: "foobar_pkey" PRIMARY KEY, btree (a, b) INCLUDING (c) Since the machinery appears to all be in place to have primary keys with INCLUDING columns, it would be nice if the syntax for adding primary keys allowed one to implement them directly. Is this something or future expansion, or could it be added at the same time as the main patch? Good point. At quick glance, this looks easy to implement it. The only problem is that there are too many places in code which must be updated. I'll try to do it, and if there would be difficulties, it's fine with me to delay this feature for the future work. I found one more thing to do. Pgdump does not handle included columns now. I will fix it in the next version of the patch. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex
30.12.2015 16:01, Aleksander Alekseev: Here is a clean version of the patch. Step 1 is an optimization. Step 2 refactors this: HTAB * ShmemInitHash(const char *name, /* table string name for shmem index */ - long init_size, /* initial table size */ + long init_size, /* initial table size XXX ignored, refactor! */ Hi, I found that the patch is still not reviewed, so I've looked it over. I haven't dived into this subject before, so my comments will be more general than relating to your investigation. Sorry if some things seem like nitpicks, I just suppose that clear patch would be more convenient for reviewers. First of all, why not merge both patches into one? They aren't too big anyway. Then, this thread became too tangled. I think it's worth to write a new message with the patch, the test script, some results and brief overview of how does it really works. It will make following review much easier. + /* number of entries in hash table */ + longnentries[NUM_LOCK_PARTITIONS]; + /* linked list of free elements */ + HASHELEMENT *freeList[NUM_LOCK_PARTITIONS]; I think comments should be changed, to be more informative here. + if (IS_PARTITIONED(hashp->hctl)) + partitions_number = NUM_LOCK_PARTITIONS; + else + partitions_number = 1; + + nelem_alloc = ((int) nelem) / partitions_number; + if (nelem_alloc == 0) + nelem_alloc = 1; Add a comment here too. -/* Number of partitions of the shared buffer mapping hashtable */ -#define NUM_BUFFER_PARTITIONS 128 - /* Number of partitions the shared lock tables are divided into */ -#define LOG2_NUM_LOCK_PARTITIONS 4 +#define LOG2_NUM_LOCK_PARTITIONS 7 #define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS) +/* Number of partitions of the shared buffer mapping hashtable */ +#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS + I'm sure, it was discussed above. but these definitions do not look obvious at all. Maybe you should explain this magic number 7 in the comment above? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch update of indexes
20.01.2016 17:55, Konstantin Knizhnik: Hi, Hi, I glad to see that you interested in that too. I think this is a good feature and I think it will be very useful to have. I have already mentioned some related problems and possible improvements in my presentation. http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts Last two slides concern to this thread. Briefly, I've suggested to think about insertion buffer. Something very similar to it is already implemented in BRIN. It does not index last data from heap, while the number of last pages is less than pages_per_block. Do you mean GIN-like usage of insertion buffer (here it is called "pending list")? So that we have to combine search in the main tree and in the insert buffer? Actually this is what I want to avoided (because at least in case of GIN pending list cause significant degrade of performance, while up-to-date state of full text index is rarely required). What I meant is more like a BRIN-like combination of an index scan and heap scan. Maybe it could be called "deferred inserts" or "temporary read-only index" Maybe it's similar with mysql insert buffer http://dev.mysql.com/doc/refman/5.7/en/innodb-insert-buffering.html I think it'll be more clear with example. Please don't care about syntax. CREATE TABLE tbl (c1 int); CREATE INDEX idx on tbl(c1); SET enable_deferred_insert(idx) = on; At this moment, we save the last_indexed_item (its TID) somewhere in index metapage. Since that moment, the data inserted into the table doesn't touch the index. We perform some heavy insert and then go back to the normal index behavior. SET enable_deferred_insert(idx) = off; This command takes all the data between the last_indexed_item and the end of the table, and inserts it into the index at a time. Of course there are new problems to deal with, but it's really useful for the use case to balance irregular heavy write load, isn't it? BTW, could you explain, what is the reason to copy data into the pending list and then copy it again while flushing pending list into the index? Why not read this data directly from the table? I feel that I've missed something important here. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS]WIP: Covering + unique indexes.
25.01.2016 03:32, Jeff Janes: On Fri, Jan 22, 2016 at 7:19 AM, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: Done. I hope that my patch is close to the commit too. Thanks for the update. I've run into this problem: create table foobar (x text, w text); create unique index foobar_pkey on foobar (x) including (w); alter table foobar add constraint foobar_pkey primary key using index foobar_pkey; ERROR: index "foobar_pkey" does not have default sorting behavior LINE 1: alter table foobar add constraint foobar_pkey primary key us... ^ DETAIL: Cannot create a primary key or unique constraint using such an index. Time: 1.577 ms If I instead define the table as create table foobar (x int, w xml); Then I can create the index and then the primary key the first time I do this in a session. But then if I drop the table and repeat the process, I get "does not have default sorting behavior" error even for this index that previously succeeded, so I think there is some kind of problem with the backend syscache or catcache. create table foobar (x int, w xml); create unique index foobar_pkey on foobar (x) including (w); alter table foobar add constraint foobar_pkey primary key using index foobar_pkey; drop table foobar ; create table foobar (x int, w xml); create unique index foobar_pkey on foobar (x) including (w); alter table foobar add constraint foobar_pkey primary key using index foobar_pkey; ERROR: index "foobar_pkey" does not have default sorting behavior LINE 1: alter table foobar add constraint foobar_pkey primary key us... ^ DETAIL: Cannot create a primary key or unique constraint using such an index. Great, I've fixed that. Thank you for the tip about cache. I've also found and fixed related bug in copying tables with indexes: create table tbl2 (like tbl including all); And there's one more tiny fix in get_pkey_attnames in dblink module. including_columns_3.0 is the latest version of patch. And changes regarding the previous version are attached in a separate patch. Just to ease the review and debug. I've changed size of pg_index.indclass array. It contains indnkeyatts elements now. While pg_index.indkey still contains all attributes. And this query Retrieve primary key columns <https://wiki.postgresql.org/wiki/Retrieve_primary_key_columns> provides pretty non-obvious result. Is it a normal behavior here or some changes are required? Do you know any similar queries? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 9c8e308..ef6aee5 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -2058,7 +2058,7 @@ get_pkey_attnames(Relation rel, int16 *numatts) /* we're only interested if it is the primary key */ if (index->indisprimary) { - *numatts = index->indnatts; + *numatts = index->indnkeyatts; if (*numatts > 0) { result = (char **) palloc(*numatts * sizeof(char *)); diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 412c845..c201f8b 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -3515,6 +3515,14 @@ pg_class.relnatts) + + indnkeyatts + int2 + + The number of key columns in the index. "Key columns" are ordinary + index columns in contrast with "included" columns. + + indisunique bool diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index 5f7befb..aaed5a7 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -111,6 +111,8 @@ typedef struct IndexAmRoutine boolamclusterable; /* does AM handle predicate locks? */ boolampredlocks; +/* does AM support columns included with clause INCLUDING? */ +boolamcaninclude; /* type of data stored in index, or InvalidOid if variable */ Oid amkeytype; @@ -852,7 +854,8 @@ amrestrpos (IndexScanDesc scan); using unique indexes, which are indexes that disallow multiple entries with identical keys. An access method that supports this feature sets amcanunique true. - (At present, only b-tree supports it.) + (At present, only b-tree supports it.) Columns which are present in the + INCLUDING clause are not used to enforce uniqueness. diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml index 23bbec6..09d4e6b 100644 --- a/doc/src/sgml/indices.sgml +++ b/doc/src/sgml/indices.sgml @@ -633,7 +633,8 @@ CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST); Indexes can also be used to enforce uniqueness of a column's value, or the uniqueness of the combined values of more than one column. -CREATE UNIQUE INDEX name ON table (column , ...); +CREATE UNIQUE INDEX name ON ta
Re: [HACKERS] Batch update of indexes
tion plan while it is not needed in this case: search condition is exactly the same as partial index condition. Optimal plan should be: Index Scan using idx1 on t (cost=0.00..4.13 rows=12263 width=0) Index Cond: (c1 < '10'::double precision) There was the discussion of the patch for partial indexes. http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html Since I haven't watched it closely, It seems to be open still. I think it'll be interesting to you. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS]WIP: Covering + unique indexes.
18.01.2016 01:02, David Rowley пишет: On 14 January 2016 at 08:24, David Rowley <david.row...@2ndquadrant.com <mailto:david.row...@2ndquadrant.com>> wrote: I will try to review the omit_opclass_4.0.patch soon. Hi, as promised, here's my review of the omit_opclass_4.0.patch patch. Thank you again. All mentioned points are fixed and patches are merged. I hope it's all right now. Please check comments one more time. I rather doubt that I wrote everything correctly. Also this makes me think that the name ii_KeyAttrNumbers is now out-of-date, as it contains the including columns too by the looks of it. Maybe it just needs to drop the "Key" and become "ii_AttrNumbers". It would be interesting to hear what others think of that. I'm also wondering if indexkeys is still a good name for the IndexOptInfo struct member. Including columns are not really keys, but I feel renaming that might cause a fair bit of code churn, so I'd be interested to hear what other's have to say. I agree that KeyAttrNumbers and indexkeys are a bit confusing names, but I'd like to keep them at least in this patch. It's may be worth doing "index structures refactoring" as a separate patch. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 97ef618..d17a06c 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -644,6 +644,13 @@ Does an index of this type manage fine-grained predicate locks? + + amcaninclude + bool + + Does the access method support included columns? + + amkeytype oid @@ -3714,6 +3721,14 @@ pg_class.relnatts) + + indnkeyatts + int2 + + The number of key columns in the index. "Key columns" are ordinary + index columns in contrast with "included" columns. + + indisunique bool diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index 1c09bae..d01af17 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -767,7 +767,8 @@ amrestrpos (IndexScanDesc scan); using unique indexes, which are indexes that disallow multiple entries with identical keys. An access method that supports this feature sets pg_am.amcanunique true. - (At present, only b-tree supports it.) + (At present, only b-tree supports it.) Columns included with clause + INCLUDING aren't used to enforce uniqueness. diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml index 23bbec6..09d4e6b 100644 --- a/doc/src/sgml/indices.sgml +++ b/doc/src/sgml/indices.sgml @@ -633,7 +633,8 @@ CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST); Indexes can also be used to enforce uniqueness of a column's value, or the uniqueness of the combined values of more than one column. -CREATE UNIQUE INDEX name ON table (column , ...); +CREATE UNIQUE INDEX name ON table (column , ...) +INCLUDING (column , ...); Currently, only B-tree indexes can be declared unique. @@ -642,7 +643,9 @@ CREATE UNIQUE INDEX name ON tableINCLUDING aren't used to enforce constraints (UNIQUE, + PRIMARY KEY, etc). diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml index ce36a1b..8360bb6 100644 --- a/doc/src/sgml/ref/create_index.sgml +++ b/doc/src/sgml/ref/create_index.sgml @@ -23,6 +23,7 @@ PostgreSQL documentation CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON table_name [ USING method ] ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ) +[ INCLUDING ( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ) [ WITH ( storage_parameter = value [, ... ] ) ] [ TABLESPACE tablespace_name ] [ WHERE predicate ] @@ -139,6 +140,32 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] + INCLUDING + + +An optional INCLUDING clause allows a list of columns to be +specified which will be included in the index, in the non-key portion of +the index. Columns which are part of this clause cannot also exist in the +key columns portion of the index, and vice versa. The +INCLUDING columns exist solely to allow more queries to benefit +from index-only scans by including certain columns in the +index, the value of which would otherwise have to be obtained by reading +the table's heap. Having these columns in the INCLUDING clause +in some cases allows PostgreSQL to skip the heap read +completely. This also allows UNIQUE indexes to be defined on +one set of columns, which can include another set of co
Re: [HACKERS]WIP: Covering + unique indexes.
12.01.2016 20:47, Jeff Janes: It looks like the "covering" patch, with or without the "omit_opclass" patch, does not support expressions as included columns: create table foobar (x text, y xml); create index on foobar (x) including (md5(x)); ERROR: unrecognized node type: 904 create index on foobar (x) including ((y::text)); ERROR: unrecognized node type: 911 I think we would probably want it to work with those (or at least to throw a better error message). Thank you for the notice. I couldn't fix it quickly and added a stub in the latest patch. But I'll try to fix it and add expressions support a bit later. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patch: fix lock contention for HASHHDR.mutex
22.01.2016 13:48, Aleksander Alekseev: Then, this thread became too tangled. I think it's worth to write a new message with the patch, the test script, some results and brief overview of how does it really works. It will make following review much easier. Sure. HASHHDR represents a hash table. It could be usual or partitioned. Partitioned table is stored in a shared memory and accessed by multiple processes simultaneously. To prevent data corruption hash table is partitioned and each process has to acquire a lock for a corresponding partition before accessing data in it. Number of partition is determine by lower bits of key's hash value. Most tricky part is --- dynahash knows nothing about these locks, all locking is done on calling side. Since shared memory is pre-allocated and can't grow to allocate memory in a hash table we use freeList. Also we use nentries field to count current number of elements in a hash table. Since hash table is used by multiple processes these fields are protected by a spinlock. Which as it turned out could cause lock contention and create a bottleneck. After trying a few approaches I discovered that using one spinlock per partition works quite well. Here are last benchmark results: http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu Note that "no locks" solution cant be used because it doesn't guarantee that all available memory will be used in some corner cases. You can find a few more details and a test script in the first message of this thread. If you have any other questions regarding this patch please don't hesitate to ask. InitLocks /* * Compute init/max size to request for lock hashtables. Note these * calculations must agree with LockShmemSize! */ This comment certainly requires some changes. BTW, could you explain why init_table_size was two times less than max_table_size? Although, I don't see any problems with your changes. -hctl->nentries = 0; -hctl->freeList = NULL; Why did you delete these two lines? I wonder if you should rewrite them instead? + * This particular number of partitions significantly reduces lock contention + * in partitioned hash tables, almost if partitioned tables didn't use any + * locking at all As far as I understood, this number was obtained experimentally? Maybe you should note that in the comment. And the last, but not least. +nelem_alloc = ((int) nelem) / partitions_number; +if (nelem_alloc == 0) +nelem_alloc = 1; + +for (i = 0; i < partitions_number; i++) +if (!element_alloc(hashp, nelem_alloc, i)) +ereport(ERROR, +(errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); It looks like you forgot to round the result of the division. For example, if you have nelem=25 and partitions_number=6. 25 / 6 = 4. And then you allocate 24 nelems, while 1 is lost. What about productivity, I did a test on my notebook. Patched version shows small performance improvement. pgbench -j 1 -c 1 -f pgbench.sql -T 300 postgres base patched 127tps 145tps pgbench -j 8 -c 8 -f pgbench.sql -T 300 postgres base patched 247tps 270tps But I haven't any big machine to perform tests, where the patch is supposed to make significant changes. So I would rely on your and the other reviewers results. Except mentioned notes, I suppose the patch is good enough to pass it to a reviewer. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
18.02.2016 20:18, Anastasia Lubennikova: 04.02.2016 20:16, Peter Geoghegan: On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: I fixed it in the new version (attached). Thank you for the review. At last, there is a new patch version 3.0. After some refactoring it looks much better. I described all details of the compression in this document https://goo.gl/50O8Q0 (the same text without pictures is attached in btc_readme_1.0.txt). Consider it as a rough copy of readme. It contains some notes about tricky moments of implementation and questions about future work. Please don't hesitate to comment it. Sorry, previous patch was dirty. Hotfix is attached. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company Compression. To be correct, itâs not actually compression, but just effective layout of ItemPointers on an index page. compressed tuple = IndexTuple (with metadata in TID field+ key) + PostingList 1. Gin index fits extremely good for really large sets of repeating keys, but on the other hand, it completely fails to handle unique keys. To btree it is essential to have good performance and concurrency in any corner cases with any number of duplicates. Thatâs why we canât just copy the gin implementation of item pointers compression. The first difference is that btree algorithm performs compression (or, in other words, changes index tuple layout) only if thereâs more than one tuple with this key. It allows us to avoid the overhead of storing useless metadata for mostly different keys (see picture below). It seems that compression could be useful for unique indexes under heavy write/update load (because of MVCC copies), but I donât sure whether this use-case really exists. Those tuples should be deleted by microvacuum as soon as possible. Anyway, I think that itâs worth to add storage_parameter for btree which enables/disables compression for each particular index. And set compression of unique indexes to off by default. System indexes do not support compression for several reasons. First of all because of WIP state of the patch (debugging system catalog isnât a big pleasure). The next reason is that I know many places in the code where hardcode or some non-obvious syscache routines are used. I do not feel brave enough to change this code. And last but not least, I donât see good reasons to do that. 2. If the index key is very small (smaller than metadata) and the number of duplicates is small, compression could lead to index bloat instead of index size decrease (see picture below). I donât sure whether itâs worth to handle this case separately because itâs really rare and I consider that itâs the DBAâs job to disable compression on such indexes. But if you see any clear way to do this, it would be great. 3. For GIN indexes, if a posting list is too large, a posting tree is created. It proceeded on assumptions that: Indexed keys are never deleted. It makes all tree algorithms much easier. There are always many duplicates. Otherwise, gin becomes really inefficient. Thereâs no big concurrent rate. In order to add a new entry into a posting tree, we hold a lock on its root, so only 1 backend at a time can perform insertion. In btree we canât afford these assumptions. So we should handle big posting lists in another way. If there are too many ItemPointers to fit into a single posting list, we will just create another one. The overhead of this approach is that we have to store a duplicate of the key and metadata. It leads to the problem of big keys. If the keysize is close to BTMaxItemSize, compression will give us really small benefit, if any at all (see picture below). 4. The more item pointers fit into the single posting list, the rare we have to split it and repeat the key. Therefore, the bigger BTMaxItemSize is the better. The comment in nbtree.h says: âWe actually need to be able to fit three items on every page, so restrict any one item to 1/3 the per-page available space.â That is quite right for regular items, but if the index tuple is compressed it already contains more than one item. Taking it into account, we can assert that BTMaxItemSize ~ â pagesize for regular items, and ~ ½ pagesize for compressed items. Are there any objections? I wonder if we can increase BTMaxItemSize with some other assumption? The problem I see here is that varlena highkey could be as big as the compressed tuple. 5. CREATE INDEX. _bt_load. The algorithm of btree build is following: do the heap scan, add tuples into spool, sort the data, insert ordered data from spool into leaf index pages (_bt_load), build inner pages and root. The main changes are applied to _bt_load function. While loading tuples, we do not insert them one by one, but instead, compare each tuple with the previous one, and if they are equal we put them into posting list. If the posting list is large
Re: [HACKERS]WIP: Covering + unique indexes.
02.02.2016 15:50, Anastasia Lubennikova: 31.01.2016 11:04, David Rowley: On 27 January 2016 at 03:35, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: including_columns_3.0 is the latest version of patch. And changes regarding the previous version are attached in a separate patch. Just to ease the review and debug. Hi, I've made another pass over the patch. There's still a couple of things that I think need to be looked at. Thank you again. I just write here to say that I do not disappear and I do remember about the issue. But I'm very very busy this week. I'll send an updated patch next week as soon as possible. As promised, here's the new version of the patch "including_columns_4.0". I fixed all issues except some points mentioned below. Besides, I did some refactoring: - use macros IndexRelationGetNumberOfAttributes, IndexRelationGetNumberOfKeyAttributes where possible. Use macro RelationGetNumberOfAttributes. Maybe that's a bit unrelated changes, but it'll make development much easier in future. - rename related variables to indnatts, indnkeyatts. I'm also not that keen on index_reform_tuple() in general. I wonder if there's a way we can just keep the Datum/isnull arrays a bit longer, and only form the tuple when needed. I've not looked into this in detail, but it does look like reforming the tuple is not going to be cheap. It is used in splits, for example. There is no datum array, we just move tuple key from a child page to a parent page or something like that. And according to INCLUDING algorithm we need to truncate nonkey attributes. If we do need to keep this function, I think a better name might be index_trim_tuple() and I don't think you need to pass the original length. It might make sense to Assert() that the trim length is smaller than the tuple size As regards the performance, I don't think that it's a big problem here. Do you suggest to do it in a following way memcpy(oldtup, newtup, newtuplength)? I've tested it some more, and still didn't find any performance issues. in gistrescan() there is some code: for (attno = 1; attno <= natts; attno++) { TupleDescInitEntry(so->giststate->fetchTupdesc, attno, NULL, scan->indexRelation->rd_opcintype[attno - 1], -1, 0); } Going by RelationInitIndexAccessInfo() rd_opcintype[] is allocated to be sized by the number of key columns, but this loop goes over the number of attribute columns. Perhaps this is not a big problem since GIST does not support INCLUDING columns, but it does seem wrong still. GiST doesn't support INCLUDING clause, so natts and nkeyatts are always equal. I don't see any problem here. And I think that it's an extra work to this patch. Maybe I or someone else would add this feature to other access methods later. Still the same. Which brings me to the fact that I've spent a bit of time trying to look for places where you've forgotten to change natts to nkeyatts. I did find this one, but I don't have much confidence that there's not lots more places that have been forgotten. Apart from this one, how confident are you that you've found all the places? I'm getting towards being happy with the code that I see that's been changed, but I'm hesitant to mark as "Ready for committer" due to not being all that comfortable that all the code that needs to be updated has been updated. I'm not quite sure of a good way to find all these places. I found all mentions of natts and other related variables with grep, and replaced (or expand) them with nkeyatts where it was necessary. As mentioned before, I didn't change other AMs. I strongly agree that any changes related to btree require thorough inspection, so I'll recheck it again. But I'm almost sure that it's okay. I rechecked everything again and fixed couple of omissions. Thank you for being exacting reviewer) I don't know how to ensure that everything is ok, but I have no idea what else I can do. I wondering if hacking the code so that each btree index which is created with > 1 column puts all but the first column into the INCLUDING columns, then run the regression tests to see if there are any crashes. I'm really not that sure of how else to increase the confidence levels on this. Do you have ideas? Do I understand correctly that you suggest to replace all multicolumn indexes with (1key column) + included? I don't think it's a good idea. INCLUDING clause brings some disadvantages. For example, included columns must be filtered after the search, while key columns could be used in scan key directly. I already mentioned this in test example: explain analyze select c1, c2 from tbl where c1<1 and c3<20; If columns' opclasses are used, new query plan uses them in Index Cond: ((c1 < 1) AND (c3 < 20)) Otherwise, new query can not use included column in Index Cond and uses filter instead: Index Cond: (c1 < 1) Filter: (c3 < 20) Rows Removed by Filter: 999
Re: [HACKERS]WIP: Covering + unique indexes.
31.01.2016 11:04, David Rowley: On 27 January 2016 at 03:35, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: including_columns_3.0 is the latest version of patch. And changes regarding the previous version are attached in a separate patch. Just to ease the review and debug. Hi, I've made another pass over the patch. There's still a couple of things that I think need to be looked at. Thank you again. I just write here to say that I do not disappear and I do remember about the issue. But I'm very very busy this week. I'll send an updated patch next week as soon as possible. Do we need the "b (included)" here? The key is (a) = (1). Having irrelevant details might be confusing. postgres=# create table a (a int not null, b int not null); CREATE TABLE postgres=# create unique index on a (a) including(b); CREATE INDEX postgres=# insert into a values(1,1); INSERT 0 1 postgres=# insert into a values(1,1); ERROR: duplicate key value violates unique constraint "a_a_b_idx" DETAIL: Key (a, b (included))=(1, 1) already exists. I thought that it could be strange if user inserts two values and then sees only one of them in error message. But now I see that you're right. I'll also look at the same functional in other DBs and fix it. In index_reform_tuple() I find it a bit scary that you change the TupleDesc's number of attributes then set it back again once you're finished reforming the shortened tuple. Maybe it would be better to modify index_form_tuple() to accept a new argument with a number of attributes, then you can just Assert that this number is never higher than the number of attributes in the TupleDesc. Good point. I agree that this function is a bit strange. I have to set tupdesc->nattrs to support compatibility with index_form_tuple(). I didn't want to add neither a new field to tupledesc nor a new parameter to index_form_tuple(), because they are used widely. I'm also not that keen on index_reform_tuple() in general. I wonder if there's a way we can just keep the Datum/isnull arrays a bit longer, and only form the tuple when needed. I've not looked into this in detail, but it does look like reforming the tuple is not going to be cheap. It is used in splits, for example. There is no datum array, we just move tuple key from a child page to a parent page or something like that. And according to INCLUDING algorithm we need to truncate nonkey attributes. If we do need to keep this function, I think a better name might be index_trim_tuple() and I don't think you need to pass the original length. It might make sense to Assert() that the trim length is smaller than the tuple size As regards the performance, I don't think that it's a big problem here. Do you suggest to do it in a following way memcpy(oldtup, newtup, newtuplength)? I will in gistrescan() there is some code: for (attno = 1; attno <= natts; attno++) { TupleDescInitEntry(so->giststate->fetchTupdesc, attno, NULL, scan->indexRelation->rd_opcintype[attno - 1], -1, 0); } Going by RelationInitIndexAccessInfo() rd_opcintype[] is allocated to be sized by the number of key columns, but this loop goes over the number of attribute columns. Perhaps this is not a big problem since GIST does not support INCLUDING columns, but it does seem wrong still. GiST doesn't support INCLUDING clause, so natts and nkeyatts are always equal. I don't see any problem here. And I think that it's an extra work to this patch. Maybe I or someone else would add this feature to other access methods later. Which brings me to the fact that I've spent a bit of time trying to look for places where you've forgotten to change natts to nkeyatts. I did find this one, but I don't have much confidence that there's not lots more places that have been forgotten. Apart from this one, how confident are you that you've found all the places? I'm getting towards being happy with the code that I see that's been changed, but I'm hesitant to mark as "Ready for committer" due to not being all that comfortable that all the code that needs to be updated has been updated. I'm not quite sure of a good way to find all these places. I found all mentions of natts and other related variables with grep, and replaced (or expand) them with nkeyatts where it was necessary. As mentioned before, I didn't change other AMs. I strongly agree that any changes related to btree require thorough inspection, so I'll recheck it again. But I'm almost sure that it's okay. I wondering if hacking the code so that each btree index which is created with > 1 column puts all but the first column into the INCLUDING columns, then run the regression tests to see if there are any crashes. I'm really not that sure of how else to increase the confidence levels on this. Do you have ideas? Do I understand correctly that you suggest to replace all multicolumn indexes with (1key column) + included? I don't think it's a good idea.
[HACKERS] Re: [PATCH] Integer overflow in timestamp[tz]_part() and date/time boundaries check
14.03.2016 16:23, David Steele: On 2/25/16 4:44 PM, Vitaly Burovoy wrote: Added to the commitfest 2016-03. [CF] https://commitfest.postgresql.org/9/540/ This looks like a fairly straight-forward bug fix (the size of the patch is deceptive because there a lot of new tests included). It applies cleanly. Anastasia, I see you have signed up to review. Do you have an idea when you will get the chance to do that? Thanks, I've read the patch thoroughly and haven't found any problems. I think that the patch is in a very good shape. It fixes a bug and has an excellent set of tests. There is an issue, mentioned in the thread above: postgres=# select postgres-# to_char(date_trunc('week', '4713-01-01 BC'::date),'day') postgres-# ,to_char(date_trunc('week', '4714-12-29 BC'::date),'day') postgres-# ,to_char(date_trunc('week', '4714-12-28 BC'::date),'day'); to_char | to_char | to_char ---+---+--- monday| monday| thursday (1 row) since 4714-12-28 BC and to the past detection when a week is starting is broken (because it is boundary of isoyears -4713 and -4712). Is it worth to break undocumented range or leave it as is? But I suppose that behavior of undocumented dates is not essential. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [PATCH] Integer overflow in timestamp[tz]_part() and date/time boundaries check
15.03.2016 03:21, Vitaly Burovoy: On 3/14/16, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: 14.03.2016 16:23, David Steele: On 2/25/16 4:44 PM, Vitaly Burovoy wrote: Added to the commitfest 2016-03. [CF] https://commitfest.postgresql.org/9/540/ This looks like a fairly straight-forward bug fix (the size of the patch is deceptive because there a lot of new tests included). It applies cleanly. Anastasia, I see you have signed up to review. Do you have an idea when you will get the chance to do that? Thanks, I've read the patch thoroughly and haven't found any problems. I think that the patch is in a very good shape. It fixes a bug and has an excellent set of tests. There is an issue, mentioned in the thread above: postgres=# select postgres-# to_char(date_trunc('week', '4713-01-01 BC'::date),'day') postgres-# ,to_char(date_trunc('week', '4714-12-29 BC'::date),'day') postgres-# ,to_char(date_trunc('week', '4714-12-28 BC'::date),'day'); to_char | to_char | to_char ---+---+--- monday| monday| thursday (1 row) since 4714-12-28 BC and to the past detection when a week is starting is broken (because it is boundary of isoyears -4713 and -4712). Is it worth to break undocumented range or leave it as is? But I suppose that behavior of undocumented dates is not essential. I'm sorry... What should I do with "Waiting on Author" state if you don't have complaints? I was going to set "Ready for Committer", but then I've noticed message from Mark Dilger and changed my mind. Now, when you answered him, I have no objections. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
14.03.2016 16:02, David Steele: Hi Anastasia, On 2/18/16 12:29 PM, Anastasia Lubennikova wrote: 18.02.2016 20:18, Anastasia Lubennikova: 04.02.2016 20:16, Peter Geoghegan: On Fri, Jan 29, 2016 at 8:50 AM, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: I fixed it in the new version (attached). Thank you for the review. At last, there is a new patch version 3.0. After some refactoring it looks much better. I described all details of the compression in this document https://goo.gl/50O8Q0 (the same text without pictures is attached in btc_readme_1.0.txt). Consider it as a rough copy of readme. It contains some notes about tricky moments of implementation and questions about future work. Please don't hesitate to comment it. Sorry, previous patch was dirty. Hotfix is attached. This looks like an extremely valuable optimization for btree indexes but unfortunately it is not getting a lot of attention. It still applies cleanly for anyone interested in reviewing. Thank you for attention. I would be indebted to all reviewers, who can just try this patch on real data and workload (except WAL for now). B-tree needs very much testing. It's not clear to me that you answered all of Peter's questions in [1]. I understand that you've provided a README but it may not be clear if the answers are in there (and where). I described in README all the points Peter asked. But I see that it'd be better to answer directly. Thanks for reminding, I'll do it tomorrow. Also, at the end of the README it says: 13. Xlog. TODO. Does that mean the patch is not yet complete? Yes, you're right. Frankly speaking, I supposed that someone will help me with that stuff, but now I almost completed it. I'll send updated patch in the next letter. I'm still doubtful about some patch details. I mentioned them in readme (bold type). But they are mostly about future improvements. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
fer, copy it into local memory allocated by palloc(), and then immediately release the buffer lock and drop the pin. This is the same in all instances. There is never more than one buffer lock or pin held at a time. We do, on the other hand, have a detailed rationale for why it's okay that we don't have an ExclusiveLock on the index relation for checks that span the key space of more than one page by following right links to compare items across sibling pages. This isn't the same thing as having an explicit interlock like a pin -- our interlock is one against *recycling* by vacuum, which is based on recentGlobalXmin. This rationale requires expert review. I'm not an expert, but I promise to take a closer look at locking. I will send another email soon. Performance == Trying to keep the tool as simple as possible, while still making it do verification that is as useful as possible was my priority here, not performance. Still, verification completes fairly quickly. Certainly, it takes far less time than having to REINDEX the index, and doesn't need too much memory. I think that in practice most problems that can be detected by the B-Tree checker functions will be detected with the lighter variant. I do not see any performance issues. I'm sure that if someone wants to check whether the index is corrupted, he certainly can wait a minute (. But I have some concerns about compatibility with my patches. I've tried to call bt_index_check() over my "including" patch [1] and caught a segfault. LOG: server process (PID 31794) was terminated by signal 11: Segmentation fault DETAIL: Failed process was running: select bt_index_check('idx'); I do hope that my patch will be accepted in 9.6, so this conflict looks really bad. I think that error is caused by changes in pages layout. To save some space nonkey attributes are truncated when btree copies the indexed value into inner pages or into high key. You can look at index_reform_tuple() calls. I wonder, whether you can add some check of catalog version to your module or this case requires more changes? With second patch which implements compression [2], amcheck causes another error. postgres=# insert into tbl select 1 from generate_series(0,5); INSERT 0 6 postgres=# SELECT * FROM bt_page_items('idx', 1); itemoffset | ctid | itemlen | nulls | vars | data ++-+---+--+ - 1 | (2147483664,6) | 56 | f | f| 01 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 00 00 02 00 00 00 00 00 03 00 00 00 00 00 04 00 00 00 00 00 05 00 00 00 00 00 06 00 00 00 00 00 postgres=# select bt_index_check('idx'); ERROR: cannot check index "idx" DETAIL: index is not yet ready for insertions But I'm sure that it's a problem of my patch. So I'll fix it and try again. [1] https://commitfest.postgresql.org/9/433/ [2] https://commitfest.postgresql.org/9/494/ -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
Please, find the new version of the patch attached. Now it has WAL functionality. Detailed description of the feature you can find in README draft https://goo.gl/50O8Q0 This patch is pretty complicated, so I ask everyone, who interested in this feature, to help with reviewing and testing it. I will be grateful for any feedback. But please, don't complain about code style, it is still work in progress. Next things I'm going to do: 1. More debugging and testing. I'm going to attach in next message couple of sql scripts for testing. 2. Fix NULLs processing 3. Add a flag into pg_index, that allows to enable/disable compression for each particular index. 4. Recheck locking considerations. I tried to write code as less invasive as possible, but we need to make sure that algorithm is still correct. 5. Change BTMaxItemSize 6. Bring back microvacuum functionality. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index e3c55eb..72acc0f 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -24,6 +24,8 @@ #include "storage/predicate.h" #include "utils/tqual.h" +#include "catalog/catalog.h" +#include "utils/datum.h" typedef struct { @@ -82,6 +84,7 @@ static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup, OffsetNumber itup_off); static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum, int keysz, ScanKey scankey); + static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel); @@ -113,6 +116,11 @@ _bt_doinsert(Relation rel, IndexTuple itup, BTStack stack; Buffer buf; OffsetNumber offset; + Page page; + TupleDesc itupdesc; + int nipd; + IndexTuple olditup; + Size sizetoadd; /* we need an insertion scan key to do our search, so build one */ itup_scankey = _bt_mkscankey(rel, itup); @@ -190,6 +198,7 @@ top: if (checkUnique != UNIQUE_CHECK_EXISTING) { + bool updposting = false; /* * The only conflict predicate locking cares about for indexes is when * an index tuple insert conflicts with an existing lock. Since the @@ -201,7 +210,42 @@ top: /* do the insertion */ _bt_findinsertloc(rel, , , natts, itup_scankey, itup, stack, heapRel); - _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false); + + /* + * Decide, whether we can apply compression + */ + page = BufferGetPage(buf); + + if(!IsSystemRelation(rel) + && !rel->rd_index->indisunique + && offset != InvalidOffsetNumber + && offset <= PageGetMaxOffsetNumber(page)) + { + itupdesc = RelationGetDescr(rel); + sizetoadd = sizeof(ItemPointerData); + olditup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offset)); + + if(_bt_isbinaryequal(itupdesc, olditup, + rel->rd_index->indnatts, itup)) + { +if (!BtreeTupleIsPosting(olditup)) +{ + nipd = 1; + sizetoadd = sizetoadd*2; +} +else + nipd = BtreeGetNPosting(olditup); + +if ((IndexTupleSize(olditup) + sizetoadd) <= BTMaxItemSize(page) + && PageGetFreeSpace(page) > sizetoadd) + updposting = true; + } + } + + if (updposting) + _bt_pgupdtup(rel, buf, offset, itup, olditup, nipd); + else + _bt_insertonpg(rel, buf, InvalidBuffer, stack, itup, offset, false); } else { @@ -1042,6 +1086,7 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright, itemid = PageGetItemId(origpage, P_HIKEY); itemsz = ItemIdGetLength(itemid); item = (IndexTuple) PageGetItem(origpage, itemid); + if (PageAddItem(rightpage, (Item) item, itemsz, rightoff, false, false) == InvalidOffsetNumber) { @@ -1072,13 +1117,39 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright, itemsz = ItemIdGetLength(itemid); item = (IndexTuple) PageGetItem(origpage, itemid); } - if (PageAddItem(leftpage, (Item) item, itemsz, leftoff, + + if (BtreeTupleIsPosting(item)) + { + Size hikeysize = BtreeGetPostingOffset(item); + IndexTuple hikey = palloc0(hikeysize); + + /* Truncate posting before insert it as a hikey. */ + memcpy (hikey, item, hikeysize); + hikey->t_info &= ~INDEX_SIZE_MASK; + hikey->t_info |= hikeysize; + ItemPointerSet(&(hikey->t_tid), origpagenumber, P_HIKEY); + + if (PageAddItem(leftpage, (Item) hikey, hikeysize, leftoff, false, false) == InvalidOffsetNumber) + { + memset(rightpage, 0, BufferGetPageSize(rbuf)); + elog(ERROR, "failed to add hikey to the left sibling" +" while splitting block %u of index \"%s\"", +origpagenumber, RelationGetRelationName(rel)); + } + + pfree(hikey); + } + else { - memset(rightpage, 0, BufferGetPageSize(rbuf)); - elog(ERROR, "failed to add hikey to the left siblin
Re: [HACKERS] [PATCH] Supporting +-Infinity values by to_timestamp(float8)
15.03.2016 22:28, David Steele: On 3/4/16 2:56 PM, Vitaly Burovoy wrote: On 3/4/16, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: I think that you should update documentation. At least description of epoch on this page: http://www.postgresql.org/docs/devel/static/functions-datetime.html Thank you very much for pointing where it is located (I saw only "to_timestamp(TEXT, TEXT)"). I'll think how to update it. Vitaly, have you decided how to update this yet? 3. (nitpicking) I don't sure about "4STAMPS" suffix. "4" is nice abbreviation, but it seems slightly confusing to me. It doesn't matter for me what it is called, it is short enough and reflects a type on which it is applied. What would the best name be for it? Anastasia, any suggestions for a better name, or just leave it as is? I'm not in favor of the "4", either. I think I would prefer JULIAN_MAXYEAR_STAMP. This point is related to another patch https://commitfest.postgresql.org/9/540/. And added to this patch just for compatibility. If Tom wouldn't change the name of the macros there, I don't see any reasons why should we do it in this patch. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS]WIP: Covering + unique indexes.
02.03.2016 08:50, Michael Paquier: On Wed, Mar 2, 2016 at 2:10 AM, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: 01.03.2016 19:55, Anastasia Lubennikova: It is not the final version, because it breaks pg_dump for previous versions. I need some help from hackers here. pgdump. line 5466 if (fout->remoteVersion >= 90400) What does 'remoteVersion' mean? And what is the right way to change it? Or it changes between releases? I guess that 90400 is for 9.4 and 80200 is for 8.2 but is it really so? That is totally new to me. Yes, you got it. That's basically PG_VERSION_NUM as compiled on the server that has been queried, in this case the server from which a dump is taken. If you are changing the system catalog layer, you would need to provide a query at least equivalent to what has been done until now for your patch, the modify pg_dump as follows: if (fout->remoteVersion >= 90600) { query = my_new_query; } else if (fout->remoteVersion >= 90400) { query = the existing 9.4 query } etc. In short you just need to add a new block so as remote servers newer than 9.6 will be able to dump objects correctly. pg_upgrade is a good way to check the validity of pg_dump actually, this explains why some objects are not dropped in the regression tests. Perhaps you'd want to do the same with your patch if the current test coverage of pg_dump is not enough. I have not looked at your patch so I cannot say for sure. Thank you for the explanation. New version of the patch implements pg_dump well. Documentation related to constraints is updated. I hope, that patch is in a good shape now. Brief overview for reviewers: This patch allows unique indexes to be defined on one set of columns and include another set of column in the INCLUDING clause, on which the uniqueness is not enforced upon. It allows more queries to benefit from using index-only scan. Currently, only the B-tree access method supports this feature. Syntax example: CREATE TABLE tbl (c1 int, c2 int, c3 box); CREATE INDEX idx ON TABLE tbl (c1) INCLUDING (c2, c3); In opposite to key columns (c1), included columns (c2,c3) are not used in index scankeys neither in "search" scankeys nor in "insertion" scankeys. Included columns are stored only in leaf pages and it can help to slightly reduce index size. Hence, included columns do not require any opclass for btree access method. As you can see from example above, it's possible to add into index columns of "box" type. The most common use-case for this feature is combination of UNIQUE or PRIMARY KEY constraint on columns (a,b) and covering index on columns (a,b,c). So, there is a new syntax for constraints. CREATE TABLE tblu (c1 int, c2 int, c3 box, UNIQUE (c1,c2) INCLUDING (c3)); Index, created for this constraint contains three columns. "tblu_c1_c2_c3_key" UNIQUE CONSTRAINT, btree (c1, c2) INCLUDING (c3) CREATE TABLE tblpk (c1 int, c2 int, c3 box, PRIMARY KEY (c1) INCLUDING (c3)); Index, created for this constraint contains two columns. Note that NOT NULL constraint is applied only to key column(s) as well as unique constraint. postgres=# \d tblpk Table "public.tblpk" Column | Type | Modifiers +-+--- c1 | integer | not null c2 | integer | c3 | box | Indexes: "tblpk_pkey" PRIMARY KEY, btree (c1) INCLUDING (c3) Same for ALTER TABLE statements: CREATE TABLE tblpka (c1 int, c2 int, c3 box); ALTER TABLE tblpka ADD PRIMARY KEY (c1) INCLUDING (c3); pg_dump is updated and seems to work fine with this kind of indexes. I see only one problem left (maybe I've mentioned it before). Queries like this [1] must be rewritten, because after catalog changes, i.indkey contains both key and included attrs. One more thing to do is some refactoring of names, since "indkey" looks really confusing to me. But it could be done as a separate patch [2]. [1] https://wiki.postgresql.org/wiki/Retrieve_primary_key_columns [2] http://www.postgresql.org/message-id/56bb7788.30...@postgrespro.ru -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 9c8e308..891325d 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -100,7 +100,7 @@ static remoteConn *getConnectionByName(const char *name); static HTAB *createConnHash(void); static void createNewConnection(const char *name, remoteConn *rconn); static void deleteConnection(const char *name); -static char **get_pkey_attnames(Relation rel, int16 *numatts); +static char **get_pkey_attnames(Relation rel, int16 *indnkeyatts); static char **get_text_array_contents(ArrayType *array, int *numitems); static char *get_sql_insert(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals, char **tgt_pkattvals); static char *get_sql_delete(Relation rel, int *pkattnums, int pknuma
Re: [HACKERS] WIP: Covering + unique indexes.
05.04.2016 01:48, Peter Geoghegan : On Mon, Mar 21, 2016 at 9:53 AM, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: It's a bit more complicated to add it into index creation algorithm. There's a trick with a "high key". /* * We copy the last item on the page into the new page, and then * rearrange the old page so that the 'last item' becomes its high key * rather than a true data item. There had better be at least two * items on the page already, else the page would be empty of useful * data. */ /* * Move 'last' into the high key position on opage */ To be consistent with other steps of algorithm ( all high keys must be truncated tuples), I had to update this high key on place: delete the old one, and insert truncated high key. Hmm. But the high key comparing equal to the Scankey gives insertion the choice of where to put its IndexTuple (it can go on the page with the high key, or its right-sibling, according only to considerations about fillfactor, etc). Is this changed? Does it not matter? Why not? Is it just worth it? I would say, this is changed, but it doesn't matter. Performing any search in btree (including choosing suitable page for insertion), we use only key attributes. We assume that included columns are stored in index unordered. Simple example. create table tbl(id int, data int); create index idx on tbl (id) including (data); Select query does not consider included columns in scan key. It selects all tuples satisfying the condition on key column. And only after that it applies filter to remove wrong rows from the result. If key attribute doesn't satisfy query condition, there are no more tuples to return and we can interrupt scan. You can find more explanations in the attached sql script, that contains queries to recieve detailed information about index structure using pageinspect. The right-most page on every level has no high-key. But you say those pages have an "imaginary" *positive* infinity high key, just as internal pages have (non-imaginary) minus infinity downlinks as their first item/downlink. So tuples in a (say) leaf page are always bound by the downlink lower bound in parent, while their own high key is an upper bound. Either (and, rarely, both) could be (positive or negative) infinity. Maybe you now see why I talked about special _bt_compare() logic for this. I proposed special logic that is similar to the existing minus infinity thing _bt_compare() does (although _bt_binsrch(), an important caller of _bt_compare() also does special things for internal .vs leaf case, so I'm not sure any new special logic must go in _bt_compare()). In my view, it's the correct way to fix this problem, because the caller is responsible for passing proper arguments to the function. Of course I will add a check into bt_compare, but I'd rather make it an assertion (see the patch attached). I see what you mean, but I think we need to decide what to do about the key space when leaf high keys are truncated. I do think that truncating the high key was the right idea, though, and it nicely illustrates that nothing special should happen in upper levels. Suffix truncation should only happen when leaf pages are split, generally speaking. As I said, the high key is very similar to the downlinks, in that both bound the things that go on each page. Together with downlinks they represent *discrete* ranges *unambiguously*, so INCLUDING truncation needs to make it clear which page new items go on. As I said, _bt_binsrch() already takes special actions for internal pages, making sure to return the item that is < the scankey, not <= the scankey which is only allowed for leaf pages. (See README, from "Lehman and Yao assume that the key range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page..."). To give a specific example, I worry about the case where two sibling downlinks in a parent page are distinct, but per specific-to-Postgres "Ki <= v <= Ki+1" thing (which differs from the classic L invariant), some tuples with all right downlink's attributes matching end up in left child page, not right child page. I worry that since _bt_findsplitloc() doesn't consider this (for example), the split point doesn't *reliably* and unambiguously divide the key space between the new halves of a page being split. I think the "Ki <= v <= Ki+1"/_bt_binsrch() thing might save you in common cases where all downlink attributes are distinct, so maybe that simpler case is okay. But to be even more specific, what about the more complicated case where the downlinks *are* fully _bt_compare()-wise equal? This could happen even though they're constrained to be unique in leaf pages, due to bloat. Unique indexes aren't special here; they just make it far less likely that this w
Re: [HACKERS] WIP: Covering + unique indexes.
06.04.2016 23:52, Peter Geoghegan: On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <p...@heroku.com> wrote: Personally, I like documenting assertions, and will sometimes write assertions that the compiler could easily optimize away. Maybe going *that* far is more a matter of personal style, but I think an assertion about the new index tuple size being <= the old one is just a good idea. It's not about a problem in your code at all. You should make index_truncate_tuple()/index_reform_tuple() promise to always do this in its comments/contract with caller as part of this, IMV. Mentioned issues are fixed. Patch is attached. I'd like to remind you that the commitfest will be closed very-very soon, so I'd like to get your final resolution about the patch. Not to have it in the 9.6 release will be very disappointing. I agree that b-tree is a crucial subsystem. But it seems to me, that we have lack of improvements in this area not only because of the algorithm's complexity but also because of lack of enthusiasts to work on it and struggle through endless discussions. But it's off-topic here. Attention to these development difficulties will be one of the messages of my pgcon talk. You know, we lost a lot of time discussing various b-tree problems. Besides that, I am sure that the patch is really in a good shape. It hasn't any open problems to fix. And possible subtle bugs can be found at the testing stage of the release. Looking forward to your reply. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 9c8e308..891325d 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -100,7 +100,7 @@ static remoteConn *getConnectionByName(const char *name); static HTAB *createConnHash(void); static void createNewConnection(const char *name, remoteConn *rconn); static void deleteConnection(const char *name); -static char **get_pkey_attnames(Relation rel, int16 *numatts); +static char **get_pkey_attnames(Relation rel, int16 *indnkeyatts); static char **get_text_array_contents(ArrayType *array, int *numitems); static char *get_sql_insert(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals, char **tgt_pkattvals); static char *get_sql_delete(Relation rel, int *pkattnums, int pknumatts, char **tgt_pkattvals); @@ -1485,7 +1485,7 @@ PG_FUNCTION_INFO_V1(dblink_get_pkey); Datum dblink_get_pkey(PG_FUNCTION_ARGS) { - int16 numatts; + int16 indnkeyatts; char **results; FuncCallContext *funcctx; int32 call_cntr; @@ -1511,7 +1511,7 @@ dblink_get_pkey(PG_FUNCTION_ARGS) rel = get_rel_from_relname(PG_GETARG_TEXT_P(0), AccessShareLock, ACL_SELECT); /* get the array of attnums */ - results = get_pkey_attnames(rel, ); + results = get_pkey_attnames(rel, ); relation_close(rel, AccessShareLock); @@ -1531,9 +1531,9 @@ dblink_get_pkey(PG_FUNCTION_ARGS) attinmeta = TupleDescGetAttInMetadata(tupdesc); funcctx->attinmeta = attinmeta; - if ((results != NULL) && (numatts > 0)) + if ((results != NULL) && (indnkeyatts > 0)) { - funcctx->max_calls = numatts; + funcctx->max_calls = indnkeyatts; /* got results, keep track of them */ funcctx->user_fctx = results; @@ -2023,10 +2023,10 @@ dblink_fdw_validator(PG_FUNCTION_ARGS) * get_pkey_attnames * * Get the primary key attnames for the given relation. - * Return NULL, and set numatts = 0, if no primary key exists. + * Return NULL, and set indnkeyatts = 0, if no primary key exists. */ static char ** -get_pkey_attnames(Relation rel, int16 *numatts) +get_pkey_attnames(Relation rel, int16 *indnkeyatts) { Relation indexRelation; ScanKeyData skey; @@ -2036,8 +2036,8 @@ get_pkey_attnames(Relation rel, int16 *numatts) char **result = NULL; TupleDesc tupdesc; - /* initialize numatts to 0 in case no primary key exists */ - *numatts = 0; + /* initialize indnkeyatts to 0 in case no primary key exists */ + *indnkeyatts = 0; tupdesc = rel->rd_att; @@ -2058,12 +2058,12 @@ get_pkey_attnames(Relation rel, int16 *numatts) /* we're only interested if it is the primary key */ if (index->indisprimary) { - *numatts = index->indnatts; - if (*numatts > 0) + *indnkeyatts = index->indnkeyatts; + if (*indnkeyatts > 0) { -result = (char **) palloc(*numatts * sizeof(char *)); +result = (char **) palloc(*indnkeyatts * sizeof(char *)); -for (i = 0; i < *numatts; i++) +for (i = 0; i < *indnkeyatts; i++) result[i] = SPI_fname(tupdesc, index->indkey.values[i]); } break; diff --git a/contrib/tcn/tcn.c b/contrib/tcn/tcn.c index 7352b29..142730a 100644 --- a/contrib/tcn/tcn.c +++ b/contrib/tcn/tcn.c @@ -138,9 +138,9 @@ triggered_change_notification(PG_FUNCTION_ARGS) /* we're only interested if it is the primary key and valid */ if (index->indisprimary
Re: [HACKERS] WIP: Covering + unique indexes.
08.04.2016 15:06, Teodor Sigaev: On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <p...@heroku.com> wrote: Personally, I like documenting assertions, and will sometimes write assertions that the compiler could easily optimize away. Maybe going *that* far is more a matter of personal style, but I think an assertion about the new index tuple size being <= the old one is just a good idea. It's not about a problem in your code at all. You should make index_truncate_tuple()/index_reform_tuple() promise to always do this in its comments/contract with caller as part of this, IMV. Some notices: - index_truncate_tuple(Relation idxrel, IndexTuple olditup, int indnatts, int indnkeyatts) Why we need indnatts/indnkeyatts? They are presented in idxrel struct already - follow code where index_truncate_tuple() is called, it should never called in case where indnatts == indnkeyatts. So, indnkeyatts should be strictly less than indnatts, pls, change assertion. If they are equal the this function becomes complicated variant of CopyIndexTuple() Good point. These attributes seem to be there since previous versions of the function. But now they are definitely unnecessary. Updated patch is attached -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 9c8e308..891325d 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -100,7 +100,7 @@ static remoteConn *getConnectionByName(const char *name); static HTAB *createConnHash(void); static void createNewConnection(const char *name, remoteConn *rconn); static void deleteConnection(const char *name); -static char **get_pkey_attnames(Relation rel, int16 *numatts); +static char **get_pkey_attnames(Relation rel, int16 *indnkeyatts); static char **get_text_array_contents(ArrayType *array, int *numitems); static char *get_sql_insert(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals, char **tgt_pkattvals); static char *get_sql_delete(Relation rel, int *pkattnums, int pknumatts, char **tgt_pkattvals); @@ -1485,7 +1485,7 @@ PG_FUNCTION_INFO_V1(dblink_get_pkey); Datum dblink_get_pkey(PG_FUNCTION_ARGS) { - int16 numatts; + int16 indnkeyatts; char **results; FuncCallContext *funcctx; int32 call_cntr; @@ -1511,7 +1511,7 @@ dblink_get_pkey(PG_FUNCTION_ARGS) rel = get_rel_from_relname(PG_GETARG_TEXT_P(0), AccessShareLock, ACL_SELECT); /* get the array of attnums */ - results = get_pkey_attnames(rel, ); + results = get_pkey_attnames(rel, ); relation_close(rel, AccessShareLock); @@ -1531,9 +1531,9 @@ dblink_get_pkey(PG_FUNCTION_ARGS) attinmeta = TupleDescGetAttInMetadata(tupdesc); funcctx->attinmeta = attinmeta; - if ((results != NULL) && (numatts > 0)) + if ((results != NULL) && (indnkeyatts > 0)) { - funcctx->max_calls = numatts; + funcctx->max_calls = indnkeyatts; /* got results, keep track of them */ funcctx->user_fctx = results; @@ -2023,10 +2023,10 @@ dblink_fdw_validator(PG_FUNCTION_ARGS) * get_pkey_attnames * * Get the primary key attnames for the given relation. - * Return NULL, and set numatts = 0, if no primary key exists. + * Return NULL, and set indnkeyatts = 0, if no primary key exists. */ static char ** -get_pkey_attnames(Relation rel, int16 *numatts) +get_pkey_attnames(Relation rel, int16 *indnkeyatts) { Relation indexRelation; ScanKeyData skey; @@ -2036,8 +2036,8 @@ get_pkey_attnames(Relation rel, int16 *numatts) char **result = NULL; TupleDesc tupdesc; - /* initialize numatts to 0 in case no primary key exists */ - *numatts = 0; + /* initialize indnkeyatts to 0 in case no primary key exists */ + *indnkeyatts = 0; tupdesc = rel->rd_att; @@ -2058,12 +2058,12 @@ get_pkey_attnames(Relation rel, int16 *numatts) /* we're only interested if it is the primary key */ if (index->indisprimary) { - *numatts = index->indnatts; - if (*numatts > 0) + *indnkeyatts = index->indnkeyatts; + if (*indnkeyatts > 0) { -result = (char **) palloc(*numatts * sizeof(char *)); +result = (char **) palloc(*indnkeyatts * sizeof(char *)); -for (i = 0; i < *numatts; i++) +for (i = 0; i < *indnkeyatts; i++) result[i] = SPI_fname(tupdesc, index->indkey.values[i]); } break; diff --git a/contrib/tcn/tcn.c b/contrib/tcn/tcn.c index 7352b29..142730a 100644 --- a/contrib/tcn/tcn.c +++ b/contrib/tcn/tcn.c @@ -138,9 +138,9 @@ triggered_change_notification(PG_FUNCTION_ARGS) /* we're only interested if it is the primary key and valid */ if (index->indisprimary && IndexIsValid(index)) { - int numatts = index->indnatts; + int indnkeyatts = index->indnkeyatts; - if (numatts > 0) + if (indnkeyatts > 0) { int i; @@ -150,7 +150,7 @@ triggered_change_notification(P
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
ake it as simple as possible despite significant effort. It's hard to describe these things in an accessible way. Maybe someone can try and understand what I've said there, and let me know how I might have fallen short. I have used a new term, "cousin page" in this explanation (cf. sibling page). This seems like useful terminology that makes these difficult explanations a bit more accessible. I'm not an expert in btree locking races, but I tested the patch it on different loads and it works as promised. I didn't find mistakes in the code logic as well. As a reviewer, I don't have any objections to mark it "Ready for Committer". The only notice I'd like to add is about it's compatibility with covering indexes. We already discussed it in the thread https://commitfest.postgresql.org/9/433/ Tiny attached patch fixes this issue. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/amcheck/amcheck.c b/contrib/amcheck/amcheck.c index 66d1f4f..91b2bc5 100644 --- a/contrib/amcheck/amcheck.c +++ b/contrib/amcheck/amcheck.c @@ -1118,6 +1118,9 @@ invariant_key_less_than_equal_offset(BtreeCheckState *state, ScanKey key, int16 natts = state->rel->rd_rel->relnatts; int32 cmp; +#if (PG_VERSION_NUM >= 90600) + natts = state->rel->rd_index->indnkeyatts; +#endif cmp = _bt_compare(state->rel, natts, key, state->target, upperbound); return cmp <= 0; @@ -1139,6 +1142,9 @@ invariant_key_greater_than_equal_offset(BtreeCheckState *state, ScanKey key, int16 natts = state->rel->rd_rel->relnatts; int32 cmp; +#if (PG_VERSION_NUM >= 90600) + natts = state->rel->rd_index->indnkeyatts; +#endif cmp = _bt_compare(state->rel, natts, key, state->target, lowerbound); return cmp >= 0; @@ -1164,6 +1170,9 @@ invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state, int16 natts = state->rel->rd_rel->relnatts; int32 cmp; +#if (PG_VERSION_NUM >= 90600) + natts = state->rel->rd_index->indnkeyatts; +#endif cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound); return cmp <= 0; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Covering + unique indexes.
08.04.2016 15:45, Anastasia Lubennikova: 08.04.2016 15:06, Teodor Sigaev: On Wed, Apr 6, 2016 at 1:50 PM, Peter Geoghegan <p...@heroku.com> wrote: Personally, I like documenting assertions, and will sometimes write assertions that the compiler could easily optimize away. Maybe going *that* far is more a matter of personal style, but I think an assertion about the new index tuple size being <= the old one is just a good idea. It's not about a problem in your code at all. You should make index_truncate_tuple()/index_reform_tuple() promise to always do this in its comments/contract with caller as part of this, IMV. Some notices: - index_truncate_tuple(Relation idxrel, IndexTuple olditup, int indnatts, int indnkeyatts) Why we need indnatts/indnkeyatts? They are presented in idxrel struct already - follow code where index_truncate_tuple() is called, it should never called in case where indnatts == indnkeyatts. So, indnkeyatts should be strictly less than indnatts, pls, change assertion. If they are equal the this function becomes complicated variant of CopyIndexTuple() Good point. These attributes seem to be there since previous versions of the function. But now they are definitely unnecessary. Updated patch is attached One more improvement - note about expressions into documentation. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml index b5f67af..61a21a9 100644 --- a/doc/src/sgml/ref/create_index.sgml +++ b/doc/src/sgml/ref/create_index.sgml @@ -161,6 +161,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] INCLUDING clause, which can slightly reduce the size of the index, due to storing included attributes only in leaf index pages. Currently, only the B-tree access method supports this feature. +Expressions as included columns are not supported since they cannot be used +in index-only scan. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Covering + unique indexes.
Attached version has fix of pg_dump suggested by Stephen Frost<http://postgresql.nabble.com/template/NamlServlet.jtp?macro=user_nodes=75583> in -committers thread. http://postgresql.nabble.com/pgsql-CREATE-INDEX-INCLUDING-column-td5897653.html Sooner or later, I'd like to see this patch finished. For now, it has two complaints: - support of expressions as included columns. Frankly, I don't understand, why it's a problem of the patch. The patch is already big enough and it will be much easier to add expressions support in the following patch, after the first one will be stable. I wonder, if someone has objections to that? Yes, it's a kind of delayed feature. But should we wait for every patch when it will be entirely completed? - lack of review and testing Obviously I did as much testing as I could. So, if reviewers have any concerns about the patch, I'm waiting forward to see them. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 9c8e308..891325d 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -100,7 +100,7 @@ static remoteConn *getConnectionByName(const char *name); static HTAB *createConnHash(void); static void createNewConnection(const char *name, remoteConn *rconn); static void deleteConnection(const char *name); -static char **get_pkey_attnames(Relation rel, int16 *numatts); +static char **get_pkey_attnames(Relation rel, int16 *indnkeyatts); static char **get_text_array_contents(ArrayType *array, int *numitems); static char *get_sql_insert(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals, char **tgt_pkattvals); static char *get_sql_delete(Relation rel, int *pkattnums, int pknumatts, char **tgt_pkattvals); @@ -1485,7 +1485,7 @@ PG_FUNCTION_INFO_V1(dblink_get_pkey); Datum dblink_get_pkey(PG_FUNCTION_ARGS) { - int16 numatts; + int16 indnkeyatts; char **results; FuncCallContext *funcctx; int32 call_cntr; @@ -1511,7 +1511,7 @@ dblink_get_pkey(PG_FUNCTION_ARGS) rel = get_rel_from_relname(PG_GETARG_TEXT_P(0), AccessShareLock, ACL_SELECT); /* get the array of attnums */ - results = get_pkey_attnames(rel, ); + results = get_pkey_attnames(rel, ); relation_close(rel, AccessShareLock); @@ -1531,9 +1531,9 @@ dblink_get_pkey(PG_FUNCTION_ARGS) attinmeta = TupleDescGetAttInMetadata(tupdesc); funcctx->attinmeta = attinmeta; - if ((results != NULL) && (numatts > 0)) + if ((results != NULL) && (indnkeyatts > 0)) { - funcctx->max_calls = numatts; + funcctx->max_calls = indnkeyatts; /* got results, keep track of them */ funcctx->user_fctx = results; @@ -2023,10 +2023,10 @@ dblink_fdw_validator(PG_FUNCTION_ARGS) * get_pkey_attnames * * Get the primary key attnames for the given relation. - * Return NULL, and set numatts = 0, if no primary key exists. + * Return NULL, and set indnkeyatts = 0, if no primary key exists. */ static char ** -get_pkey_attnames(Relation rel, int16 *numatts) +get_pkey_attnames(Relation rel, int16 *indnkeyatts) { Relation indexRelation; ScanKeyData skey; @@ -2036,8 +2036,8 @@ get_pkey_attnames(Relation rel, int16 *numatts) char **result = NULL; TupleDesc tupdesc; - /* initialize numatts to 0 in case no primary key exists */ - *numatts = 0; + /* initialize indnkeyatts to 0 in case no primary key exists */ + *indnkeyatts = 0; tupdesc = rel->rd_att; @@ -2058,12 +2058,12 @@ get_pkey_attnames(Relation rel, int16 *numatts) /* we're only interested if it is the primary key */ if (index->indisprimary) { - *numatts = index->indnatts; - if (*numatts > 0) + *indnkeyatts = index->indnkeyatts; + if (*indnkeyatts > 0) { -result = (char **) palloc(*numatts * sizeof(char *)); +result = (char **) palloc(*indnkeyatts * sizeof(char *)); -for (i = 0; i < *numatts; i++) +for (i = 0; i < *indnkeyatts; i++) result[i] = SPI_fname(tupdesc, index->indkey.values[i]); } break; diff --git a/contrib/tcn/tcn.c b/contrib/tcn/tcn.c index 7352b29..142730a 100644 --- a/contrib/tcn/tcn.c +++ b/contrib/tcn/tcn.c @@ -138,9 +138,9 @@ triggered_change_notification(PG_FUNCTION_ARGS) /* we're only interested if it is the primary key and valid */ if (index->indisprimary && IndexIsValid(index)) { - int numatts = index->indnatts; + int indnkeyatts = index->indnkeyatts; - if (numatts > 0) + if (indnkeyatts > 0) { int i; @@ -150,7 +150,7 @@ triggered_change_notification(PG_FUNCTION_ARGS) appendStringInfoCharMacro(payload, ','); appendStringInfoCharMacro(payload, operation); -for (i = 0; i < numatts; i++) +for (i = 0; i < indnkeyatts; i++) { int colno = index->indkey.values[i]; diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index d6b60db..3
Re: [HACKERS] WIP: Covering + unique indexes.
b; ==12310== Invalid read of size 4 ==12310==at 0x656615: match_clause_to_indexcol (indxpath.c:2226) ==12310==by 0x656615: match_clause_to_index (indxpath.c:2144) ==12310==by 0x656DBC: match_clauses_to_index (indxpath.c:2115) ==12310==by 0x658054: match_restriction_clauses_to_index (indxpath.c:2026) ==12310==by 0x658054: create_index_paths (indxpath.c:269) ==12310==by 0x64D1DB: set_plain_rel_pathlist (allpaths.c:649) ==12310==by 0x64D1DB: set_rel_pathlist (allpaths.c:427) ==12310==by 0x64D93B: set_base_rel_pathlists (allpaths.c:299) ==12310==by 0x64D93B: make_one_rel (allpaths.c:170) ==12310==by 0x66876C: query_planner (planmain.c:246) ==12310==by 0x669FBA: grouping_planner (planner.c:1666) ==12310==by 0x66D0C9: subquery_planner (planner.c:751) ==12310==by 0x66D3DA: standard_planner (planner.c:300) ==12310==by 0x66D714: planner (planner.c:170) ==12310==by 0x6FD692: pg_plan_query (postgres.c:798) ==12310==by 0x59082D: ExplainOneQuery (explain.c:350) ==12310== Address 0xbff290c is 2,508 bytes inside a block of size 8,192 alloc'd ==12310==at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==12310==by 0x81B7FA: AllocSetAlloc (aset.c:853) ==12310==by 0x81D257: palloc (mcxt.c:907) ==12310==by 0x4B6F65: RelationGetIndexScan (genam.c:94) ==12310==by 0x4C135D: btbeginscan (nbtree.c:431) ==12310==by 0x4B7A5C: index_beginscan_internal (indexam.c:279) ==12310==by 0x4B7C5A: index_beginscan (indexam.c:222) ==12310==by 0x4B73D1: systable_beginscan (genam.c:379) ==12310==by 0x7E8CF9: ScanPgRelation (relcache.c:341) ==12310==by 0x7EB3C4: RelationBuildDesc (relcache.c:951) ==12310==by 0x7ECD35: RelationIdGetRelation (relcache.c:1800) ==12310==by 0x4A4D37: relation_open (heapam.c:1118) ==12310== { Memcheck:Addr4 fun:match_clause_to_indexcol fun:match_clause_to_index fun:match_clauses_to_index fun:match_restriction_clauses_to_index fun:create_index_paths fun:set_plain_rel_pathlist fun:set_rel_pathlist fun:set_base_rel_pathlists fun:make_one_rel fun:query_planner fun:grouping_planner fun:subquery_planner fun:standard_planner fun:planner fun:pg_plan_query fun:ExplainOneQuery } Separately, I tried "make installcheck-tests TESTS=index_including" from Postgres + Valgrind, with Valgrind's --track-origins option enabled (as it was above). I recommend installing Valgrind, and making sure that the patch shows no errors. I didn't actually find a Valgrind issue from just using your regression tests (nor did I find an issue from separately running the regression tests with CLOBBER_CACHE_ALWAYS, FWIW). Thank you for advice. Another miss of index->ncolumns to index->nkeycolumns replacement in match_clause_to_index. Fixed. I also updated couple of typos in documentation. Thank you again for the detailed review. -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 9c8e308..891325d 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -100,7 +100,7 @@ static remoteConn *getConnectionByName(const char *name); static HTAB *createConnHash(void); static void createNewConnection(const char *name, remoteConn *rconn); static void deleteConnection(const char *name); -static char **get_pkey_attnames(Relation rel, int16 *numatts); +static char **get_pkey_attnames(Relation rel, int16 *indnkeyatts); static char **get_text_array_contents(ArrayType *array, int *numitems); static char *get_sql_insert(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals, char **tgt_pkattvals); static char *get_sql_delete(Relation rel, int *pkattnums, int pknumatts, char **tgt_pkattvals); @@ -1485,7 +1485,7 @@ PG_FUNCTION_INFO_V1(dblink_get_pkey); Datum dblink_get_pkey(PG_FUNCTION_ARGS) { - int16 numatts; + int16 indnkeyatts; char **results; FuncCallContext *funcctx; int32 call_cntr; @@ -1511,7 +1511,7 @@ dblink_get_pkey(PG_FUNCTION_ARGS) rel = get_rel_from_relname(PG_GETARG_TEXT_P(0), AccessShareLock, ACL_SELECT); /* get the array of attnums */ - results = get_pkey_attnames(rel, ); + results = get_pkey_attnames(rel, ); relation_close(rel, AccessShareLock); @@ -1531,9 +1531,9 @@ dblink_get_pkey(PG_FUNCTION_ARGS) attinmeta = TupleDescGetAttInMetadata(tupdesc); funcctx->attinmeta = attinmeta; - if ((results != NULL) && (numatts > 0)) + if ((results != NULL) && (indnkeyatts > 0)) { - funcctx->max_calls = numatts; + funcctx->max_calls = indnkeyatts; /* got results, keep track of them */ funcctx->user_fctx = results; @@ -2023,10 +2023,10 @@ dblink_fdw_validator(PG_FUNCTION_ARGS) * get_pkey_attnames * * Get the primary key attnames for the given relation. - * Return NULL, and set numatts = 0, if no primary
Re: [HACKERS] WIP: Covering + unique indexes.
06.04.2016 16:15, Anastasia Lubennikova : 06.04.2016 03:05, Peter Geoghegan: * There is some stray whitespace within RelationGetIndexAttrBitmap(). I think you should have updated it with code, though. I don't think it's necessary for HOT updates to work, but I think it could be necessary so that we don't need to get a row lock that blocks non-conflict foreign key locking (see heap_update() callers). I think you need to be careful for non-key columns within the loop in RelationGetIndexAttrBitmap(), basically, because it seems to still go through all columns. UPSERT also must call this code, FWIW. * I think that a similar omission is also made for the replica identity stuff in RelationGetIndexAttrBitmap(). Some thought is needed on how this patch interacts with logical decoding, I guess. Good point. Indexes are everywhere in the code. I missed that RelationGetIndexAttrBitmap() is used not only for REINDEX. I'll discuss it with Theodor and send an updated patch tomorrow. As promised, updated patch is in attachments. But, I'm not an expert in this area, so it needs a 'critical look'. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 9c8e308..891325d 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -100,7 +100,7 @@ static remoteConn *getConnectionByName(const char *name); static HTAB *createConnHash(void); static void createNewConnection(const char *name, remoteConn *rconn); static void deleteConnection(const char *name); -static char **get_pkey_attnames(Relation rel, int16 *numatts); +static char **get_pkey_attnames(Relation rel, int16 *indnkeyatts); static char **get_text_array_contents(ArrayType *array, int *numitems); static char *get_sql_insert(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals, char **tgt_pkattvals); static char *get_sql_delete(Relation rel, int *pkattnums, int pknumatts, char **tgt_pkattvals); @@ -1485,7 +1485,7 @@ PG_FUNCTION_INFO_V1(dblink_get_pkey); Datum dblink_get_pkey(PG_FUNCTION_ARGS) { - int16 numatts; + int16 indnkeyatts; char **results; FuncCallContext *funcctx; int32 call_cntr; @@ -1511,7 +1511,7 @@ dblink_get_pkey(PG_FUNCTION_ARGS) rel = get_rel_from_relname(PG_GETARG_TEXT_P(0), AccessShareLock, ACL_SELECT); /* get the array of attnums */ - results = get_pkey_attnames(rel, ); + results = get_pkey_attnames(rel, ); relation_close(rel, AccessShareLock); @@ -1531,9 +1531,9 @@ dblink_get_pkey(PG_FUNCTION_ARGS) attinmeta = TupleDescGetAttInMetadata(tupdesc); funcctx->attinmeta = attinmeta; - if ((results != NULL) && (numatts > 0)) + if ((results != NULL) && (indnkeyatts > 0)) { - funcctx->max_calls = numatts; + funcctx->max_calls = indnkeyatts; /* got results, keep track of them */ funcctx->user_fctx = results; @@ -2023,10 +2023,10 @@ dblink_fdw_validator(PG_FUNCTION_ARGS) * get_pkey_attnames * * Get the primary key attnames for the given relation. - * Return NULL, and set numatts = 0, if no primary key exists. + * Return NULL, and set indnkeyatts = 0, if no primary key exists. */ static char ** -get_pkey_attnames(Relation rel, int16 *numatts) +get_pkey_attnames(Relation rel, int16 *indnkeyatts) { Relation indexRelation; ScanKeyData skey; @@ -2036,8 +2036,8 @@ get_pkey_attnames(Relation rel, int16 *numatts) char **result = NULL; TupleDesc tupdesc; - /* initialize numatts to 0 in case no primary key exists */ - *numatts = 0; + /* initialize indnkeyatts to 0 in case no primary key exists */ + *indnkeyatts = 0; tupdesc = rel->rd_att; @@ -2058,12 +2058,12 @@ get_pkey_attnames(Relation rel, int16 *numatts) /* we're only interested if it is the primary key */ if (index->indisprimary) { - *numatts = index->indnatts; - if (*numatts > 0) + *indnkeyatts = index->indnkeyatts; + if (*indnkeyatts > 0) { -result = (char **) palloc(*numatts * sizeof(char *)); +result = (char **) palloc(*indnkeyatts * sizeof(char *)); -for (i = 0; i < *numatts; i++) +for (i = 0; i < *indnkeyatts; i++) result[i] = SPI_fname(tupdesc, index->indkey.values[i]); } break; diff --git a/contrib/tcn/tcn.c b/contrib/tcn/tcn.c index 7352b29..142730a 100644 --- a/contrib/tcn/tcn.c +++ b/contrib/tcn/tcn.c @@ -138,9 +138,9 @@ triggered_change_notification(PG_FUNCTION_ARGS) /* we're only interested if it is the primary key and valid */ if (index->indisprimary && IndexIsValid(index)) { - int numatts = index->indnatts; + int indnkeyatts = index->indnkeyatts; - if (numatts > 0) + if (indnkeyatts > 0) { int i; @@ -150,7 +150,7 @@ triggered_change_notification(PG_FUNCTION_ARGS) appendStringInfoCharMacro(payload, ','); appendStringInfoCharMacro(payload, operation); -for (i = 0; i <
Re: [HACKERS] WIP: Covering + unique indexes.
19.03.2016 08:00, Peter Geoghegan: On Fri, Mar 18, 2016 at 5:15 AM, David Steele <da...@pgmasters.net> wrote: It looks like this patch should be marked "needs review" and I have done so. Uh, no it shouldn't. I've posted an extensive review on the original design thread. See CF entry: https://commitfest.postgresql.org/9/433/ Marked "Waiting on Author". Thanks to David, I've missed these letters at first. I'll answer here. * You truncate (remove suffix attributes -- the "included" attributes) within _bt_insertonpg(): - right_item = CopyIndexTuple(item); + indnatts = IndexRelationGetNumberOfAttributes(rel); + indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); + + if (indnatts != indnkeyatts) + { + right_item = index_reform_tuple(rel, item, indnatts, indnkeyatts); + right_item_sz = IndexTupleDSize(*right_item); + right_item_sz = MAXALIGN(right_item_sz); + } + else + right_item = CopyIndexTuple(item); ItemPointerSet(&(right_item->t_tid), rbkno, P_HIKEY); I suggest that you do this within _bt_insert_parent(), instead, iff the original target page is know to be a leaf page. That's where it needs to happen for conventional suffix truncation, which has special considerations when determining which attributes are safe to truncate (or even which byte in the first distinguishing attribute it is okay to truncate past) I agree that _bt_insertonpg() is not right for truncation. Furthermore, I've noticed that all internal keys are solely the copies of "High keys" from the leaf pages. Which is pretty logical. Therefore, if we have already truncated the tuple, when it became a High key, we do not need the same truncation within _bt_insert_parent() or any other function. So the only thing to worry about is the HighKey truncation. I rewrote the code. Now only _bt_split cares about truncation. It's a bit more complicated to add it into index creation algorithm. There's a trick with a "high key". /* * We copy the last item on the page into the new page, and then * rearrange the old page so that the 'last item' becomes its high key * rather than a true data item. There had better be at least two * items on the page already, else the page would be empty of useful * data. */ /* * Move 'last' into the high key position on opage */ To be consistent with other steps of algorithm ( all high keys must be truncated tuples), I had to update this high key on place: delete the old one, and insert truncated high key. The very same logic I use to truncate posting list of a compressed tuple in the "btree_compression" patch. [1] I hope, both patches will be accepted, and then I'll thoroughly merge them . * I think the comparison logic may have a bug. Does this work with amcheck? Maybe it works with bt_index_check(), but not bt_index_parent_check()? I think that you need to make sure that _bt_compare() knows about this, too. That's because it isn't good enough to let a truncated internal IndexTuple compare equal to a scankey when non-truncated attributes are equal. It is a very important issue. But I don't think it's a bug there. I've read amcheck sources thoroughly and found that the problem appears at "invariant_key_less_than_equal_nontarget_offset() static bool invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state, Page nontarget, ScanKey key, OffsetNumber upperbound) { int16natts = state->rel->rd_rel->relnatts; int32cmp; cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound); return cmp <= 0; } It uses scankey, made with _bt_mkscankey() which uses only key attributes, but calls _bt_compare with wrong keysz. If we wiil use nkeyatts = state->rel->rd_index->relnatts; instead of natts, all the checks would be passed successfully. Same for invariant_key_greater_than_equal_offset() and invariant_key_less_than_equal_nontarget_offset(). In my view, it's the correct way to fix this problem, because the caller is responsible for passing proper arguments to the function. Of course I will add a check into bt_compare, but I'd rather make it an assertion (see the patch attached). I'll add a flag to distinguish regular and truncated tuples, but it will not be used in this patch. Please, comment, if I've missed something. As you've already mentioned, neither high keys, nor tuples on internal pages are using "itup->t_tid.ip_posid", so I'll take one bit of it. It will definitely require changes in the future works on suffix truncation or something like that, but IMHO for now it's enough. Do you have any objections or comments? [1] https://commitfest.postgresql.org/9/494/ -- Anastasia Lubennikova Pos
Re: [HACKERS][PATCH] Supporting +-Infinity values by to_timestamp(float8)
27.02.2016 09:57, Vitaly Burovoy: Hello, Hackers! I worked on a patch[1] allows "EXTRACT(epoch FROM +-Inf::timestamp[tz])" to return "+-Inf::float8". There is an opposite function "to_timestamp(float8)" which now defined as: SELECT ('epoch'::timestamptz + $1 * '1 second'::interval) Hi, thank you for the patches. Could you explain, whether they depend on each other? Since intervals do not support infinity values, it is impossible to do something like: SELECT to_timestamp('infinity'::float8); ... which is not good. Supporting of such converting is in the TODO list[2] (by "converting between infinity timestamp and float8"). You mention intervals here, and TODO item definitely says about 'infinity' interval, while patch and all the following discussion concerns to timestamps. Is it a typo or I misunderstood something important? I assumed that following query will work, but it isn't. Could you clarify that? select to_timestamp('infinity'::interval); Proposed patch implements it. There is an other patch in the CF[3] 2016-03 implements checking of timestamp[tz] for being in allowed range. Since it is wise to set (fix) the upper boundary of timestamp[tz]s, I've included the file "src/include/datatype/timestamp.h" from there to check that an input value and a result are in the allowed range. There is no changes in a documentation because allowed range is the same as officially supported[4] (i.e. until 294277 AD). I think that you should update documentation. At least description of epoch on this page: http://www.postgresql.org/docs/devel/static/functions-datetime.html Here is how you can convert an epoch value back to a time stamp: SELECT TIMESTAMP WITH TIME ZONE 'epoch' + 982384720.12 * INTERVAL '1 second'; (The |to_timestamp| function encapsulates the above conversion.) More thoughts about the patch: 1. When I copy value from hints for min and max values (see examples below), it works fine for min, while max still leads to error. It comes from the check "if (seconds >= epoch_ubound)". I wonder, whether you should change hint message? select to_timestamp(-210866803200.00); to_timestamp - 4714-11-24 02:30:17+02:30:17 BC (1 row) select to_timestamp(9224318016000.00); ERROR: UNIX epoch out of range: "9224318016000.00" HINT: Maximal UNIX epoch value is "9224318016000.00" 2. There is a comment about JULIAN_MAXYEAR inaccuracy in timestamp.h: * IS_VALID_JULIAN checks the minimum date exactly, but is a bit sloppy * about the maximum, since it's far enough out to not be especially * interesting. Maybe you can expand it? - Is JULIAN_MAXYEAR4STAMPS helps to avoid overflow in all possible cases? - Why do we need to hold both definitions? I suppose, it's a matter of backward compatibility, isn't it? 3. (nitpicking) I don't sure about "4STAMPS" suffix. "4" is nice abbreviation, but it seems slightly confusing to me. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS]WIP: Covering + unique indexes.
29.02.2016 18:17, Anastasia Lubennikova: 25.02.2016 21:39, Jeff Janes: As promised, here's the new version of the patch "including_columns_4.0". I fixed all issues except some points mentioned below. Thanks for the update patch. I get a compiler warning: genam.c: In function 'BuildIndexValueDescription': genam.c:259: warning: unused variable 'tupdesc' Thank you for the notice, I'll fix it in the next update. Also, I can't create a primary key INCLUDING columns directly: jjanes=# create table foobar (a int, b int, c int); jjanes=# alter table foobar add constraint foobar_pkey primary key (a,b) including (c); ERROR: syntax error at or near "including" But I can get there using a circuitous route: jjanes=# create unique index on foobar (a,b) including (c); jjanes=# alter table foobar add constraint foobar_pkey primary key using index foobar_a_b_c_idx; The description of the table's index knows to include the including column: jjanes=# \d foobar Table "public.foobar" Column | Type | Modifiers +-+--- a | integer | not null b | integer | not null c | integer | Indexes: "foobar_pkey" PRIMARY KEY, btree (a, b) INCLUDING (c) Since the machinery appears to all be in place to have primary keys with INCLUDING columns, it would be nice if the syntax for adding primary keys allowed one to implement them directly. Is this something or future expansion, or could it be added at the same time as the main patch? Good point. At quick glance, this looks easy to implement it. The only problem is that there are too many places in code which must be updated. I'll try to do it, and if there would be difficulties, it's fine with me to delay this feature for the future work. I found one more thing to do. Pgdump does not handle included columns now. I will fix it in the next version of the patch. As promised, fixed patch is in attachments. It allows to perform following statements: create table utbl (a int, b box); alter table utbl add unique (a) including(b); create table ptbl (a int, b box); alter table ptbl add primary key (a) including(b); And now they can be dumped/restored successfully. I used following settings pg_dump --verbose -Fc postgres -f pg.dump pg_restore -d newdb pg.dump It is not the final version, because it breaks pg_dump for previous versions. I need some help from hackers here. pgdump. line 5466 if (fout->remoteVersion >= 90400) What does 'remoteVersion' mean? And what is the right way to change it? Or it changes between releases? I guess that 90400 is for 9.4 and 80200 is for 8.2 but is it really so? That is totally new to me. BTW, While we are on the subject, maybe it's worth to replace these magic numbers with some set of macro? P.S. I'll update documentation for ALTER TABLE in the next patch. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Covering + unique indexes.
21.03.2016 19:53, Anastasia Lubennikova: 19.03.2016 08:00, Peter Geoghegan: On Fri, Mar 18, 2016 at 5:15 AM, David Steele <da...@pgmasters.net> wrote: It looks like this patch should be marked "needs review" and I have done so. Uh, no it shouldn't. I've posted an extensive review on the original design thread. See CF entry: https://commitfest.postgresql.org/9/433/ Marked "Waiting on Author". Thanks to David, I've missed these letters at first. I'll answer here. * You truncate (remove suffix attributes -- the "included" attributes) within _bt_insertonpg(): - right_item = CopyIndexTuple(item); + indnatts = IndexRelationGetNumberOfAttributes(rel); + indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel); + + if (indnatts != indnkeyatts) + { + right_item = index_reform_tuple(rel, item, indnatts, indnkeyatts); + right_item_sz = IndexTupleDSize(*right_item); + right_item_sz = MAXALIGN(right_item_sz); + } + else + right_item = CopyIndexTuple(item); ItemPointerSet(&(right_item->t_tid), rbkno, P_HIKEY); I suggest that you do this within _bt_insert_parent(), instead, iff the original target page is know to be a leaf page. That's where it needs to happen for conventional suffix truncation, which has special considerations when determining which attributes are safe to truncate (or even which byte in the first distinguishing attribute it is okay to truncate past) I agree that _bt_insertonpg() is not right for truncation. Furthermore, I've noticed that all internal keys are solely the copies of "High keys" from the leaf pages. Which is pretty logical. Therefore, if we have already truncated the tuple, when it became a High key, we do not need the same truncation within _bt_insert_parent() or any other function. So the only thing to worry about is the HighKey truncation. I rewrote the code. Now only _bt_split cares about truncation. It's a bit more complicated to add it into index creation algorithm. There's a trick with a "high key". /* * We copy the last item on the page into the new page, and then * rearrange the old page so that the 'last item' becomes its high key * rather than a true data item. There had better be at least two * items on the page already, else the page would be empty of useful * data. */ /* * Move 'last' into the high key position on opage */ To be consistent with other steps of algorithm ( all high keys must be truncated tuples), I had to update this high key on place: delete the old one, and insert truncated high key. The very same logic I use to truncate posting list of a compressed tuple in the "btree_compression" patch. [1] I hope, both patches will be accepted, and then I'll thoroughly merge them . * I think the comparison logic may have a bug. Does this work with amcheck? Maybe it works with bt_index_check(), but not bt_index_parent_check()? I think that you need to make sure that _bt_compare() knows about this, too. That's because it isn't good enough to let a truncated internal IndexTuple compare equal to a scankey when non-truncated attributes are equal. It is a very important issue. But I don't think it's a bug there. I've read amcheck sources thoroughly and found that the problem appears at "invariant_key_less_than_equal_nontarget_offset() static bool invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state, Page nontarget, ScanKey key, OffsetNumber upperbound) { int16natts = state->rel->rd_rel->relnatts; int32cmp; cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound); return cmp <= 0; } It uses scankey, made with _bt_mkscankey() which uses only key attributes, but calls _bt_compare with wrong keysz. If we wiil use nkeyatts = state->rel->rd_index->relnatts; instead of natts, all the checks would be passed successfully. Same for invariant_key_greater_than_equal_offset() and invariant_key_less_than_equal_nontarget_offset(). In my view, it's the correct way to fix this problem, because the caller is responsible for passing proper arguments to the function. Of course I will add a check into bt_compare, but I'd rather make it an assertion (see the patch attached). I'll add a flag to distinguish regular and truncated tuples, but it will not be used in this patch. Please, comment, if I've missed something. As you've already mentioned, neither high keys, nor tuples on internal pages are using "itup->t_tid.ip_posid", so I'll take one bit of it. It will definitely require changes in the future works on suffix truncation or something like that, but IMHO for now it's enough. Do you have any objections or comments? [1] https://commitfest.pos
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
25.03.2016 01:12, Peter Geoghegan: On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas <robertmh...@gmail.com> wrote: I really like this idea, and the performance results seem impressive, but I think we should push this out to 9.7. A btree patch that didn't have WAL support until two and a half weeks into the final CommitFest just doesn't seem to me like a good candidate. First, as a general matter, if a patch isn't code-complete at the start of a CommitFest, it's reasonable to say that it should be reviewed but not necessarily committed in that CommitFest. You're right. Frankly, I thought that someone will help me with the path, but I had to finish it myself. *off-topic* I wonder, if we can add new flag to commitfest. Something like "Needs assistance", which will be used to mark big and complicated patches in progress. While "Needs review" means that the patch is almost ready and only requires the final review. This patch has had some review, but I'm not sure how deep that review is, and I think it's had no code review at all of the WAL logging changes, which were submitted only a week ago, well after the CF deadline. Second, the btree AM is a particularly poor place to introduce possibly destabilizing changes. Everybody depends on it, all the time, for everything. And despite new tools like amcheck, it's not a particularly easy thing to debug. Regrettably, I must agree. I don't see a plausible path to commit for this patch in the ongoing CF. I think that Anastasia did an excellent job here, and I wish I could have been of greater help sooner. Nevertheless, it would be unwise to commit this given the maturity of the code. There have been very few instances of performance improvements to the B-Tree code for as long as I've been interested, because it's so hard, and the standard is so high. The only example I can think of from the last few years is Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which were far less invasive, and Simon's commit c7111d11b1, which we just outright reverted from 9.5 due to subtle bugs (and even that was significantly less invasive than this patch). Improving nbtree is something that requires several rounds of expert review, and that's something that's in short supply for the B-Tree code in particular. I think that a new testing strategy is needed to make this easier, and I hope to get that going with amcheck. I need help with formalizing a "testing first" approach for improving the B-Tree code, because I think it's the only way that we can move forward with projects like this. It's *incredibly* hard to push forward patches like this given our current, limited testing strategy. Unfortunately, I must agree. This patch seems to be far from final version until the feature freeze. I'll move it to the future commitfest. Anyway it means, that now we have more time to improve the patch. If you have any ideas related to this patch like prefix/suffix compression, I'll be glad to discuss them. Same for any other ideas of B-tree optimization. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Supporting +-Infinity values by to_timestamp(float8)
17.03.2016 06:27, Vitaly Burovoy: On 2016-03-15, David Steele <da...@pgmasters.net> wrote: On 3/4/16 2:56 PM, Vitaly Burovoy wrote: On 3/4/16, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: I think that you should update documentation. At least description of epoch on this page: http://www.postgresql.org/docs/devel/static/functions-datetime.html Thank you very much for pointing where it is located (I saw only "to_timestamp(TEXT, TEXT)"). I'll think how to update it. Vitaly, have you decided how to update this yet? Yes, there are three versions: * remove mentioning how to get timestamptz from UNIX stamp; * leave a note how to get timestamptz and add a note that such encapsulation existed prior to 9.6; * replace to the proposed current behavior (without interval). I decided to implement the third case (but a result there has a time zone which can be different, so the result can be not exactly same for a concrete user). If a committer decides to do somehow else, he is free to choose one from the list above or to do something else. 3. (nitpicking) I don't sure about "4STAMPS" suffix. "4" is nice abbreviation, but it seems slightly confusing to me. It doesn't matter for me what it is called, it is short enough and reflects a type on which it is applied. What would the best name be for it? Anastasia, any suggestions for a better name, or just leave it as is? I'm not in favor of the "4", either. I think I would prefer JULIAN_MAXYEAR_STAMP. It turns out that Tom has changed almost one third of timestamp.h and now the constant has a name "TIMESTAMP_END_JULIAN". I've rebased the patch to the current master (5db5146) and changed it according to the new timestamp.h. Now it passes both versions (integer and float timestamps). I deleted test for the upper boundary for integer timestamps, changed a little comments. I decided to delete hints about minimum and maximum allowed values because no one type has such hint. Please find attached a new version of the patch. I think, I should write something as a reviewer. I read the patch again and I don't see any issues with it. It applies to the master and works as expected. So I'll mark it as "Ready for committer" -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Processes and caches in postgresql
Hi, hackers. There's a couple of questions about processes. I found EXEC_BACKEND flag, while reading the code. As I understood, it exists because we have to emulate fork() on WIN32. And also it allows to debug the same behavior on Linux. Is it right? Are there any other use cases? Another question is about "--fork" argument (see code below). I didn't find it in documentation, so I'm a bit confused. I wonder if it exists only for internal purposes? #ifdef EXEC_BACKEND if (argc > 1 && strncmp(argv[1], "--fork", 6) == 0) SubPostmasterMain(argc, argv);/* does not return */ #endif And the last, but not least. Do we have any presentations/articles/READMEs/whatever about caches (src/backend/utils/cache/*) in postgresql? I found nothing, besides comments in the code. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Leaking memory in text_overlay function
I found this thread on the open CF. As I see, the discussion is ended with decision to reject patch. I've just changed the status to "Rejected". 02.07.2016 18:11, Dirk Rudolph: Well that's good to know. It was just curious that, in my case, I only saw this in this particular function. Anyway. I will consider to handle the memory the same way, if this is the recommendation. Many thanks. /Closed On Sat, Jul 2, 2016 at 4:45 PM, Tom Lane <t...@sss.pgh.pa.us <mailto:t...@sss.pgh.pa.us>> wrote: Dirk Rudolph <dirk.rudo...@netcentric.biz <mailto:dirk.rudo...@netcentric.biz>> writes: > while implementing my own C function, I mentioned that some memory is not freed by the text_overlay function in varlena.c By and large, that's intentional. SQL-called functions normally run in short-lived memory contexts, so that any memory they don't bother to pfree will be cleaned up automatically at the end of the tuple cycle. If we did not follow that approach, there would need to be many thousands more explicit pfree calls than there are today, and the system would be net slower overall because multiple retail pfree's are slower than a MemoryContextReset. There are places where it's important to avoid leaking memory, certainly. But I don't think this is one of them, unless you can demonstrate a scenario with query-lifespan or worse leakage. > Particularly I mean the both substrings allocated by text_substring (according to the documentation of text_substring "The result is always a freshly palloc'd datum.") and one of the concatenation results. I’m aware of the MemoryContext being deleted in my case but IMHO builtin functions should be memory safe. That is not the project policy. regards, tom lane -- Dirk Rudolph | Senior Software Engineer Netcentric AG M: +41 79 642 37 11 D: +49 174 966 84 34 dirk.rudo...@netcentric.biz <mailto:dirk.rudo...@netcentric.biz> | www.netcentric.biz <http://www.netcentric.biz/> -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
[HACKERS] Refactoring of heapam code.
Working on page compression and some other issues related to access methods, I found out that the code related to heap looks too complicated. Much more complicated, than it should be. Since I anyway got into this area, I want to suggest a set of improvements. There is a number of problems I see in the existing code: Problem 1. Heap != Relation. File heapam.c has a note: * This file contains the heap_ routines which implement * the POSTGRES heap access method used for all POSTGRES * relations. This statement is wrong, since we also have index relations, that are implemented using other access methods. Also I think that it's quite strange to have a function heap_create(), that is called from index_create(). It has absolutely nothing to do with heap access method. And so on, and so on. One more thing that requires refactoring is "RELKIND_RELATION" name. We have a type of relation called "relation"... Problem 2. Some functions are shared between heap and indexes access methods. Maybe someday it used to be handy, but now, I think, it's time to separate them, because IndexTuples and HeapTuples have really little in common. Problem 3. The word "heap" is used in many different contexts. Heap - a storage where we have tuples and visibility information. Generally speaking, it can be implemented over any data structure, not just a plain table as it is done now. Heap - an access method, that provides a set of routines for plain table storage, such as insert, delete, update, gettuple, vacuum and so on. We use this access method not only for user's tables. There are many types of relations (pg_class.h): #define RELKIND_RELATION'r'/* ordinary table */ #define RELKIND_INDEX 'i'/* secondary index */ #define RELKIND_SEQUENCE'S'/* sequence object */ #define RELKIND_TOASTVALUE 't'/* for out-of-line values */ #define RELKIND_VIEW'v'/* view */ #define RELKIND_COMPOSITE_TYPE 'c'/* composite type */ #define RELKIND_FOREIGN_TABLE 'f'/* foreign table */ #define RELKIND_MATVIEW 'm'/* materialized view */ They can be logically separated into three categories: "primary storage" - r, S, t, v. They store data and visibility information. The only implementation is heapam.c "secondary index" - i. They store some data and pointers to primary storage. Various existing AMs and managed by AM API. "virtual relations" - c, f, m. They have no physical storage, only entries in caches and catalogs. Now, for some reasons, we sometimes use name "heap" for both "primary storage" and "virual relations". See src/backend/catalog/heap.c. But some functions work only with "primary storage". See pgstat_relation(). And we constantly have to check relkind everywhere. I'd like it would be clear from the code interface and naming. So there is a couple of patches. They do not cover all mentioned problems, but I'd like to get a feedback before continuing. Patch 1: There is a macro "PageGetItem" in bufpage.h that is used to retrieve an item from the given page. It is used for both heap and index tuples. It doesn't seems a problem, until you are going to find anything in this code. The first patch "PageGetItem_refactoring.patch" is intended to fix it. It is just renaming: (IndexTuple) PageGetItem ---> PageGetItemIndex (HeapTupleHeader) PageGetItem ---> PageGetItemHeap Other types of tuples (BrinTuple, SpGistInnerTuple, SpGistLeafTuple, SpGistDeadTuple) still use PageGetItem. I also changed functions that do not access tuple's data, but only need HeapTupleHeader fields. They use a macro PageGetItemHeapHeaderOnly. I do not insist on these particular names, by the way. Patch 2. heapam.c, hio.c and src/backend/catalog/heap.c have a lot of confusing names and outdated comments. The patch is intended to improve it. Fun fact: I found several "check it later" comments unchanged since "Postgres95 1.01 Distribution - Virgin Sources" commit. I wonder if we can wind better name relation_drop_with_catalog() since, again, it's not about all kinds of relations. We use another function to drop index relations. I hope these changes will be useful for the future development. Suggested patches are mostly about renaming and rearrangement of functions between files, so, if nobody has conceptual objections, all I need from reviewers is an attentive look to find typos, grammar mistakes and overlooked areas. -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c index 4983bbb..257b609 100644 --- a/contrib/pageinspect/btreefuncs.c +++ b/contrib/pageinspect/btreefuncs.c
Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]
30.07.2016 14:00, Andrew Borodin: Here is BRIN-compatible version of patch. Now function PageIndexTupleOverwrite consumes tuple size as a parameter, this behavior is coherent with PageAddItem. Also, i had to remove asserstion that item has a storage in the loop that adjusts line pointers. It would be great if someone could check that place (commented Assert(ItemIdHasStorage(ii)); in PageIndexTupleOverwrite). I suspect that there might be a case when linp's should not be adjusted. Hi, I reviewed the code and I have couple of questions about following code: /* may have negative size here if new tuple is larger */ size_diff = oldsize - alignednewsize; offset = ItemIdGetOffset(tupid); if (offset < phdr->pd_upper || (offset + size_diff) > phdr->pd_special || offset != MAXALIGN(offset) || size_diff != MAXALIGN(size_diff)) ereport(ERROR, (errcode(ERRCODE_DATA_CORRUPTED), errmsg("corrupted item offset: offset = %u, size = %u", offset, (unsigned int) size_diff))); First of all, shouldn't we use MAXALIGN(oldsize) instead of oldsize? Although, I'm quite sure that it was already aligned somewhere before. I doubt that the check (size_diff != MAXALIGN(size_diff)) is necessary. I'd rather add the check: (offset+size_diff < pd_lower). Besides that moment, the patch seems pretty good. BTW, I'm very surprised that it improves performance so much. And also size is reduced significantly. 89MB against 289MB without patch. Could you explain in details, why does it happen? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Refactoring of heapam code.
05.08.2016 16:30, Kevin Grittner: On Fri, Aug 5, 2016 at 2:54 AM, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: They can be logically separated into three categories: "primary storage" - r, S, t, v. They store data and visibility information. The only implementation is heapam.c "secondary index" - i. They store some data and pointers to primary storage. Various existing AMs and managed by AM API. "virtual relations" - c, f, m. They have no physical storage, only entries in caches and catalogs. Materialized views (relkind == "m") have physical storage, and may have indexes. Good point. I сonfused letters for views (virtual relations) and materialized views (primary storage). Should be: "primary storage" - r, S, t, m. They store data and visibility information. The only implementation is heapam.c "secondary index" - i. They store some data and pointers to primary storage. Various existing AMs and managed by AM API. "virtual relations" - c, f, v. They have no physical storage, only entries in caches and catalogs. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]
06.08.2016 19:51, Andrew Borodin: Anastasia , thank you for your attentive code examine. 2016-08-05 21:19 GMT+05:00 Anastasia Lubennikova <a.lubennik...@postgrespro.ru>: First of all, shouldn't we use MAXALIGN(oldsize) instead of oldsize? Although, I'm quite sure that it was already aligned somewhere before. I doubt that the check (size_diff != MAXALIGN(size_diff)) is necessary. I'd rather add the check: (offset+size_diff < pd_lower). Actually, that's a tricky question. There may be different code expectations. 1. If we expect that not-maxaligned tuple may be placed by any other routine, we should remove check (size_diff != MAXALIGN(size_diff)), since it will fail for not-maxaligned tuple. 2. If we expect that noone should use PageIndexTupleOverwrite with not-maxaligned tuples, than current checks are OK: we will break execution if we find non-maxaligned tuples. With an almost correct message. I suggest that this check may be debug-only assertion: in a production code routine will work with not-maxaligned tuples if they already reside on the page, in a dev build it will inform dev that something is going wrong. Is this behavior Postgres-style compliant? I agree that pd_lower check makes sense. Pointing out to this comparison I worried about possible call of MAXALIGN for negative size_diff value. Although I don't sure if it can cause any problem. Tuples on a page are always maxaligned. But, as far as I recall, itemid->lp_len can contain non-maxaligned value. So, I'd suggest you to perform MAXALIGN(oldsize) before computing size_diff: size_diff = oldsize - alignednewsize; Since both arguments are maxaligned the check of size_diff is not needed. BTW, I'm very surprised that it improves performance so much. And also size is reduced significantly. 89MB against 289MB without patch. Could you explain in details, why does it happen? Size reduction is unexpected for me. There might be 4 plausible explanations. I'll list them ordered by descending of probability: 1. Before this patch every update was throwing recently updated tuple to the end of a page. Thus, in some-how ordered data, recent best-fit will be discovered last. This can cause increase of MBB's overlap in spatial index and slightly higher tree. But 3 times size decrease is unlikely. How did you obtained those results? Can I look at a test case? Nothing special, I've just run the test you provided in this thread. And check index size via select pg_size_pretty(pg_relation_size('idx')); 2. Bug in PageIndexTupleDelete causing unused space emersion. I've searched for it, unsuccessfully. 3. Bug in PageIndexTupleOVerwrite. I cannot imagine nature of such a bug. May be we are not placing something not very important and big on a page? 4. Magic. I do not see what one should do with the R-tree to change it's size by a factor of 3. First three explanations are not better that forth, actually. Those 89 MB, they do not include WAL, right? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)
05.08.2016 19:41, Robert Haas: 2. This inserts additional code in a bunch of really low-level places like heap_hot_search_buffer, heap_update, heap_delete, etc. I think what you've basically done here is create a new, in-memory heap AM and then, because we don't have an API for adding new storage managers, you've bolted it onto the existing heapam layer. That's certainly a reasonable approach for creating a PoC, but I think we actually need a real API here. Otherwise, when the next person comes along and wants to add a third heap implementation, they've got to modify all of these same places again. I don't think this code is reasonably maintainable in this form. As I can see, you recommend to clean up the API of storage management code. I strongly agree that it's high time to do it. So, I started the discussion about refactoring and improving API of heapam and heap relations. You can find it on commitfest: https://commitfest.postgresql.org/10/700/ I'll be glad to see your thoughts on the thread. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Refactoring of heapam code.
05.08.2016 20:56, Alvaro Herrera: Anastasia Lubennikova wrote: Working on page compression and some other issues related to access methods, I found out that the code related to heap looks too complicated. Much more complicated, than it should be. Since I anyway got into this area, I want to suggest a set of improvements. Hm. I'm hacking on heapam.c pretty heavily ATM. Your first patch causes no problem I think, or if it does it's probably pretty minor, so that's okay. I'm unsure about the second one -- I think it's okay too, because it mostly seems to be about moving stuff from heapam.c to hio.c and shuffling some code around that I don't think I'm modifying. I'm sure these patches will not cause any problems. The second one is mostly about moving unrelated stuff from heapam.c to hio.c and renaming couple of functions in heap.c. Anyway, the are not a final solution just proof of concept. By the way, I thought about looking at the mentioned patch and probably reviewing it, but didn't find any of your patches on the current commitfest. Could you point out the thread? Also I think that it's quite strange to have a function heap_create(), that is called from index_create(). It has absolutely nothing to do with heap access method. Agreed. But changing its name while keeping it in heapam.c does not really improve things enough. I'd rather have it moved elsewhere that's not specific to "heaps" (somewhere in access/common perhaps). However, renaming the functions would break third-party code for no good reason. I propose that we only rename things if we also break it for other reasons (say because the API changes in a fundamental way). Yes, I agree that it should be moved to another file. Just to be more accurate: it's not in heapam.c now, it is in "src/backend/catalog/heap.c" which requires much more changes than I did. Concerning to the third-party code. It's a good observation and we definitely should keep it in mind. Nevertheless, I doubt that there's a lot of external calls to these functions. And I hope this thread will end up with fundamental changes introducing clear API for all mentioned problems. One more thing that requires refactoring is "RELKIND_RELATION" name. We have a type of relation called "relation"... I don't see us fixing this bit, really. Historical baggage and all that. For example, a lot of SQL queries use "WHERE relkind = 'r'". If we change the name, we should probably also change the relkind constant, and that would break the queries. If we change the name and do not change the constant, then we have to have a comment "we call them RELKIND_TABLE but the char is r because it was called RELATION previously", which isn't any better. Good point. I'd rather change it to RELKIND_BASIC_RELATION or something like that, which is more specific and still goes well with 'r' constant. But minor changes like that can wait until we clarify the API in general. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Refactoring of heapam code.
08.08.2016 03:51, Michael Paquier: On Sat, Aug 6, 2016 at 2:56 AM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: Anastasia Lubennikova wrote: So there is a couple of patches. They do not cover all mentioned problems, but I'd like to get a feedback before continuing. I agree that we could improve things in this general area, but I do not endorse any particular changes you propose in these patches; I haven't reviewed your patches. Well, I am interested in the topic. And just had a look at the patches proposed. + /* + * Split PageGetItem into set of different macros + * in order to make code more readable. + */ +#define PageGetItemHeap(page, itemId) \ +( \ + AssertMacro(PageIsValid(page)), \ + AssertMacro(ItemIdHasStorage(itemId)), \ + (HeapTupleHeader)(((char *)(page)) + ItemIdGetOffset(itemId)) \ +) Placing into bufpage.h macros that are specific to heap and indexes is not that much a good idea. I'd suggest having the heap ones in heapam.h, and the index ones in a more generic place, as src/access/common as already mentioned by Alvaro. + onpage_tup = PageGetItemHeapHeaderOnly(page, newitemid); Just PageGetItemHeapHeader would be fine. @@ -2324,7 +2087,6 @@ FreeBulkInsertState(BulkInsertState bistate) pfree(bistate); } - /* * heap_insert - insert tuple into a heap Those patches have some noise. Patch 0002 is doing two things: reorganizing the order of the routines in heapam.c and move some routines from heapam.c to hio.c. Those two could be separate patches/ If the final result is to be able to extend custom AMs for tables, we should have a structure like src/access/common/index.c, src/access/common/table.c and src/access/common/relation.c for example, and have headers dedicated to them. But yes, there is definitely a lot of room for refactoring as we are aiming, at least I guess so, at having more various code paths for access methods. Thank you for the review, I'll fix these problems in final version. Posting the first message I intended to raise the discussion. Patches were attached mainly to illustrate the problem and to be more concrete. Thank you for attention and feedback. Since there are no objections to the idea in general, I'll write an exhaustive README about current state of the code and then propose API design to discuss details. Stay tuned. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Covering + unique indexes.
14.08.2016 20:11, Andrey Borodin: The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, failed Spec compliant: tested, passed Documentation:tested, passed Hi hackers! I've read the patch and here is my code review. ==PURPOSE I've used this feature from time to time with MS SQL. From my experience INCLUDE is a 'sugar on top' feature. Some MS SQL classes do not even mention INCLUDE despite it's there from 2005 (though classes do not mention lots of important things, so it's not kind of valuable indicator). But those who use it, use it whenever possible. For example, system view with recommended indices rarely list one without INCLUDE columns. So, this feature is very important from perspective of converting MS SQL DBAs to PostgreSQL. This is how I see it. Thank you for the review, I hope this feature will be useful for many people. SUGGESTIONS== 0. Index build is broken. This script https://github.com/x4m/pggistopt/blob/8ad65d2e305e98c836388a07909af5983dba9c73/test.sql SEGFAULTs and may cause situation when you cannot insert anything into table (I think drop of index would help, but didn't tested this) Thank you for reporting. That was a bug caused by high key truncation, that occurs when index has more than 3 levels. Fixed. See attached file. 1. I think MS SQL syntax INCLUDE instead of INCLUDING would be better (for a purpose listed above) I've chosen this particular name to avoid using of new keyword. We already have INCLUDING in postgres in a context of inheritance that will never intersect with covering indexes. I'm sure it won't be a big problem of migration from MsSQL. 2. Empty line added in ruleutils.c. Is it for a reason? No, just a missed line. Fixed. 3. Now we have indnatts and indnkeyatts instead of indnatts. I think it is worth considering renaming indnatts to something different from old name. Someone somewhere could still suppose it's a number of keys. I agree that naming became vague after this patch. I've already suggested to replace "indkeys[]" with more specific name, and AFAIR there was no reaction. So I didn't do that. But I don't sure about your suggestion regarding indnatts. Old queries (and old indexes) can still use it correctly. I don't see a reason to break compatibility for all users. Those who will use this new feature, should ensure that their queries to pg_index behave as expected. PERFORMANCE== Due to suggestion number 0 I could not measure performance of index build. Index crashes when there's more than 1.1 million of rows in a table. Performance test script is here https://github.com/x4m/pggistopt/blob/f206b4395baa15a2fa42897eeb27bd555619119a/test.sql Test scenario is following: 1. Create table, then create index, then add data. 2. Make a query touching data in INCLUDING columns. This scenario is tested against table with: A. Table with index, that do not contain touched columns, just PK. B. Index with all columns in index. C. Index with PK in keys and INCLUDING all other columns. Tests were executed 5 times on Ubuntu VM under Hyper-V i5 2500 CPU, 16 Gb of RAM, SSD disk. Time to insert 10M rows: A. AVG 110 seconds STD 4.8 B. AVG 121 seconds STD 2.0 C. AVG 111 seconds STD 5.7 Inserts to INCLUDING index is almost as fast as inserts to index without extra columns. Time to run SELECT query: A. AVG 2864 ms STD 794 B. AVG 2329 ms STD 84 C. AVG 2293 ms STD 58 Selects with INCLUDING columns is almost as fast as with full index. Index size (deterministic measure, STD = 0) A. 317 MB B. 509 MB C. 399 MB Index size is in the middle between full index and minimal index. I think this numbers agree with expectation from the feature. CONCLUSION== This patch brings useful and important feature. Build shall be repaired; other my suggestions are only suggestions. Best regards, Andrey Borodin, Octonica & Ural Federal University. The new status of this patch is: Waiting on Author -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 9c8e308..891325d 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -100,7 +100,7 @@ static remoteConn *getConnectionByName(const char *name); static HTAB *createConnHash(void); static void createNewConnection(const char *name, remoteConn *rconn); static void deleteConnection(const char *name); -static char **get_pkey_attnames(Relation rel, int16 *numatts); +static char **get_pkey_attnames(Relation rel, int16 *indnkeyatts); static char **get_text_array_contents(ArrayType *array, int *numitems); static char *get_sql_insert(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals, char **tgt_pkattvals); static char *get_sql_delete(Relation rel, int *pkattnums,
Re: [HACKERS] Refactoring of heapam code.
08.08.2016 12:43, Anastasia Lubennikova: 08.08.2016 03:51, Michael Paquier: On Sat, Aug 6, 2016 at 2:56 AM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: Anastasia Lubennikova wrote: So there is a couple of patches. They do not cover all mentioned problems, but I'd like to get a feedback before continuing. I agree that we could improve things in this general area, but I do not endorse any particular changes you propose in these patches; I haven't reviewed your patches. Well, I am interested in the topic. And just had a look at the patches proposed. + /* + * Split PageGetItem into set of different macros + * in order to make code more readable. + */ +#define PageGetItemHeap(page, itemId) \ +( \ + AssertMacro(PageIsValid(page)), \ + AssertMacro(ItemIdHasStorage(itemId)), \ + (HeapTupleHeader)(((char *)(page)) + ItemIdGetOffset(itemId)) \ +) Placing into bufpage.h macros that are specific to heap and indexes is not that much a good idea. I'd suggest having the heap ones in heapam.h, and the index ones in a more generic place, as src/access/common as already mentioned by Alvaro. + onpage_tup = PageGetItemHeapHeaderOnly(page, newitemid); Just PageGetItemHeapHeader would be fine. @@ -2324,7 +2087,6 @@ FreeBulkInsertState(BulkInsertState bistate) pfree(bistate); } - /* * heap_insert - insert tuple into a heap Those patches have some noise. Patch 0002 is doing two things: reorganizing the order of the routines in heapam.c and move some routines from heapam.c to hio.c. Those two could be separate patches/ If the final result is to be able to extend custom AMs for tables, we should have a structure like src/access/common/index.c, src/access/common/table.c and src/access/common/relation.c for example, and have headers dedicated to them. But yes, there is definitely a lot of room for refactoring as we are aiming, at least I guess so, at having more various code paths for access methods. Thank you for the review, I'll fix these problems in final version. Posting the first message I intended to raise the discussion. Patches were attached mainly to illustrate the problem and to be more concrete. Thank you for attention and feedback. Since there are no objections to the idea in general, I'll write an exhaustive README about current state of the code and then propose API design to discuss details. Stay tuned. Here is the promised design draft. https://wiki.postgresql.org/wiki/HeapamRefactoring Looking forward to your comments. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
alone. This is somewhat messy in the case of heap_multi_insert because it returns several items; I think it's acceptable to return an array of ItemPointers in the same order as the input tuples. This works fine for the only caller, which is COPY in batch mode. For the other routines, they don't really care where the TID is returned AFAICS. Additional noteworthy items: i) Speculative insertion: the speculative insertion token is no longer installed directly in the heap tuple by the executor (of course). Instead, the token becomes part of the slot. When the tuple_insert method is called, the insertion routine is in charge of setting the token from the slot into the storage tuple. Executor is in charge of calling method->speculative_finish() / abort() once the insertion has been confirmed by the indexes. ii) execTuples has additional accessors for tuples-in-slot, such as ExecFetchSlotTuple and friends. I expect to have some of them to return abstract StorageTuples, others HeapTuple or MinimalTuples (possibly wrapped in Datum), depending on callers. We might be able to cut down on these later; my first cut will try to avoid API changes to keep fallout to a minimum. iii) All tuples need to be identifiable by ItemPointers. Storages that have different requirements will need careful additional thought across the board. iv) System catalogs cannot use pluggable storage. We continue to use heap_open etc in the DDL code, in order not to make this more invasive that it already is. We may lift this restriction later for specific catalogs, as needed. v) Currently, one Buffer may be associated with one HeapTuple living in a slot; when the slot is cleared, the buffer pin is released. My current patch moves the buffer pin to inside the heapam-based storage AM and the buffer is released by the ->slot_clear_tuple method. The rationale for doing this is that some storage AMs might want to keep several buffers pinned at once, for example, and must not to release those pins individually but in batches as the scan moves forwards (say a batch of tuples in a columnar storage AM has column values spread across many buffers; they must all be kept pinned until the scan has moved past the whole set of tuples). But I'm not really sure that this is a great design. I welcome comments on these ideas. My patch for this is nowhere near completion yet; expect things to change for items that I've overlooked, but I hope I didn't overlook any major. If things are handwavy, it is probably because I haven't fully figured them out yet. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
es forwards (say a batch of tuples in a columnar storage AM has column values spread across many buffers; they must all be kept pinned until the scan has moved past the whole set of tuples). But I'm not really sure that this is a great design. Frankly, I doubt that it's real to implement columnar storage just as a variant of pluggable storage. It requires a lot of changes in executor and optimizer and so on, which are hardly compatible with existing tuple-oriented model. However I'm not so good in this area, so if you feel that it's possible, go ahead. I welcome comments on these ideas. My patch for this is nowhere near completion yet; expect things to change for items that I've overlooked, but I hope I didn't overlook any major. If things are handwavy, it is probably because I haven't fully figured them out yet. Thank you again for beginning the big project. Looking forward to the prototype. I think it will make the discussion more concrete and useful. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Cast jsonb to numeric, int, float, bool
Now the simplest way to extract booleans and numbers from json/jsonb is to cast it to text and then cast to the appropriate type: postgres=# select 'true'::jsonb::text::bool; bool -- t postgres=# select '1.0'::jsonb::text::numeric; numeric - 1.0 This patch implements direct casts from jsonb numeric (jbvNumeric) to numeric, int4 and float8, and from jsonb bool (jbvBool) to bool. postgres=# select 'true'::jsonb::bool; bool -- t postgres=# select '1.0'::jsonb::numeric; numeric - 1.0 Waiting for your feedback. If you find it useful, I can also add support of json and other types, such as smallint and bigint. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/utils/adt/jsonb.c b/src/backend/utils/adt/jsonb.c index b9bf18f..4bbe81c 100644 --- a/src/backend/utils/adt/jsonb.c +++ b/src/backend/utils/adt/jsonb.c @@ -1939,3 +1939,129 @@ jsonb_object_agg_finalfn(PG_FUNCTION_ARGS) PG_RETURN_POINTER(out); } + +Datum +jsonb_numeric(PG_FUNCTION_ARGS) +{ + Jsonb *in = PG_GETARG_JSONB(0); + JsonbIterator *it; + JsonbValue v; + + if (!JB_ROOT_IS_OBJECT(in) + && !(JB_ROOT_IS_ARRAY(in) && !JB_ROOT_IS_SCALAR(in))) + { + Assert(JB_ROOT_IS_SCALAR(in)); + + it = JsonbIteratorInit(>root); + + /* + * A root scalar is stored as an array of one element, so we get the + * array and then its first (and only) member. + */ + (void) JsonbIteratorNext(, , true); + Assert(v.type == jbvArray); + (void) JsonbIteratorNext(, , true); + + if (v.type == jbvNumeric) + PG_RETURN_NUMERIC(v.val.numeric); + } + + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("key value must be json numeric"))); +} + +Datum +jsonb_int4(PG_FUNCTION_ARGS) +{ + Jsonb *in = PG_GETARG_JSONB(0); + JsonbIterator *it; + JsonbValue v; + + if (!JB_ROOT_IS_OBJECT(in) + && !(JB_ROOT_IS_ARRAY(in) && !JB_ROOT_IS_SCALAR(in))) + { + Assert(JB_ROOT_IS_SCALAR(in)); + + it = JsonbIteratorInit(>root); + + /* + * A root scalar is stored as an array of one element, so we get the + * array and then its first (and only) member. + */ + (void) JsonbIteratorNext(, , true); + Assert(v.type == jbvArray); + (void) JsonbIteratorNext(, , true); + + if (v.type == jbvNumeric) + PG_RETURN_INT32(DatumGetInt32(DirectFunctionCall1(numeric_int4, + NumericGetDatum(v.val.numeric; + } + + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("key value must be json numeric"))); +} + +Datum +jsonb_float8(PG_FUNCTION_ARGS) +{ + Jsonb *in = PG_GETARG_JSONB(0); + JsonbIterator *it; + JsonbValue v; + + if (!JB_ROOT_IS_OBJECT(in) + && !(JB_ROOT_IS_ARRAY(in) && !JB_ROOT_IS_SCALAR(in))) + { + Assert(JB_ROOT_IS_SCALAR(in)); + + it = JsonbIteratorInit(>root); + + /* + * A root scalar is stored as an array of one element, so we get the + * array and then its first (and only) member. + */ + (void) JsonbIteratorNext(, , true); + Assert(v.type == jbvArray); + (void) JsonbIteratorNext(, , true); + + if (v.type == jbvNumeric) + PG_RETURN_FLOAT8(DatumGetFloat8(DirectFunctionCall1(numeric_float8, + NumericGetDatum(v.val.numeric; + } + + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("key value must be json numeric"))); +} + +Datum +jsonb_bool(PG_FUNCTION_ARGS) +{ + Jsonb *in = PG_GETARG_JSONB(0); + JsonbIterator *it; + JsonbValue v; + + if (!JB_ROOT_IS_OBJECT(in) + && !(JB_ROOT_IS_ARRAY(in) && !JB_ROOT_IS_SCALAR(in))) + { + Assert(JB_ROOT_IS_SCALAR(in)); + + it = JsonbIteratorInit(>root); + + /* + * A root scalar is stored as an array of one element, so we get the + * array and then its first (and only) member. + */ + (void) JsonbIteratorNext(, , true); + Assert(v.type == jbvArray); + (void) JsonbIteratorNext(, , true); + + if (v.type == jbvBool) + PG_RETURN_BOOL(v.val.boolean); + } + + ereport(ERROR, +(errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("key value must be json boolean"))); +} \ No newline at end of file diff --git a/src/include/catalog/pg_cast.h b/src/include/catalog/pg_cast.h index 80a40ab..0646f99 100644 --- a/src/include/catalog/pg_cast.h +++ b/src/include/catalog/pg_cast.h @@ -377,5 +377,9 @@ DATA(insert ( 1700 1700 1703 i f )); /* json to/from jsonb */ DATA(insert ( 114 38020 a i )); DATA(insert ( 3802 1140 a i )); +DATA(insert ( 3802 1700 774 e f )); +DATA(insert ( 3802 23 775 e f )); +DATA(insert ( 3802 701776 e f )); +DATA(insert ( 3802 16 777 e f )); #endif /* PG_CAST_H */ diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h index 31c828a..ff7da5b 100644 --- a/src/include/catalog/pg_proc.h +++ b/src/include/catalog/pg_proc.h @@ -2364,6 +2364,15 @@ DESCR("convert int2 to numeric"); DATA(insert OID = 1783
Re: [HACKERS] IF NOT EXISTS option for CREATE SERVER and CREATE USER MAPPING statements
13.02.2017 19:34, Andrew Dunstan: On 01/13/2017 08:36 AM, Anastasia Lubennikova wrote: I implemented IF NOT EXISTS option for CREATE SERVER and CREATE USER MAPPING statements for one of our customers. I think other users can also find it useful for scripting and automated tasks. The patches themselves are small and simple. Documentation is updated as well. This looks good and useful. Please add some regression tests. Done. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/ref/create_server.sgml b/doc/src/sgml/ref/create_server.sgml index 734c6c9..6777679 100644 --- a/doc/src/sgml/ref/create_server.sgml +++ b/doc/src/sgml/ref/create_server.sgml @@ -21,7 +21,7 @@ PostgreSQL documentation -CREATE SERVER server_name [ TYPE 'server_type' ] [ VERSION 'server_version' ] +CREATE SERVER [IF NOT EXISTS] server_name [ TYPE 'server_type' ] [ VERSION 'server_version' ] FOREIGN DATA WRAPPER fdw_name [ OPTIONS ( option 'value' [, ... ] ) ] @@ -56,6 +56,19 @@ CREATE SERVER server_name [ TYPE '< Parameters + + +IF NOT EXISTS + + + Do not throw an error if a server with the same name already exists. + A notice is issued in this case. Note that there is no guarantee that + the existing server is anything like the one that would have been + created. + + + + server_name diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c index 6ff8b69..b4ae5de 100644 --- a/src/backend/commands/foreigncmds.c +++ b/src/backend/commands/foreigncmds.c @@ -883,12 +883,25 @@ CreateForeignServer(CreateForeignServerStmt *stmt) /* * Check that there is no other foreign server by this name. + * Do nothing if IF NOT EXISTS was enforced. */ if (GetForeignServerByName(stmt->servername, true) != NULL) - ereport(ERROR, -(errcode(ERRCODE_DUPLICATE_OBJECT), - errmsg("server \"%s\" already exists", - stmt->servername))); + { + if (stmt->if_not_exists) + { + ereport(NOTICE, + (errcode(ERRCODE_DUPLICATE_OBJECT), + errmsg("foreign server \"%s\" already exists, skipping", + stmt->servername))); + heap_close(rel, RowExclusiveLock); + return InvalidObjectAddress; + } + else + ereport(ERROR, + (errcode(ERRCODE_DUPLICATE_OBJECT), + errmsg("foreign server \"%s\" already exists", + stmt->servername))); + } /* * Check that the FDW exists and that we have USAGE on it. Also get the diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index a4edea0..d007468 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -4655,6 +4655,19 @@ CreateForeignServerStmt: CREATE SERVER name opt_type opt_foreign_server_version n->version = $5; n->fdwname = $9; n->options = $10; + n->if_not_exists = false; + $$ = (Node *) n; +} +| CREATE SERVER IF_P NOT EXISTS name opt_type opt_foreign_server_version + FOREIGN DATA_P WRAPPER name create_generic_options +{ + CreateForeignServerStmt *n = makeNode(CreateForeignServerStmt); + n->servername = $6; + n->servertype = $7; + n->version = $8; + n->fdwname = $12; + n->options = $13; + n->if_not_exists = true; $$ = (Node *) n; } ; diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index 07a8436..704bc6b 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -2113,6 +2113,7 @@ typedef struct CreateForeignServerStmt char *servertype; /* optional server type */ char *version; /* optional server version */ char *fdwname; /* FDW name */ + bool if_not_exists; /* just do nothing if it already exists? */ List *options; /* generic options to server */ } CreateForeignServerStmt; diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out index 3a9fb8f..17f9f40 100644 --- a/src/test/regress/expected/foreign_data.out +++ b/src/test/regress/expected/foreign_data.out @@ -283,7 +283,9 @@ ERROR: foreign-data wrapper "foo" does not exist CREATE FOREIGN DATA WRAPPER foo OPTIONS ("test wrapper" 'true'); CREATE SERVER s1 FOREIGN DATA WRAPPER foo; CREATE SERVER s1 FOREIGN DATA WRAPPER foo; -- ERROR -ERROR: server "s1" already exists +ERROR: foreign server "s1" already exists +CREATE SERVER IF NOT EXISTS s1 FOREIGN DATA WRAPPER foo; -- No ERROR, just NOTICE +NOTICE: foreign server "s1" already exists, skipping CREATE SERVER s2 FOREIGN DATA WRAPPER foo OPTIONS (host 'a', dbname 'b'); CREATE SERVER s3 TYPE 'oracle' FOREIGN DATA WRAPPER foo; CREATE SERVER s4 TYPE 'oracle' FOREIGN DATA WRAPPER foo OPTIONS (host 'a', dbname 'b'); diff --git a/src/test/regress/sql
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
28.12.2016 23:43, Claudio Freire: Attached v4 patches with the requested fixes. Sorry for being late, but the tests took a lot of time. create table t1 as select i, md5(random()::text) from generate_series(0,4) as i; create index md5_idx ON t1(md5); update t1 set md5 = md5((random() * (100 + 500))::text); vacuum; Patched vacuum used 2.9Gb of memory and vacuumed the index in one pass, while for old version it took three passes (1GB+1GB+0.9GB). Vacuum duration results: vanilla: LOG: duration: 4359006.327 ms statement: vacuum verbose t1; patched: LOG: duration: 3076827.378 ms statement: vacuum verbose t1; We can see 30% vacuum speedup. I should note that this case can be considered as favorable to vanilla vacuum: the table is not that big, it has just one index and disk used is a fast fusionIO. We can expect even more gain on slower disks. Thank you again for the patch. Hope to see it in 10.0. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] GSoC 2017
I'm ready to be a mentor. 10.01.2017 12:53, Alexander Korotkov: Hi all! In 2016 PostgreSQL project didn't pass to GSoC program. In my understanding the reasons for that are following. 1. We did last-minute submission of our application to GSoC. 2. In 2016 GSoC application form for mentoring organizations has been changed. In particular, it required more detailed information about possible project. As result we didn't manage to make a good enough application that time. Thus, our application was declined. See [1] and [2] for details. I think that the right way to manage this in 2017 would be to start collecting required information in advance. According to GSoC 2017 timeline [3] mentoring organization can submit their applications from January 19 to February 9. Thus, now it's a good time to start collecting project ideas and make call for mentors. Also, we need to decide who would be our admin this year. In sum, we have following questions: 1. What project ideas we have? 2. Who are going to be mentors this year? 3. Who is going to be project admin this year? BTW, I'm ready to be mentor this year. I'm also open to be an admin if needed. [1] https://www.postgresql.org/message-id/flat/CAA-aLv4p1jfuMpsRaY2jDUQqypkEXUxeb7z8Mp-0mW6M03St7A%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CALxAEPuGpAjBSN-PTuxHfuLLqDS47BEbO_ZYxUYQR3ud1nwbww%40mail.gmail.com [3] https://developers.google.com/open-source/gsoc/timeline -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]
09.08.2016 19:45, Andrew Borodin: Here is new version of the patch, now it includes recommendations from Anastasia Lubennikova. I've investigated anamalous index size decrease. Most probable version appeared to be true. Cube extension, as some others, use Guttman's polynomial time split algorithm. It is known to generate "needle-like" MBBs (MBRs) for sorted data due to imbalanced splits (splitting 100 tuples as 98 to 2). Due to previous throw-to-the-end behavior of GiST this imbalance was further amplificated (most of inserts were going to bigger part after split). So GiST inserts were extremely slow for sorted data. There is no need to do exact sorting to trigger this behavior. This behavior can be fused by implementation of small-m restriction in split (AFAIR this is described in original R-tree paper from 84), which prescribes to do a split in a parts no smaller than m, where m is around 20% of a page capacity (in tuples number). After application of this patch performance for sorted data is roughly the same as performance for randomized data. Thank you for explanation. Now it's clear to me. I did some more testing and found nothing special. The declared feature is implemented correctly. It passes all regression tests and improves performance. I still have a few minor nitpicks about the patch style. You can find a style guide on https://www.postgresql.org/docs/9.6/static/source.html 1) remove extra whitespace in README 2) add whitespace: if(ntup == 1) 3) fix comments in accordance with coding conventions /* In case of single tuple update GiST calls overwrite * In all other cases function gistplacetopage deletes * old tuples and place updated at the end * */ +/* Normally here was following assertion + * Assert(ItemIdHasStorage(ii)); + * This assertion was introduced in PageIndexTupleDelete + * But if this function will be used from BRIN index + * this assertion will fail. Thus, here we do not check that + * item has the storage. + * */ 4) remove unrelated changes -data += sizeof(OffsetNumber) * xldata->ntodelete; +data += sizeof(OffsetNumber) *xldata->ntodelete; 5) If the comment is correct, maxalignment is not necessary. +/* tuples on a page are always maxaligned */ +oldsize = MAXALIGN(oldsize); 6) I'd rather use alignednewsize here. +ItemIdSetNormal(tupid, offset + size_diff, newsize); After the cleanup you can change status to "Ready for Committer" without waiting for the response. If data is randomized this effect of Guttman poly-time split is not in effect; test from the start of the thread will show no statistically confident improvement by the patch. To prove performance increase in randomized case I've composed modified test https://gist.github.com/x4m/856b2e1a5db745f8265c76b9c195f2e1 This test with five runs shows following: Before patch Insert Time AVG 78 seconds STD 9.5 Afer patch Insert Time AVG 68 seconds STD 7.7 This is marginal but statistically viable performance improvement. For sorted data performance is improved by a factor of 3. Best regards, Andrey Borodin, Octonica & Ural Federal University. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Refactoring of heapam code.
06.09.2016 07:44, Pavan Deolasee: On Mon, Aug 8, 2016 at 3:13 PM, Anastasia Lubennikova <a.lubennik...@postgrespro.ru <mailto:a.lubennik...@postgrespro.ru>> wrote: Thank you for the review, I'll fix these problems in final version. Posting the first message I intended to raise the discussion. Patches were attached mainly to illustrate the problem and to be more concrete. I started looking at the first patch posted above, but it seems you'll rewrite these patches after discussion on the heapam refactoring ideas you posted here https://wiki.postgresql.org/wiki/HeapamRefactoring <https://wiki.postgresql.org/wiki/HeapamRefactoring> concludes. Thank you for the review. Some comments anyways on the first patch: 1. It does not apply cleanly with git-apply - many white space errors I'll fix all merge issues and attach it to the following message. 2. I don't understand the difference between PageGetItemHeapHeaderOnly() and PageGetItemHeap(). They seem to do exactly the same thing and can be used interchangeably. The only difference between these two macros is that PageGetItemHeapHeaderOnly() doesn't touch any key fields of a tuple, it only checks header fields (see HeapTupleHeaderData). I divided it into two separate functions, while I was working on page compression and I wanted to avoid unnecessary decompression calls. These names are just a kind of 'markers' to make the code reading and improving easier. 3. While I like the idea of using separate interfaces to get heap/index tuple from a page, may be they can internally use a common interface instead of duplicating what PageGetItem() does already. I don't sure I get it right. Do you suggest to leave PageGetItem and write PageGetItemHeap() and PageGetItemIndex() as wrappers on it? It sounds reasonable while we have similar layout for both heap and index pages. In any case, it'll be easy to separate them when necessary. 4. Should we rename PageGetItemHeap() to PageGetHeapTuple() or something like that? I don't feel like doing that, because HeapTuple is a different structure. What we do get from page is a HeapTupleHeaderData structure followed by user's data. 5. If we do this, we should probably have corresponding wrapper functions/macros for remaining callers of PageGetItem() There are some calls of PageGetItem() left in BRIN and SP-gist indexes, I can write simple wrappers for them. I left them just because they were out of interest for my compression prototype. I also looked at the refactoring design doc. Looks like a fine approach to me, but the API will need much more elaborate discussions. I am not sure if the interfaces as presented are enough to support everything that even heapam can do today. What features of heapam do you think could be unsupportable in this API? Maybe I've just missed them. My point here is that heapam is too complicated for many simpler use-cases read-only tables and append-only tables do not need many MVCC related tricks like vacuum and complicated locking, tables with fixed len attributes can use more optimal page layout. And so on. I suggest refactoring, that will allow us to develop new heap-like access methods. For the first version, they must have methods to "heapify" tuple i.e turn internal representation of the data to regular HeapTuple, for example put some stubs into MVCC fields. Of course this approach has its disadvantages, such as performance issues. It definitely won't be enough to write column storage or to implement other great data structures. But it allows us not to depend of the Executor's code. And much more thoughts will be required to ensure we don't miss out things that new primary access method may need. Do you have any particular ideas of new access methods? A few points regarding how the proposed API maps to heapam: - How do we support parallel scans on heap? As far as I understand, parallel heap scan requires one more API function heap_beginscan_parallel(). It uses the same API function - heap_getnext(), but in addition to ordinary seq scan it selects page to read in a synchronized manner. - Does the primary AM need to support locking of rows? That's a good question. I'd say it should be an option. - Would there be separate API to interpret the PAM tuple itself? (something that htup_details.h does today). It seems natural that every PAM will have its own interpretation of tuple structure and we would like to hide that inside PAM implementation. As I already wrote, for the first prototype, I'd use function to "heapify" tuple and don't care about executor issues at all. More detailed discussion of this question you can find in another thread [1]. - There are many upper modules that need access to system attributes (xmin, xmax for starters). How do you plan to handle that? You think we can provide enough abstraction so that the callers don't need to know