Re: [HACKERS] [PATCHES] GIN improvements
Updated: http://www.sigaev.ru/misc/fast_insert_gin-0.9.gz need more review of fast_insert yet? It looked like a number of people commented on it already. I still havn't clearness of acceptability for suggested aminsertcleanup calling. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [HACKERS] [PATCHES] GIN improvements
I've committed the multicolumn one with minor revisions (fix some poor English in docs and comments, add regression-test coverage). Do you Thank you very much. need more review of fast_insert yet? It looked like a number of people commented on it already. I should modify it to support/synchronize with multicolumn GIN - both patches touch the same pieces of code, and I didn't make a single patch to simplify review. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [HACKERS] [PATCHES] GIN improvements
I looked this over and it looks good in general. May I think that patch passed review and commit it? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] GIN improvements
I looked this over and it looks good in general. I was only wondering about for single-column indexes -- we're storing attribute numbers too, right? No, GinState-oneCol field signals to GinFormTuple and gin_index_getattr/gintuple_get_attrnum about actual storage. Single column index is binary compatible with current index :) -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Multi-column GIN
What's the benefit of a multi-column GIN index over multiple single-column GIN indexes? Page 12 from presentation on PgCon (http://www.sigaev.ru/gin/fastinsert_and_multicolumn_GIN.pdf): Multicolumn index vs. 2 single column indexes Size:539 Mb538 Mb Speed: *1.885* ms4.994 ms Index: ~340 s~200 s Insert: 72 s/166 s/1 -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] GIN improvements
Sync with current CVS HEAD and post in hackers- too because patches- close to the closing. http://www.sigaev.ru/misc/fast_insert_gin-0.7.gz http://www.sigaev.ru/misc/multicolumn_gin-0.3.gz -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] GIN improvements
1) multicolumn GIN Unlike other indexes, the performance of search doesn't depends on what column of index (first, last, any subset) is used in search clause. This property can be used in gincostestimate, but I haven't looked on it yet. After some playing I didn't find any mentions in *costestimate function about difference of cost estimation between first and any other columns in clauses, so, IMHO, issue above isn't an issue. :) So, I didn't see any comments/objections and I intend to commit this patch for next two days and synchronize 'fast insert into GIN' patch with CVS. Objections? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] GIN improvements
Right now this is not a problem because there is no insert_cleanup function for btree, but I wonder if we should clean it up. Look at gistbulkdelete and gistvacuumcleanup, first function wants to send a bool flag to second one and they use GiSTBulkDelete structure instead of usual IndexBulkDeleteResult. When it will be needed btree may use the same method. FWIW there's a typo in catalogs.sgml (finction - function) Thank you, will fix. What's the use of the FASTUPDATE parameter? Is there a case when a user is interested in turning it off? Yeah - when time of search is much-much more important (or crucial) than insertion time. Or table stores read-only values. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] GIN improvements
2) fast insert into GIN New version: http://www.sigaev.ru/misc/fast_insert_gin-0.6.gz Changes: - added option FASTUPDATE=(1|t|true|on|enable|0|f|false|disable) for CREATE/ALTER command for GIN indexes - Since there wasn't any comments on first email, pg_am.aminsertcleanup optional method was introduced. - added documentation Suppose, patch is ready to review/commit... -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] GIN improvements
How about having a constant sized fastupdate buffer, of say 100 rows or a fixed number of pages, and when that becomes full, the next inserter will have to pay the price of updating the index and flushing I'm not sure that is acceptable because flushing pending list may take several seconds in unpredictable moment. the buffer. To keep that overhead out of the main codepath, we could make autovacuum to flush the buffers periodically. Do you mean that GIN sends a smoke signal to the autovacuum launcher process to ask about vacuum? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
[PATCHES] GIN improvements
Improvements of GIN indexes were presented on PGCon 2008. Presentation: http://www.sigaev.ru/gin/fastinsert_and_multicolumn_GIN.pdf 1) multicolumn GIN This patch ( http://www.sigaev.ru/misc/multicolumn_gin-0.2.gz ) adds multicolumn support to GIN. The basic idea is: keys (entries in GIN terminology) extracted from values are stored in separated tuples along with their column number. In that case, multicolumn clause is just AND of column's clauses. Unlike other indexes, the performance of search doesn't depends on what column of index (first, last, any subset) is used in search clause. This property can be used in gincostestimate, but I haven't looked on it yet. 2) fast insert into GIN This patch ( http://www.sigaev.ru/misc/fast_insert_gin-0.4.gz ) implements an idea of using bulk insert technique, which used at index creation time. Inserted rows are stored in the linked list of pending pages and inserted to the regular structure of GIN at vacuum time. The algorithm is shown in presentation, but insert completion process (vacuum) was significantly reworkes to improve concurrency. Now, the list of pending page is locked much lesser time - only during deletion of pages from the list. Open item: what is a right time to call insert completion? Currently, it is called by ginbulkdelete and ginvacuumcleanup, ginvacuumcleanup will call completion if ginbulkdelete wasn't called. That's not good, but works. Completion process should started before ginbulkdelete because ginbulkdelete doesn't look on pending pages at all. Since insert completion (of any index if that method will exists, I think) runs fast if number of inserted tuples is a small because it doesn't go through the whole index, so, IMHO, the existing statistic's fields should not be changed. That idea, discussed at PGCon, is to have trigger in vacuum which will be fired if number of inserted tuples becomes big. Now I don't think that the idea is useful for two reason: for small number of tuples completion is a cheap and it should be called before ginbulkdelete. IMHO, it's enough to add an optional method to pg_am (aminsertcleanup, per Tom's suggestion). This method will be called before ambulkdelete and amvacuumcleanup. Opinions, objections, suggestions? On presentation some people were interested on how our changes affect the search speed after rows insert. The tests are below: We use the same tables as in presentation and measure search times ( after insertion of some rows ) before and after vacuum. All times are in ms. Test tables contain 10 rows, in the first table the number of elements in array is 100 with cardinality = 500, second - 100 and 50, last - 1000 and 500. Insert 1 into table with 10 rows (10%) |v '{v1}' | -+-++ found | novac-d | vac-d | rows -+-++--- n:100, c:500| 118 |35 | 19909 n:100, c:50 |95 | 0.7 |25 n:1000, c:500| 380 | 79 | 95211 Insert 1000 into table with 10 rows (1%) |v '{v1}' | -+-++ found | novac-d | vac-d | rows -+-++--- n:100, c:500|40 |31 | 18327 n:100, c:50 |13 | 0.5 |26 n:1000, c:500| 102 |71 | 87499 Insert 100 into table with 10 rows (0.1%) |v '{v1}' | -+-++ found | novac-d | vac-d | rows -+-++--- n:100, c:500|32 |31 | 18171 n:100, c:50 | 1.7 | 0.5 |20 n:1000, c:500|74 |71 | 87499 Looking at result it's easy to conclude that: - time of search pending list is O(number of inserted rows), i.e., search time is equal to (time of search in GIN) + K1 * (number of inserted rows after the last vacuum). - search time is O(average length of indexed columns). Observations made above is also applicable here. - significant performance gap starts around 5-10% of inserts or near 500-1000 inserts. This is very depends on specific dataset. Notice, that insert performance to GIN was increased up to 10 times. See exact results in presentation. Do we need to add option to control this (fast insertion) feature? If so, what is a default value? It's not clear to me. Note: These patches are mutually exclusive because they touch the same pieces of code and I'm too lazy to manage several depending patches. I don't see any problem to join patches to one, but IMHO it will be difficult to review. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql
Re: [PATCHES] Partial match in GIN (next vesrion)
There seems to be something broken here: it's acting like prefix search is on all the time, eg I'm in sackcloth and ashes... Fixed and extended regression tests. http://www.sigaev.ru/misc/tsearch_prefix-0.9.gz -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Partial match in GIN (next vesrion)
It might be useful, although I don't see any usage of that right now. I'll add this option. Ping? I'd like to get this patch out of the way. I'm very sorry for long delay. http://www.sigaev.ru/misc/partial_match_gin-0.10.gz http://www.sigaev.ru/misc/tsearch_prefix-0.8.gz http://www.sigaev.ru/misc/wildspeed-0.12.tgz Changes: - Sync with CVS HEAD - add third option (StrategyNumber) for comparePartialFn. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Partial match in GIN (next vesrion)
Looking at this now. Wouldn't it be a good idea for comparePartial to get the strategy number of the operator? As you have it set up, I doubt that an opclass can support more than one partial-match operator. It might be useful, although I don't see any usage of that right now. I'll add this option. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Partial match in GIN (next vesrion)
http://www.sigaev.ru/misc/partial_match_gin-0.9.gz Sync with CVS. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Partial match in GIN (next vesrion)
http://www.sigaev.ru/misc/partial_match_gin-0.8.gz Reworked interface as it suggested by Gregory (http://archives.postgresql.org/pgsql-patches/2008-04/msg00199.php) and move check of index into expand_indexqual_opclause() as suggested by Heikki (http://archives.postgresql.org/pgsql-patches/2008-04/msg00200.php) http://www.sigaev.ru/misc/tsearch_prefix-0.7.gz Sync with current CVS and partial match GIN patch. Allow full scan index, so now GIN supports search with queries like '!foo'. Implemented via using empty string for prefix search. http://www.sigaev.ru/misc/wildspeed-0.11.tgz Sync with CVS changes and partial match GIN patch. Teach opclass to correct use of recheck feature. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] [HACKERS] Remove lossy-operator RECHECK flag?
Actually, I think I figured it out. Please look over the GIST/GIN parts of this patch and see if they're OK with you. Looks good, thank you. Why don't use suggested way for GIN index over tsvector? http://archives.postgresql.org/pgsql-hackers/2008-04/msg00812.php It's nothing to cost light optimization... This is still WIP because I haven't touched any contrib code, but as far as the main backend goes I think it's ready to apply. Patch to all contrib modules: http://www.sigaev.ru/misc/contrib.patch.gz -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Partial match in GIN
How about forcing the use of a bitmap index scan, and modify the indexam API so that GIN could a return a lossy bitmap, and let the bitmap heap scan do the rechecking? Partial match might be used only for one search entry from many. In sext search example: 'a:* qwertyuiop' - second lexeme has only a few matched tuples. But GIN itself doesn't know about semantic meaning of operation and can not distinguish following tsqueries: '!a:* qwertyuiop' '!a:* qwertyuiop' 'a:* !qwertyuiop' So, your suggestion is equivalent to mark all operation with RECHEK flag and OR-ing all posting lists. That will be give a lot of false match and too slow. I don't think the storage size of tsquery matters much, so whatever is the best solution in terms of code readability etc. That was about tsqueryesend/recv format? not a storage on disk. We don't require compatibility of binary format of db's files, but I have some doubts about binary dump. We generally don't make any promises about cross-version compatibility of binary dumps, though it would be nice not to break it if it's not too much effort. Hmm. match_special_index_operator() already checks that the index's opfamily is pattern_ops, or text_ops with C-locale. Are you reusing the same operator families for wildspeed? Doesn't it then also get confused if you do a WHERE textcol 'foo' query by hand? No, wildspeed use the same operator ~~ match_special_index_operator() isn't called at all: in match_clause_to_indexcol() function is_indexable_operator() is called before match_special_index_operator() and returns true. expand_indexqual_opclause() sees that operation is a OID_TEXT_LIKE_OP and calls prefix_quals() which fails because it wishes only several Btree opfamilies. Oh, I see. So this assumption mentioned in the comment there: /* * LIKE and regex operators are not members of any index opfamily, * so if we find one in an indexqual list we can assume that it * was accepted by match_special_index_operator(). */ is no longer true with wildspeed. So we do need to check that in expand_indexqual_opclause() then. NOTICE 2: it seems to me, that similar technique could be implemented for ordinary BTree to eliminate hack around LIKE support. LIKE expression. I wonder what the size and performance of that would be like, in comparison to the proposed GIN solution? GIN speeds up '%foo%' too - which is impossible for btree. But I don't like a hack around LIKE support in BTree. This support uses outflank ways missing regular one. You could satisfy '%foo%' using a regular and a reverse B-tree index, and a bitmap AND. Which is interestingly similar to the way you proposed to use a TIDBitmap within GIN. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Partial match in GIN
Looking at the patch, you require that the TIDBitmap fits in work_mem in non-lossy format. I don't think that's acceptable, it can easily exceed work_mem if you search for some very common word. Failing to execute a valid query is not good. But way is better than nothing. In really, that way was chosen to have fast merge of (potentially) hundreds of sorted lists of ItemPointers. Other ways is much slower. Some calculations: with 8Mb of mem_work TIDBimap in non-lossy mode can store at least 20 pages, which gives to us no less than 20 tuples. For frequent word, that number should multiplied to 10 or 100, because practically every tuple will contain it. Practical limit to number of articles/document served by one servers is about 10 millions. There are no so many alternatives: - collect all needed ItemPointers and sort then unique them. - merge each posting list with already collected ones - N-way merge, where N can be very big - Rerun index scan with all possible combinations All this ways will be much slower even for not very big collections. I don't think the storage size of tsquery matters much, so whatever is the best solution in terms of code readability etc. That was about tsqueryesend/recv format? not a storage on disk. We don't require compatibility of binary format of db's files, but I have some doubts about binary dump. Hmm. match_special_index_operator() already checks that the index's opfamily is pattern_ops, or text_ops with C-locale. Are you reusing the same operator families for wildspeed? Doesn't it then also get confused if you do a WHERE textcol 'foo' query by hand? No, wildspeed use the same operator ~~ match_special_index_operator() isn't called at all: in match_clause_to_indexcol() function is_indexable_operator() is called before match_special_index_operator() and returns true. expand_indexqual_opclause() sees that operation is a OID_TEXT_LIKE_OP and calls prefix_quals() which fails because it wishes only several Btree opfamilies. NOTICE 2: it seems to me, that similar technique could be implemented for ordinary BTree to eliminate hack around LIKE support. LIKE expression. I wonder what the size and performance of that would be like, in comparison to the proposed GIN solution? GIN speeds up '%foo%' too - which is impossible for btree. But I don't like a hack around LIKE support in BTree. This support uses outflank ways missing regular one. I'm thinking about add new strategy to Btree and allow directly support of prefix LIKE search. And BTree will scan index while compare method with option returns true. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
[PATCHES] Partial match in GIN
We (Oleg and me) would like to present patch implements partial match for GIN index and two extensions which use this new feature. We hope that after short review they will be committed to CVS. This work was sponsored by EnterpriseDB. http://www.sigaev.ru/misc/partial_match_gin-0.7.gz Implements partial match for GIN. It extends interface of support function but keeps backward compatibility. The basic idea is to find first greater or equal value in index and scan sequentially until support function says stop. For each matched entry all corresponding ItemPointers are collected in TIDBitmap structure to effective merge ItemPointers from different entries. Patch introduces following changes in interface: - compare function has third (optional) argument, of boolean type, it points to kind of compare: partial or exact match. If argument is equal to 'false', function should produce comparing as usual, else function's result is treated as: = 0 - match 0 - doesn't match but continue scan 0 - stop scan - extractQuery function has fourth (optional) argument of bool** type. Function is responsible to allocate correct memory for that array with the same size as returning array of searching entries. if extractQuery wishs to point partial match for some entry it should set corresponding element of bool array to true. If function described above hasn't extra arguments then GIN will not be able to use partial match. http://www.sigaev.ru/misc/tsearch_prefix-0.6.gz Implements prefix search. This was one of the most wanted feature of text search. Lexeme to partial match should be labeled with asterisk: select count(*) from apod where fti @@ 'star:*'; or even select count(*) from apod where fti @@ to_tsquery('star:*'); Dictionary may set a normalized lexeme with flag (TSL_PREFIX) to point to its prefix path. Here there is a unclean issue: now tsquery has new flag to label prefix search and cstring representation has backward compatibility, but external binary hasn't it now. Now, extra byte is used for storage of this flag. In other hand, there 4 unused bits in external binary representation (in byte stores weights of lexeme), so it's possible to use one of them to store this flag. What are opinions? http://www.sigaev.ru/misc/wildspeed-0.10.tgz docs: http://mira.sai.msu.su/~megera/pgsql/pgdoc/wildspeed.html http://www.sai.msu.su/~megera/wiki/wildspeed In short, it's a contrib module that speeds up LIKE operation with any kind of expression, like 'foo%bar' or '%foo%' or even '%foo%bar'. This module is based on partial match patch of GIN. NOTICE 1: current index support of LIKE believes that only BTree can speed up LIKE and becomes confused with this module with error 'unexpected opfamily' in prefix_quals(). For this reason, partial match patch adds small check before calling expand_indexqual_opclause(). NOTICE 2: it seems to me, that similar technique could be implemented for ordinary BTree to eliminate hack around LIKE support. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] [HACKERS] Include Lists for Text Search
Hmm, I can see how some middleware would help with folding or not folding the input token, but what about the words coming from the dictionary file (particularly the *output* lexeme)? It's not apparent to me that it's sensible to try to control that from outside the dictionary. Right now I see an significant advantage of such layer: two possible extension of dictionary (filtering and storing original form) are independent from nature of dictionary. So, instead of modifying of every dictionary we can add some layer, common for all dictionary. With syntax like: CREATE/ALTER TEXT SEARCH DICTIONARY foo (...) WITH ( filtering=on|off, store_original=on|off ); Or per token's type/dictionary pair. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] [HACKERS] Include Lists for Text Search
Well, if you think this can/should be done somewhere outside the dictionary, should I revert the applied patch? No, that patch is about case sensitivity of synonym dictionary. I suppose, Simon wants to replace 'bill' to 'account', but doesn't want to get 'account Clinton' For another dictionary ( dictionary of number, snowball ) that option is a meaningless. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] [HACKERS] Include Lists for Text Search
Right now I see an significant advantage of such layer: two possible extension of dictionary (filtering and storing original form) are One more extension: drop too long words. For example, decrease limit of max length of word to prevent long to be indexed - word with 100 characters is suspiciously long for human input. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
To be precise about tsvector: 1) GiST index is lossy for any kind of tserach queries, GIN index for @@ operation is not lossy, for @@@ - is lossy. 2) Number of positions per word is limited to 256 number - bigger number of positions is not helpful for ranking, but produces a big tsvector. If word has a lot of positions in document then it close to be a stopword. We could easy increase this limit to 65536 positions 3) Maximum value of position is 2^14, because for position's storage we use uint16. In this integer it's needed to reserve 2 bits to store weight of this position. It's possible to increase int16 to int32, but it will doubled tsvector size, which is unpractical, I suppose. So, part of document used for ranking contains first 16384 words - that is about first 50-100 kilobytes. 4) Limit of total size of tsvector is in WordEntry-pos (ts_type.h) field. It contains number of bytes between first lexeme in tsvector and needed lexeme. So, limitation is total length of lexemes plus theirs positional information. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-patches
Re: [PATCHES] Proposed patch to change TOAST compression strategy
Tsvector dump (taken by Magnus from mail archives of pgsql's lists) http://www.sigaev.ru/misc/tstest.sql.bz2 Query: select sum(ts_rank( vector, 'asdfjkl' )) from tstest ; ts_rank detoasts value in any case, even tsvector doesn't contain needed lexemes. Test was on my notebook: Core2 Duo 1.8MHz, 2Gb with default postgres.conf 8.4 without patch: Time: 10883,368 ms 8.4 with patch (db was reinited) Time: 9654,266 ms -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Proposed patch to change TOAST compression strategy
This proposed patch addresses some issues in TOAST compression strategy that I have not done any performance testing of these changes --- does anyone have specific test scenarios they'd like to try? That's very important change for text search, because of tsvector storage. Headline and rank reads a lot of tsvectors. It seems to me that ranking test will be very clear: rank function reads whole tsvector and returns small amount of data (just a number). Another testing focus may be a lossy indexes, like a index over polygons. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] minor compiler warning in backend/utils/adt/tsrank.c
Thank you, committed Hannes Eder wrote: while rebuilding postgres with msvc 2005 I noticed some minor compiler warnings: .\src\backend\utils\adt\tsrank.c(24): warning C4305: 'initializing' : truncation from 'double' to 'float' .\src\backend\utils\adt\tsrank.c(24): warning C4305: 'initializing' : truncation from 'double' to 'float' .\src\backend\utils\adt\tsrank.c(24): warning C4305: 'initializing' : truncation from 'double' to 'float' .\src\backend\utils\adt\tsrank.c(46): warning C4305: 'return' : truncation from 'double' to 'float4' .\src\backend\utils\adt\tsrank.c(339): warning C4305: '=' : truncation from 'double' to 'float' see attached diff -Hannes ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] [COMMITTERS] pgsql: Remove QueryOperand-istrue flag, it was used only in cover
This change introduced a compiler warning. Here is a fix for it. Remove QueryOperand-istrue flag, it was used only in cover ranking (ts_rank_cd). Use palloc'ed array in ranking instead of flag. Thank you, committed -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Yet more tsearch refactoring
Heikki Linnakangas wrote: * Defined new struct WordEntryPosVector that holds a uint16 length and a variable size array of WordEntries. This replaces the previous convention of a variable size uint16 array, with the first element implying the length. WordEntryPosVector has the same layout in memory, but is more readable in source code. The POSDATAPTR and POSDATALEN macros are still used, though it would now be more readable to access the fields in WordEntryPosVector directly. Did you check it on 64-bit boxes with strict alignment? I remember that was a headache for me. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] tsearch refactorings
BTW, Teodor, are you intending to review/apply Heikki's tsearch fixes, or do you want someone else to do it? I'll do it. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] tsearch refactorings
I am getting confused with the patches and version I have lying around here... I think I'll have to wait for review of the patches I've posted this far before I continue hacking. Sorry for delay - I was busy by another. All your patches are committed with very small changes. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] tsearch refactorings
Heikki, I see some strange changes in your patch, not related to tsearch at all: contrib/pageinspect/pageinspect.sql.in contrib/pageinspect/rawpage.c The usage of the QueryItem struct was very confusing. It was used for both operators and operands. For operators, val was a single character casted to a int4, marking the operator type. For operands, val was the CRC-32 of the value. Other fields were used only either for operands or for operators. The biggest change in the patch is that I broke the QueryItem struct into QueryOperator and QueryOperand. Type was really ... - Removed ParseQueryNode struct used internally by makepol and friends. push*-functions now construct QueryItems directly. It's needed to set unused bytes in QueryItem to zero, it's common requiremens for types in pgsql. After allocating space for tsquery in parse_tsquery you copy just sizeof(QueryOperator) bytes and leave sizeof(QueryItem) - sizeof(QueryOperator) bytes untouched. QueryOperand is a biggest component in QueryItem union. I don't check other places. that? And parse_query always produces trees that are in prefix notation, so the left-field is really redundant, but using tsqueryrecv, you could inject queries that are not in prefix notation; is there anything in the code that depends on that? It's used by TS_execute for optimization reason. With clear postfix notation you should go through every nodes. For example: FALSE FALSE FALSE You will go to the end of query to produce correct result. In fact, TSQuery is a prefix notation with pointer to another operand or, by another words, just a plain view of tree where right operand of operation is always placed after operation. That notation allows to calculate only one of operand if it possible: FALSE FALSE FALSE 1 2 3 4 5 --Nodes After evaluating of second node you can return FALSE for whole expression and do not evaluate nodes 3-5. For query TRUE FALSE FALSE it's needed to evaluate 1,2,3,4 nodes. In most cases checking QI_VAL node is much more expensive that QI_OPR - There's many internal intermediate representations of a query: TSQuery, a QTNode-tree, NODE-tree (in tsquery_cleanup.c), prefix notation stack of QueryItems (in parser), infix-tree. Could we remove some of these? I havn't strong objections, QTNode and NODE are tree-like structures, but TSQuery is a postfix notation for storage in plain memory. NODE is used only cleanup stop-word placeholders, so it's a binary tree while QTNode represents t-ary tree (with any number of children). Thank you for your interesting in tsearch - after recheck of problem pointed above I'll commit your patch. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] tsearch refactorings
I fixed this by making the left-field a uint32. There's no reason to arbitrarily limit it to 16-bits, and it won't increase the disk/memory footprint either now that QueryOperator and QueryOperand are separate structs. ... I added check_stack_depth() call to all recursive functions I found. Some of them might have a natural limit so that you can't force arbitrarily deep recursions, but check_stack_depth() is cheap enough that seems best to just stick it into anything that might be a problem. Agreed. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] tsearch refactorings
Ok. Probably easiest to do that by changing the palloc to palloc0 in parse_tsquery. and change sizeof to sizeof(QueryItem) -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] tsearch refactorings
Heikki Linnakangas wrote: Teodor Sigaev wrote: Ok. Probably easiest to do that by changing the palloc to palloc0 in parse_tsquery. and change sizeof to sizeof(QueryItem) Do you mean the sizeofs in the memcpys in parse_tsquery? You can't Oops, I meant pallocs in push* function. palloc0 in parse_tsquery is another way. BTW, can you explain what the CRC-32 of a value is used for? It looks like it's used to speed up some operations, by comparing the CRCs before comparing the values, but I didn't quite figure out how it works. How It's mostly used in GiST indexes - recalculating crc32 every time for each index tuple to be checked is rather expensive. much of a performance difference does it make? Would hash_any do a better/cheaper job? crc32 was chosen after testing a lot of hash function. Perl's hash was the fastest, but crc32 makes much less number of collisions. That's interesting for ASCII a lot of functions produce rather small number of collision, but for upper part of table (0x7f-0xff) crc32 was the best. CRC32 has evenly distributed collisions over characters, others - not. In any case, I think we need to calculate the CRC/hash in tsqueryrecv, instead of trusting the client. Agreed. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PATCHES] tsearch core path, v0.58
Tom Lane wrote: Teodor Sigaev [EMAIL PROTECTED] writes: http://www.sigaev.ru/misc/tsearch_core-0.58.gz What is src/backend/utils/tsearch/dict_ispell/parse.h ? oops - that is unused file after some experiments. Will be removed from patch. Well, I know what it *is*: it's bison output. The question is what is it doing here? It doesn't seem to be used, and if it is used then I do not see the bison grammar file it's made from. BTW, I would like to shorten some of the path names in this fileset. Is there a reason not to combine src/backend/utils/adt/tsearch, src/backend/utils/tsearch, and src/backend/utils/tsearch/dict_ispell into one place, perhaps src/backend/tsearch? There is no any strong reason - src/backend/utils/tsearch contains fucntions about processing text, they are rather complex. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[PATCHES] tsearch core path, v0.58
http://www.sigaev.ru/misc/tsearch_core-0.58.gz Changes since 0.52 version: 1) Introduce dictionary's template which contains only methods of dictionary and can be managed only by superuser. CREATE TEXT SEARCH DICTIONARY dictname TEMPLATE dicttmplname [OPTION opt_text ] ; CREATE TEXT SEARCH DICTIONARY TEMPLATE dicttmplname LEXIZE lexize_function [INIT init_function ] ; DROP TEXT SEARCH DICTIONARY TEMPLATE [IF EXISTS] dicttmplname [CASCADE] ALTER TEXT SEARCH DICTIONARY TEMPLATE dicttmplname RENAME TO newname; psql has \dFt command operated templates 2) parser and dictionary template could be managed only by superuser (due to security reasons pointed by Tom). So, they don't have owner columns and removed ALTER .. PARSER .. OWNER TO command 4) As Bruce suggests, GUC variable tsearch_conf_name is renamed to default_text_search_config and trigger tsearch is renamed to tsvector_update_trigger 5) remove cfglocale and cfgdefault columns in configuration. So, CREATE/ALTER .. CONFIGURATION hasn't AS DEFAULT and LOCALE options. Instead of that initdb tries to find suitable configuration name for selected locale. Or it uses -T, --text-search-config=CFG switch. 6) pg_dump, psql are changed accordingly. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PATCHES] index support is NULL
http://www.sigaev.ru/misc/indexnulls-0.8.gz Initially patch was developed by Martijn van Oosterhout. But it's reworked and support of searching NULLS to GiST too. Patch adds new column named amsearchnull to pg_am. To recognize IS NULL clause ScanKey-sk_flags contains (SK_ISNULL SK_INDEXFINDNULL) and ScanKey-sk_strategy sets to InvalidStrategy. IS NOT NULL isn't supported. The patch was already suggested to community (http://archives.postgresql.org/pgsql-patches/2006-12/msg00019.php and http://archives.postgresql.org/pgsql-hackers/2007-02/msg01162.php), but the single objection was: http://archives.postgresql.org/pgsql-patches/2006-12/msg00028.php Objection was about representing IS [NOT] NULL clause in SkanKey structure, because it required to set strategy to BTEqualStrategyNumber, and Tom doubted about needing of such support. Now ScanKey-sk_strategy = InvalidStrategy and btree code transforms it to BTEqualStrategyNumber in _bt_preprocess_keys(), btcostestimate knows that fact too. GiST doesn't need to transform strategy - it looks at SK_INDEXFINDNULL only. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[PATCHES] tsearch_core patch for inclusion
http://www.sigaev.ru/misc/tsearch_core-0.43.gz Changes: 1 Ispell dictionary now supports hunspell dictionary's format which is used by OpenOffice = 2.0.2 http://wiki.services.openoffice.org/wiki/Dictionaries Changes in format is addressed, basically, to better support of compound words ( German, Norwegian ). So, please, test it - we don't know that languages at all. 2 added recent fixes of contrib/tsearch2 3 fix usage of fopen/fclose -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] seg regression failures
Anybody see a smoking gun? I don't really know what seg is supposed to be doing, so I don't know where to start. As crazy idea only: try to modify seg_overlap to bool seg_overlap(SEG * a, SEG * b) { return ( ((a-upper = b-upper) (a-lower = b-upper)) || ((b-upper = a-upper) (b-lower = a-upper)) ) ? true : false; } -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PATCHES] Updated bitmap index patch
I don't very like GiST changes: gistgetbitmap will lock/unlock page for every tuple. It seems to me taht is better to change gistnext function to use some sort callback, for example. Or have static array for tids in gistgetbitmap... -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [PATCHES] First implementation of GIN for pg_trgm
From a previous discussion with Teodor, it would be better to store an int in the index instead of a text (it takes less space and is faster). I couldn't find any example so if anyone has an advice to fix that, it's welcome (mostly how to pack the trigram into an int instead of a text). Something like that: trg = generate_trgm(VARDATA(text), VARSIZE(text) - VARHDRSZ); nentries = ARRNELEM(trg); if ( nentries 0 ) { *entries = palloc(sizeof(Datum)*nentries); for(i=0;inentries;i++) { int tmp=0; trgm *ptr = GETARR(trg)+i; CPTRGM(tmp, ptr); tmp = 8; entries[i] = Int32GetDatum(tmp); } } Do not forget to change CREATE OPERATOR CLASS accordingly. The last problem is that similarity calculated in the GIN index is higher than the one with GIST so I have to set the trgm_limit quite high to have decent results (a limit of 0.8 instead of 0.3 seems to be quite good). AFAICS, it comes from the fact that I couldn't find any way to get the length of the indexed trigram which is taken into account with GIST so we're not exactly filtering the results in the same way. Does anyone have an idea on how to fix this point? For calculating similarity, you should have three value: length of first word (let it be a indexed text) in trigrams, length of second word (query word) and number of the same trigrams on both words. It's a pity, but during index scan we don't know length of indexed text. So, in index scan (consistent function) we could not compute exact value of similarity, but we can compute lower limit. For example, if our query has 50 trigrams and only one of them is a common for indexed value and query we can conclude that indexed value can not be similar to our query. So, our consistent function should say FALSE when indexed value is not similar to query exactly and TRUE in opposite case. Let lquery is a length of query and ncommon is a number of common trigrams (look at cnt_sml() function), and consistent function should be: #ifdef DIVUNION /* original formula is: count/(lvalue+lquery-lcommon), so with any lvalue 0 resulting similarity is smaller than computed below */ return ( count/(lquery-lcommon) limit ) ? TRUE : FALSE; #else /* original formula is: count/max(lvalue,lquery) - the same discourse */ return ( count/lquery limit ) ? TRUE : FALSE; #endif Now consistent function doesn't guarantee exact result, so we should mark '%' operator in CREATE OPERATOR CLASS with RECHECK option. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] First implementation of GIN for pg_trgm
I didn't see any improvement in terms of size of the index (14 MB for 642 738 rows in the index in both cases) or speed. Our dictionary table contains 78367 words and its size is 3 MB. Did I miss something? Comparing integers is cheaper than strings. Although it hasn't significant matter for index scan. The attached patch adds a RECHECK too. It seems to work correctly but the RECHECK COND costs a lot of time :/. :( How long is average length of strings in table? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] First implementation of GIN for pg_trgm
I think it can be interesting for other flavours of GIN usage. Is there a way to add the number of entries of the considered indexed item to the consistent prototype without adding too much overhead and complexity? We are thinking about adding extra value, but it's still only thinking. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[pgsql-patches] Docs improvements
1) mvcc.sgml.patch Update comments about installation of DocBook on FreeBSD. DocBook v4.2 is present in ports now. 2) docguide.sgml.patch Table of compatibility of table-lock modes. IMHO, it's useful, clear for understanding. Text view of the table is: | AS | RS | RE | SUE | S | SRE | E | AE | +---+---+---+---+---+---+---+---+ AS | O | O | O | O | O | O | O | X | +---+---+---+---+---+---+---+---+ RS | O | O | O | O | O | O | X | X | +---+---+---+---+---+---+---+---+ RE | O | O | O | O | X | X | X | X | +---+---+---+---+---+---+---+---+ SUE | O | O | O | X | X | X | X | X | +---+---+---+---+---+---+---+---+ S | O | O | X | X | O | X | X | X | +---+---+---+---+---+---+---+---+ SRE | O | O | X | X | X | X | X | X | +---+---+---+---+---+---+---+---+ E | O | X | X | X | X | X | X | X | +---+---+---+---+---+---+---+---+ AE | X | X | X | X | X | X | X | X | +---+---+---+---+---+---+---+---+ Upper row and left column contain abbreviation of lock modes by the first characters( AS - ACCESS SHARE ) -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ *** mvcc.sgml.orig Fri Jan 26 16:27:27 2007 --- mvcc.sgml Fri Jan 26 17:49:29 2007 *** *** 741,746 --- 741,863 releases locks acquired within it. /para + table tocentry=1 id=table-lock-compatibility + title Compatibility of lock modes/title + tgroup cols=9 + colspec colnum=1 colwidth=1* + colspec colnum=2 colwidth=1* + colspec colnum=3 colwidth=1* + colspec colnum=4 colwidth=1* + colspec colnum=5 colwidth=1* + colspec colnum=6 colwidth=1* + colspec colnum=7 colwidth=1* + colspec colnum=8 colwidth=1* + colspec colnum=9 colwidth=1* + thead + row + entryModes/entry + entryAS/entry + entryRS/entry + entryRE/entry + entrySUE/entry + entryS/entry + entrySRE/entry + entryE/entry + entryAE/entry + /row + /thead + tbody + row + entryAS/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerN/entry + /row + row + entryRS/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerN/entry + entry align=centerN/entry + /row + row + entryRE/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerN/entry + /row + row + entrySUE/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerN/entry + /row + row + entryS/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerY/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerN/entry + /row + row + entrySRE/entry + entry align=centerY/entry + entry align=centerY/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerN/entry + /row + row + entryE/entry + entry align=centerY/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerN/entry + entry align=centerN
Re: [pgsql-patches] [HACKERS] [PATCHES] Bundle of patches
Nice, thanks a lot. Tom Lane wrote: Teodor Sigaev [EMAIL PROTECTED] writes: Just a freshing for clean applying.. http://www.sigaev.ru/misc/user_defined_typmod-0.11.gz Applied with some revisions, and pg_dump support and regression tests added. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [PATCHES] Bundle of patches
Just a freshing for clean applying.. http://www.sigaev.ru/misc/user_defined_typmod-0.11.gz Is any objections to commit? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Bundle of patches
This is not responding to my concern. What you presented was an Sorry, I see your point now. Is that test enough? Or I should make more? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [PATCHES] Bundle of patches
0.9 doesn't apply cleanly after Peter's changes, so, new version http://www.sigaev.ru/misc/user_defined_typmod-0.10.gz Teodor Sigaev wrote: Perhaps an array of int4 would be better? How much Done http://www.sigaev.ru/misc/user_defined_typmod-0.9.gz The patch needs more cleanup before applying, too, eg make comments match code, get rid of unused keywords added to gram.y. Cleaned. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Bundle of patches
Perhaps an array of int4 would be better? How much Done http://www.sigaev.ru/misc/user_defined_typmod-0.9.gz The patch needs more cleanup before applying, too, eg make comments match code, get rid of unused keywords added to gram.y. Cleaned. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] Bundle of patches
10; Limit (cost=28.85..28.87 rows=10 width=8) (actual time=0.586..0.627 rows=10 loops=1) - Sort (cost=28.85..28.90 rows=20 width=8) (actual time=0.581..0.594 rows=10 loops=1) Sort Key: t.f1, t.f2 - Result (cost=0.00..28.42 rows=20 width=8) (actual time=0.085..0.431 rows=20 loops=1) - Append (cost=0.00..28.42 rows=20 width=8) (actual time=0.076..0.345 rows=20 loops=1) - Limit (cost=0.00..14.11 rows=10 width=8) (actual time=0.074..0.127 rows=10 loops=1) - Index Scan using idx on foo (cost=0.00..1358815.69 rows=963067 width=8) (actual time=0.069..0.098 rows=10 loops=1) Index Cond: ((f1 4) AND (f1 5)) - Limit (cost=0.00..14.11 rows=10 width=8) (actual time=0.101..0.154 rows=10 loops=1) - Index Scan using idx on foo (cost=0.00..1527039.50 rows=1082489 width=8) (actual time=0.097..0.121 rows=10 loops=1) Index Cond: ((f1 6) AND (f1 7)) Total runtime: 0.952 ms [3] # explain analyze select * from ( select * from (select * from foo where f149980 and f15 order by f1, f2 limit 10) as t1 union all select * from (select * from foo where f169980 and f17 order by f1, f2 limit 10) as t2 ) as t order by f1, f2 limit 10; Limit (cost=35.61..35.64 rows=10 width=8) (actual time=0.630..0.673 rows=10 loops=1) - Sort (cost=35.61..35.66 rows=20 width=8) (actual time=0.626..0.641 rows=10 loops=1) Sort Key: t.f1, t.f2 - Result (cost=0.00..35.18 rows=20 width=8) (actual time=0.114..0.476 rows=20 loops=1) - Append (cost=0.00..35.18 rows=20 width=8) (actual time=0.107..0.401 rows=20 loops=1) - Limit (cost=0.00..17.51 rows=10 width=8) (actual time=0.103..0.156 rows=10 loops=1) - Index Scan using idx on foo (cost=0.00..3550.22 rows=2028 width=8) (actual time=0.099..0.126 rows=10 loops=1) Index Cond: ((f1 49980) AND (f1 5)) - Limit (cost=0.00..17.47 rows=10 width=8) (actual time=0.129..0.181 rows=10 loops=1) - Index Scan using idx on foo (cost=0.00..3804.01 rows=2177 width=8) (actual time=0.125..0.152 rows=10 loops=1) Index Cond: ((f1 69980) AND (f1 7)) Total runtime: 1.004 ms =[IV]With patch=== [1] # explain analyze select * from foo where (f1=7 and f295) or f17 order by f1, f2 limit 1 Limit (cost=0.00..1.41 rows=1 width=8) (actual time=0.314..0.316 rows=1 loops=1) - Index Scan using idx on foo (cost=0.00..4210762.24 rows=2982440 width=8) (actual time=0.309..0.309 rows=1 loops=1) Index Cond: (f1 = 7) Filter: (((f1 = 7) AND (f2 95)) OR (f1 7)) Total runtime: 0.444 ms [2] # explain analyze select * from foo where (f14 and f15) or (f16 and f17) order by f1, f2 limit 10; Limit (cost=0.00..14.16 rows=10 width=8) (actual time=0.087..0.192 rows=10 loops=1) - Result (cost=0.00..2932816.48 rows=2071539 width=8) (actual time=0.081..0.160 rows=10 loops=1) - Append (cost=0.00..2932816.48 rows=2071539 width=8) (actual time=0.074..0.123 rows=10 loops=1) - Index Scan using idx on foo (cost=0.00..1382937.86 rows=976723 width=8) (actual time=0.069..0.092 rows=10 loops=1) Index Cond: ((f1 4) AND (f1 5)) - Index Scan using idx on foo (cost=0.00..1549878.62 rows=1094816 width=8) (never executed) Index Cond: ((f1 6) AND (f1 7)) Total runtime: 0.410 ms [3] # explain analyze select * from foo where (f149980 and f15) or (f169980 and f17) order by f1, f2 limit 10 Limit (cost=0.00..17.57 rows=10 width=8) (actual time=0.115..0.226 rows=10 loops=1) - Result (cost=0.00..6902.27 rows=3928 width=8) (actual time=0.111..0.197 rows=10 loops=1) - Append (cost=0.00..6902.27 rows=3928 width=8) (actual time=0.102..0.156 rows=10 loops=1) - Index Scan using idx on foo (cost=0.00..3427.16 rows=1950 width=8) (actual time=0.099..0.125 rows=10 loops=1) Index Cond: ((f1 49980) AND (f1 5)) - Index Scan using idx on foo (cost=0.00..3475.11 rows=1978 width=8) (never executed) Index Cond: ((f1 69980) AND (f1 7)) Total runtime: 0.679 ms -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
[PATCHES] Bundle of patches
) AND (a 21000)) 4.3) As side effect of previous point, overlapped clauses can be eliminated: SELECT a FROM tst WHERE ( a 5 AND a 61000 ) OR ( a 6 AND a 60100 ) ORDER BY a LIMIT 20; Limit (cost=0.00..4.14 rows=20 width=4) (actual time=0.168..1.001 rows=20 loops=1) - Index Scan using aidx on tst (cost=0.00..2344.85 rows=11338 width=4) (actual time=0.162..0.935 rows=20 loops=1) Index Cond: ((a 5) AND (a 61000)) -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Bundle of patches
This has obvious semantic disdvantages (what if foo is an expensive function?); Agree. but the real problem is that there's no way for the planner to reason about ordering in this representation. This patch would guarantee that an ORDER BY with the NULLS option couldn't use an indexscan, even if the index sorts nulls at the correct end. create table foo ( i int); insert into foo values (1), (5), (NULL); create index fooidx on foo (i); set enable_seqscan=off; set enable_bitmapscan=off; explain select i from foo order by i asc nulls last; QUERY PLAN --- Index Scan using fooidx on foo (cost=0.00..12.05 rows=3 width=4) explain select i from foo order by i desc nulls first; QUERY PLAN Index Scan Backward using fooidx on foo (cost=0.00..12.05 rows=3 width=4) Patch is smart enough about native NULL's ordering, so it adds quals only if it needed. Index support of non-native NULL's ordering, IMHO, has some correlation with suggested OR-patch. Sorting by ASC NULLS FIRST may done by two index scan with append node: Append Index Scan Cond: foo IS NULL Index Scan Cond: foo IS NOT NULL I think a reasonable implementation requires introducing an explicit concept of nulls-first-or-last into the planner's model of sort order, Agree, but I tried to keep patches independent as possible... If we will have agreement about ways to resolve, I'll will time to work further in foreseeable future. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Bundle of patches
This seems kinda ugly --- it looks very expensive and unlikely to find useful optimizations most of the time. Have you done any benchmarking to find out what the cost is when the optimizations don't succeed? Yep, it's a big win with order and limit specified. Let table (a,b) has index over (a,b), so, queries with ((a=3 AND b10) OR a3) ORDER BY a,b may be done with one pass of index scan with condition a=3 and filter ((a=3 and b10) or a3). And scan out is already sorted. The single way to execute it without patch is bitmap or scan over two index scans and following ordering. If limit is small enough then there is a lot unnecessary work for executor. Thats case may be found by find_common_quals() which is fast enough. Simplest case of second kind is ( a3 or a5 ). If it's possible to prove that set of rows for first condition and set for second are not intersected then output of correlated index scans can be simply joined/appended. In this case, we broaden applicability of Append node. What is more, it's possible to order index scans by conditions, which allows do not use sort node. I understand, that proving of non-intersecting and ordering by conditions is an expensive operation because of using predicate_implied_by/predicate_refuted_by, but in most cases they aren't even called. For using this optimization all conditions should on single index - and it's first step. Suggested plan's is very effective when a or b is large values like varchar, not a just integers. Also, what's with the pull_tlist bit? pull target list from subplan. I've add it because query select b,a from foo where a3 or a5 order by a limit 1; with plan Result Append Index Scan Index Scan fails because Result node thinks that it gets (b,a) tuple, but in fact it gets (a,b). So, if pull_tlist is TRUE, then create_append_plan takes target list from first subplan. Currently, that bit set only with OR-optimized plan. And optimizer of OR-clause guarantee that target lists are the same. Sorry, but I didn't find clearer solution... -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PATCHES] Bundle of patches
This one seems generally workable, but I really dislike the approach that's been used for passing typmod arguments to the typmod_in function. Representing them with an internal parameter means it'll be forever impossible to write typmod functions in anything but C, which seems an ugly restriction. Perhaps an array of int4 would be better? How much I don't think that is a problem - I'll change that flexibility do we really want to provide for typmod arguments? Allowing full expr_list in the grammar seems less than sane, considering the result is still going to have to pack into 32 bits. As I remember, I tried to use some thing else but, I've got a lot conflicts with AexprConst: func_name '(' expr_list ')' Sconst The patch needs more cleanup before applying, too, eg make comments match code, get rid of unused keywords added to gram.y. Ok. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] Bundle of patches
And what happens when we implement NULLS FIRST/LAST correctly? This is really a poor choice of representation. If it's just appending of indexscan's it's not a problem... One thing I find questionable about this is the assumption that indexes can support foo IS NULL and foo IS NOT NULL searches equally conveniently. This is demonstrably false for, say, hash. (Hash could store null keys by assigning them a fixed hashcode, say 0. Then it would be able to handle IS NULL searches, but not IS NOT NULL, because it can't do full-index scans.) Is there a guarantee that hash value of some not-null keys doesn't equal to special hash code? the patch to do IS NULL only. But if we are going areto support both, we probably have to have two pg_am flags not one. GiST isn't effective with single NOT NULL condition ... So, using two flags may be useful. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] Use of backslash in tsearch2
Patch isn't full, simple test (values are took from regression.diffs): and try dump table and restore: ERROR: syntax error CONTEXT: COPY tt, line 5, column tq: '1 ''2' Attached cumulative patch fixes problem, but I have some doubts, is it really needed? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ttt.gz Description: Unix tar archive ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PATCHES] Use of backslash in tsearch2
Teodor, are the new attached regression results correct? If so, I will apply the patch and update the expected file. Patch isn't full, simple test (values are took from regression.diffs): # create table tt (tv tsvector, tq tsquery); # insert into tt values (E'''1 \\''2''', NULL); # insert into tt values (E'''1 \\''2''3', NULL); # insert into tt values ( E'''1 \\''2'' 3', NULL); # insert into tt values ( E'''1 \\''2'' '' 3'' 4 ', NULL); # insert into tt values ( NULL, E'''1 \\''2'''); # insert into tt values ( NULL, E'''1 \\''2'''); # insert into tt values ( NULL, E'1(''2''('' 4''(\\|5 | ''6 \\'' !|'')))'); # insert into tt values ( NULL, E'1(''2''('' 4''(\\|5 | ''6 \\'' !|'')))'); and try dump table and restore: ERROR: syntax error CONTEXT: COPY tt, line 5, column tq: '1 ''2' PS I'm not subscribed to -patches, so I post to -hackers -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PATCHES] tsearch2 makefile fixes
Hi! I don't see a patch :) Alvaro Herrera wrote: Hi, I needed to apply this patch in order for tsearch2 to build here. This is a VPATH build, so maybe it's a reason why it's not a common test scenario. The changes itself also seem sound to me, at least as far as I understand our makefile structure. Please consider applying this to 8.1 (or just let me know and I'll do it for you). -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PATCHES] tsearch2 makefile fixes
As I can see, this patch already apllyed at 18 Oct: % grep subdir */Makefile ispell/Makefile:subdir = contrib/tsearch2/ispell snowball/Makefile:subdir = contrib/tsearch2/snowball wordparser/Makefile:subdir = contrib/tsearch2/wordparser CVS HEAD Alvaro Herrera wrote: [Resend, this time with the patch attached] Hi, I needed to apply this patch in order for tsearch2 to build here. This is a VPATH build, so maybe it's a reason why it's not a common test scenario. The changes itself also seem sound to me, at least as far as I understand our makefile structure. Please consider applying this to 8.1 (or just let me know and I'll do it for you). Index: ispell/Makefile === RCS file: /home/alvherre/Code/cvs/pgsql/contrib/tsearch2/ispell/Makefile,v retrieving revision 1.8 diff -c -r1.8 Makefile *** ispell/Makefile 27 Sep 2005 17:13:11 - 1.8 --- ispell/Makefile 2 Nov 2005 09:56:58 - *** *** 8,14 PGXS := $(shell pg_config --pgxs) include $(PGXS) else ! subdir = contrib/tsearch2 top_builddir = ../../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk --- 8,14 PGXS := $(shell pg_config --pgxs) include $(PGXS) else ! subdir = contrib/tsearch2/ispell top_builddir = ../../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk Index: snowball/Makefile === RCS file: /home/alvherre/Code/cvs/pgsql/contrib/tsearch2/snowball/Makefile,v retrieving revision 1.7 diff -c -r1.7 Makefile *** snowball/Makefile 27 Sep 2005 17:13:12 - 1.7 --- snowball/Makefile 2 Nov 2005 09:53:41 - *** *** 9,15 PGXS := $(shell pg_config --pgxs) include $(PGXS) else ! subdir = contrib/tsearch2 top_builddir = ../../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk --- 9,15 PGXS := $(shell pg_config --pgxs) include $(PGXS) else ! subdir = contrib/tsearch2/snowball top_builddir = ../../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk Index: wordparser/Makefile === RCS file: /home/alvherre/Code/cvs/pgsql/contrib/tsearch2/wordparser/Makefile,v retrieving revision 1.7 diff -c -r1.7 Makefile *** wordparser/Makefile 27 Sep 2005 17:13:12 - 1.7 --- wordparser/Makefile 2 Nov 2005 09:54:30 - *** *** 9,15 PGXS := $(shell pg_config --pgxs) include $(PGXS) else ! subdir = contrib/tsearch2 top_builddir = ../../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk --- 9,15 PGXS := $(shell pg_config --pgxs) include $(PGXS) else ! subdir = contrib/tsearch2/wordparser top_builddir = ../../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly