Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
Alexander Korotkov writes: > On Tue, Apr 5, 2011 at 5:05 PM, Robert Haas wrote: >> I am probably being stupid here, but doesn't the number of links to >> rows grow proportionately to the number of n-grams? > Number of links to rows grow proportionally to total number of extracted > q-grams, but not proportionally to number of unique q-grams. Sure. The number of links is exactly proportional to the size of the text, no? An n-character text contains exactly n-q+1 q-grams, no more, no less. You might have some rules that cause you to discard some of them, but basically the TID portion of the index will be proportional to data volume, with no measurable dependence on q. Or at least that's what it seems like before I've had my morning caffeine fix... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
On Tue, Apr 5, 2011 at 5:05 PM, Robert Haas wrote: > I am probably being stupid here, but doesn't the number of links to > rows grow proportionately to the number of n-grams? Number of links to rows grow proportionally to total number of extracted q-grams, but not proportionally to number of unique q-grams. Though, if extracted q-grams are not unique inside same indexed value, then it can reduce number of links (but it is rarity). Lets consider simple example. Two rows contains strings 'aaa' and 'aaab'. We extract 3-gram 'aaa' from first string and 3-grams 'aaa' and 'aab' from second string (for simplicity, there is no padding here). GIN index will contain structure, which can be represented so: 'aaa' => 1, 2 'aab' => 2 We can see, that there are 2 unique 3-grams, but 3 links to the rows. With best regards, Alexander Korotkov.
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
On Tue, Apr 5, 2011 at 8:41 AM, Alexander Korotkov wrote: > For example, here is distribution of q-grams count in 120 Mb of dblp paper > titles (pretty large dataset). > q count > 2 7218 > 3 115107 > 4 589428 > 5 1648453 > 6 3336685 > Number of 5-grams if about 15x larger than number of 3-grams. But most part > of index space will be occupied by links to the rows(about 120 millions of > links), while size of q-grams itself will be almost ignorable in comparison > with it. I am probably being stupid here, but doesn't the number of links to rows grow proportionately to the number of n-grams? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
For example, here is distribution of q-grams count in 120 Mb of dblp paper titles (pretty large dataset). q count 27218 3 115107 4 589428 5 1648453 6 3336685 Number of 5-grams if about 15x larger than number of 3-grams. But most part of index space will be occupied by links to the rows(about 120 millions of links), while size of q-grams itself will be almost ignorable in comparison with it. With best regards, Alexander Korotkov.
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
On Tue, Apr 5, 2011 at 3:56 PM, Robert Haas wrote: > So with q=5, the index will be approximately 10x larger than with q=3. > Maybe that's OK, I'm not sure. But it is a big difference. Not whole index will be approximately 10x larger, but only entries pages number (which contains btree on gin keys, i.e. q-grams), while data pages number (which contains links to rows in lists or btrees) will be similar. In dependence on data volume index size can be 10x larger (on small datasets) or few percents larger (on large datasets). With best regards, Alexander Korotkov.
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
On Tue, Apr 5, 2011 at 4:52 AM, Alexander Korotkov wrote: > On Mon, Apr 4, 2011 at 9:01 PM, Robert Haas wrote: >> >> On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov >> wrote: >> > relatively small when q <= 5. Accordingly, I think we should expect >> > indexes >> > to be usable with at least with q = 5. >> >> I defer to your opinion on this, since you know more about it than I >> do. But I think it would still be worthwhile to write a quick Perl >> script and calculate the number q-grams in various sample texts for >> various values of q. The worst case is surely exponential in q, so >> it'd be nice to have some evidence of what the real-world behavior is. > > Here is distribution of numbers of different q-grams count in various > datasets. Q-grams didn't pass any preprocessing, preprocessed q-grams (for > example, lowercased) should have lower counts. > q ds1 ds2 ds3 ds4 > 2 2313 3461 1625 1288 > 3 15146 25094 14090 10728 > 4 58510 105908 69127 47499 > 5 161801 298466 182680 110929 > 6 351175 633750 331090 176336 > 7 613299 1049088 496426 234730 > 8 921962 1450715 657965 283698 > 9 1248339 1793158 802188 321261 > 10 1556838 2066775 926043 348058 > ds1 - J. R. R. Tolkien, The Lord of the Rings, 2805204 bytes > ds2 - Leo Tolstoy, War and Peace volume 1, 3197190 bytes > ds3 - set of person first and last names, 2142298 bytes > ds4 - english dictionary, 931708 bytes > Sure, q-grams count grows with q increasing. At low q we can see > approximately exponential grow. At high q grow is slowing and it is > approximately linear. > In the worst case count of q-grams is exponential in q if we think data > volume to be much higher then number of possible q-grams. But with high q > real limitation is total number of q-grams extracted from dataset. In worst > case each extracted q-gram is unique. This means that entries pages number > would be comparable with data pages number. In this case index size with > high q would be few times greater that index size with low q. So with q=5, the index will be approximately 10x larger than with q=3. Maybe that's OK, I'm not sure. But it is a big difference. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
On Mon, Apr 4, 2011 at 9:01 PM, Robert Haas wrote: > On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov > wrote: > > relatively small when q <= 5. Accordingly, I think we should expect > indexes > > to be usable with at least with q = 5. > > I defer to your opinion on this, since you know more about it than I > do. But I think it would still be worthwhile to write a quick Perl > script and calculate the number q-grams in various sample texts for > various values of q. The worst case is surely exponential in q, so > it'd be nice to have some evidence of what the real-world behavior is. Here is distribution of numbers of different q-grams count in various datasets. Q-grams didn't pass any preprocessing, preprocessed q-grams (for example, lowercased) should have lower counts. q ds1 ds2 ds3ds4 2 231334611625 1288 315146 25094 14090 10728 458510 105908 69127 47499 5 161801 298466 182680 110929 6 351175 633750 331090 176336 7 613299 1049088 496426 234730 8 921962 1450715 657965 283698 9 1248339 1793158 802188 321261 10 1556838 2066775 926043 348058 ds1 - J. R. R. Tolkien, The Lord of the Rings, 2805204 bytes ds2 - Leo Tolstoy, War and Peace volume 1, 3197190 bytes ds3 - set of person first and last names, 2142298 bytes ds4 - english dictionary, 931708 bytes Sure, q-grams count grows with q increasing. At low q we can see approximately exponential grow. At high q grow is slowing and it is approximately linear. In the worst case count of q-grams is exponential in q if we think data volume to be much higher then number of possible q-grams. But with high q real limitation is total number of q-grams extracted from dataset. In worst case each extracted q-gram is unique. This means that entries pages number would be comparable with data pages number. In this case index size with high q would be few times greater that index size with low q. With best regards, Alexander Korotkov.
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov wrote: > relatively small when q <= 5. Accordingly, I think we should expect indexes > to be usable with at least with q = 5. I defer to your opinion on this, since you know more about it than I do. But I think it would still be worthwhile to write a quick Perl script and calculate the number q-grams in various sample texts for various values of q. The worst case is surely exponential in q, so it'd be nice to have some evidence of what the real-world behavior is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
On Mon, Apr 4, 2011 at 6:56 PM, Robert Haas wrote: > On Mon, Apr 4, 2011 at 7:35 AM, Alexander Korotkov > wrote: > > I would like to propose a q-gram module which would have following > > differences in comparison with pg_trgm: > > 1) Focus on acceleration of edit distance (e.g. levenshtein distance) > > queries and LIKE/ILIKE queries > > 2) Support of various q > > Doesn't the index size grow rather drastically with increasing q? 1) GIN GIN index size can be represent as entries pages size + data pages size. With increasement of data volume grow of entries pages size is slowing , because set of q-grams which are actually occurs in text is limited. It can be illustrated on following example: data set count avg. length entries pages data pages english dictionary98569 8.5 529 656 names of persons 105420 19.3 475 1985 paper titles2526871 47.2123484819 Statistics of GIN index of pg_trgm module is presented in table above. We can see that ratio of entries pages to data pages is decreasing with increasing of data volume. Since number of q-grams extracted from one indexed value is similar for distinct q, we should not expect significant grow of data pages size with increasing q. With increasing q we should expect increase of entries pages size, but on large datasets and with not very high q we shoudn't expect significant grow of index size. In some papers I met assumption that set of whole q-grams in natural text is relatively small when q <= 5. Accordingly, I think we should expect indexes to be usable with at least with q = 5. 2) GiST Since I'm going to store exact values in leaf nodes instead sets of q-grams, index grow isn't directly expected with increasing of q. Because index size would depends on size of indexed values and signature size(both don't depends on q). There would be possible indirect index grow, because we probably need longer signatures for appropriate representation of q-gram set. With best regards, Alexander Korotkov.
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
On Mon, Apr 4, 2011 at 7:35 AM, Alexander Korotkov wrote: > I would like to propose a q-gram module which would have following > differences in comparison with pg_trgm: > 1) Focus on acceleration of edit distance (e.g. levenshtein distance) > queries and LIKE/ILIKE queries > 2) Support of various q Doesn't the index size grow rather drastically with increasing q? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
Here is text of my GSoC proposal. Given details probably makes essence of my proposal clear. Any comments are welcome. *Name of project* Q-gram indexing module * * *Synopsis* Currently PostgreSQL has support for trigram-based string collection indexing in pg_trgm module. Indexes in pg_trgm was originally focused on trigram similatiry queries optimization. LIKE/ILIKE operations support was added by my patch, but this support is limited, because trigrams not always can be extracted from wildcard. I would like to propose a q-gram module which would have following differences in comparison with pg_trgm: 1) Focus on acceleration of edit distance (e.g. levenshtein distance) queries and LIKE/ILIKE queries 2) Support of various q 3) Support of positional q-grams (q-gram which contain position in which it appears in text) in GIN which give a hope to handle edit distance queries better than ordinal q-grams 4) LIKE/ILIKE acceleration even in case when no q-gram can be extracted from wildcard (via partial match) 5) Support of various signature size in GiST 6) Store exact string value in leaf pages of GiST index tree (In spite of set of q-grams, exact string value takes less space and allow exact filtering for LIKE/ILIKE and edit distance filtering) *Benefits to the PostgreSQL Community* Additional support of q-gram indexing will consolidate PostgreSQL as most advanced DBMS in fuzzy search on string collections. Q-gram indexes will be applicable for industry tasks. *Quantifiable results* Acceleration of edit distance queries and LIKE/ILIKE queries. *Project Details* *Q-gram index for LIKE/ILIKE queries* Basic idea is so [1]: 1) Extract q-grams which should anyway present in string which conforms to wildcard. 2) Filter only string which contain q-grams extracted on previous step. As the rule, recheck is required. Currently this approach was impelemented for pg_trgm, but there is following limitation besides fixed q. In some cases we can extract no q-grams from wildcard (for example '%zz%' and q = 3). In GIN we could filter all q-grams which starts from 'zz' using partial match. But it require to store original q-gram in index (in pg_trgm crc32 in stored in the case of multibyte encoding). *Using q-gram index for edit distance queries* Number of common trigrams in strings p and r is at least max{length(p), length(r)} - q + 1 - k*q [2,5]. If p is query string and r is indexed string, we don't know exactly length of r. Then we can use following minimal common q-grams number bound length(p) - q + 1 - k*q. This bound allow us to filter strings using q-gram GIN and GiST indexes. Moreover, when filtering using GIN or GiST index we know about absence in indexing string of particular q-grams. In some cases it should be more effective then minimal bound of common q-grams. For examle, if query string is 'abcdefg' then absence of trigrams 'abc' and 'efg' can ensure us that minimal possible edit distance is 2. *Using positional q-gram index for edit distance queries* In string to set of q-grams decomposition not only q-gram presence itself but also q-gram position is useful information [3,4,5]. GIN index can store q-gram position as lower part of key. Q-gram position can be-used for edit distance filtering. For example, if we extract q-gram x with position l from query with threshold k, then corresponding q-gram in indexed string should have position in interval [l-k,l+k]. This condition can be used for more effective filtering than with ordinal q-grams. *Questions to discuss with community* 1) Find a way to give additional parameters to index (value of q and signature size for GiST). If it wouldn't be possible to add extension-specific parameters to GiST and GIN indexes then distinct opclasses can be created for various parameters values; 2) Find a way to represent edit distance query. Edit distance query require two parameters query string and threshold. Threshold is maximum edit distance between result string and query string. Similar problem presents in pg_trgm for trigram similarity query: we need threshold of matchind trigrams amount. In pg_trgm threshold is declared as global session parameter and it can be manipulated by set_limit show_limit. However this approach have some limitation. For example, different thresholds can't be combined in same query. Probably, we should consider using of composite type for edit distance query respresentation. 3) Find appropriate place for proposed functionality: core, contrib module or a separate project. *Inch-stones (project broken into small, distinct chunks)* 1) Produce a solution af architecture questions with help of community. 2) Edit distance operator implementation 3) Implement GIN qgram index b) Index building implementation c) Edit distance strategy implementation d) LIKE/ILIKE strategy implementation 4) Implement GIN pqgram index a) Index building implementation b) Edit distance strategy implementation c) LIKE/ILIKE strategy implementati
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
On Mon, Mar 28, 2011 at 7:27 PM, Tom Lane wrote: > Robert Haas writes: > > I'm afraid I don't know this code well enough to give you any > > meaningful feedback, but I hope someone will. > > Really Oleg and Teodor are the only people well-qualified to comment on > such stuff. (It sounds reasonable to me, but I wouldn't know if there > are problems in the idea.) They may be too busy right at the moment. > Thank you for reply. I'm going to ask Oleg and Teodor for their feedback. With best regards, Alexander Korotkov.
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
Robert Haas writes: > On Fri, Mar 25, 2011 at 1:32 PM, Alexander Korotkov > wrote: >> I would like to ask you about currency of the work above. I propose to >> develop functionality of GIN and GiST q-gram indexes with following >> features: > I'm afraid I don't know this code well enough to give you any > meaningful feedback, but I hope someone will. Really Oleg and Teodor are the only people well-qualified to comment on such stuff. (It sounds reasonable to me, but I wouldn't know if there are problems in the idea.) They may be too busy right at the moment. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
On Fri, Mar 25, 2011 at 1:32 PM, Alexander Korotkov wrote: > I would like to ask you about currency of the work above. I propose to > develop functionality of GIN and GiST q-gram indexes with following > features: > 1) Handle edit distance (e.g. levenshtein distance) and LIKE/ILIKE > queries(using GIN partial match if no full q-grams can be extracted > from wildcard) > 2) Support of various q > 3) Support of positional q-grams in GIN (for more effective edit > distance filtering) > 4) Various signature size in GiST > As you can see, there are some significant differences from pg_trgm. > Do you see this functionality useful? If you think this functionality > useful, where do you like to see it: separate project, contrib module, > core (of course, in the case when code have sufficient quality)? > I have stong confidence level about implementability of this project > in few month. That's why I could propose this as an GSoC project. I'm afraid I don't know this code well enough to give you any meaningful feedback, but I hope someone will. All - note that Alexander has contributed a number of patches in this area previously that have been committed, so it'd be great if we can do our part to help him continue contributing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: q-gram GIN and GiST indexes
On Fri, Mar 25, 2011 at 8:32 PM, Alexander Korotkov wrote: > I would like to ask you about currency of the work above. This seems to be a mess of words. Sorry for my bad english. Actually, I meant that I need a appraisal of my proposal. With best regards, Alexander Korotkov. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Proposal: q-gram GIN and GiST indexes
Hackers, I would like to ask you about currency of the work above. I propose to develop functionality of GIN and GiST q-gram indexes with following features: 1) Handle edit distance (e.g. levenshtein distance) and LIKE/ILIKE queries(using GIN partial match if no full q-grams can be extracted from wildcard) 2) Support of various q 3) Support of positional q-grams in GIN (for more effective edit distance filtering) 4) Various signature size in GiST As you can see, there are some significant differences from pg_trgm. Do you see this functionality useful? If you think this functionality useful, where do you like to see it: separate project, contrib module, core (of course, in the case when code have sufficient quality)? I have stong confidence level about implementability of this project in few month. That's why I could propose this as an GSoC project. With best regards, Alexander Korotkov. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers