Re: [HACKERS] Rethinking our fulltext phrase-search implementation
Artur Zakirovwrites: > Otherwise it seems that queries like 'a <-> (b & c)' will always return > false. Then we need maybe some warning message. Well, the query as written is pointless, but it could be useful with something other than "b" and "c" as the AND-ed terms. In this usage "&" is equivalent to "<0>", which we know has corner-case uses. I'm not inclined to issue any sort of warning for unsatisfiable queries. We don't issue a warning when a SQL WHERE condition collapses to constant FALSE, and that seems like exactly the same sort of situation. It strikes me though that the documentation should point this out. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rethinking our fulltext phrase-search implementation
Hello Tom, On 17.12.2016 21:36, Tom Lane wrote: 4. The transformations are wrong anyway. The OR case I showed above is all right, but as I argued in <24331.1480199...@sss.pgh.pa.us>, the AND case is not: regression=# select 'a <-> (b & c)'::tsquery; tsquery --- 'a' <-> 'b' & 'a' <-> 'c' (1 row) This matches 'a b a c', because 'a <-> b' and 'a <-> c' can each be matched at different places in that text; but it seems highly unlikely to me that that's what the writer of such a query wanted. (If she did want that, she would write it that way to start with.) NOT is not very nice either: If I'm not mistaken PostgreSQL 9.6 and master with patch "fix-phrase-search.patch" return false for the query: select 'a b a c' @@ 'a <-> (b & c)'::tsquery; ?column? -- f (1 row) I agree that such query is confusing. Maybe it is better to return true for such queries? Otherwise it seems that queries like 'a <-> (b & c)' will always return false. Then we need maybe some warning message. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Rethinking our fulltext phrase-search implementation
I wrote: > I've been thinking about how to fix the problem Andreas Seltenreich > reported at > https://postgr.es/m/87eg1y2s3x@credativ.de Attached is a proposed patch that deals with the problems discussed here and in <26706.1482087...@sss.pgh.pa.us>. Is anyone interested in reviewing this, or should I just push it? BTW, I noticed that ts_headline() seems to not behave all that nicely for phrase searches, eg regression=# SELECT ts_headline('simple', '1 2 3 1 3'::text, '2 <-> 3', 'ShortWord=0'); ts_headline 1 2 3 1 3 (1 row) Highlighting the second "3", which is not a match, seems pretty dubious. Negative items are even worse, they don't change the results at all: regression=# SELECT ts_headline('simple', '1 2 3 1 3'::text, '!2 <-> 3', 'ShortWord=0'); ts_headline 1 2 3 1 3 (1 row) However, the code involved seems unrelated to the present patch, and it's also about as close to completely uncommented as I've seen anywhere in the PG code base. So I'm not excited about touching it. regards, tom lane diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index 67d0c34..464ce83 100644 *** a/doc/src/sgml/datatype.sgml --- b/doc/src/sgml/datatype.sgml *** SELECT 'fat rat ! cat'::tsqu *** 3959,3973 tsquery 'fat' 'rat' !'cat' - - SELECT '(fat | rat) - cat'::tsquery; - tsquery - --- - 'fat' - 'cat' | 'rat' - 'cat' - - The last example demonstrates that tsquery sometimes - rearranges nested operators into a logically equivalent formulation. --- 3959,3965 diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml index 2da7595..bc33a70 100644 *** a/doc/src/sgml/textsearch.sgml --- b/doc/src/sgml/textsearch.sgml *** SELECT 'fat cow'::tsquery @@ 'a fa *** 264,270 text, any more than a tsvector is. A tsquery contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. ! (For details see .) There are functions to_tsquery, plainto_tsquery, and phraseto_tsquery that are helpful in converting user-written text into a proper --- 264,270 text, any more than a tsvector is. A tsquery contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. ! (For syntax details see .) There are functions to_tsquery, plainto_tsquery, and phraseto_tsquery that are helpful in converting user-written text into a proper *** text @@ text *** 323,328 --- 323,330 at least one of its arguments must appear, while the ! (NOT) operator specifies that its argument must not appear in order to have a match. + For example, the query fat ! rat matches documents that + contain fat but not rat. *** SELECT phraseto_tsquery('the cats ate th *** 377,382 --- 379,401 then , then -, and ! most tightly. + + + It's worth noticing that the AND/OR/NOT operators mean something subtly + different when they are within the arguments of a FOLLOWED BY operator + than when they are not, because then the position of the match is + significant. Normally, !x matches only documents that do not + contain x anywhere. But x - !y + matches x if it is not immediately followed by y; + an occurrence of y elsewhere in the document does not prevent + a match. Another example is that x y normally only + requires that x and y both appear somewhere in the + document, but (x y) - z requires x + and y to match at the same place, immediately before + a z. Thus this query behaves differently from x + - z y - z, which would match a document + containing two separate sequences x z and y z. + diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c index efc111e..3e0a444 100644 *** a/src/backend/utils/adt/tsginidx.c --- b/src/backend/utils/adt/tsginidx.c *** checkcondition_gin(void *checkval, Query *** 212,218 * Evaluate tsquery boolean expression using ternary logic. */ static GinTernaryValue ! TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem) { GinTernaryValue val1, val2, --- 212,218 * Evaluate tsquery boolean expression using ternary logic. */ static GinTernaryValue ! TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem, bool in_phrase) { GinTernaryValue val1, val2, *** TS_execute_ternary(GinChkVal *gcv, Query *** 230,236 switch (curitem->qoperator.oper) { case OP_NOT: ! result = TS_execute_ternary(gcv,
Re: [HACKERS] Rethinking our fulltext phrase-search implementation
I wrote: > It's worth noting that with these rules, phrase searches will act as > though "!x" always matches somewhere; for instance "!a <-> !b" will match > any tsvector. I argue that this is not wrong, not even if the tsvector is > empty: there could have been adjacent stopwords matching !a and !b in the > original text. Since we've adjusted the phrase matching rules to treat > stopwords as unknown-but-present words in a phrase, I think this is > consistent. It's also pretty hard to assert this is wrong and at the same > time accept "!a <-> b" matching b at the start of the document. To clarify this point, I'm imagining that the patch would include documentation changes like the attached. regards, tom lane diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index 67d0c34..464ce83 100644 *** a/doc/src/sgml/datatype.sgml --- b/doc/src/sgml/datatype.sgml *** SELECT 'fat rat ! cat'::tsqu *** 3959,3973 tsquery 'fat' 'rat' !'cat' - - SELECT '(fat | rat) - cat'::tsquery; - tsquery - --- - 'fat' - 'cat' | 'rat' - 'cat' - - The last example demonstrates that tsquery sometimes - rearranges nested operators into a logically equivalent formulation. --- 3959,3965 diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml index 2da7595..bc33a70 100644 *** a/doc/src/sgml/textsearch.sgml --- b/doc/src/sgml/textsearch.sgml *** text @@ text *** 323,328 --- 323,330 at least one of its arguments must appear, while the ! (NOT) operator specifies that its argument must not appear in order to have a match. + For example, the query fat ! rat matches documents that + contain fat but not rat. *** SELECT phraseto_tsquery('the cats ate th *** 377,382 --- 379,401 then , then -, and ! most tightly. + + + It's worth noticing that the AND/OR/NOT operators mean something subtly + different when they are within the arguments of a FOLLOWED BY operator + than when they are not, because then the position of the match is + significant. Normally, !x matches only documents that do not + contain x anywhere. But x - !y + matches x if it is not immediately followed by y; + an occurrence of y elsewhere in the document does not prevent + a match. Another example is that x y normally only + requires that x and y both appear somewhere in the + document, but (x y) - z requires x + and y to match at the same place, immediately before + a z. Thus this query behaves differently from x + - z y - z, which would match a document + containing two separate sequences x z and y z. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Rethinking our fulltext phrase-search implementation
I've been thinking about how to fix the problem Andreas Seltenreich reported at https://postgr.es/m/87eg1y2s3x@credativ.de The core of that problem is that the phrase-search patch attempts to restructure tsquery trees so that there are no operators underneath a PHRASE operator, except possibly other PHRASE operators. The goal is to not have to deal with identifying specific match locations except while processing a PHRASE operator. Well, that's an OK idea if it can be implemented reasonably, but there are several problems with the code as it stands: 1. The transformation is done by normalize_phrase_tree(), which is currently invoked (via cleanup_fakeval_and_phrase()) in a rather ad-hoc set of places including tsqueryin() and the various variants of to_tsquery(). This leaves lots of scope for errors of omission, which is exactly the proximate cause of Andreas' bug: ts_rewrite() is neglecting to re-normalize its result tsquery. I have little faith that there aren't other similar errors of omission today, and even less that we won't introduce more in future. 2. Because the transformation is invoked as early as tsqueryin(), it's user-visible: regression=# select 'a <-> (b | c)'::tsquery; tsquery --- 'a' <-> 'b' | 'a' <-> 'c' (1 row) This is confusing to users, and I do not think it's a good idea to expose what's basically an optimization detail this way. For stored tsqueries, we're frozen into this approach forever even if we later decide that checking for 'a' twice isn't such a hot idea. 3. Worse, early application of the transformation creates problems for operations such as ts_rewrite: a rewrite might fail to match, or produce surprising results, because the query trees actually being operated on aren't what the user probably thinks they are. At a minimum, to get consistent results I think we'd have to re-normalize after *each step* of ts_rewrite, not only at the end. That would be expensive, and it's far from clear that it would eliminate all the surprises. 4. The transformations are wrong anyway. The OR case I showed above is all right, but as I argued in <24331.1480199...@sss.pgh.pa.us>, the AND case is not: regression=# select 'a <-> (b & c)'::tsquery; tsquery --- 'a' <-> 'b' & 'a' <-> 'c' (1 row) This matches 'a b a c', because 'a <-> b' and 'a <-> c' can each be matched at different places in that text; but it seems highly unlikely to me that that's what the writer of such a query wanted. (If she did want that, she would write it that way to start with.) NOT is not very nice either: regression=# select '!a <-> !b'::tsquery; tsquery !'a' & !( !( 'a' <-> 'b' ) & 'b' ) (1 row) If you dig through that, you realize that the <-> test is pointless; the query can't match any vector containing 'a', so certainly the <-> condition can't succeed, making this an expensive way to spell "!a & !b". And that's not the right semantics anyway: if 'x y' matches this query, which it does and should, why doesn't 'x y a' match? 5. The case with only one NOT under <-> looks like this: regression=# select '!a <-> b'::tsquery; tsquery !( 'a' <-> 'b' ) & 'b' (1 row) This is more or less in line with the naive view of what its semantics should be, although I notice that it will match a 'b' at position 1, which might be a surprising result. We're not out of the woods though: this will (and should) match, eg, 'c b a'. But that means that '!a & b' is not a safe lossy approximation to '!a <-> b', which is an assumption that is wired into a number of places. Simple testing shows that indeed GIN and GIST index searches get the wrong answers, different from what you get in a non-indexed search, for queries like this. So we have a mess here, which needs to be cleaned up quite aside from the fact that it's capable of provoking Assert failures and/or crashes. I thought for awhile about moving the normalize_phrase_tree() work to occur at the start of tsquery execution, rather than in tsqueryin() et al. That addresses points 1..3, but doesn't by itself do anything for points 4 or 5. Also, it would be expensive because right now execution works directly from the flat tsquery representation; there's no conversion to an explicit tree structure on which normalize_phrase_tree() could conveniently be applied. Nor do we know at the start whether the tsquery contains any PHRASE operators. On the whole, it seems like the best fix would be to get rid of normalize_phrase_tree() altogether, and instead fix the TS_execute() engine so it can cope with regular operators underneath phrase operators. We can still have the optimization of not worrying about lexeme locations when no phrase operator has been seen, but we have to change TS_phrase_execute to cope with plain AND/OR/NOT operators and calculate proper location