Re: [HACKERS] Rethinking our fulltext phrase-search implementation

2016-12-21 Thread Tom Lane
Artur Zakirov  writes:
> Otherwise it seems that queries like 'a <-> (b & c)' will always return 
> false. Then we need maybe some warning message.

Well, the query as written is pointless, but it could be useful with
something other than "b" and "c" as the AND-ed terms.  In this usage
"&" is equivalent to "<0>", which we know has corner-case uses.

I'm not inclined to issue any sort of warning for unsatisfiable queries.
We don't issue a warning when a SQL WHERE condition collapses to constant
FALSE, and that seems like exactly the same sort of situation.

It strikes me though that the documentation should point this out.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rethinking our fulltext phrase-search implementation

2016-12-21 Thread Artur Zakirov

Hello Tom,

On 17.12.2016 21:36, Tom Lane wrote:


4. The transformations are wrong anyway.  The OR case I showed above is
all right, but as I argued in <24331.1480199...@sss.pgh.pa.us>, the AND
case is not:

regression=# select 'a <-> (b & c)'::tsquery;
  tsquery
---
 'a' <-> 'b' & 'a' <-> 'c'
(1 row)

This matches 'a b a c', because 'a <-> b' and 'a <-> c' can each be
matched at different places in that text; but it seems highly unlikely to
me that that's what the writer of such a query wanted.  (If she did want
that, she would write it that way to start with.)  NOT is not very nice
either:


If I'm not mistaken PostgreSQL 9.6 and master with patch 
"fix-phrase-search.patch" return false for the query:


select 'a b a c' @@ 'a <-> (b & c)'::tsquery;
 ?column?
--
 f
(1 row)

I agree that such query is confusing. Maybe it is better to return true 
for such queries?
Otherwise it seems that queries like 'a <-> (b & c)' will always return 
false. Then we need maybe some warning message.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Rethinking our fulltext phrase-search implementation

2016-12-20 Thread Tom Lane
I wrote:
> I've been thinking about how to fix the problem Andreas Seltenreich
> reported at
> https://postgr.es/m/87eg1y2s3x@credativ.de

Attached is a proposed patch that deals with the problems discussed
here and in <26706.1482087...@sss.pgh.pa.us>.  Is anyone interested
in reviewing this, or should I just push it?

BTW, I noticed that ts_headline() seems to not behave all that nicely
for phrase searches, eg

regression=# SELECT ts_headline('simple', '1 2 3 1 3'::text, '2 <-> 3', 
'ShortWord=0');
  ts_headline   

 1 2 3 1 3
(1 row)

Highlighting the second "3", which is not a match, seems pretty dubious.
Negative items are even worse, they don't change the results at all:

regression=# SELECT ts_headline('simple', '1 2 3 1 3'::text, '!2 <-> 3', 
'ShortWord=0');
  ts_headline   

 1 2 3 1 3
(1 row)

However, the code involved seems unrelated to the present patch, and
it's also about as close to completely uncommented as I've seen anywhere
in the PG code base.  So I'm not excited about touching it.

regards, tom lane

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 67d0c34..464ce83 100644
*** a/doc/src/sgml/datatype.sgml
--- b/doc/src/sgml/datatype.sgml
*** SELECT 'fat  rat  ! cat'::tsqu
*** 3959,3973 
  tsquery 
  
   'fat'  'rat'  !'cat'
- 
- SELECT '(fat | rat) - cat'::tsquery;
-   tsquery
- ---
-  'fat' - 'cat' | 'rat' - 'cat'
  
- 
-  The last example demonstrates that tsquery sometimes
-  rearranges nested operators into a logically equivalent formulation.
  
  
  
--- 3959,3965 
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 2da7595..bc33a70 100644
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
*** SELECT 'fat  cow'::tsquery @@ 'a fa
*** 264,270 
  text, any more than a tsvector is.  A tsquery
  contains search terms, which must be already-normalized lexemes, and
  may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
! (For details see .)  There are
  functions to_tsquery, plainto_tsquery,
  and phraseto_tsquery
  that are helpful in converting user-written text into a proper
--- 264,270 
  text, any more than a tsvector is.  A tsquery
  contains search terms, which must be already-normalized lexemes, and
  may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
! (For syntax details see .)  There are
  functions to_tsquery, plainto_tsquery,
  and phraseto_tsquery
  that are helpful in converting user-written text into a proper
*** text @@ text
*** 323,328 
--- 323,330 
  at least one of its arguments must appear, while the ! (NOT)
  operator specifies that its argument must not appear in
  order to have a match.
+ For example, the query fat  ! rat matches documents that
+ contain fat but not rat.
 
  
 
*** SELECT phraseto_tsquery('the cats ate th
*** 377,382 
--- 379,401 
  then , then -,
  and ! most tightly.
 
+ 
+
+ It's worth noticing that the AND/OR/NOT operators mean something subtly
+ different when they are within the arguments of a FOLLOWED BY operator
+ than when they are not, because then the position of the match is
+ significant.  Normally, !x matches only documents that do not
+ contain x anywhere.  But x - !y
+ matches x if it is not immediately followed by y;
+ an occurrence of y elsewhere in the document does not prevent
+ a match.  Another example is that x  y normally only
+ requires that x and y both appear somewhere in the
+ document, but (x  y) - z requires x
+ and y to match at the same place, immediately before
+ a z.  Thus this query behaves differently from x
+ - z  y - z, which would match a document
+ containing two separate sequences x z and y z.
+

  

diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index efc111e..3e0a444 100644
*** a/src/backend/utils/adt/tsginidx.c
--- b/src/backend/utils/adt/tsginidx.c
*** checkcondition_gin(void *checkval, Query
*** 212,218 
   * Evaluate tsquery boolean expression using ternary logic.
   */
  static GinTernaryValue
! TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem)
  {
  	GinTernaryValue val1,
  val2,
--- 212,218 
   * Evaluate tsquery boolean expression using ternary logic.
   */
  static GinTernaryValue
! TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem, bool in_phrase)
  {
  	GinTernaryValue val1,
  val2,
*** TS_execute_ternary(GinChkVal *gcv, Query
*** 230,236 
  	switch (curitem->qoperator.oper)
  	{
  		case OP_NOT:
! 			result = TS_execute_ternary(gcv, 

Re: [HACKERS] Rethinking our fulltext phrase-search implementation

2016-12-17 Thread Tom Lane
I wrote:
> It's worth noting that with these rules, phrase searches will act as
> though "!x" always matches somewhere; for instance "!a <-> !b" will match
> any tsvector.  I argue that this is not wrong, not even if the tsvector is
> empty: there could have been adjacent stopwords matching !a and !b in the
> original text.  Since we've adjusted the phrase matching rules to treat
> stopwords as unknown-but-present words in a phrase, I think this is
> consistent.  It's also pretty hard to assert this is wrong and at the same
> time accept "!a <-> b" matching b at the start of the document.

To clarify this point, I'm imagining that the patch would include
documentation changes like the attached.

regards, tom lane

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 67d0c34..464ce83 100644
*** a/doc/src/sgml/datatype.sgml
--- b/doc/src/sgml/datatype.sgml
*** SELECT 'fat  rat  ! cat'::tsqu
*** 3959,3973 
  tsquery 
  
   'fat'  'rat'  !'cat'
- 
- SELECT '(fat | rat) - cat'::tsquery;
-   tsquery
- ---
-  'fat' - 'cat' | 'rat' - 'cat'
  
- 
-  The last example demonstrates that tsquery sometimes
-  rearranges nested operators into a logically equivalent formulation.
  
  
  
--- 3959,3965 
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index 2da7595..bc33a70 100644
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
*** text @@ text
*** 323,328 
--- 323,330 
  at least one of its arguments must appear, while the ! (NOT)
  operator specifies that its argument must not appear in
  order to have a match.
+ For example, the query fat  ! rat matches documents that
+ contain fat but not rat.
 
  
 
*** SELECT phraseto_tsquery('the cats ate th
*** 377,382 
--- 379,401 
  then , then -,
  and ! most tightly.
 
+ 
+
+ It's worth noticing that the AND/OR/NOT operators mean something subtly
+ different when they are within the arguments of a FOLLOWED BY operator
+ than when they are not, because then the position of the match is
+ significant.  Normally, !x matches only documents that do not
+ contain x anywhere.  But x - !y
+ matches x if it is not immediately followed by y;
+ an occurrence of y elsewhere in the document does not prevent
+ a match.  Another example is that x  y normally only
+ requires that x and y both appear somewhere in the
+ document, but (x  y) - z requires x
+ and y to match at the same place, immediately before
+ a z.  Thus this query behaves differently from x
+ - z  y - z, which would match a document
+ containing two separate sequences x z and y z.
+

  


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Rethinking our fulltext phrase-search implementation

2016-12-17 Thread Tom Lane
I've been thinking about how to fix the problem Andreas Seltenreich
reported at
https://postgr.es/m/87eg1y2s3x@credativ.de

The core of that problem is that the phrase-search patch attempts to
restructure tsquery trees so that there are no operators underneath a
PHRASE operator, except possibly other PHRASE operators.  The goal is
to not have to deal with identifying specific match locations except
while processing a PHRASE operator.  Well, that's an OK idea if it can
be implemented reasonably, but there are several problems with the code
as it stands:

1. The transformation is done by normalize_phrase_tree(), which is
currently invoked (via cleanup_fakeval_and_phrase()) in a rather
ad-hoc set of places including tsqueryin() and the various variants
of to_tsquery().  This leaves lots of scope for errors of omission,
which is exactly the proximate cause of Andreas' bug: ts_rewrite() is
neglecting to re-normalize its result tsquery.  I have little faith
that there aren't other similar errors of omission today, and even
less that we won't introduce more in future.

2. Because the transformation is invoked as early as tsqueryin(),
it's user-visible:

regression=# select 'a <-> (b | c)'::tsquery;
  tsquery  
---
 'a' <-> 'b' | 'a' <-> 'c'
(1 row)

This is confusing to users, and I do not think it's a good idea to
expose what's basically an optimization detail this way.  For stored
tsqueries, we're frozen into this approach forever even if we later
decide that checking for 'a' twice isn't such a hot idea.

3. Worse, early application of the transformation creates problems for
operations such as ts_rewrite: a rewrite might fail to match, or produce
surprising results, because the query trees actually being operated on
aren't what the user probably thinks they are.  At a minimum, to get
consistent results I think we'd have to re-normalize after *each step*
of ts_rewrite, not only at the end.  That would be expensive, and it's
far from clear that it would eliminate all the surprises.

4. The transformations are wrong anyway.  The OR case I showed above is
all right, but as I argued in <24331.1480199...@sss.pgh.pa.us>, the AND
case is not:

regression=# select 'a <-> (b & c)'::tsquery;
  tsquery  
---
 'a' <-> 'b' & 'a' <-> 'c'
(1 row)

This matches 'a b a c', because 'a <-> b' and 'a <-> c' can each be
matched at different places in that text; but it seems highly unlikely to
me that that's what the writer of such a query wanted.  (If she did want
that, she would write it that way to start with.)  NOT is not very nice
either:

regression=# select '!a <-> !b'::tsquery;
  tsquery   

 !'a' & !( !( 'a' <-> 'b' ) & 'b' )
(1 row)

If you dig through that, you realize that the <-> test is pointless;
the query can't match any vector containing 'a', so certainly the <->
condition can't succeed, making this an expensive way to spell "!a & !b".
And that's not the right semantics anyway: if 'x y' matches this query,
which it does and should, why doesn't 'x y a' match?

5. The case with only one NOT under <-> looks like this:

regression=# select '!a <-> b'::tsquery;
tsquery 

 !( 'a' <-> 'b' ) & 'b'
(1 row)

This is more or less in line with the naive view of what its semantics
should be, although I notice that it will match a 'b' at position 1,
which might be a surprising result.  We're not out of the woods though:
this will (and should) match, eg, 'c b a'.  But that means that '!a & b'
is not a safe lossy approximation to '!a <-> b', which is an assumption
that is wired into a number of places.  Simple testing shows that indeed
GIN and GIST index searches get the wrong answers, different from what
you get in a non-indexed search, for queries like this.


So we have a mess here, which needs to be cleaned up quite aside from the
fact that it's capable of provoking Assert failures and/or crashes.

I thought for awhile about moving the normalize_phrase_tree() work
to occur at the start of tsquery execution, rather than in tsqueryin()
et al.  That addresses points 1..3, but doesn't by itself do anything
for points 4 or 5.  Also, it would be expensive because right now
execution works directly from the flat tsquery representation; there's no
conversion to an explicit tree structure on which normalize_phrase_tree()
could conveniently be applied.  Nor do we know at the start whether the
tsquery contains any PHRASE operators.

On the whole, it seems like the best fix would be to get rid of
normalize_phrase_tree() altogether, and instead fix the TS_execute()
engine so it can cope with regular operators underneath phrase operators.
We can still have the optimization of not worrying about lexeme locations
when no phrase operator has been seen, but we have to change
TS_phrase_execute to cope with plain AND/OR/NOT operators and calculate
proper location