Re: [HACKERS] [GENERAL] Incorrect FTS result with GIN index

2010-07-29 Thread Oleg Bartunov

Tom,

we're not able to work on this right now, so go ahead if you have time.
I also wonder why did I get right result :) Just repeated the query:

test=# select count(*) from search_tab where (to_tsvector('german', keywords ) @@ 
to_tsquery('german', 'ee:*  dd:*'));
 count 
---

   123
(1 row)

Time: 26.185 ms


Oleg
On Wed, 28 Jul 2010, Tom Lane wrote:


Oleg Bartunov o...@sai.msu.su writes:

you can download dump http://mira.sai.msu.su/~megera/tmp/search_tab.dump


Hmm ... I'm not sure why you're failing to reproduce it, because it's
falling over pretty easily for me.  After poking at it for awhile,
I am of the opinion that scanGetItem's handling of multiple keys is
fundamentally broken and needs to be rewritten completely.  The
particular case I'm seeing here is that one key returns this sequence of
TIDs/lossy flags:

...
1085/4 0
1086/65535 1
1087/4 0
...

while the other one returns this:

...
1083/11 0
1086/6 0
1086/10 0
1087/10 0
...

and what comes out of scanGetItem is just

...
1086/6 1
...

because after returning that, on the next call it advances both input
keystreams.  So 1086/10 should be visited and is not.

I think that depending on the previous entryRes state to determine what
to do is basically unworkable, and what should probably be done instead
is to remember the last-returned TID and advance keystreams with TIDs =
that.  I haven't quite thought through how that should interact with
lossy-page TIDs but it seems more robust than what we've got.

I'm also noticing that the ANDing behavior for the ee:*  dd:* query
style seems very much stupider than it needs to be --- it's returning
lossy pages that very obviously don't need to be examined because the
other keystream has no match at all on that page.  But I haven't had
time to probe into the reason why.

I'm out of time for today, do you want to work on it?

regards, tom lane



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Incorrect FTS result with GIN index

2010-07-29 Thread Tom Lane
Oleg Bartunov o...@sai.msu.su writes:
 I also wonder why did I get right result :) Just repeated the query:

 test=# select count(*) from search_tab where (to_tsvector('german', keywords 
 ) @@ to_tsquery('german', 'ee:*  dd:*'));
   count 
 ---
 123
 (1 row)

Yeah, that case works (though I think it's unnecessarily slow).  The one
that gives the wrong answer is the equivalent form with two AND'ed @@
operators.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Incorrect FTS result with GIN index

2010-07-29 Thread Oleg Bartunov

On Thu, 29 Jul 2010, Tom Lane wrote:


Oleg Bartunov o...@sai.msu.su writes:

I also wonder why did I get right result :) Just repeated the query:



test=# select count(*) from search_tab where (to_tsvector('german', keywords ) @@ 
to_tsquery('german', 'ee:*  dd:*'));
  count
---
123
(1 row)


Yeah, that case works (though I think it's unnecessarily slow).  The one
that gives the wrong answer is the equivalent form with two AND'ed @@
operators.


hmm, that query works too :)

test=# select count(*) from search_tab where (to_tsvector('german', keywords ) 
@@ to_tsquery('german', 'ee:*')) and (to_tsvector('german', keywords ) @@ 
to_tsquery('german', 'dd:*'));
 count 
---

   123
(1 row)

Time: 26.155 ms


test=# explain analyze select count(*) from search_tab where 
(to_tsvector('german', keywords ) @@ to_tsquery('german', 'ee:*')) and 
(to_tsvector('german', keywords ) @@ to_tsquery('german', 'dd:*'));
   QUERY PLAN 
-

 Aggregate  (cost=103.87..103.88 rows=1 width=0) (actual time=22.819..22.820 
rows=1 loops=1)
   -  Bitmap Heap Scan on search_tab  (cost=5.21..103.80 rows=25 width=0) 
(actual time=22.677..22.799 rows=123 loops=1)
 Recheck Cond: ((to_tsvector('german'::regconfig, keywords) @@ 
'''ee'':*'::tsquery) AND (to_tsvector('german'::regconfig, keywords) @@ 
'''dd'':*'::tsquery))
 -  Bitmap Index Scan on idx_keywords_ger  (cost=0.00..5.21 rows=25 
width=0) (actual time=22.655..22.655 rows=123 loops=1)
   Index Cond: ((to_tsvector('german'::regconfig, keywords) @@ 
'''ee'':*'::tsquery) AND (to_tsvector('german'::regconfig, keywords) @@ 
'''dd'':*'::tsquery))
 Total runtime: 22.865 ms



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Incorrect FTS result with GIN index

2010-07-29 Thread Tom Lane
Oleg Bartunov o...@sai.msu.su writes:
 On Thu, 29 Jul 2010, Tom Lane wrote:
 Yeah, that case works (though I think it's unnecessarily slow).  The one
 that gives the wrong answer is the equivalent form with two AND'ed @@
 operators.

 hmm, that query works too :)

There may be some platform dependency involved --- in particular, you
wouldn't see the issue unless one keystream has two nonlossy TIDs on the
same page as the other one has a lossy TID, so it's going to depend on
the placement of heap rows.  Anyway, I can reproduce it just by loading
the given dump, on both 8.4 and HEAD.  Will work on a fix.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Incorrect FTS result with GIN index

2010-07-28 Thread Tom Lane
Oleg Bartunov o...@sai.msu.su writes:
 you can download dump http://mira.sai.msu.su/~megera/tmp/search_tab.dump

Hmm ... I'm not sure why you're failing to reproduce it, because it's
falling over pretty easily for me.  After poking at it for awhile,
I am of the opinion that scanGetItem's handling of multiple keys is
fundamentally broken and needs to be rewritten completely.  The
particular case I'm seeing here is that one key returns this sequence of
TIDs/lossy flags:

...
1085/4 0
1086/65535 1
1087/4 0
...

while the other one returns this:

...
1083/11 0
1086/6 0
1086/10 0
1087/10 0
...

and what comes out of scanGetItem is just

...
1086/6 1
...

because after returning that, on the next call it advances both input
keystreams.  So 1086/10 should be visited and is not.

I think that depending on the previous entryRes state to determine what
to do is basically unworkable, and what should probably be done instead
is to remember the last-returned TID and advance keystreams with TIDs =
that.  I haven't quite thought through how that should interact with
lossy-page TIDs but it seems more robust than what we've got.

I'm also noticing that the ANDing behavior for the ee:*  dd:* query
style seems very much stupider than it needs to be --- it's returning
lossy pages that very obviously don't need to be examined because the
other keystream has no match at all on that page.  But I haven't had
time to probe into the reason why.

I'm out of time for today, do you want to work on it?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers