Re: [HACKERS] [sqlsmith] FailedAssertion("!(k == indices_count)", File: "tsvector_op.c", Line: 511)

2016-08-05 Thread Tom Lane
Thomas Munro  writes:
> The assertion in tsvector_delete_by_indices fails because its counting
> algorithm doesn't expect indices_to_delete to contain multiple
> references to the same index.  Maybe that could be fixed by
> uniquifying in tsvector_delete_arr before calling it, but since
> tsvector_delete_by_indices already qsorts its input, it should be able
> to handle duplicates cheaply.

I poked at this and realized that that's not sufficient.  If there are
duplicates in indices_to_delete, then the initial estimate

tsout->size = tsv->size - indices_count;

is wrong because indices_count is an overestimate of how many lexemes
will be removed.  And because the calculation "dataout = STRPTR(tsout)"
depends on tsout->size, we can't just wait till later to get it right.

We could possibly initialize tsout->size = tsv->size (the maximum
possible value), thereby ensuring that the WordEntry array doesn't
overlap the dataout area; compute the correct tsout->size in the loop;
and then memmove the data area into place to collapse out wasted space.
But I think it might be simpler and better-performant just to de-dup the
indices_to_delete array after qsort'ing it; that would certainly win
for the case of indices_count == 1.

The other problems I noted with failure to delete items seem to stem
from the fact that tsvector_delete_arr relies on tsvector_bsearch to
find items, but the input tsvector is not sorted (never mind de'duped)
by array_to_tsvector.  This seems like simple brain fade in
array_to_tsvector, as AFAICS that's a required property of tsvectors.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] FailedAssertion("!(k == indices_count)", File: "tsvector_op.c", Line: 511)

2016-08-03 Thread Thomas Munro
On Thu, Aug 4, 2016 at 9:39 AM, Tom Lane  wrote:
> Andreas Seltenreich  writes:
>> the following statement triggers an assertion in tsearch:
>
>> select ts_delete(array_to_tsvector('{smith,smith,smith}'::text[]),  
>> '{smith,smith}'::text[]);
>> -- TRAP: FailedAssertion("!(k == indices_count)", File: "tsvector_op.c", 
>> Line: 511)
>
> Confirmed here.  I notice that the output of array_to_tsvector() is
> already fishy in this example:
>
> regression=# select array_to_tsvector('{smith,smith,smith}'::text[]);
> array_to_tsvector
> -
>  'smith' 'smith' 'smith'
> (1 row)
>
> Shouldn't those have been merged together?  You certainly don't get
> results like that from other tsvector-producing operations:
>
> regression=# select to_tsvector('smith smith smith');
>   to_tsvector
> ---
>  'smith':1,2,3
> (1 row)
> regression=# select 'smith smith smith'::tsvector;
>  tsvector
> --
>  'smith'
> (1 row)
>
> However, that does not seem to be the proximate cause of the crash
> in ts_delete, because this non-duplicated case still crashes:
>
> select ts_delete(array_to_tsvector('{smith,smithx,smithy}'::text[]),  
> '{smith,smith}'::text[]);
>
> It kinda looks like you need more than one deletion request for
> the first entry in the sorted tsvector, because for example
> {smith,foo,toolbox} works but not {smith,too,toolbox}.
>
> I'm thinking there are two distinct bugs here.

The assertion in tsvector_delete_by_indices fails because its counting
algorithm doesn't expect indices_to_delete to contain multiple
references to the same index.  Maybe that could be fixed by
uniquifying in tsvector_delete_arr before calling it, but since
tsvector_delete_by_indices already qsorts its input, it should be able
to handle duplicates cheaply.  I was thinking something like this:

for (i = j = k = 0; i < tsv->size; i++)
{
+   bool drop_lexeme = false;
+
/*
 * Here we should check whether current i is present in
 * indices_to_delete or not. Since indices_to_delete
is already sorted
-* we can advance it index only when we have match.
+* we can advance it index only when we have match.  We do this
+* repeatedly, in case indices_to_delete contains
duplicate references
+* to the same index.
 */
-   if (k < indices_count && i == indices_to_delete[k])
+   while (k < indices_count && i == indices_to_delete[k])
{
+   drop_lexeme = true;
k++;
-   continue;
}
+   if (drop_lexeme)
+   continue;

But that doesn't seem to be enough, there is something else wrong here
resulting in garbage output, maybe related to the failure to merge the
tsvector...

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] FailedAssertion("!(k == indices_count)", File: "tsvector_op.c", Line: 511)

2016-08-03 Thread Tom Lane
I wrote:
> I'm thinking there are two distinct bugs here.

Actually, make that three bugs.  I was so focused on the crashing
that I failed to notice that ts_delete wasn't producing sane answers
even when it didn't crash:

regression=# select ts_delete(array_to_tsvector('{smith,foo,bar}'::text[]),  
'{smith,smith}'::text[]);
  ts_delete  
-
 'smith' 'foo' 'bar'
(1 row)

regression=# select ts_delete(array_to_tsvector('{smith,foo,bar}'::text[]),  
'{smith,foo}'::text[]);
   ts_delete   
---
 'smith' 'bar'
(1 row)

regression=# select ts_delete(array_to_tsvector('{smith,foo,bar}'::text[]),  
'{bar,smith}'::text[]);
  ts_delete  
-
 'smith' 'foo' 'bar'
(1 row)

The non-array version is no better:

regression=# select ts_delete(array_to_tsvector('{smith,foo,bar}'::text[]),  
'smith'::text);
  ts_delete  
-
 'smith' 'foo' 'bar'
(1 row)

regression=# select ts_delete(array_to_tsvector('{smith,foo,bar}'::text[]),  
'foo'::text);
   ts_delete   
---
 'smith' 'bar'
(1 row)

regression=# select ts_delete(array_to_tsvector('{smith,foo,bar}'::text[]),  
'bar'::text);
  ts_delete  
-
 'smith' 'foo' 'bar'
(1 row)

I'm not sure if ts_delete takes its second argument as verbatim lexemes
or normalizes them first, but none of these words are changed by
to_tsvector, so either way it seems to fail to delete stuff it should.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] FailedAssertion("!(k == indices_count)", File: "tsvector_op.c", Line: 511)

2016-08-03 Thread Tom Lane
Andreas Seltenreich  writes:
> the following statement triggers an assertion in tsearch:

> select ts_delete(array_to_tsvector('{smith,smith,smith}'::text[]),  
> '{smith,smith}'::text[]);
> -- TRAP: FailedAssertion("!(k == indices_count)", File: "tsvector_op.c", 
> Line: 511)

Confirmed here.  I notice that the output of array_to_tsvector() is
already fishy in this example:

regression=# select array_to_tsvector('{smith,smith,smith}'::text[]);
array_to_tsvector
-
 'smith' 'smith' 'smith'
(1 row)

Shouldn't those have been merged together?  You certainly don't get
results like that from other tsvector-producing operations:

regression=# select to_tsvector('smith smith smith');
  to_tsvector  
---
 'smith':1,2,3
(1 row)
regression=# select 'smith smith smith'::tsvector;
 tsvector 
--
 'smith'
(1 row)

However, that does not seem to be the proximate cause of the crash
in ts_delete, because this non-duplicated case still crashes:

select ts_delete(array_to_tsvector('{smith,smithx,smithy}'::text[]),  
'{smith,smith}'::text[]);

It kinda looks like you need more than one deletion request for
the first entry in the sorted tsvector, because for example
{smith,foo,toolbox} works but not {smith,too,toolbox}.

I'm thinking there are two distinct bugs here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers