Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-05-06 Thread Bruce Momjian

Added to TODO:

o Consider changing error to warning for strings larger than one
  megabyte

  http://archives.postgresql.org/pgsql-bugs/2008-02/msg00190.php
  http://archives.postgresql.org/pgsql-patches/2008-03/msg00062.php


---

Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  I don't think that follows.  A tsearch index is lossy anyway, so there's
 
  Uh, the index is lossy but I thought it was lossy in a way that just
  required additional heap accesses, not lossy in that it doesn't index
  everything.
 
 Sure it's lossy.  It doesn't index stopwords, and it doesn't index the
 difference between various forms of a word (when the dictionaries reduce
 them to a common root).
 
  I am concerned a 1mb limit is too low though.  Exactly why can't we have
  a higher limit?  Is positional information that significant?
 
 That's pretty much exactly the point: it's not very significant, and it
 doesn't justify a total inability to index large documents.
 
 One thing we could do is index words that are past the limit but not
 store a position, or perhaps have the convention that the maximum
 position value means somewhere past here.
 
   regards, tom lane
 
 -- 
 Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-patches

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-03-09 Thread Euler Taveira de Oliveira

Tom Lane wrote:


Well, there is exactly zero chance of that happening in 8.3.x, because
the bit allocations for on-disk tsvector representation are already
determined.  It's fairly hard to see a way of doing it in future
releases that would have acceptable costs, either.


I think you missed my point or i didn't explain it in details. I'm
talking about doing the error-or-notice condition to be a guc variable
(eg ignore_tsearch_limits = on/off).


But more to the point: no matter what the document length limit is,
why should it be a hard error to exceed it?  The downside of not
indexing words beyond the length limit is that searches won't find
documents in which the search terms occur only very far into the
document.  The downside of throwing an error is that we can't store such
documents at all, which surely guarantees that searches won't find
them.  How can you possibly argue that that option is better?

IMHO, both of the approaches are bad; that's why i propose an user 
option. So the user can choose between to be strict (error out when it 
exceeds some limit) and to relax (emit a notice when it exceeds some 
limit). Maybe some day we can increase the limits (eg ts_type.h redesign).



--
  Euler Taveira de Oliveira
  http://www.timbira.com/


--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-03-07 Thread Bruce Momjian
Tom Lane wrote:
 Euler Taveira de Oliveira [EMAIL PROTECTED] writes:
  Edwin Groothuis wrote:
  Is it possible to make it a WARNING instead of an ERROR? Right now I get:
  
  No. All of the other types emit an ERROR if you're trying an out of 
  range value.
 
 I don't think that follows.  A tsearch index is lossy anyway, so there's

Uh, the index is lossy but I thought it was lossy in a way that just
required additional heap accesses, not lossy in that it doesn't index
everything.

I think just indexing what we can and throwing away the rest is a MySQL
approach we don't want to take.  We could throw a warning on truncation
but that seems wrong too.

I am concerned a 1mb limit is too low though.  Exactly why can't we have
a higher limit?  Is positional information that significant?

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-03-07 Thread Bruce Momjian
Edwin Groothuis wrote:
 On Thu, Mar 06, 2008 at 08:19:35PM -0300, Euler Taveira de Oliveira wrote:
  Edwin Groothuis wrote:
  
  Is it possible to make it a WARNING instead of an ERROR? Right now I get:
  
  No. All of the other types emit an ERROR if you're trying an out of 
  range value.
 
 Does that then mean that, because of the triggers on inserts, we
 will never be able to add this record?

Right now, yes, unless we figure something else out.

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-03-07 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I don't think that follows.  A tsearch index is lossy anyway, so there's

 Uh, the index is lossy but I thought it was lossy in a way that just
 required additional heap accesses, not lossy in that it doesn't index
 everything.

Sure it's lossy.  It doesn't index stopwords, and it doesn't index the
difference between various forms of a word (when the dictionaries reduce
them to a common root).

 I am concerned a 1mb limit is too low though.  Exactly why can't we have
 a higher limit?  Is positional information that significant?

That's pretty much exactly the point: it's not very significant, and it
doesn't justify a total inability to index large documents.

One thing we could do is index words that are past the limit but not
store a position, or perhaps have the convention that the maximum
position value means somewhere past here.

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-03-07 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  I don't think that follows.  A tsearch index is lossy anyway, so there's
 
  Uh, the index is lossy but I thought it was lossy in a way that just
  required additional heap accesses, not lossy in that it doesn't index
  everything.
 
 Sure it's lossy.  It doesn't index stopwords, and it doesn't index the
 difference between various forms of a word (when the dictionaries reduce
 them to a common root).

Yes, but you specify the stop words and stemming rules --- it isn't like
it drops words that are out of your control.

  I am concerned a 1mb limit is too low though.  Exactly why can't we have
  a higher limit?  Is positional information that significant?
 
 That's pretty much exactly the point: it's not very significant, and it
 doesn't justify a total inability to index large documents.

Agreed.  I think losing positional information after 1mb is acceptable.

 One thing we could do is index words that are past the limit but not
 store a position, or perhaps have the convention that the maximum
 position value means somewhere past here.

Sure.

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-03-07 Thread Teodor Sigaev

To be precise about tsvector:

1) GiST index is lossy for any kind of tserach queries, GIN index for @@ 
operation is not lossy, for @@@ - is lossy.


2) Number of positions per word is limited to 256 number - bigger number of 
positions is not helpful for ranking, but produces a big tsvector. If word has a 
lot of positions in document then it close to be a stopword. We could easy 
increase this limit to 65536 positions


3) Maximum value of position is 2^14, because for position's storage we use 
uint16. In this integer it's needed to reserve 2 bits to store weight of this 
position. It's possible to increase int16 to int32, but it will doubled tsvector 
size, which is unpractical, I suppose. So, part of document used for ranking 
contains first 16384 words - that is about first 50-100 kilobytes.


4) Limit of total size of tsvector is in WordEntry-pos (ts_type.h) field. It 
contains number of bytes between first lexeme in tsvector and needed lexeme.

So, limitation is total length of lexemes plus  theirs positional information.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-03-06 Thread Euler Taveira de Oliveira

Edwin Groothuis wrote:


Is it possible to make it a WARNING instead of an ERROR? Right now I get:

No. All of the other types emit an ERROR if you're trying an out of 
range value.



--
  Euler Taveira de Oliveira
  http://www.timbira.com/

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.orgextra=pgsql-patches


Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-03-06 Thread Edwin Groothuis
On Thu, Mar 06, 2008 at 08:19:35PM -0300, Euler Taveira de Oliveira wrote:
 Edwin Groothuis wrote:
 
 Is it possible to make it a WARNING instead of an ERROR? Right now I get:
 
 No. All of the other types emit an ERROR if you're trying an out of 
 range value.

Does that then mean that, because of the triggers on inserts, we
will never be able to add this record?

Edwin
-- 
Edwin Groothuis  |Personal website: http://www.mavetju.org
[EMAIL PROTECTED]|  Weblog: http://www.mavetju.org/weblog/

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-03-06 Thread Tom Lane
Euler Taveira de Oliveira [EMAIL PROTECTED] writes:
 Edwin Groothuis wrote:
 Is it possible to make it a WARNING instead of an ERROR? Right now I get:
 
 No. All of the other types emit an ERROR if you're trying an out of 
 range value.

I don't think that follows.  A tsearch index is lossy anyway, so there's
no hard and fast reason why it should reject entries that it can't index
completely.  I think it would be more useful to index whatever it can
(probably just the words in the first N bytes of the document) than to
prevent you from storing the document.

There is another precedent too, namely that tsearch already discards
individual words over so-many-bytes long.  If it doesn't throw an error
for that case, why is it throwing an error for this one?

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-03-06 Thread Euler Taveira de Oliveira

Tom Lane wrote:


I don't think that follows.  A tsearch index is lossy anyway, so there's
no hard and fast reason why it should reject entries that it can't index
completely.  I think it would be more useful to index whatever it can
(probably just the words in the first N bytes of the document) than to
prevent you from storing the document.

The problem with this approach is how to select the part of the document 
to index. How will you ensure you're not ignoring the more important 
words of the document?
IMHO Postgres shouldn't decide it; it would be good if an user could set 
it runtime and/or on postgresql.conf.



--
  Euler Taveira de Oliveira
  http://www.timbira.com/

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-03-06 Thread Tom Lane
Euler Taveira de Oliveira [EMAIL PROTECTED] writes:
 The problem with this approach is how to select the part of the document 
 to index. How will you ensure you're not ignoring the more important 
 words of the document?

That's *always* a risk, anytime you do any sort of processing or
normalization on the text.  The question here is not whether or not
we will make tradeoffs, only which ones to make.

 IMHO Postgres shouldn't decide it; it would be good if an user could set 
 it runtime and/or on postgresql.conf.

Well, there is exactly zero chance of that happening in 8.3.x, because
the bit allocations for on-disk tsvector representation are already
determined.  It's fairly hard to see a way of doing it in future
releases that would have acceptable costs, either.

But more to the point: no matter what the document length limit is,
why should it be a hard error to exceed it?  The downside of not
indexing words beyond the length limit is that searches won't find
documents in which the search terms occur only very far into the
document.  The downside of throwing an error is that we can't store such
documents at all, which surely guarantees that searches won't find
them.  How can you possibly argue that that option is better?

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit

2008-03-05 Thread Edwin Groothuis
On Wed, Mar 05, 2008 at 10:53:38AM -0500, Bruce Momjian wrote:
 Euler Taveira de Oliveira wrote:
  Edwin Groothuis wrote:
  
   Ouch. But... since very long words are already not indexed (is the length
   configurable anywhere because I don't mind setting it to 50 characters), I
   don't think that it should bomb out of this but print a similar warning 
   like
   String only partly indexed.
   
  This is not a bug. I would say it's a limitation. Look at 
  src/include/tsearch/ts_type.h. You could decrease len in WordEntry to 9 
  (512 characters) and increase pos to 22 (4 Mb). Don't forget to update 
  MAXSTRLEN and MAXSTRPOS accordingly.
  
   I'm still trying to determine how big the message it failed on was...
   
  Maybe we should change the string is too long for tsvector to string 
  is too long (%ld bytes, max %ld bytes) for tsvector.
 
 Good idea.  I have applied the following patch to report in the error
 message the string length and maximum, like we already do for long
 words:
 
 Old:
   test= select repeat('a', 3000)::tsvector;
   ERROR:  word is too long (3000 bytes, max 2046 bytes)
 
 New:
   test= select repeat('a ', 300)::tsvector;
   ERROR:  string is too long for tsvector (1048576 bytes, max 1048575 
 bytes)

Is it possible to make it a WARNING instead of an ERROR? Right now I get:

NOTICE: word is too long to be indexed
DETAIL: Words longer than 2047 characters are ignored. 

when updating the dictionary on a table, which will make it continue,
but with some long messages I get:

ERROR: string is too long for tsvector 

Which is quite fatal for the whole UPDATE / INSERT statement.

Edwin

--
Edwin Groothuis  |Personal website: http://www.mavetju.org
[EMAIL PROTECTED]|  Weblog: http://www.mavetju.org/weblog/

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.orgextra=pgsql-patches