subject:"\[HACKERS\] default_text_search_config and expression indexes"

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Alvaro Herrera

Oleg Bartunov wrote:
 On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote:

 Maybe I'm missing something, but it seems to me that the configuration
 is more attached to a column/index thatn to the whole database. If
 there's a default in an expression, I'd rather expect this default to be
 drawn from the index involved than from a global value (like a functional
 index does now).

 I'm tired to repeat - index itself doesn't know about configuration !

Is there a way to change that?  For example store the configuration in a
metapage or something?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Gregory Stark


Alvaro Herrera [EMAIL PROTECTED] writes:

 Oleg Bartunov wrote:

 I'm tired to repeat - index itself doesn't know about configuration !

 Is there a way to change that?  For example store the configuration in a
 metapage or something?

I think Heikki's suggestion of having each configuration create a new type
would effectively do the same thing.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Mike Rylander

On 8/13/07, Bruce Momjian [EMAIL PROTECTED] wrote:
 Heikki Linnakangas wrote:
  Bruce Momjian wrote:
   Heikki Linnakangas wrote:
   Removing the default configuration setting altogether removes the 2nd
   problem, but that's not good from a usability point of view. And it
   doesn't solve the general issue, you can still do things like:
   SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
   to_tsquery('confB', 'query');
  
   True, but in that case you are specifically naming different
   configurations, so it is hopefully obvious you have a mismatch.
 
  There's many more subtle ways to do that. For example, filling a
  tsvector column using a DEFAULT clause. But then you sometimes fill it
  in the application instead, with a different configuration. Or if one of
  the function calls is buried in another user defined function.
 
  I don't think explicitly naming the configuration gives enough protection.

 Oh, wow, OK, well in that case the text search API isn't ready and we
 will have to hold this for 8.4.


I've been watching this thread with a mixture of dread and hope,
waiting to see where the developers' inclination will end up; whether
leaving a useful foot gun available will be allowed.

This is just my $0.02 as a fairly heavy user of the current tsearch2
code, but I sincerely hope you do not cripple the system by removing
the ability to store tsvectors built using arbitrary configurations in
a single column.  Yes, it can lead to unexpected results if you do not
know what you are doing, but if you have gone beyond building a single
tsearch2 configuration then you are required to know what you are
doing.  What's more, IMO the default configuration mechanism feels
very much like a CONSTRAINT, as Oleg suggests.  That point is one of
cognizance, where if one has gone to the trouble of setting up
multiple configurations and has learned enough to do so correctly,
then one necessarily understands the importance of the setting and can
use it (or not, and use explicit configurations) correctly.  The
default config lowers the bar to an acceptable level for beginners
that have no need of multiple configurations, and while I don't feel
too strongly, personally, about having a default, I think it is both
useful and helpful for new users -- it was for me.

Now, so this email isn't entirely complaining, and as a data point for
the discussion, I'll explain why I do not want to see tsearch2
crippled in the way suggested by Heikki and Bruce.

My application (http://open-ils.org, which run 80% of the public
libraries in Georgia, USA, http://gapines.org and
http://georgialibraries.org/lib/pines.html) requires that I be able to
search a corpus of bibliographic records in a mix of languages, and
potentially with mixed stop-word rules, with one query.  I cannot know
ahead of time what languages will be used in the corpus and I cannot
restrict any one query to one language.  To accomplish this, the
record itself will be inspected inside an INSERT/UPDATE trigger to
determine the language and type, and use the correct configuration for
creating the tsvector.  This will obviously result in a mixed
tsvector column, but that's exactly what I need.  I can filter on
record language if the user happens to specify a query language (and
thus configuration), or simply rank the assumed (IP based, perhaps, or
browser preference based) preferred language higher, or one of a
hundred other things.  But I won't be able to do any of that if
tsvectors are required to have one and only one configuration per
column.

Anyway, I felt I needed to provide some outside perspective to this,
as a user, since it seems that the external viewpoint (my particular
viewpoint, at least) was missing from the discussion.

Thanks, folks, for all the work on this so far!

--miker

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Heikki Linnakangas

Mike Rylander wrote:
 This is just my $0.02 as a fairly heavy user of the current tsearch2
 code, but I sincerely hope you do not cripple the system by removing
 the ability to store tsvectors built using arbitrary configurations in
 a single column.  Yes, it can lead to unexpected results if you do not
 know what you are doing, but if you have gone beyond building a single
 tsearch2 configuration then you are required to know what you are
 doing.  What's more, IMO the default configuration mechanism feels
 very much like a CONSTRAINT, as Oleg suggests.  That point is one of
 cognizance, where if one has gone to the trouble of setting up
 multiple configurations and has learned enough to do so correctly,
 then one necessarily understands the importance of the setting and can
 use it (or not, and use explicit configurations) correctly.  The
 default config lowers the bar to an acceptable level for beginners
 that have no need of multiple configurations, and while I don't feel
 too strongly, personally, about having a default, I think it is both
 useful and helpful for new users -- it was for me.

Thanks for chiming in. As a disclaimer: I've never used tsearch2 in a
real application myself.

 My application (http://open-ils.org, which run 80% of the public
 libraries in Georgia, USA, http://gapines.org and
 http://georgialibraries.org/lib/pines.html) requires that I be able to
 search a corpus of bibliographic records in a mix of languages, and
 potentially with mixed stop-word rules, with one query.  I cannot know
 ahead of time what languages will be used in the corpus and I cannot
 restrict any one query to one language.  To accomplish this, the
 record itself will be inspected inside an INSERT/UPDATE trigger to
 determine the language and type, and use the correct configuration for
 creating the tsvector.  This will obviously result in a mixed
 tsvector column, but that's exactly what I need.  I can filter on
 record language if the user happens to specify a query language (and
 thus configuration), or simply rank the assumed (IP based, perhaps, or
 browser preference based) preferred language higher, or one of a
 hundred other things.  But I won't be able to do any of that if
 tsvectors are required to have one and only one configuration per
 column.

Don't you need to use the right configuration to parse the query into a
tsquery as well?

What you have is basically the classic problem problem of representing
inheritance in a relational model. You have a superclass, bibliographic
record, and multiple subclasses, one per language with one extra field,
the corpus in the right language. You've solved it by storing them all
in one table, with an extra column (language) to denote which subclass
the record is. Alternatively, you could solve it by using inherited
tables, or having one table per language with a foreign key referencing
the master table containing the other fields, or having a single table
with one nullable field per configuration, and a check constraint to
check that exactly one of them is not null.

As a thought experiment, let me present another, not text search related
example that's isomorphic to your problem:

Imagine an international online auction system, where you can put items
for sale and specify a minimum price. Part of the database schema is:

CREATE TABLE item (
  id INTEGER,
  description TEXT,
  currency CHAR(3),
  price NUMERIC
);

It clearly doesn't make sense to directly compare prices in different
currencies against each other. A query like WHERE price  1000 doesn't
make sense, unless you also restrict the currency, or use an exchange
rate to convert between currencies. Also, different currencies might
have slightly different rules on how many decimal places are
significant. In this example, as well as your bibliographic scenario, we
can conveniently store prices in all currencies in the same field
because they're all numerics. If we wanted to enforce per-currency
rules, like NUMERIC(10, 2) for USD and NUMERIC(10,0) for Italian lires
(which doesn't really exists anymore, I know), we'd need to store them
in separate columns. And before the decimalisation in 1971, the British
pound was divided into 20 shillings, each of which was divided to 12
pence, so it wouldn't have fit to normal numeric field, and we would
again have to store that in a separate column.

What I'm trying to point out is that the problem isn't unique to text
search. It's an old problem, with many alternative solutions, even with
strong typing. Arguably if you store data in multiple languages in same
field, you have a denormalized schema. Granted, loose typing can be more
convenient, but you give up the benefits of strong typing as well.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Heikki Linnakangas

Bruce Momjian wrote:
 Heikki Linnakangas wrote:
 Bruce Momjian wrote:
 Heikki Linnakangas wrote:
 Removing the default configuration setting altogether removes the 2nd
 problem, but that's not good from a usability point of view. And it
 doesn't solve the general issue, you can still do things like:
 SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
 to_tsquery('confB', 'query');
 True, but in that case you are specifically naming different
 configurations, so it is hopefully obvious you have a mismatch.
 There's many more subtle ways to do that. For example, filling a
 tsvector column using a DEFAULT clause. But then you sometimes fill it
 in the application instead, with a different configuration. Or if one of
 the function calls is buried in another user defined function.

 I don't think explicitly naming the configuration gives enough protection.
 
 Oh, wow, OK, well in that case the text search API isn't ready and we
 will have to hold this for 8.4.

That would be unfortunate :(. Sorry I haven't looked at this earlier. Do
you think that implementing a strongly typed system is too much work for
8.3?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Oleg Bartunov


On Tue, 14 Aug 2007, Alvaro Herrera wrote:


Oleg Bartunov wrote:

On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote:


Maybe I'm missing something, but it seems to me that the configuration
is more attached to a column/index thatn to the whole database. If
there's a default in an expression, I'd rather expect this default to be
drawn from the index involved than from a global value (like a functional
index does now).


I'm tired to repeat - index itself doesn't know about configuration !


Is there a way to change that?  For example store the configuration in a
metapage or something?


it's useless, in general, since you could use different configuration to 
build tsvector.



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Mike Rylander

On 8/14/07, Heikki Linnakangas [EMAIL PROTECTED] wrote:
 Mike Rylander wrote:
[snip]

 Don't you need to use the right configuration to parse the query into a
 tsquery as well?


Only if the user (or user agent) can supply enough information to move
away from the configured default of, say, en-US.  And even then, other
tsvector configurations need to be searched.  Configurations are not,
of course, just about language/stemming, but also stop-words and
thesauri.  If it were just languages it wouldn't be too big of a deal
(other than in-app DDL management...).

 What you have is basically the classic problem problem of representing
 inheritance in a relational model. You have a superclass, bibliographic
 record, and multiple subclasses, one per language with one extra field,
 the corpus in the right language. You've solved it by storing them all
 in one table, with an extra column (language) to denote which subclass
 the record is. Alternatively, you could solve it by using inherited
 tables, or having one table per language with a foreign key referencing
 the master table containing the other fields, or having a single table
 with one nullable field per configuration, and a check constraint to
 check that exactly one of them is not null.


Sorry, communication problem here ... I provided an oversimplified
example meant more to show the issues than to find alternate
solutions, though I appreciate you taking the time to consider the
problem.  If I only had to consider one delimiting facet per record
then it would be much simpler. But with the complication that some
fields have stop-word lists (and some not), some use thesauri (and
different ones, at that), and these fields (as extracted from the
records) and their configurations (stem? use a stop-word list? use
thesaurus x, y or z?) are user-defined ...

 As a thought experiment, let me present another, not text search related
 example that's isomorphic to your problem:


Unfortunately, the problem you describe is not quite the same as the
problem I need to solve.

Fortunately, the current incarnation of tsearch2 does a pretty good
job of solving the problem if you store everything in one place and
use the query-time environment to apply some adjustments to the
ranking of items.  I could still work around this problem by creating
inherited tables, one for each configuration on each index-providing
table but I /really/ hope to avoid that.  Creating new configurations
for local requirements doesn't require creating new tables (and the
associated management overhead in the app) today, something I'd really
like to avoid.  In fact, I'm starting to sweat just thinking about
what the planner would go through with the number tables needed for
the potential configurations in an installation that makes use of
multiple thesauri and a mix of stop-word lists across, say, 30
languages.  Such a dataset is not uncommon.

In any case, thanks again for taking the time to think about the
problem.  I still think having the ability to store any old tsvector I
happen to have hanging around in any column of the correct type is a
GoodThing(tm).  I see from Oleg's message down-thread that that's the
way things will be (the tsvector type doesn't know about columns, just
lexem tuples).

--miker

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Bruce Momjian

Heikki Linnakangas wrote:
 Bruce Momjian wrote:
  Heikki Linnakangas wrote:
  Bruce Momjian wrote:
  Heikki Linnakangas wrote:
  Removing the default configuration setting altogether removes the 2nd
  problem, but that's not good from a usability point of view. And it
  doesn't solve the general issue, you can still do things like:
  SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
  to_tsquery('confB', 'query');
  True, but in that case you are specifically naming different
  configurations, so it is hopefully obvious you have a mismatch.
  There's many more subtle ways to do that. For example, filling a
  tsvector column using a DEFAULT clause. But then you sometimes fill it
  in the application instead, with a different configuration. Or if one of
  the function calls is buried in another user defined function.
 
  I don't think explicitly naming the configuration gives enough protection.
  
  Oh, wow, OK, well in that case the text search API isn't ready and we
  will have to hold this for 8.4.
 
 That would be unfortunate :(. Sorry I haven't looked at this earlier. Do
 you think that implementing a strongly typed system is too much work for
 8.3?

Yea, probably.  See my other posting with an updated subject line.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Bruce Momjian

Mike Rylander wrote:
 This is just my $0.02 as a fairly heavy user of the current tsearch2
 code, but I sincerely hope you do not cripple the system by removing
 the ability to store tsvectors built using arbitrary configurations in
 a single column.  Yes, it can lead to unexpected results if you do not
 know what you are doing, but if you have gone beyond building a single
 tsearch2 configuration then you are required to know what you are
 doing.  What's more, IMO the default configuration mechanism feels
 very much like a CONSTRAINT, as Oleg suggests.  That point is one of
 cognizance, where if one has gone to the trouble of setting up
 multiple configurations and has learned enough to do so correctly,
 then one necessarily understands the importance of the setting and can
 use it (or not, and use explicit configurations) correctly.  The
 default config lowers the bar to an acceptable level for beginners
 that have no need of multiple configurations, and while I don't feel
 too strongly, personally, about having a default, I think it is both
 useful and helpful for new users -- it was for me.

What has really hurt the default GUC idea is how to do restores from a
pg_dump.  How do you make sure the right default is used on a restore,
particularly if multiple objects are being restored, and each has a
different default GUC.  I suppose your trigger handles that but that
isn't going to help with an expression index, nor in cases where the
default of the old database is different from the new one.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Alvaro Herrera

Bruce Momjian escribió:
 Mike Rylander wrote:
  This is just my $0.02 as a fairly heavy user of the current tsearch2
  code, but I sincerely hope you do not cripple the system by removing
  the ability to store tsvectors built using arbitrary configurations in
  a single column.  Yes, it can lead to unexpected results if you do not
  know what you are doing, but if you have gone beyond building a single
  tsearch2 configuration then you are required to know what you are
  doing.  What's more, IMO the default configuration mechanism feels
  very much like a CONSTRAINT, as Oleg suggests.  That point is one of
  cognizance, where if one has gone to the trouble of setting up
  multiple configurations and has learned enough to do so correctly,
  then one necessarily understands the importance of the setting and can
  use it (or not, and use explicit configurations) correctly.  The
  default config lowers the bar to an acceptable level for beginners
  that have no need of multiple configurations, and while I don't feel
  too strongly, personally, about having a default, I think it is both
  useful and helpful for new users -- it was for me.
 
 What has really hurt the default GUC idea is how to do restores from a
 pg_dump.  How do you make sure the right default is used on a restore,
 particularly if multiple objects are being restored, and each has a
 different default GUC.  I suppose your trigger handles that but that
 isn't going to help with an expression index, nor in cases where the
 default of the old database is different from the new one.

I guess what should happen is that pg_dump should include a
SET default_text_search_config = 'foo'
just before the CREATE INDEX, like we do for other variables.  Of
course, in order for this to work, the index itself must know what value
was used on creation.  Oleg already dismissed my suggestion of putting
it into the index itself (a metapage or something).

Maybe store it in reloptions?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Alvaro Herrera

Oleg Bartunov wrote:
 On Tue, 14 Aug 2007, Alvaro Herrera wrote:

 Oleg Bartunov wrote:
 On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote:

 Maybe I'm missing something, but it seems to me that the configuration
 is more attached to a column/index thatn to the whole database. If
 there's a default in an expression, I'd rather expect this default to be
 drawn from the index involved than from a global value (like a 
 functional
 index does now).

 I'm tired to repeat - index itself doesn't know about configuration !

 Is there a way to change that?  For example store the configuration in a
 metapage or something?

 it's useless, in general, since you could use different configuration to 
 build tsvector.

Hmm, sorry, I think I just understood what this was about: so you mean
that the configuration is really *per row* and not per index?  So I can
store rows into an index using more than one configuration, and it will
work?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Mike Rylander

On 8/14/07, Alvaro Herrera [EMAIL PROTECTED] wrote:
 Oleg Bartunov wrote:
  On Tue, 14 Aug 2007, Alvaro Herrera wrote:
 
  Oleg Bartunov wrote:
  On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote:
 
  Maybe I'm missing something, but it seems to me that the configuration
  is more attached to a column/index thatn to the whole database. If
  there's a default in an expression, I'd rather expect this default to be
  drawn from the index involved than from a global value (like a
  functional
  index does now).
 
  I'm tired to repeat - index itself doesn't know about configuration !
 
  Is there a way to change that?  For example store the configuration in a
  metapage or something?
 
  it's useless, in general, since you could use different configuration to
  build tsvector.

 Hmm, sorry, I think I just understood what this was about: so you mean
 that the configuration is really *per row* and not per index?  So I can
 store rows into an index using more than one configuration, and it will
 work?

Can and does, to great success.  :)

--miker

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Tom Lane

Alvaro Herrera [EMAIL PROTECTED] writes:
 Bruce Momjian escribió:
 What has really hurt the default GUC idea is how to do restores from a
 pg_dump.

 I guess what should happen is that pg_dump should include a
 SET default_text_search_config = 'foo'
 just before the CREATE INDEX, like we do for other variables.

The nasty cases are in data-only dumps, that is, where you're trying to
load data into a table with pre-existing indexes or triggers.  A SET
like the above is at least as likely to be wrong as right, if the index
or trigger depends on it to tell it what to do.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Gregory Stark

Mike Rylander [EMAIL PROTECTED] writes:

 My application (http://open-ils.org, which run 80% of the public
 libraries in Georgia, USA, http://gapines.org and
 http://georgialibraries.org/lib/pines.html) requires that I be able to
 search a corpus of bibliographic records in a mix of languages, and
 potentially with mixed stop-word rules, with one query.  I cannot know
 ahead of time what languages will be used in the corpus and I cannot
 restrict any one query to one language.  To accomplish this, the
 record itself will be inspected inside an INSERT/UPDATE trigger to
 determine the language and type, and use the correct configuration for
 creating the tsvector.  This will obviously result in a mixed
 tsvector column, but that's exactly what I need.  I can filter on
 record language if the user happens to specify a query language (and
 thus configuration), or simply rank the assumed (IP based, perhaps, or
 browser preference based) preferred language higher, or one of a
 hundred other things.  But I won't be able to do any of that if
 tsvectors are required to have one and only one configuration per
 column.

 Anyway, I felt I needed to provide some outside perspective to this,
 as a user, since it seems that the external viewpoint (my particular
 viewpoint, at least) was missing from the discussion.

This is *extremely* useful. I think it's precisely what we've been missing so
far. At least, what I've been missing.

So the question is what exactly happens in this case? If I search for the
does that mean it will ignore matches in English where that's a stop-word but
find me books on tea in French? Is that what I should expect to happen? What
if I search for earl and the? Does that find me French books on Early Grey
Tea but English books on all earls?

What happens if I use the same operator directly on the text column? Or
perhaps it's not even possible to specify stop-words when operating on a text
column? Should it be?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Bruce Momjian

Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Bruce Momjian escribi?:
  What has really hurt the default GUC idea is how to do restores from a
  pg_dump.
 
  I guess what should happen is that pg_dump should include a
  SET default_text_search_config = 'foo'
  just before the CREATE INDEX, like we do for other variables.
 
 The nasty cases are in data-only dumps, that is, where you're trying to
 load data into a table with pre-existing indexes or triggers.  A SET
 like the above is at least as likely to be wrong as right, if the index
 or trigger depends on it to tell it what to do.

Ouch.  I had not even thought that far.

FYI, yes, the default tsearch GUC controls operations per row _if_ you
have triggers or expression indexes that rely on the default
configuration GUC.  If you have specified the configuration, there is no
problem, and hence my conclusion that the default GUC is too
error-prone.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Mike Rylander

On 8/14/07, Gregory Stark [EMAIL PROTECTED] wrote:
 Mike Rylander [EMAIL PROTECTED] writes:

  My application (http://open-ils.org, which run 80% of the public
  libraries in Georgia, USA, http://gapines.org and
  http://georgialibraries.org/lib/pines.html) requires that I be able to
  search a corpus of bibliographic records in a mix of languages, and
  potentially with mixed stop-word rules, with one query.  I cannot know
  ahead of time what languages will be used in the corpus and I cannot
  restrict any one query to one language.  To accomplish this, the
  record itself will be inspected inside an INSERT/UPDATE trigger to
  determine the language and type, and use the correct configuration for
  creating the tsvector.  This will obviously result in a mixed
  tsvector column, but that's exactly what I need.  I can filter on
  record language if the user happens to specify a query language (and
  thus configuration), or simply rank the assumed (IP based, perhaps, or
  browser preference based) preferred language higher, or one of a
  hundred other things.  But I won't be able to do any of that if
  tsvectors are required to have one and only one configuration per
  column.
 
  Anyway, I felt I needed to provide some outside perspective to this,
  as a user, since it seems that the external viewpoint (my particular
  viewpoint, at least) was missing from the discussion.

 This is *extremely* useful. I think it's precisely what we've been missing so
 far. At least, what I've been missing.

 So the question is what exactly happens in this case? If I search for the
 does that mean it will ignore matches in English where that's a stop-word but
 find me books on tea in French? Is that what I should expect to happen? What
 if I search for earl and the? Does that find me French books on Early Grey
 Tea but English books on all earls?

Oh dear ... you went and got me started...

Each field type has a different set of configurations for potenial
use.  Title and subject fields, for instance, do not (generally) use
stop-word lists, so a subject search for the will match any record
with the lexem the in a subject field.  Title fields are a little
more complicated, because there is information in the bibliographic
record about how and when to skip leading articles, but generally
those are indexed as well for FTS purposes.  English LCSH subjects
generally don't have stop(like) words in them, so you'll probably just
get French tea records.  Title results would be a mix of earls and
French tea records (probably) and the correlation between the user's
preferred language (either chosen query lang or UI lang, whichever is
available) will help adjust the ranking, pushing what are likely to be
the more appropriate records to the top.

Note, however, that much of this multi-tsearch2-configuration setup is
not used in the implementation at http://gapines.org/ because, well,
there's not much need (read: demand from librarians) for that dataset
to support these more complex tricks.  It's basically all en-US and
exclude stop-words.  Other implementations are making more use of what
I describe above, including a (government mandated) French-English
bilingual institution who shall remain nameless for the time being...


 What happens if I use the same operator directly on the text column? Or
 perhaps it's not even possible to specify stop-words when operating on a text
 column? Should it be?

You mean with an expression index on a text column?  I haven't
considered using them for FTS.  It just feels easier and more flexible
to me to use an external tsvector column because of the fairly heavy
processing that goes into creating each tsvector value.  I may
re-evaluate that position now that CREATE INDEX CONCURRENTLY exists,
but I'm not developing with 8.2+ only features yet.  Once 8.3 is out
that may change.

Also, unless I misunderstand, you have to wrap the text column in the
function used to build the index.  For my purposes, that makes
building a generic FTS driver for my app (which, admittedly, only has
a Postgresql driver ;) ) more difficult than having a hidden extra
column.  Again, that could change if the benefits of CREATE INDEX
CONCURRENTLY end up outweighing simpler FTS driver code.

--miker

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-14 Thread Oleg Bartunov


On Tue, 14 Aug 2007, Alvaro Herrera wrote:


Oleg Bartunov wrote:

On Tue, 14 Aug 2007, Alvaro Herrera wrote:


Oleg Bartunov wrote:

On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote:


Maybe I'm missing something, but it seems to me that the configuration
is more attached to a column/index thatn to the whole database. If
there's a default in an expression, I'd rather expect this default to be
drawn from the index involved than from a global value (like a
functional
index does now).


I'm tired to repeat - index itself doesn't know about configuration !


Is there a way to change that?  For example store the configuration in a
metapage or something?


it's useless, in general, since you could use different configuration to
build tsvector.


Hmm, sorry, I think I just understood what this was about: so you mean
that the configuration is really *per row* and not per index?  So I can


in the very extreme case, yes. Index doesn't care about configuration.
Everything should works without index !


store rows into an index using more than one configuration, and it will
work?


why not. For one set of documents you can use one configuration
(parser+mappings), for another - different configuration.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-13 Thread Bruce Momjian

Heikki Linnakangas wrote:
 Oleg Bartunov wrote:
  On Wed, 8 Aug 2007, Bruce Momjian wrote:
  Heikki Linnakangas wrote:
  If I understood correctly, the basic issue is that a tsvector datum
  created using configuration A is incompatible with a tsquery datum
  created using configuration B, in the sense that you won't get
  reasonable results if you use the tsquery to search the tsvector, or do
  ranking or highlighting. If the configurations happen to be similar
  enough, it can work, but not in general.
 
  Right.
  
  not fair. There are many cases when one can intentionally use different
  configurations. But I agree, this is not for beginners.
 
 Can you give an example of that?
 
 I certainly can see the need to use different configurations in one
 database, but what's the use case for comparing a tsvector created with
 configuration A against a tsquery created with configuration B?

I assume you could have a configuration with different stop words or
synonymns and compare them.

  - using an expression index instead of a tsvector-field, and always
  explicitly specifying the configuration, you can avoid that problem (a
  query with a different configuration won't use the index). But an
  expression index, without explicitly specifying the configuration, will
  get corrupted if you change the default configuration.
 
  Right.
  
  the same problem if you drop constrain from table (accidently) and then
  gets surprised by select results.
 
 The difference is that if you change the default configuration, you
 won't expect that your queries start to return funny results. It looks
 harmless, like changing the date style. If you drop a constraint, it's
 much more obvious what the consequences are.
 
  We should agree that all you describe is only for DUMMY users. From
  authors point of view I dislike your approach to treat text searching as
  a very limited tool. But I understand that we should preserve people
  from stupid errors.
  
  I want for beginners easy setup and error-prone functionality,
  but leaving experienced users to develop complex search engines.
  Can we have separate safe interface for text searching and explicitly
  recommend it for beginners ?
 
 I don't see how any of the suggestions limits what you can do with it.
 If we remove the default configuration parameter, you just have to be
 explicit. If we go with the type-system I suggested, you could still add
 casts and conversion functions between different tsvector types, where
 it make sense.

I don't think the type system is workable given the ability to create
new configurations on the fly.  I think the configuration must be
specified each time.

At this point, if we keep discussing the tsearch2 API we are not going
to have this in 8.3.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-13 Thread Bruce Momjian

Heikki Linnakangas wrote:
 Bruce Momjian wrote:
  Heikki Linnakangas wrote:
  Removing the default configuration setting altogether removes the 2nd
  problem, but that's not good from a usability point of view. And it
  doesn't solve the general issue, you can still do things like:
  SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
  to_tsquery('confB', 'query');
  
  True, but in that case you are specifically naming different
  configurations, so it is hopefully obvious you have a mismatch.
 
 There's many more subtle ways to do that. For example, filling a
 tsvector column using a DEFAULT clause. But then you sometimes fill it
 in the application instead, with a different configuration. Or if one of
 the function calls is buried in another user defined function.
 
 I don't think explicitly naming the configuration gives enough protection.

Oh, wow, OK, well in that case the text search API isn't ready and we
will have to hold this for 8.4.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-10 Thread Heikki Linnakangas

Bruce Momjian wrote:
 Heikki Linnakangas wrote:
 Removing the default configuration setting altogether removes the 2nd
 problem, but that's not good from a usability point of view. And it
 doesn't solve the general issue, you can still do things like:
 SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
 to_tsquery('confB', 'query');
 
 True, but in that case you are specifically naming different
 configurations, so it is hopefully obvious you have a mismatch.

There's many more subtle ways to do that. For example, filling a
tsvector column using a DEFAULT clause. But then you sometimes fill it
in the application instead, with a different configuration. Or if one of
the function calls is buried in another user defined function.

I don't think explicitly naming the configuration gives enough protection.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-10 Thread Heikki Linnakangas

Oleg Bartunov wrote:
 On Wed, 8 Aug 2007, Bruce Momjian wrote:
 Heikki Linnakangas wrote:
 If I understood correctly, the basic issue is that a tsvector datum
 created using configuration A is incompatible with a tsquery datum
 created using configuration B, in the sense that you won't get
 reasonable results if you use the tsquery to search the tsvector, or do
 ranking or highlighting. If the configurations happen to be similar
 enough, it can work, but not in general.

 Right.
 
 not fair. There are many cases when one can intentionally use different
 configurations. But I agree, this is not for beginners.

Can you give an example of that?

I certainly can see the need to use different configurations in one
database, but what's the use case for comparing a tsvector created with
configuration A against a tsquery created with configuration B?

 - using an expression index instead of a tsvector-field, and always
 explicitly specifying the configuration, you can avoid that problem (a
 query with a different configuration won't use the index). But an
 expression index, without explicitly specifying the configuration, will
 get corrupted if you change the default configuration.

 Right.
 
 the same problem if you drop constrain from table (accidently) and then
 gets surprised by select results.

The difference is that if you change the default configuration, you
won't expect that your queries start to return funny results. It looks
harmless, like changing the date style. If you drop a constraint, it's
much more obvious what the consequences are.

 We should agree that all you describe is only for DUMMY users. From
 authors point of view I dislike your approach to treat text searching as
 a very limited tool. But I understand that we should preserve people
 from stupid errors.
 
 I want for beginners easy setup and error-prone functionality,
 but leaving experienced users to develop complex search engines.
 Can we have separate safe interface for text searching and explicitly
 recommend it for beginners ?

I don't see how any of the suggestions limits what you can do with it.
If we remove the default configuration parameter, you just have to be
explicit. If we go with the type-system I suggested, you could still add
casts and conversion functions between different tsvector types, where
it make sense.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-09 Thread Bruce Momjian

Oleg Bartunov wrote:
  Yea, seems more work than is necessary.  If we require the configuration
  to be always supplied, and document that mismatches are a problem, I
  think we are in good shape.
 
 We should agree that all you describe is only for DUMMY users. 
 From authors point of view I dislike your approach to treat text searching 
 as a very limited tool. But I understand that we should preserve people from 
 stupid errors.
 
 I want for beginners easy setup and error-prone functionality,
 but leaving experienced users to develop complex search engines.
 Can we have separate safe interface for text searching and explicitly
 recommend it for beginners ?

I am glad we are moving this interface discussion forward.  It seems 
Heikki has similar concerns about the interface being error-prone.

It would be nice to have a novice and advanced interface, but we would
have to document both, and then that is going to be confusing for users.

As I see it, specifying the configuration name in every function call is
the novice interface, and avoids the most common errors.  I can see
defaulting the interface name as being an advanced interface, but I
don't think it has enough of a feature to be worth documenting and
implementing.

If we figure out something better in 8.4 we can implement it, but at
this point I can't think of any good solution to not specifying the
configuration name every time.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-09 Thread tomas

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Thu, Aug 09, 2007 at 02:36:41AM -0400, Bruce Momjian wrote:
 Oleg Bartunov wrote:
   Yea, seems more work than is necessary.  If we require the configuration
   to be always supplied, and document that mismatches are a problem, I
   think we are in good shape.
  
  We should agree that all you describe is only for DUMMY users. 
  From authors point of view I dislike your approach to treat text searching 
  as a very limited tool [...]

[...]

 I am glad we are moving this interface discussion forward.  It seems 
 Heikki has similar concerns about the interface being error-prone.
 
 It would be nice to have a novice and advanced interface, but we would
 have to document both, and then that is going to be confusing for users.
 
 As I see it, specifying the configuration name in every function call is
 the novice interface, and avoids the most common errors.  I can see
 defaulting the interface name as being an advanced interface, but I
 don't think it has enough of a feature to be worth documenting and
 implementing.
 
 If we figure out something better in 8.4 we can implement it, but at
 this point I can't think of any good solution to not specifying the
 configuration name every time.

Maybe I'm missing something, but it seems to me that the configuration
is more attached to a column/index thatn to the whole database. If
there's a default in an expression, I'd rather expect this default to be
drawn from the index involved than from a global value (like a functional
index does now).

Regards
- -- tomás
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFGuuAoBcgs9XrR2kYRAqiiAJsFL+Iu/b/xYaLza5ozmi839Qh5awCeOp+f
SZHKDPUHZ3u99XzLBn2ZKjw=
=twEt
-END PGP SIGNATURE-


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-09 Thread Oleg Bartunov


On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote:


Maybe I'm missing something, but it seems to me that the configuration
is more attached to a column/index thatn to the whole database. If
there's a default in an expression, I'd rather expect this default to be
drawn from the index involved than from a global value (like a functional
index does now).


I'm tired to repeat - index itself doesn't know about configuration !
It just index tsvector data type. tsvector in turn can be obtained
using various ways:
1. manually
2. to_tsvector

tsvector can be stored in a separate attribute or be fully virtual
like in expressional index.

Moreover, tsvector can be obtained using various configurations depending
on your application.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-09 Thread tomas

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Thu, Aug 09, 2007 at 02:03:13PM +0400, Oleg Bartunov wrote:
 On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote:
 
 Maybe I'm missing something [...]

 I'm tired to repeat - index itself doesn't know about configuration !
 It just index tsvector data type. tsvector in turn can be obtained
 using various ways:
 1. manually
 2. to_tsvector
 
 tsvector can be stored in a separate attribute or be fully virtual
 like in expressional index.
 
 Moreover, tsvector can be obtained using various configurations depending
 on your application.

Yep. I think I got that.

Now what will be the normal case for a new user?

* build an index with a virtual tsvector (using a configuration).
  Then I would expect the index to know the whole function to
  calculate its entries -- that would include the config used.

  This would be more compatible with the approach stated elsewhere to
  always mention explicitly the config.

* manually. Would a novice do that? Or is that advanced stuff?

Regards -- and sorry for my stupid questions :)

- -- tomás
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFGu1ciBcgs9XrR2kYRAnVqAJ9mSc463I44JxLLDPpUZ/JirUVH5ACeOhUr
2/4aXs0ukMnvP8YCq8pamwQ=
=IgfC
-END PGP SIGNATURE-


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-08 Thread Bruce Momjian

Heikki Linnakangas wrote:
  Sure, but you have make sure you use the right configuration in the
  trigger, no?  Does the tsquery have to use the same configuration?
  I wish I knew this myself. :-)   Whatever I had done happened to work
  but that was largely through people on IRC walking me through it.
  
  This illustrates the major issue --- that this has to be simple for
  people to get started, while keeping the capabilities for experienced
  users.
  
  I am now thinking that making users always specify the configuration
  name and not allowing :: casting is going to be the best approach.  We
  can always add more in 8.4 after it is in wide use.
 
 I just read the docs and I'm trying to get a grip of the problem here.
 
 If I understood correctly, the basic issue is that a tsvector datum
 created using configuration A is incompatible with a tsquery datum
 created using configuration B, in the sense that you won't get
 reasonable results if you use the tsquery to search the tsvector, or do
 ranking or highlighting. If the configurations happen to be similar
 enough, it can work, but not in general.

Right.

 That underlying issue manifests itself in many ways, including:
 - if you create table with a field of type tsvector, typically kept
 up-to-date by triggers, and do a search on it using a different
 configuration, you get incorrect results.

Right.

 - using an expression index instead of a tsvector-field, and always
 explicitly specifying the configuration, you can avoid that problem (a
 query with a different configuration won't use the index). But an
 expression index, without explicitly specifying the configuration, will
 get corrupted if you change the default configuration.

Right.

 Removing the default configuration setting altogether removes the 2nd
 problem, but that's not good from a usability point of view. And it
 doesn't solve the general issue, you can still do things like:
 SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
 to_tsquery('confB', 'query');

True, but in that case you are specifically naming different
configurations, so it is hopefully obvious you have a mismatch.

 ISTM we should have a separate tsvector and tsquery data type for each
 configuration, and throw an error if you try to mix and match them in a
 query. to_tsquery and to_tsvector would be new kind of polymorphic
 functions that work with the types. Or we could automatically create a
 copy of them when you create a new configuration. We could have a
 default configuration setting and rewrite queries that don't explicitly
 specify a configuration to use the default.

That is going to make multiple configurations quite complex in the
backend, and I think for little value.

 You could still get into trouble if you alter the configuration after
 starting to use it. We could solve that by not allowing you to ALTER
 CONFIGURATION, at least not if it's used in tables or indexes. Forcing
 people to create a new configuration, and to recreate all indexes and
 tsvector columns every time you add a word to a stop-list, for example,
 seems too onerous, though. Not sure what to do about that.

Yea, seems more work than is necessary.  If we require the configuration
to be always supplied, and document that mismatches are a problem, I
think we are in good shape.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-08 Thread Oleg Bartunov


On Wed, 8 Aug 2007, Bruce Momjian wrote:


Heikki Linnakangas wrote:

Sure, but you have make sure you use the right configuration in the
trigger, no?  Does the tsquery have to use the same configuration?

I wish I knew this myself. :-)   Whatever I had done happened to work
but that was largely through people on IRC walking me through it.


This illustrates the major issue --- that this has to be simple for
people to get started, while keeping the capabilities for experienced
users.

I am now thinking that making users always specify the configuration
name and not allowing :: casting is going to be the best approach.  We
can always add more in 8.4 after it is in wide use.


I just read the docs and I'm trying to get a grip of the problem here.

If I understood correctly, the basic issue is that a tsvector datum
created using configuration A is incompatible with a tsquery datum
created using configuration B, in the sense that you won't get
reasonable results if you use the tsquery to search the tsvector, or do
ranking or highlighting. If the configurations happen to be similar
enough, it can work, but not in general.


Right.


not fair. There are many cases when one can intentionally use different
configurations. But I agree, this is not for beginners.




That underlying issue manifests itself in many ways, including:
- if you create table with a field of type tsvector, typically kept
up-to-date by triggers, and do a search on it using a different
configuration, you get incorrect results.


Right.


again, you might want to use different configuration.




- using an expression index instead of a tsvector-field, and always
explicitly specifying the configuration, you can avoid that problem (a
query with a different configuration won't use the index). But an
expression index, without explicitly specifying the configuration, will
get corrupted if you change the default configuration.


Right.


the same problem if you drop constrain from table (accidently) and then
gets surprised by select results.




Removing the default configuration setting altogether removes the 2nd
problem, but that's not good from a usability point of view. And it
doesn't solve the general issue, you can still do things like:
SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
to_tsquery('confB', 'query');


True, but in that case you are specifically naming different
configurations, so it is hopefully obvious you have a mismatch.


ISTM we should have a separate tsvector and tsquery data type for each
configuration, and throw an error if you try to mix and match them in a
query. to_tsquery and to_tsvector would be new kind of polymorphic
functions that work with the types. Or we could automatically create a
copy of them when you create a new configuration. We could have a
default configuration setting and rewrite queries that don't explicitly
specify a configuration to use the default.


That is going to make multiple configurations quite complex in the
backend, and I think for little value.


You could still get into trouble if you alter the configuration after
starting to use it. We could solve that by not allowing you to ALTER
CONFIGURATION, at least not if it's used in tables or indexes. Forcing
people to create a new configuration, and to recreate all indexes and
tsvector columns every time you add a word to a stop-list, for example,
seems too onerous, though. Not sure what to do about that.


Yea, seems more work than is necessary.  If we require the configuration
to be always supplied, and document that mismatches are a problem, I
think we are in good shape.


We should agree that all you describe is only for DUMMY users. 
From authors point of view I dislike your approach to treat text searching 
as a very limited tool. But I understand that we should preserve people from 
stupid errors.


I want for beginners easy setup and error-prone functionality,
but leaving experienced users to develop complex search engines.
Can we have separate safe interface for text searching and explicitly
recommend it for beginners ?

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-05 Thread Heikki Linnakangas

Bruce Momjian wrote:
 Ron Mayer wrote:
 We need more feedback from users.
 Well, I am waiting for other hackers to get involved, but if they don't,
 I have to evaluate it myself on the email lists.
 Personally, I think documentation changes would be an OK way to
 to handle it.   Something that makes it extremely clear to the
 user the advantages of having the extra column and the risks
 of avoiding them.
 Sure, but you have make sure you use the right configuration in the
 trigger, no?  Does the tsquery have to use the same configuration?
 I wish I knew this myself. :-)   Whatever I had done happened to work
 but that was largely through people on IRC walking me through it.
 
 This illustrates the major issue --- that this has to be simple for
 people to get started, while keeping the capabilities for experienced
 users.
 
 I am now thinking that making users always specify the configuration
 name and not allowing :: casting is going to be the best approach.  We
 can always add more in 8.4 after it is in wide use.

I just read the docs and I'm trying to get a grip of the problem here.

If I understood correctly, the basic issue is that a tsvector datum
created using configuration A is incompatible with a tsquery datum
created using configuration B, in the sense that you won't get
reasonable results if you use the tsquery to search the tsvector, or do
ranking or highlighting. If the configurations happen to be similar
enough, it can work, but not in general.

That underlying issue manifests itself in many ways, including:
- if you create table with a field of type tsvector, typically kept
up-to-date by triggers, and do a search on it using a different
configuration, you get incorrect results.
- using an expression index instead of a tsvector-field, and always
explicitly specifying the configuration, you can avoid that problem (a
query with a different configuration won't use the index). But an
expression index, without explicitly specifying the configuration, will
get corrupted if you change the default configuration.

Removing the default configuration setting altogether removes the 2nd
problem, but that's not good from a usability point of view. And it
doesn't solve the general issue, you can still do things like:
SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@
to_tsquery('confB', 'query');

ISTM we should have a separate tsvector and tsquery data type for each
configuration, and throw an error if you try to mix and match them in a
query. to_tsquery and to_tsvector would be new kind of polymorphic
functions that work with the types. Or we could automatically create a
copy of them when you create a new configuration. We could have a
default configuration setting and rewrite queries that don't explicitly
specify a configuration to use the default.

You could still get into trouble if you alter the configuration after
starting to use it. We could solve that by not allowing you to ALTER
CONFIGURATION, at least not if it's used in tables or indexes. Forcing
people to create a new configuration, and to recreate all indexes and
tsvector columns every time you add a word to a stop-list, for example,
seems too onerous, though. Not sure what to do about that.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-01 Thread Ron Mayer

Bruce Momjian wrote:
 Oleg Bartunov wrote:
 What is a basis of your assumption ? In my opinion, it's very limited
 use of text search, because it doesn't supports ranking. For 4-5 years
 of tsearch2 usage I never used it and I never seem in mailing lists.
 This is very user-oriented feature and we could probably ask 
 -general people for their opinion.

I think I asked about this kind of usage a couple years back;
and Oleg pointed out other reasons why it wasn't as good an
idea too.

http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php
http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php

The particular question I had asked why the functional index was
slower than maintaining the extra column; with the explanation
that the lossy index having to call the function (including
parsing, dictionary lookup, etc) for re-checking the data made
it inadvisable to avoid the extra column anyway.

 I doubt 'general' is going to understand the details of merging this
 into the backend.  I assume we have enough people on hackers to decide
 this.
 
 Are you saying the majority of users have a separate column with a
 trigger?

I think so.   At least when I was using it in 2005 the second
column with the trigger was faster than using a functional index.

 We need more feedback from users.
 
 Well, I am waiting for other hackers to get involved, but if they don't,
 I have to evaluate it myself on the email lists.

Personally, I think documentation changes would be an OK way to
to handle it.   Something that makes it extremely clear to the
user the advantages of having the extra column and the risks
of avoiding them.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-01 Thread Bruce Momjian

Oleg Bartunov wrote:
 On Tue, 31 Jul 2007, Bruce Momjian wrote:
 
  Oleg Bartunov wrote:
  On Tue, 31 Jul 2007, Bruce Momjian wrote:
 
  And if we have to require the configuration name in CREATE INDEX, it has
  to be used in WHERE, so we might as well just remove the default
  capability and always require the configuration name.
 
  this is very rare use case for text searching
  1. expression index without configuration name
  2. default_text_search_config can be changed by somebody
 
  If you are going to be using the configuration name with the create
  expression index, you have to use it in the WHERE clause (or the index
  doesn't work), and I assume that is 90% of the text search uses.  I
  don't see it as rare at all.
 
  What is a basis of your assumption ? In my opinion, it's very limited
  use of text search, because it doesn't supports ranking. For 4-5 years
  of tsearch2 usage I never used it and I never seem in mailing lists.
  This is very user-oriented feature and we could probably ask
  -general people for their opinion.
 
  I doubt 'general' is going to understand the details of merging this
  into the backend.  I assume we have enough people on hackers to decide
  this.
 
 I mean not technical details, but use case. Does they need expressional
 index without ranking but sacrifice ability to use default configuration
 in other cases too ? My prediction is that  people doesn't ever thought about 
 this possibility until we said them about.

In a choice between expression indexes and default_text_search_config,
there is no question in my mind that expression indexes are more useful.
Lack of default_text_search_config only means you have to specify the
configuration name every time, and can't do casting to a text search
data type.

  Are you saying the majority of users have a separate column with a
  trigger?  Does the trigger specify the configuation?  I don't see that
  as a parameter argument to tsvector_update_trigger().  If you reload a
  pg_dump, what does it use for the configuration?
 
 
 yes, separate column with custom trigger works fine. It's up to you how
 to keep your data actual and it's up to you how to write trigger. 
 Our tsvector_update_trigger() is a tsvector_update_trigger_example() !

Well, that is the major problem --- that this is very error-prone,
especially considering that the tsvector_update_trigger() doesn't get it
right either.

  Why is a separate column better than the index?  Just ranking?
 
 ranking + composite documents. I already mentioned, that this could be
 rather expensive. Also, having separate column allow people various
 ways to say what is a document and even change it.

OK, I am confused why an expression index can't use those features if a
separate column can.  I realize the index can't store that information,
but why can the code pick it out of a heap column but not run the
function on the heap row to get that information.  I assume it is
something that is just hard to implement.

  The reason the expression index is nice is this feature has to be easy
  to use for people who are new to full text and even PostgreSQL.  Right
  now /contrib is fine for experts to use, but we want a larger user base
  for this feature.
 
 I agree here. This was one of the main reason of our work for 8.3.
 Probably, we shold think in another direction - not to curtail tsearch2
 and confuse rather big existing users, but to add an ability to save somehow
 configuration used for creating of *document*
 either implicitly (in expression index, or just gin(text_column)), or
 explicitly (separate column). There is no problem with index itself !

Agreed.  We need to find a way to save the configuration when the output
of a text search function is stored, either in an expression index or
via a trigger into a separate column, but only if we allow the default
configuration to be changed by non-super-users.

 
  Should we hold the patch for 8.4?
 
 If we're not agree to say in docs, that implicit usage of text search 
 configuration in CREATE INDEX command doesn't supported. Could we leave
 default_text_search_config for super-users, at least ?
 
 Anyway, let's wait what other people say.

The big problem is that not many people have taken the time to fully
understand how full text search works. I hoped that putting the updated
documentation online would help:

http://momjian.us/expire/fulltext/HTML/textsearch.html

but it seems it hasn't.

What we could do it if we make default_text_search_config
super-user-only and tell users at the start that if
default_text_search_config doesn't match the language they want to use,
then they have to read a documentation section that explains the problem
of configuration mismatches.

The problem with that is that we should be setting
default_text_search_config in the pg_dump output, like we do for
client_encoding, but because it is a super-user-only, it will fail for
non-super-user restores.

So, I am back to thinking

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-01 Thread Bruce Momjian

Ron Mayer wrote:
 Bruce Momjian wrote:
  Oleg Bartunov wrote:
  What is a basis of your assumption ? In my opinion, it's very limited
  use of text search, because it doesn't supports ranking. For 4-5 years
  of tsearch2 usage I never used it and I never seem in mailing lists.
  This is very user-oriented feature and we could probably ask 
  -general people for their opinion.
 
 I think I asked about this kind of usage a couple years back;
 and Oleg pointed out other reasons why it wasn't as good an
 idea too.
 
 http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php
 http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php
 
 The particular question I had asked why the functional index was
 slower than maintaining the extra column; with the explanation
 that the lossy index having to call the function (including
 parsing, dictionary lookup, etc) for re-checking the data made
 it inadvisable to avoid the extra column anyway.
 
  I doubt 'general' is going to understand the details of merging this
  into the backend.  I assume we have enough people on hackers to decide
  this.
  
  Are you saying the majority of users have a separate column with a
  trigger?
 
 I think so.   At least when I was using it in 2005 the second
 column with the trigger was faster than using a functional index.

OK, it is good you measured it.  I wonder how GIN would behave because
it is not lossy.

  We need more feedback from users.
  
  Well, I am waiting for other hackers to get involved, but if they don't,
  I have to evaluate it myself on the email lists.
 
 Personally, I think documentation changes would be an OK way to
 to handle it.   Something that makes it extremely clear to the
 user the advantages of having the extra column and the risks
 of avoiding them.

Sure, but you have make sure you use the right configuration in the
trigger, no?  Does the tsquery have to use the same configuration?

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-01 Thread Ron Mayer

Bruce Momjian wrote:
 Ron Mayer wrote:
 Bruce Momjian wrote:
 Oleg Bartunov wrote:
 What is a basis of your assumption ? 
 I think I asked about this kind of usage a couple years back;...

 http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php
 http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php

 ...why the functional index was
 slower than maintaining the extra column; with the explanation
 that the lossy index having to call the function (including
 parsing, dictionary lookup, etc) for re-checking the data ...
 ...

 Are you saying the majority of users have a separate column with a
 trigger?
 I think so.   At least when I was using it in 2005 the second
 column with the trigger was faster than using a functional index.
 
 OK, it is good you measured it.  I wonder how GIN would behave because
 it is not lossy.

Too bad I don't have the same database around anymore.
It seems the re-parsing for re-checking for the lossy index was very
expensive, tho.
In the end, I suspect it depends greatly on what fraction of rows match.

 We need more feedback from users.
 Well, I am waiting for other hackers to get involved, but if they don't,
 I have to evaluate it myself on the email lists.
 Personally, I think documentation changes would be an OK way to
 to handle it.   Something that makes it extremely clear to the
 user the advantages of having the extra column and the risks
 of avoiding them.
 
 Sure, but you have make sure you use the right configuration in the
 trigger, no?  Does the tsquery have to use the same configuration?

I wish I knew this myself. :-)   Whatever I had done happened to work
but that was largely through people on IRC walking me through it.

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-01 Thread Bruce Momjian

Ron Mayer wrote:
  We need more feedback from users.
  Well, I am waiting for other hackers to get involved, but if they don't,
  I have to evaluate it myself on the email lists.
  Personally, I think documentation changes would be an OK way to
  to handle it.   Something that makes it extremely clear to the
  user the advantages of having the extra column and the risks
  of avoiding them.
  
  Sure, but you have make sure you use the right configuration in the
  trigger, no?  Does the tsquery have to use the same configuration?
 
 I wish I knew this myself. :-)   Whatever I had done happened to work
 but that was largely through people on IRC walking me through it.

This illustrates the major issue --- that this has to be simple for
people to get started, while keeping the capabilities for experienced
users.

I am now thinking that making users always specify the configuration
name and not allowing :: casting is going to be the best approach.  We
can always add more in 8.4 after it is in wide use.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] default_text_search_config and expression indexes

2007-08-01 Thread Ron Mayer

Bruce Momjian wrote:
 Ron Mayer wrote:
 I wish I knew this myself. :-)   Whatever I had done happened to work
 but that was largely through people on IRC walking me through it.
 
 This illustrates the major issue --- that this has to be simple for
 people to get started, while keeping the capabilities for experienced
 users.
 
 I am now thinking that making users always specify the configuration
 name and not allowing :: casting is going to be the best approach.  We
 can always add more in 8.4 after it is in wide use.

That's fair.   Either the docs need to make it totally obvious or
the software should force people to do something safe.

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-31 Thread Oleg Bartunov


On Mon, 30 Jul 2007, Bruce Momjian wrote:


Oleg Bartunov wrote:

OK, here is what I am thinking.  If we make default_text_search_config
super-user-only, then the user can't do SET (using zero_damaged_pages
as a superuser-only example):

test= set zero_damaged_pages = on;
ERROR:  permission denied to set parameter zero_damaged_pages

test= alter user guest set zero_damaged_pages = on;
ERROR:  permission denied to set parameter zero_damaged_pages

but the super-user can set it in postgresql.conf, or:

test=# alter user guest set zero_damaged_pages = on;
ALTER ROLE

or

test=# alter database vendor3 set zero_damaged_pages = on;
ALTER ROLE

meaning while it will be super-user-only, the administrator can set the
default for specific databases and users.  Is that the best approach?

A user can still over-ride the default by specifying the configuration
in the function call.


This is ok, but it will not work in hosting environment and still
doesn't prevent errors.


Agreed.  super-user-only now seems strange to me because it isn't a
security issue, but rather an attempt to avoid people causing errors.

The fundamental issue is that if you do a query using tsvector and
tsquery everything will work find because default_text_search_config
will be the same for both queries.  The problem is if do an expression
index lookup that doesn't specify the configuration name and your
default_text_search_config doesn't match the index, or you INSERT or
UPDATE into an expression index with a mismatched
default_text_search_config.

If we do make default_text_search_config super-user-only it prevents a
database owner from doing ALTER DATABASE db1 SET
default_text_search_config = 'english', which seems like a pretty big
limitation because I think per-database default_text_search_config makes
the most sense.

And, again, if you specify the configuration in the expression index you
have to specify it in the WHERE clause and then
default_text_search_config is pretty useless.


agree. Notice, this is very limited usage case.



If we required the configuration to always be specified, you could still
store multiple configurations in the same column by having a secondary
column hold the configuration name:


I don't understand this. Let's don't discuss indexes at all, since indexes
doesn't know about configuratons at all



CREATE INDEX i on x USING gist (to_tsvector(config_col, body));


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-31 Thread Oleg Bartunov


On Mon, 30 Jul 2007, Bruce Momjian wrote:


Bruce Momjian wrote:

We have to decide if we want a GUC default_text_search_config, and if so
when can it be changed.

Right now there are three ways to create a tsvector (or tsquery)

::tsvector
to_tsvector(value)
to_tsvector(config, value)

(ignoring plainto_tsvector)

Only the last one specifies the configuration. The others use the
configuration specified by default_text_search_config.  (We had an
previous discussion on what the default value of
default_text_search_config should be, and it was decided it should be
set via initdb based on a flag or the locale.)

Now, because most people use a single configuration, they can just set
default_text_search_config and there is no need to specify the
configuration name.

However, expression indexes cause a problem here:


http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX

We recommend that users create an expression index on the column they
want to do a full text search on, e.g.

CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body));

However, the big problem is that the expressions used in expression
indexes should not change their output based on the value of a GUC
variable (because it would corrupt the index), but in the case above,
default_text_search_config controls what configuration is used, and
hence the output of to_tsvector is changed if default_text_search_config
changes.

We have a few possible options:

1) Document the problem and do nothing else.
2) Make default_text_search_config a postgresql.conf-only
   setting, thereby making it impossible to change by non-super
   users, or make it a super-user-only setting.
3) Remove default_text_search_config and require the
   configuration to be specified in each function call.

If we remove default_text_search_config, it would also make ::tsvector
casting useless as well.


OK, I just found a case that I think is going to make #3 a requirement
(remove default_text_search_config).

How is a CREATE INDEX ... to_tsvector(col) going to restore from a
pg_dump?  I see no way of guaranteeing that the
default_text_search_config is correct on the restore, and in fact I
don't think we have any way of knowing the default_text_search_config
used for the index.

And if we have to require the configuration name in CREATE INDEX, it has
to be used in WHERE, so we might as well just remove the default
capability and always require the configuration name.


this is very rare use case for text searching 
1. expression index without configuration name

2. default_text_search_config can be changed by somebody

If somebody really need it, then he should be adviced to use configuration 
name, else we don't guarantee that somebody could change 
default_text_search_config  variable and this could lead to 
incorrect dump/restore.


I don't think we should remove default_text_search_config because of 
this rare case.


Regards,
Oleg

PS.

Bruce, I'm in the mountains the Northern Caucasia and internet is
a bit  unreliable :(

_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-31 Thread Bruce Momjian

Oleg Bartunov wrote:
  If we remove default_text_search_config, it would also make ::tsvector
  casting useless as well.
 
  OK, I just found a case that I think is going to make #3 a requirement
  (remove default_text_search_config).
 
  How is a CREATE INDEX ... to_tsvector(col) going to restore from a
  pg_dump?  I see no way of guaranteeing that the
  default_text_search_config is correct on the restore, and in fact I
  don't think we have any way of knowing the default_text_search_config
  used for the index.
 
  And if we have to require the configuration name in CREATE INDEX, it has
  to be used in WHERE, so we might as well just remove the default
  capability and always require the configuration name.
 
 this is very rare use case for text searching 
 1. expression index without configuration name
 2. default_text_search_config can be changed by somebody

If you are going to be using the configuration name with the create
expression index, you have to use it in the WHERE clause (or the index
doesn't work), and I assume that is 90% of the text search uses.  I
don't see it as rare at all.

 If somebody really need it, then he should be adviced to use configuration 
 name, else we don't guarantee that somebody could change 
 default_text_search_config  variable and this could lead to 
 incorrect dump/restore.
 
 I don't think we should remove default_text_search_config because of 
 this rare case.

I still feel the default_text_search_config has to be removed.  We have
tried all sorts of ways to make it work but having it not be 100%
reliable for pg_dump/restore means it might as well be in /contrib and
unsupported.  If we have it in core, it has to work 100%.  We can't have
tons of examples that don't specify the configuration name and then
expect every create expression index and WHERE clause to use it. 
default_text_search_config _can_ work, but it seems so easy to break and
so easy to get wrong that I think it must be removed.

If we are going to keep it, I need someone to explain why my comments
above are wrong.  If I am right, someone has to remove
default_text_search_config from the patch.   I can do the documentation.

 Bruce, I'm in the mountains the Northern Caucasia and internet is
 a bit  unreliable :(

Thanks.  I noticed  a lag in your reply.  Hope you are having a good
time.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-31 Thread Oleg Bartunov


On Tue, 31 Jul 2007, Bruce Momjian wrote:


And if we have to require the configuration name in CREATE INDEX, it has
to be used in WHERE, so we might as well just remove the default
capability and always require the configuration name.


this is very rare use case for text searching
1. expression index without configuration name
2. default_text_search_config can be changed by somebody


If you are going to be using the configuration name with the create
expression index, you have to use it in the WHERE clause (or the index
doesn't work), and I assume that is 90% of the text search uses.  I
don't see it as rare at all.


What is a basis of your assumption ? In my opinion, it's very limited
use of text search, because it doesn't supports ranking. For 4-5 years
of tsearch2 usage I never used it and I never seem in mailing lists.
This is very user-oriented feature and we could probably ask 
-general people for their opinion.



If somebody really need it, then he should be adviced to use configuration
name, else we don't guarantee that somebody could change
default_text_search_config  variable and this could lead to
incorrect dump/restore.

I don't think we should remove default_text_search_config because of
this rare case.


I still feel the default_text_search_config has to be removed.  We have
tried all sorts of ways to make it work but having it not be 100%
reliable for pg_dump/restore means it might as well be in /contrib and
unsupported.  If we have it in core, it has to work 100%.  We can't have
tons of examples that don't specify the configuration name and then
expect every create expression index and WHERE clause to use it.
default_text_search_config _can_ work, but it seems so easy to break and
so easy to get wrong that I think it must be removed.


I'd better say we don't support text searching using expression index
than remove default_text_search_config. Anyway, I don't feel myself
responisble for such important problem. We need more feedback from 
users.




If we are going to keep it, I need someone to explain why my comments
above are wrong.  If I am right, someone has to remove
default_text_search_config from the patch.   I can do the documentation.


I'm in conference and then will be busy writing my applications and
earning money, Teodor is in vacation. I don't want to do 
hasty conclusion, since we're very tired to change our patch from 
one solution to another. We need consensus of developers and users.

I'm almost exhausted and have  no time  to continue this discussion.

Would you be so kind to write separate post about this problem and
call -hackers and -general for feedback. Let's experienced users
show their needs. We said everything and has nothing to add.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-31 Thread Bruce Momjian

Oleg Bartunov wrote:
 On Tue, 31 Jul 2007, Bruce Momjian wrote:
 
  And if we have to require the configuration name in CREATE INDEX, it has
  to be used in WHERE, so we might as well just remove the default
  capability and always require the configuration name.
 
  this is very rare use case for text searching
  1. expression index without configuration name
  2. default_text_search_config can be changed by somebody
 
  If you are going to be using the configuration name with the create
  expression index, you have to use it in the WHERE clause (or the index
  doesn't work), and I assume that is 90% of the text search uses.  I
  don't see it as rare at all.
 
 What is a basis of your assumption ? In my opinion, it's very limited
 use of text search, because it doesn't supports ranking. For 4-5 years
 of tsearch2 usage I never used it and I never seem in mailing lists.
 This is very user-oriented feature and we could probably ask 
 -general people for their opinion.

I doubt 'general' is going to understand the details of merging this
into the backend.  I assume we have enough people on hackers to decide
this.

Are you saying the majority of users have a separate column with a
trigger?  Does the trigger specify the configuation?  I don't see that
as a parameter argument to tsvector_update_trigger().  If you reload a
pg_dump, what does it use for the configuration?

Why is a separate column better than the index?  Just ranking?

The reason the expression index is nice is this feature has to be easy
to use for people who are new to full text and even PostgreSQL.  Right
now /contrib is fine for experts to use, but we want a larger user base
for this feature.

  If somebody really need it, then he should be adviced to use configuration
  name, else we don't guarantee that somebody could change
  default_text_search_config  variable and this could lead to
  incorrect dump/restore.
 
  I don't think we should remove default_text_search_config because of
  this rare case.
 
  I still feel the default_text_search_config has to be removed.  We have
  tried all sorts of ways to make it work but having it not be 100%
  reliable for pg_dump/restore means it might as well be in /contrib and
  unsupported.  If we have it in core, it has to work 100%.  We can't have
  tons of examples that don't specify the configuration name and then
  expect every create expression index and WHERE clause to use it.
  default_text_search_config _can_ work, but it seems so easy to break and
  so easy to get wrong that I think it must be removed.
 
 I'd better say we don't support text searching using expression index
 than remove default_text_search_config. Anyway, I don't feel myself
 responisble for such important problem. We need more feedback from 
 users.

Well, I am waiting for other hackers to get involved, but if they don't,
I have to evaluate it myself on the email lists.

  If we are going to keep it, I need someone to explain why my comments
  above are wrong.  If I am right, someone has to remove
  default_text_search_config from the patch.   I can do the documentation.
 
 I'm in conference and then will be busy writing my applications and
 earning money, Teodor is in vacation. I don't want to do 
 hasty conclusion, since we're very tired to change our patch from 
 one solution to another. We need consensus of developers and users.
 I'm almost exhausted and have  no time  to continue this discussion.
 
 Would you be so kind to write separate post about this problem and
 call -hackers and -general for feedback. Let's experienced users
 show their needs. We said everything and has nothing to add.

If you have no time to continue discussion and perhaps update the patch,
we can consider this patch dead for 8.3 and we can hold it for 8.4
because I can guarantee you this is going to need more discussion and
patch modification before it gets into CVS.

This patch is being treated fairly and exactly the same as every other
patch.

Should we hold the patch for 8.4?

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-31 Thread Oleg Bartunov


On Tue, 31 Jul 2007, Bruce Momjian wrote:


Oleg Bartunov wrote:

On Tue, 31 Jul 2007, Bruce Momjian wrote:


And if we have to require the configuration name in CREATE INDEX, it has
to be used in WHERE, so we might as well just remove the default
capability and always require the configuration name.


this is very rare use case for text searching
1. expression index without configuration name
2. default_text_search_config can be changed by somebody


If you are going to be using the configuration name with the create
expression index, you have to use it in the WHERE clause (or the index
doesn't work), and I assume that is 90% of the text search uses.  I
don't see it as rare at all.


What is a basis of your assumption ? In my opinion, it's very limited
use of text search, because it doesn't supports ranking. For 4-5 years
of tsearch2 usage I never used it and I never seem in mailing lists.
This is very user-oriented feature and we could probably ask
-general people for their opinion.


I doubt 'general' is going to understand the details of merging this
into the backend.  I assume we have enough people on hackers to decide
this.


I mean not technical details, but use case. Does they need expressional
index without ranking but sacrifice ability to use default configuration
in other cases too ? My prediction is that  people doesn't ever thought about 
this possibility until we said them about.




Are you saying the majority of users have a separate column with a
trigger?  Does the trigger specify the configuation?  I don't see that
as a parameter argument to tsvector_update_trigger().  If you reload a
pg_dump, what does it use for the configuration?



yes, separate column with custom trigger works fine. It's up to you how
to keep your data actual and it's up to you how to write trigger. 
Our tsvector_update_trigger() is a tsvector_update_trigger_example() !



Why is a separate column better than the index?  Just ranking?


ranking + composite documents. I already mentioned, that this could be
rather expensive. Also, having separate column allow people various
ways to say what is a document and even change it.



The reason the expression index is nice is this feature has to be easy
to use for people who are new to full text and even PostgreSQL.  Right
now /contrib is fine for experts to use, but we want a larger user base
for this feature.


I agree here. This was one of the main reason of our work for 8.3.
Probably, we shold think in another direction - not to curtail tsearch2
and confuse rather big existing users, but to add an ability to save somehow
configuration used for creating of *document*
either implicitly (in expression index, or just gin(text_column)), or
explicitly (separate column). There is no problem with index itself !





I'd better say we don't support text searching using expression index
than remove default_text_search_config. Anyway, I don't feel myself
responisble for such important problem. We need more feedback from
users.


Well, I am waiting for other hackers to get involved, but if they don't,
I have to evaluate it myself on the email lists.


If we are going to keep it, I need someone to explain why my comments
above are wrong.  If I am right, someone has to remove
default_text_search_config from the patch.   I can do the documentation.


I'm in conference and then will be busy writing my applications and
earning money, Teodor is in vacation. I don't want to do
hasty conclusion, since we're very tired to change our patch from
one solution to another. We need consensus of developers and users.
I'm almost exhausted and have  no time  to continue this discussion.

Would you be so kind to write separate post about this problem and
call -hackers and -general for feedback. Let's experienced users
show their needs. We said everything and has nothing to add.


If you have no time to continue discussion and perhaps update the patch,
we can consider this patch dead for 8.3 and we can hold it for 8.4
because I can guarantee you this is going to need more discussion and
patch modification before it gets into CVS.

This patch is being treated fairly and exactly the same as every other
patch.


why do you say this ? I didn't complain about this.



Should we hold the patch for 8.4?


If we're not agree to say in docs, that implicit usage of text search 
configuration in CREATE INDEX command doesn't supported. Could we leave

default_text_search_config for super-users, at least ?

Anyway, let's wait what other people say.

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-30 Thread Bruce Momjian

Oleg Bartunov wrote:
  OK, here is what I am thinking.  If we make default_text_search_config
  super-user-only, then the user can't do SET (using zero_damaged_pages
  as a superuser-only example):
 
  test= set zero_damaged_pages = on;
  ERROR:  permission denied to set parameter zero_damaged_pages
 
  test= alter user guest set zero_damaged_pages = on;
  ERROR:  permission denied to set parameter zero_damaged_pages
 
  but the super-user can set it in postgresql.conf, or:
 
  test=# alter user guest set zero_damaged_pages = on;
  ALTER ROLE
 
  or
 
  test=# alter database vendor3 set zero_damaged_pages = on;
  ALTER ROLE
 
  meaning while it will be super-user-only, the administrator can set the
  default for specific databases and users.  Is that the best approach?
 
  A user can still over-ride the default by specifying the configuration
  in the function call.
 
 This is ok, but it will not work in hosting environment and still 
 doesn't prevent errors.

Agreed.  super-user-only now seems strange to me because it isn't a
security issue, but rather an attempt to avoid people causing errors.

The fundamental issue is that if you do a query using tsvector and
tsquery everything will work find because default_text_search_config
will be the same for both queries.  The problem is if do an expression
index lookup that doesn't specify the configuration name and your
default_text_search_config doesn't match the index, or you INSERT or
UPDATE into an expression index with a mismatched
default_text_search_config.

If we do make default_text_search_config super-user-only it prevents a
database owner from doing ALTER DATABASE db1 SET
default_text_search_config = 'english', which seems like a pretty big
limitation because I think per-database default_text_search_config makes
the most sense.

And, again, if you specify the configuration in the expression index you
have to specify it in the WHERE clause and then
default_text_search_config is pretty useless.

If we required the configuration to always be specified, you could still
store multiple configurations in the same column by having a secondary
column hold the configuration name:

CREATE INDEX i on x USING gist (to_tsvector(config_col, body));

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-30 Thread Bruce Momjian

Bruce Momjian wrote:
 We have to decide if we want a GUC default_text_search_config, and if so
 when can it be changed.
 
 Right now there are three ways to create a tsvector (or tsquery)
 
   ::tsvector
   to_tsvector(value)
   to_tsvector(config, value)
 
 (ignoring plainto_tsvector)
 
 Only the last one specifies the configuration. The others use the
 configuration specified by default_text_search_config.  (We had an
 previous discussion on what the default value of
 default_text_search_config should be, and it was decided it should be
 set via initdb based on a flag or the locale.)
 
 Now, because most people use a single configuration, they can just set
 default_text_search_config and there is no need to specify the
 configuration name.
 
 However, expression indexes cause a problem here:
 
   
 http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX
 
 We recommend that users create an expression index on the column they
 want to do a full text search on, e.g.
 
   CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body));
 
 However, the big problem is that the expressions used in expression
 indexes should not change their output based on the value of a GUC
 variable (because it would corrupt the index), but in the case above,
 default_text_search_config controls what configuration is used, and
 hence the output of to_tsvector is changed if default_text_search_config
 changes.
 
 We have a few possible options:
 
   1) Document the problem and do nothing else.
   2) Make default_text_search_config a postgresql.conf-only
  setting, thereby making it impossible to change by non-super
  users, or make it a super-user-only setting.
   3) Remove default_text_search_config and require the
  configuration to be specified in each function call.
 
 If we remove default_text_search_config, it would also make ::tsvector
 casting useless as well.

OK, I just found a case that I think is going to make #3 a requirement
(remove default_text_search_config).

How is a CREATE INDEX ... to_tsvector(col) going to restore from a
pg_dump?  I see no way of guaranteeing that the
default_text_search_config is correct on the restore, and in fact I
don't think we have any way of knowing the default_text_search_config
used for the index.

And if we have to require the configuration name in CREATE INDEX, it has
to be used in WHERE, so we might as well just remove the default
capability and always require the configuration name.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-30 Thread Alvaro Herrera

Bruce Momjian wrote:
 Bruce Momjian wrote:

  3) Remove default_text_search_config and require the
 configuration to be specified in each function call.
  
  If we remove default_text_search_config, it would also make ::tsvector
  casting useless as well.
 
 OK, I just found a case that I think is going to make #3 a requirement
 (remove default_text_search_config).
 
 How is a CREATE INDEX ... to_tsvector(col) going to restore from a
 pg_dump?  I see no way of guaranteeing that the
 default_text_search_config is correct on the restore, and in fact I
 don't think we have any way of knowing the default_text_search_config
 used for the index.

Make pg_dump emit only CREATE INDEX sentences with two-param format.  In
fact I think it would make sense to convert internally the one-param
format to two-param, before hitting the catalogs.

This would also solve your problem about usability of WHERE clauses, if
you rewrite the one-param calls to two-params before the optimizer kicks
in.

-- 
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
Nadie esta tan esclavizado como el que se cree libre no siendolo (Goethe)

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-30 Thread Bruce Momjian

Alvaro Herrera wrote:
 Bruce Momjian wrote:
  Bruce Momjian wrote:
 
 3) Remove default_text_search_config and require the
configuration to be specified in each function call.
   
   If we remove default_text_search_config, it would also make ::tsvector
   casting useless as well.
  
  OK, I just found a case that I think is going to make #3 a requirement
  (remove default_text_search_config).
  
  How is a CREATE INDEX ... to_tsvector(col) going to restore from a
  pg_dump?  I see no way of guaranteeing that the
  default_text_search_config is correct on the restore, and in fact I
  don't think we have any way of knowing the default_text_search_config
  used for the index.
 
 Make pg_dump emit only CREATE INDEX sentences with two-param format.  In
 fact I think it would make sense to convert internally the one-param
 format to two-param, before hitting the catalogs.

 This would also solve your problem about usability of WHERE clauses, if
 you rewrite the one-param calls to two-params before the optimizer kicks
 in.

Yes, that could be done but as far as I know we weren't planning to have
those areas of our backend be aware of configuration parameters to
text search functions, and I doubt we want to do that for 8.3, if ever.
There are many tsearch functions and doing this would spill tsearch
function checks all over the backend, reducing our modularity.

The default capability just isn't worth it, and in addition is
error-prone.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-28 Thread Magnus Hagander

Bruce Momjian wrote:
 Magnus Hagander wrote:
 However, the big problem is that the expressions used in expression
 indexes should not change their output based on the value of a GUC
 variable (because it would corrupt the index), but in the case above,
 default_text_search_config controls what configuration is used, and
 hence the output of to_tsvector is changed if default_text_search_config
 changes.
 It wuoldn't actually *corrupt* the index, right? You could end up with
 wrong results, which might be regarded as corruption in one way, but as
 long as you change the value back the index still works, no?
 
 Right, it would _temporarily_ corrupt it.  ;-)
 
 We have a few possible options:

 1) Document the problem and do nothing else.
 2) Make default_text_search_config a postgresql.conf-only
setting, thereby making it impossible to change by non-super
users, or make it a super-user-only setting.
 3) Remove default_text_search_config and require the
configuration to be specified in each function call.

 If we remove default_text_search_config, it would also make ::tsvector
 casting useless as well.
 I think 3 is a really bad solution.

 2 is a half-bad solution. Do we have a way to say that it can be set at
 database-level for example, but not at user session? Making it
 superuser-only to change it but not postgresql.conf-only could accomplish
 that, along with warnings in the docs for the super user about the effects
 on current indexes by changing it.
 
 OK, here is what I am thinking.  If we make default_text_search_config
 super-user-only, then the user can't do SET (using zero_damaged_pages
 as a superuser-only example):
 
   test= set zero_damaged_pages = on;
   ERROR:  permission denied to set parameter zero_damaged_pages
   
   test= alter user guest set zero_damaged_pages = on;
   ERROR:  permission denied to set parameter zero_damaged_pages
 
 but the super-user can set it in postgresql.conf, or:
 
   test=# alter user guest set zero_damaged_pages = on;
   ALTER ROLE
 
 or
 
   test=# alter database vendor3 set zero_damaged_pages = on;
   ALTER ROLE
 
 meaning while it will be super-user-only, the administrator can set the
 default for specific databases and users.  Is that the best approach?

That's exactly the one I was trying to suggest. And then with clear
warnings in the docs around the place that if you set it to different
values accessing the same index (for different users, for example) weird
things may happen.

But I see the ALTER DATABASE a fairly common scenario (I know I would
use it), and from what I can tell thereis no risk at all with that.

//Magnus

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-28 Thread Oleg Bartunov


On Fri, 27 Jul 2007, Bruce Momjian wrote:


Magnus Hagander wrote:

However, the big problem is that the expressions used in expression
indexes should not change their output based on the value of a GUC
variable (because it would corrupt the index), but in the case above,
default_text_search_config controls what configuration is used, and
hence the output of to_tsvector is changed if default_text_search_config
changes.


It wuoldn't actually *corrupt* the index, right? You could end up with
wrong results, which might be regarded as corruption in one way, but as
long as you change the value back the index still works, no?


Right, it would _temporarily_ corrupt it.  ;-)


We have a few possible options:

1) Document the problem and do nothing else.
2) Make default_text_search_config a postgresql.conf-only
   setting, thereby making it impossible to change by non-super
   users, or make it a super-user-only setting.
3) Remove default_text_search_config and require the
   configuration to be specified in each function call.

If we remove default_text_search_config, it would also make ::tsvector
casting useless as well.


I think 3 is a really bad solution.

2 is a half-bad solution. Do we have a way to say that it can be set at
database-level for example, but not at user session? Making it
superuser-only to change it but not postgresql.conf-only could accomplish
that, along with warnings in the docs for the super user about the effects
on current indexes by changing it.


OK, here is what I am thinking.  If we make default_text_search_config
super-user-only, then the user can't do SET (using zero_damaged_pages
as a superuser-only example):

test= set zero_damaged_pages = on;
ERROR:  permission denied to set parameter zero_damaged_pages

test= alter user guest set zero_damaged_pages = on;
ERROR:  permission denied to set parameter zero_damaged_pages

but the super-user can set it in postgresql.conf, or:

test=# alter user guest set zero_damaged_pages = on;
ALTER ROLE

or

test=# alter database vendor3 set zero_damaged_pages = on;
ALTER ROLE

meaning while it will be super-user-only, the administrator can set the
default for specific databases and users.  Is that the best approach?

A user can still over-ride the default by specifying the configuration
in the function call.


This is ok, but it will not work in hosting environment and still 
doesn't prevent errors.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-27 Thread Pavel Stehule


 1) Document the problem and do nothing else.
 2) Make default_text_search_config a postgresql.conf-only
setting, thereby making it impossible to change by non-super
users, or make it a super-user-only setting.
 3) Remove default_text_search_config and require the
configuration to be specified in each function call.


Hello,

2+.

Regards
Pavel Stehule

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-27 Thread Oleg Bartunov


On Fri, 27 Jul 2007, Pavel Stehule wrote:



1) Document the problem and do nothing else.
2) Make default_text_search_config a postgresql.conf-only
   setting, thereby making it impossible to change by non-super
   users, or make it a super-user-only setting.
3) Remove default_text_search_config and require the
   configuration to be specified in each function call.



Hello,

2+.



One of the most important purpose of integrating tsearch2 was to 
facilitate full-text search for people in hosting environment. Usually,

they have no superuser rights. I'm asking don't forget about them !

There is no problem with current behaviour once user understand what he do.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-27 Thread Oleg Bartunov


On Fri, 27 Jul 2007, Pavel Stehule wrote:


2007/7/27, Oleg Bartunov [EMAIL PROTECTED]:

On Fri, 27 Jul 2007, Pavel Stehule wrote:



1) Document the problem and do nothing else.
2) Make default_text_search_config a postgresql.conf-only
   setting, thereby making it impossible to change by non-super
   users, or make it a super-user-only setting.
3) Remove default_text_search_config and require the
   configuration to be specified in each function call.



Hello,

2+.



One of the most important purpose of integrating tsearch2 was to
facilitate full-text search for people in hosting environment. Usually,
they have no superuser rights. I'm asking don't forget about them !

There is no problem with current behaviour once user understand what he do.




I am not sure if postgresql is well for multilangual hosting
environment. There is problem with locales. Without COLLATE support
postgresql can't be used in similar environment. :(



configuration has NOTHING with language ! This is a most frequent myth about
configuration. It's just the way we chose for default_text_search_config to
use language part of locale at initdb time.
text search configuration is just a bind between parser to use for
breaking document by lexems and mapping between lexeme type and dictionaries.



nice a day
Pavel Stehule



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-27 Thread Pavel Stehule

2007/7/27, Oleg Bartunov [EMAIL PROTECTED]:
 On Fri, 27 Jul 2007, Pavel Stehule wrote:

 
  1) Document the problem and do nothing else.
  2) Make default_text_search_config a postgresql.conf-only
 setting, thereby making it impossible to change by non-super
 users, or make it a super-user-only setting.
  3) Remove default_text_search_config and require the
 configuration to be specified in each function call.
 
 
  Hello,
 
  2+.


 One of the most important purpose of integrating tsearch2 was to
 facilitate full-text search for people in hosting environment. Usually,
 they have no superuser rights. I'm asking don't forget about them !

 There is no problem with current behaviour once user understand what he do.



I am not sure if postgresql is well for multilangual hosting
environment. There is problem with locales. Without COLLATE support
postgresql can't be used in similar environment. :(

nice a day
Pavel Stehule

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-27 Thread Pavel Stehule

 configuration has NOTHING with language ! This is a most frequent myth about
 configuration. It's just the way we chose for default_text_search_config to
 use language part of locale at initdb time.
 text search configuration is just a bind between parser to use for
 breaking document by lexems and mapping between lexeme type and dictionaries.

 

I spoke about impossibility well configuration of postgresql without
administrator's rights. For my czech environment is administrator's
rights necessary too, because czech dictionary aren't in default
installation.

Regards
Pavel Stehule

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-27 Thread Magnus Hagander

On Thu, Jul 26, 2007 at 06:23:51PM -0400, Bruce Momjian wrote:
 Oleg Bartunov wrote:
   Second, I can't figure out how to reference a non-default
   configuration.
  
   See the multi-argument versions of to_tsvector etc.
  
   I do see a problem with having to_tsvector(config, text) plus
   to_tsvector(text) where the latter implicitly references a config
   selected by a GUC variable: how can you tell whether a query using the
   latter matches a particular index using the former?  There isn't
   anything in the current planner mechanisms that would make that work.
  
  Probably, having default text search configuration is not a good idea
  and we could just require it as a mandatory parameter, which could
  eliminate many confusion with selecting text search configuration.
 
 We have to decide if we want a GUC default_text_search_config, and if so
 when can it be changed.
 
 Right now there are three ways to create a tsvector (or tsquery)
 
   ::tsvector
   to_tsvector(value)
   to_tsvector(config, value)
 
 (ignoring plainto_tsvector)
 
 Only the last one specifies the configuration. The others use the
 configuration specified by default_text_search_config.  (We had an
 previous discussion on what the default value of
 default_text_search_config should be, and it was decided it should be
 set via initdb based on a flag or the locale.)
 
 Now, because most people use a single configuration, they can just set
 default_text_search_config and there is no need to specify the
 configuration name.
 
 However, expression indexes cause a problem here:
 
   
 http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX
 
 We recommend that users create an expression index on the column they
 want to do a full text search on, e.g.
 
   CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body));
 
 However, the big problem is that the expressions used in expression
 indexes should not change their output based on the value of a GUC
 variable (because it would corrupt the index), but in the case above,
 default_text_search_config controls what configuration is used, and
 hence the output of to_tsvector is changed if default_text_search_config
 changes.

It wuoldn't actually *corrupt* the index, right? You could end up with
wrong results, which might be regarded as corruption in one way, but as
long as you change the value back the index still works, no?


 We have a few possible options:
 
   1) Document the problem and do nothing else.
   2) Make default_text_search_config a postgresql.conf-only
  setting, thereby making it impossible to change by non-super
  users, or make it a super-user-only setting.
   3) Remove default_text_search_config and require the
  configuration to be specified in each function call.
 
 If we remove default_text_search_config, it would also make ::tsvector
 casting useless as well.

I think 3 is a really bad solution.

2 is a half-bad solution. Do we have a way to say that it can be set at
database-level for example, but not at user session? Making it
superuser-only to change it but not postgresql.conf-only could accomplish
that, along with warnings in the docs for the super user about the effects
on current indexes by changing it.

//Magnus

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] default_text_search_config and expression indexes

2007-07-27 Thread Bruce Momjian

Magnus Hagander wrote:
  However, the big problem is that the expressions used in expression
  indexes should not change their output based on the value of a GUC
  variable (because it would corrupt the index), but in the case above,
  default_text_search_config controls what configuration is used, and
  hence the output of to_tsvector is changed if default_text_search_config
  changes.
 
 It wuoldn't actually *corrupt* the index, right? You could end up with
 wrong results, which might be regarded as corruption in one way, but as
 long as you change the value back the index still works, no?

Right, it would _temporarily_ corrupt it.  ;-)

  We have a few possible options:
  
  1) Document the problem and do nothing else.
  2) Make default_text_search_config a postgresql.conf-only
 setting, thereby making it impossible to change by non-super
 users, or make it a super-user-only setting.
  3) Remove default_text_search_config and require the
 configuration to be specified in each function call.
  
  If we remove default_text_search_config, it would also make ::tsvector
  casting useless as well.
 
 I think 3 is a really bad solution.
 
 2 is a half-bad solution. Do we have a way to say that it can be set at
 database-level for example, but not at user session? Making it
 superuser-only to change it but not postgresql.conf-only could accomplish
 that, along with warnings in the docs for the super user about the effects
 on current indexes by changing it.

OK, here is what I am thinking.  If we make default_text_search_config
super-user-only, then the user can't do SET (using zero_damaged_pages
as a superuser-only example):

test= set zero_damaged_pages = on;
ERROR:  permission denied to set parameter zero_damaged_pages

test= alter user guest set zero_damaged_pages = on;
ERROR:  permission denied to set parameter zero_damaged_pages

but the super-user can set it in postgresql.conf, or:

test=# alter user guest set zero_damaged_pages = on;
ALTER ROLE

or

test=# alter database vendor3 set zero_damaged_pages = on;
ALTER ROLE

meaning while it will be super-user-only, the administrator can set the
default for specific databases and users.  Is that the best approach?

A user can still over-ride the default by specifying the configuration
in the function call.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

[HACKERS] default_text_search_config and expression indexes

2007-07-26 Thread Bruce Momjian

Oleg Bartunov wrote:
  Second, I can't figure out how to reference a non-default
  configuration.
 
  See the multi-argument versions of to_tsvector etc.
 
  I do see a problem with having to_tsvector(config, text) plus
  to_tsvector(text) where the latter implicitly references a config
  selected by a GUC variable: how can you tell whether a query using the
  latter matches a particular index using the former?  There isn't
  anything in the current planner mechanisms that would make that work.
 
 Probably, having default text search configuration is not a good idea
 and we could just require it as a mandatory parameter, which could
 eliminate many confusion with selecting text search configuration.

We have to decide if we want a GUC default_text_search_config, and if so
when can it be changed.

Right now there are three ways to create a tsvector (or tsquery)

::tsvector
to_tsvector(value)
to_tsvector(config, value)

(ignoring plainto_tsvector)

Only the last one specifies the configuration. The others use the
configuration specified by default_text_search_config.  (We had an
previous discussion on what the default value of
default_text_search_config should be, and it was decided it should be
set via initdb based on a flag or the locale.)

Now, because most people use a single configuration, they can just set
default_text_search_config and there is no need to specify the
configuration name.

However, expression indexes cause a problem here:


http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX

We recommend that users create an expression index on the column they
want to do a full text search on, e.g.

CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body));

However, the big problem is that the expressions used in expression
indexes should not change their output based on the value of a GUC
variable (because it would corrupt the index), but in the case above,
default_text_search_config controls what configuration is used, and
hence the output of to_tsvector is changed if default_text_search_config
changes.

We have a few possible options:

1) Document the problem and do nothing else.
2) Make default_text_search_config a postgresql.conf-only
   setting, thereby making it impossible to change by non-super
   users, or make it a super-user-only setting.
3) Remove default_text_search_config and require the
   configuration to be specified in each function call.

If we remove default_text_search_config, it would also make ::tsvector
casting useless as well.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq

54 matches

Mail list logo