Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-16 Thread Bruce Momjian
Teodor Sigaev wrote:
 So, added to my plan 
 (http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php)
 n) single encoded files. That will touch snowball, ispell, synonym, thesaurus
 and simple dictionaries
 n+1) use encoding names instead of locale's names in configuration

FYI, I am continuing with the documentation cleanup, though I will not
do the /ref directory until we are sure which commands will be kept.

We can later modify the documentation to match the new behavior.

---


 
 Tom Lane wrote:
  Teodor Sigaev [EMAIL PROTECTED] writes:
  But configuration for different languages might be differ, for example
  russian (and any cyrillic-based) configuration is differ from
  west-european configuration based on different character sets.
  
  Sure.  I'm just assuming that the set of stopwords doesn't need to vary
  depending on the encoding you're using for a language --- that is, if
  you're willing to convert the encoding then the same stopword list file
  should serve for all encodings of a given language.  Do you think this
  might be wrong?
  
  regards, tom lane
  
  ---(end of broadcast)---
  TIP 9: In versions below 8.0, the planner will ignore your desire to
 choose an index scan if your joining column's datatypes do not
 match
 
 -- 
 Teodor Sigaev   E-mail: [EMAIL PROTECTED]
 WWW: http://www.sigaev.ru/

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev

Probably, having default text search configuration is not a good idea
and we could just require it as a mandatory parameter, which could
eliminate many confusion with selecting text search configuration.
Ugh. Having default configuration (by locale or by postgresql.conf or some other 
way) simplifies life a lot in most cases.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  First, why are we specifying the server locale here since it never
  changes:
 
 It's poorly described.  What it should really say is the language
 that the text-to-be-searched is in.  We can actually support multiple
 languages here today, the restriction being that there have to be
 stemmer instances for the languages with the database encoding you're
 using.  With UTF8 encoding this isn't much of a restriction.  We do need
 to put code into the dictionary stuff to enforce that you can't use a
 stemmer when the database encoding isn't compatible with it.
 
 I would prefer that we not drive any of this stuff off the server's
 LC_xxx settings, since as you say that restricts things to just one
 locale.

The idea they had was to set the _default_ full text configuration to
match the locale, e.g.UTF8.en_US.  This works well for cases where we
ship a number of pre-installed full text configurations in pg_catalog.
But of course you can support multiple languages with that
encoding/locale, so you have to have the ability to do other languages,
but not necessarily by default.

  Second, I can't figure out how to reference a non-default
  configuration.
 
 See the multi-argument versions of to_tsvector etc.
 
 I do see a problem with having to_tsvector(config, text) plus
 to_tsvector(text) where the latter implicitly references a config
 selected by a GUC variable: how can you tell whether a query using the
 latter matches a particular index using the former?  There isn't
 anything in the current planner mechanisms that would make that work.

Well, now that I have gotten feedback, we have a few options:

1)  Require the configuration to be always specified.  The problem with
this is that casting (::tsquery) and operators (@@) have no way to
specify a configuration.

2)  Use a GUC that you can set for the configuration, and perhaps
default it if possible to match the locale.  Is the default affected by
search_path (ouch)?

How do we make sure that any index that is accessed is using the same
configuration that is being used by the query, e.g. ::tsquery?  Do we
have to store the configuration name in the index and somehow throw an
error if it doesn't match?  What about changes to the configuration
after the index has been created, e.g. new stop words or dictionaries?

The two big open issues are whether we allow a default configuration,
and whether we require the configuration name to be always specified.

My guess right now is that we use a GUC that will default if a
pg_catalog configuration name matches the lc_ctype locale name, and we
have to throw an error if an accessed index creation GUC doesn't match
the current GUC.

So we create a pg_catalog full text configuration named UTF8.en-US, and
some others like ru_RU.UTF-8.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Bruce Momjian
Bruce Momjian wrote:
 My guess right now is that we use a GUC that will default if a
 pg_catalog configuration name matches the lc_ctype locale name, and we
 have to throw an error if an accessed index creation GUC doesn't match
 the current GUC.
 
 So we create a pg_catalog full text configuration named UTF8.en-US, and
 some others like ru_RU.UTF-8.

Do locale names vary across operating systems?  If so, we might as well
skip trying to find a default.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev

1)  Require the configuration to be always specified.  The problem with
this is that casting (::tsquery) and operators (@@) have no way to
specify a configuration.

it's not comfortable for most often cases



2)  Use a GUC that you can set for the configuration, and perhaps
default it if possible to match the locale.  Is the default affected by
search_path (ouch)?

Right now it works so



How do we make sure that any index that is accessed is using the same
configuration that is being used by the query, e.g. ::tsquery?  Do we
have to store the configuration name in the index and somehow throw an
error if it doesn't match?  What about changes to the configuration
after the index has been created, e.g. new stop words or dictionaries?

That's possible intentional case, so we should not throw ERROR!




The two big open issues are whether we allow a default configuration,
and whether we require the configuration name to be always specified.

My guess right now is that we use a GUC that will default if a
pg_catalog configuration name matches the lc_ctype locale name, and we
have to throw an error if an accessed index creation GUC doesn't match
the current GUC.


Where will index store index creation GUC?


So we create a pg_catalog full text configuration named UTF8.en-US, and
some others like ru_RU.UTF-8.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes:
 My guess right now is that we use a GUC that will default if a
 pg_catalog configuration name matches the lc_ctype locale name, and we
 have to throw an error if an accessed index creation GUC doesn't match
 the current GUC.

 Where will index store index creation GUC?

It's not really the index's problem; IIUC the behavior of the gist and
gin index opclasses is not locale-specific.  It's the to_tsvector calls
that built the tsvector heap column that have a locale specified or
implicit.  We need some way of annotating the heap column about this.

In the case of a functional index you can expose the locale:

create index ... (to_tsvector('english'::regconfig, mytextcol))

but there's still the problem that the planner cannot match that to
a query specified as just WHERE to_tsvector(mytextcol) @@ query.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev

I'd suggest allowing either full names (swedish) or the standard
two-letter abbreviations (sv).  But let's stay away from locale names.

We can use database's encoding name (the same names used in initdb -E)


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 Do locale names vary across operating systems?

Yes, which is the fatal flaw in the whole thing.  The ru_RU part is
reasonably well standardized, but the encoding part is not.  Considering
that encoding is exactly the part of it we don't care about for this
purpose (because we should look to the database encoding instead),
I think it's just going to make life harder not easier to model search
language names on locales.

I'd suggest allowing either full names (swedish) or the standard
two-letter abbreviations (sv).  But let's stay away from locale names.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes:
 I'd suggest allowing either full names (swedish) or the standard
 two-letter abbreviations (sv).  But let's stay away from locale names.

 We can use database's encoding name (the same names used in initdb -E)

AFAICS the encoding name shouldn't be anywhere near this.

The only reason the TS stuff needs an encoding spec is to figure out how
to read an external stop word file.  I think my suggestion upthread is a
lot better: have just one stop word file per language, store them all in
UTF8, and convert to database encoding when loading them.  The database
encoding is implicit and doesn't need to be mentioned anywhere in the TS
configuration.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Gregory Stark

Tom Lane [EMAIL PROTECTED] writes:

 It's not really the index's problem; IIUC the behavior of the gist and
 gin index opclasses is not locale-specific.  It's the to_tsvector calls
 that built the tsvector heap column that have a locale specified or
 implicit.  We need some way of annotating the heap column about this.

 In the case of a functional index you can expose the locale:

   create index ... (to_tsvector('english'::regconfig, mytextcol))

Maybe there should be a different type for each locale.

I'm not exactly following this thread so I'm not entirely sure whether that
would actually fit well but it's just a thought I had.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 It's not really the index's problem; IIUC the behavior of the gist and
 gin index opclasses is not locale-specific.  It's the to_tsvector calls
 that built the tsvector heap column that have a locale specified or
 implicit.  We need some way of annotating the heap column about this.

 Maybe there should be a different type for each locale.

I had been idly wondering if we could do anything with using tsvector's
typmod for the purpose ...

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev

The only reason the TS stuff needs an encoding spec is to figure out how
to read an external stop word file.  I think my suggestion upthread is a
lot better: have just one stop word file per language, store them all in
UTF8, and convert to database encoding when loading them.  The database


Hmm. You mean to use language name in configuration, use current encoding to
define which dictionary should be used (stemmers for the same language are 
different for different encoding) and recode dictionaries file from UTF8 to 
current locale. Did I understand you right?


That's possible to do. But it's incompatible changes and cause some difficulties 
for DBA. If server locale is ISO (or KOI8 or any other) and file is in UTF8 then 
text editor/tools might be confused.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev

It's not really the index's problem; IIUC the behavior of the gist and
gin index opclasses is not locale-specific.  


Right


It's the to_tsvector calls
that built the tsvector heap column that have a locale specified or
implicit.  We need some way of annotating the heap column about this.

It seems too restrictive to advanced users.



In the case of a functional index you can expose the locale:

create index ... (to_tsvector('english'::regconfig, mytextcol))

but there's still the problem that the planner cannot match that to
a query specified as just WHERE to_tsvector(mytextcol) @@ query.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes:
 It's the to_tsvector calls
 that built the tsvector heap column that have a locale specified or
 implicit.  We need some way of annotating the heap column about this.

 It seems too restrictive to advanced users.

Hm, are you trying to say that it's sane to have different tsvectors in
a column computed under different language settings?  Maybe we're all
overthinking the problem.  If the tsvector representation is presumed
language-independent then I could see this being a workable approach.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes:
 Hmm. You mean to use language name in configuration, use current encoding to
 define which dictionary should be used (stemmers for the same language are 
 different for different encoding) and recode dictionaries file from UTF8 to 
 current locale. Did I understand you right?

Right.

 That's possible to do. But it's incompatible changes and cause some
 difficulties for DBA. If server locale is ISO (or KOI8 or any other)
 and file is in UTF8 then text editor/tools might be confused.

Well, I'm not as worried about that as I am about the database being
confused ;-).  We need some way to deal with stopword files that are in
a different encoding than the database encoding, and this has to be
proof against accidental or malicious mistakes by the non-superuser
users who are going to be able to specify which stopword file to use.
So I don't want the specification that goes into the CREATE DICTIONARY
command to involve an encoding.

One possibility is that the user-visible specification is just a name
(eg, english), but the actual filename out on the filesystem is,
say, name.encoding.stop (eg, english.utf8.stop) where we use PG's
names for the encodings.  We could just fail if there's not a file
matching the database encoding, or we could try that and then try
utf8, or some other rule.  In any case I'd want it to verify and
convert encoding as necessary while reading.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev

Hm, are you trying to say that it's sane to have different tsvectors in
a column computed under different language settings?  Maybe we're all


Yes, I think so.

That might have sense for close languages. Norwegian languages has two dialects 
and one of them has advanced rules for compound words, russian and ukranian has 
similar rules etc. Operation @@ is language (and encoding) independent, it use 
just strcmp call.


Most often usecase for mixing configuration is somewhere described by me in 
thread using two different configuration for indexing (tsvector creation) and 
search (tsquery creation). BTW, thesaurus dictionary could be used for similar 
reasons in search only configuration.


OpenFTS doesn't use tsearch2 configuration at all, it has such infrastructure 
itself - so, tsvector shouldn't have any information about configuration.


Most often change of configuration is a adding new stop words, which doesn't 
affect correctness of search. Removing stop words cause impossibility to find 
already indexed documents with query contains only removed stop-words.




overthinking the problem.  If the tsvector representation is presumed
language-independent then I could see this being a workable approach.


Actually, we should allow to only 'compatible' changes of configuration but it 
very hard (or even impossible) to formulate rules about that. Any dictionary has 
 its specific dictinitoption changes to become incompatible with itself, the 
same is to compatibility between two dictionaries, list of dictionaries.


In practice, we didn't see any disasters after changes in configuration - until 
reindexing search becomes less punctual.



--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev
So, added to my plan 
(http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php)

n) single encoded files. That will touch snowball, ispell, synonym, thesaurus
   and simple dictionaries
n+1) use encoding names instead of locale's names in configuration

Tom Lane wrote:

Teodor Sigaev [EMAIL PROTECTED] writes:

But configuration for different languages might be differ, for example
russian (and any cyrillic-based) configuration is differ from
west-european configuration based on different character sets.


Sure.  I'm just assuming that the set of stopwords doesn't need to vary
depending on the encoding you're using for a language --- that is, if
you're willing to convert the encoding then the same stopword list file
should serve for all encodings of a given language.  Do you think this
might be wrong?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev

One possibility is that the user-visible specification is just a name
(eg, english), but the actual filename out on the filesystem is,
say, name.encoding.stop (eg, english.utf8.stop) where we use PG's
names for the encodings.  We could just fail if there's not a file
matching the database encoding, or we could try that and then try
utf8, or some other rule.  In any case I'd want it to verify and
convert encoding as necessary while reading.


I have no strong objection for UTF8-encoded files (stop words or ispell or 
synonym or thesaurus). Just recode it after reading.


But configuration for different languages might be differ, for example russian 
(and any cyrillic-based) configuration is differ from west-european 
configuration based on different character sets. So, we should have non-obvious 
rules for stemmers to define which exact stemmer and stop-file should be used.
For russian language with utf8 encoding it should use for lword english stemmer, 
but for italian language - italian stemmer. Any ASCII chars can't present in 
russian word, but might italian word can contains only ASCII.




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes:
 But configuration for different languages might be differ, for example
 russian (and any cyrillic-based) configuration is differ from
 west-european configuration based on different character sets.

Sure.  I'm just assuming that the set of stopwords doesn't need to vary
depending on the encoding you're using for a language --- that is, if
you're willing to convert the encoding then the same stopword list file
should serve for all encodings of a given language.  Do you think this
might be wrong?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev




Sure.  I'm just assuming that the set of stopwords doesn't need to vary
depending on the encoding you're using for a language --- that is, if
you're willing to convert the encoding then the same stopword list file
should serve for all encodings of a given language.  Do you think this
might be wrong?
No. I believe that pgsql doesn't support encoding that can not be recoded from 
UTF8, at least for non-hieroglyph languages.


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Gregory Stark
Teodor Sigaev [EMAIL PROTECTED] writes:

 Hm, are you trying to say that it's sane to have different tsvectors in
 a column computed under different language settings?  Maybe we're all

 Yes, I think so.

 That might have sense for close languages. Norwegian languages has two 
 dialects
 and one of them has advanced rules for compound words, russian and ukranian 
 has
 similar rules etc. Operation @@ is language (and encoding) independent, it use
 just strcmp call.

To support this sanely though wouldn't you need to know which language rule a
tsvector was generated with? Like, have a byte in the tsvector tagging it with
the language rule forever more?

What I'm wondering about is if you use a different rule than what was used
when an index entry was inserted will you get different results using the
index than you would doing a sequential scan and reapplying the operator to
every datum?


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev

To support this sanely though wouldn't you need to know which language rule a
tsvector was generated with? Like, have a byte in the tsvector tagging it with
the language rule forever more?


No. As corner case, dictionary might return just a number or a hash value.




What I'm wondering about is if you use a different rule than what was used
when an index entry was inserted will you get different results using the
index than you would doing a sequential scan and reapplying the operator to
every datum?


Rules are apllyed  during creattion of tsvector, not during indexing of 
tsvectors. So, sequential and index scan will return identical results.


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-14 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 First, why are we specifying the server locale here since it never
 changes:

It's poorly described.  What it should really say is the language
that the text-to-be-searched is in.  We can actually support multiple
languages here today, the restriction being that there have to be
stemmer instances for the languages with the database encoding you're
using.  With UTF8 encoding this isn't much of a restriction.  We do need
to put code into the dictionary stuff to enforce that you can't use a
stemmer when the database encoding isn't compatible with it.

I would prefer that we not drive any of this stuff off the server's
LC_xxx settings, since as you say that restricts things to just one
locale.

 Second, I can't figure out how to reference a non-default
 configuration.

See the multi-argument versions of to_tsvector etc.

I do see a problem with having to_tsvector(config, text) plus
to_tsvector(text) where the latter implicitly references a config
selected by a GUC variable: how can you tell whether a query using the
latter matches a particular index using the former?  There isn't
anything in the current planner mechanisms that would make that work.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-14 Thread Oleg Bartunov

On Thu, 14 Jun 2007, Tom Lane wrote:


Bruce Momjian [EMAIL PROTECTED] writes:

First, why are we specifying the server locale here since it never
changes:


server's locale is used just for one purpose - to select what text search 
configuration to use by default. Any text search functions can accept

text search configuration as an optional parameter.



It's poorly described.  What it should really say is the language
that the text-to-be-searched is in.  We can actually support multiple
languages here today, the restriction being that there have to be
stemmer instances for the languages with the database encoding you're
using.  With UTF8 encoding this isn't much of a restriction.  We do need
to put code into the dictionary stuff to enforce that you can't use a
stemmer when the database encoding isn't compatible with it.

I would prefer that we not drive any of this stuff off the server's
LC_xxx settings, since as you say that restricts things to just one
locale.


something like 
CREATE TEXT SEARCH DICTIONARY dictname [LOCALE=ru_RU.UTF-8]
and raise warning/error if database encoding doesn't match dictionary 
encoding if specified (not all dictionaries depend on encoding, so it

should be an optional parameter).




Second, I can't figure out how to reference a non-default
configuration.


See the multi-argument versions of to_tsvector etc.

I do see a problem with having to_tsvector(config, text) plus
to_tsvector(text) where the latter implicitly references a config
selected by a GUC variable: how can you tell whether a query using the
latter matches a particular index using the former?  There isn't
anything in the current planner mechanisms that would make that work.


Probably, having default text search configuration is not a good idea
and we could just require it as a mandatory parameter, which could
eliminate many confusion with selecting text search configuration.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org