Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-16 Thread Bruce Momjian
Teodor Sigaev wrote: So, added to my plan (http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php) n) single encoded files. That will touch snowball, ispell, synonym, thesaurus and simple dictionaries n+1) use encoding names instead of locale's names in configuration FYI, I

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev
Probably, having default text search configuration is not a good idea and we could just require it as a mandatory parameter, which could eliminate many confusion with selecting text search configuration. Ugh. Having default configuration (by locale or by postgresql.conf or some other way)

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Bruce Momjian
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: First, why are we specifying the server locale here since it never changes: It's poorly described. What it should really say is the language that the text-to-be-searched is in. We can actually support multiple languages here

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Bruce Momjian
Bruce Momjian wrote: My guess right now is that we use a GUC that will default if a pg_catalog configuration name matches the lc_ctype locale name, and we have to throw an error if an accessed index creation GUC doesn't match the current GUC. So we create a pg_catalog full text

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev
1) Require the configuration to be always specified. The problem with this is that casting (::tsquery) and operators (@@) have no way to specify a configuration. it's not comfortable for most often cases 2) Use a GUC that you can set for the configuration, and perhaps default it if

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes: My guess right now is that we use a GUC that will default if a pg_catalog configuration name matches the lc_ctype locale name, and we have to throw an error if an accessed index creation GUC doesn't match the current GUC. Where will index store index

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev
I'd suggest allowing either full names (swedish) or the standard two-letter abbreviations (sv). But let's stay away from locale names. We can use database's encoding name (the same names used in initdb -E) -- Teodor Sigaev E-mail: [EMAIL PROTECTED]

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes: Do locale names vary across operating systems? Yes, which is the fatal flaw in the whole thing. The ru_RU part is reasonably well standardized, but the encoding part is not. Considering that encoding is exactly the part of it we don't care about for this

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes: I'd suggest allowing either full names (swedish) or the standard two-letter abbreviations (sv). But let's stay away from locale names. We can use database's encoding name (the same names used in initdb -E) AFAICS the encoding name shouldn't be anywhere

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes: It's not really the index's problem; IIUC the behavior of the gist and gin index opclasses is not locale-specific. It's the to_tsvector calls that built the tsvector heap column that have a locale specified or implicit. We need some way of annotating the

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: It's not really the index's problem; IIUC the behavior of the gist and gin index opclasses is not locale-specific. It's the to_tsvector calls that built the tsvector heap column that have a locale specified or

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev
The only reason the TS stuff needs an encoding spec is to figure out how to read an external stop word file. I think my suggestion upthread is a lot better: have just one stop word file per language, store them all in UTF8, and convert to database encoding when loading them. The database Hmm.

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev
It's not really the index's problem; IIUC the behavior of the gist and gin index opclasses is not locale-specific. Right It's the to_tsvector calls that built the tsvector heap column that have a locale specified or implicit. We need some way of annotating the heap column about this. It

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes: It's the to_tsvector calls that built the tsvector heap column that have a locale specified or implicit. We need some way of annotating the heap column about this. It seems too restrictive to advanced users. Hm, are you trying to say that it's sane to

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes: Hmm. You mean to use language name in configuration, use current encoding to define which dictionary should be used (stemmers for the same language are different for different encoding) and recode dictionaries file from UTF8 to current locale. Did I

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev
Hm, are you trying to say that it's sane to have different tsvectors in a column computed under different language settings? Maybe we're all Yes, I think so. That might have sense for close languages. Norwegian languages has two dialects and one of them has advanced rules for compound words,

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev
So, added to my plan (http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php) n) single encoded files. That will touch snowball, ispell, synonym, thesaurus and simple dictionaries n+1) use encoding names instead of locale's names in configuration Tom Lane wrote: Teodor Sigaev

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev
One possibility is that the user-visible specification is just a name (eg, english), but the actual filename out on the filesystem is, say, name.encoding.stop (eg, english.utf8.stop) where we use PG's names for the encodings. We could just fail if there's not a file matching the database

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes: But configuration for different languages might be differ, for example russian (and any cyrillic-based) configuration is differ from west-european configuration based on different character sets. Sure. I'm just assuming that the set of stopwords doesn't

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev
Sure. I'm just assuming that the set of stopwords doesn't need to vary depending on the encoding you're using for a language --- that is, if you're willing to convert the encoding then the same stopword list file should serve for all encodings of a given language. Do you think this might be

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Gregory Stark
Teodor Sigaev [EMAIL PROTECTED] writes: Hm, are you trying to say that it's sane to have different tsvectors in a column computed under different language settings? Maybe we're all Yes, I think so. That might have sense for close languages. Norwegian languages has two dialects and one

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-15 Thread Teodor Sigaev
To support this sanely though wouldn't you need to know which language rule a tsvector was generated with? Like, have a byte in the tsvector tagging it with the language rule forever more? No. As corner case, dictionary might return just a number or a hash value. What I'm wondering about is

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-14 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes: First, why are we specifying the server locale here since it never changes: It's poorly described. What it should really say is the language that the text-to-be-searched is in. We can actually support multiple languages here today, the restriction being

Re: [HACKERS] How does the tsearch configuration get selected?

2007-06-14 Thread Oleg Bartunov
On Thu, 14 Jun 2007, Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: First, why are we specifying the server locale here since it never changes: server's locale is used just for one purpose - to select what text search configuration to use by default. Any text search functions can