Teodor Sigaev wrote:
So, added to my plan
(http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php)
n) single encoded files. That will touch snowball, ispell, synonym, thesaurus
and simple dictionaries
n+1) use encoding names instead of locale's names in configuration
FYI, I
Probably, having default text search configuration is not a good idea
and we could just require it as a mandatory parameter, which could
eliminate many confusion with selecting text search configuration.
Ugh. Having default configuration (by locale or by postgresql.conf or some other
way)
Tom Lane wrote:
Bruce Momjian [EMAIL PROTECTED] writes:
First, why are we specifying the server locale here since it never
changes:
It's poorly described. What it should really say is the language
that the text-to-be-searched is in. We can actually support multiple
languages here
Bruce Momjian wrote:
My guess right now is that we use a GUC that will default if a
pg_catalog configuration name matches the lc_ctype locale name, and we
have to throw an error if an accessed index creation GUC doesn't match
the current GUC.
So we create a pg_catalog full text
1) Require the configuration to be always specified. The problem with
this is that casting (::tsquery) and operators (@@) have no way to
specify a configuration.
it's not comfortable for most often cases
2) Use a GUC that you can set for the configuration, and perhaps
default it if
Teodor Sigaev [EMAIL PROTECTED] writes:
My guess right now is that we use a GUC that will default if a
pg_catalog configuration name matches the lc_ctype locale name, and we
have to throw an error if an accessed index creation GUC doesn't match
the current GUC.
Where will index store index
I'd suggest allowing either full names (swedish) or the standard
two-letter abbreviations (sv). But let's stay away from locale names.
We can use database's encoding name (the same names used in initdb -E)
--
Teodor Sigaev E-mail: [EMAIL PROTECTED]
Bruce Momjian [EMAIL PROTECTED] writes:
Do locale names vary across operating systems?
Yes, which is the fatal flaw in the whole thing. The ru_RU part is
reasonably well standardized, but the encoding part is not. Considering
that encoding is exactly the part of it we don't care about for this
Teodor Sigaev [EMAIL PROTECTED] writes:
I'd suggest allowing either full names (swedish) or the standard
two-letter abbreviations (sv). But let's stay away from locale names.
We can use database's encoding name (the same names used in initdb -E)
AFAICS the encoding name shouldn't be anywhere
Tom Lane [EMAIL PROTECTED] writes:
It's not really the index's problem; IIUC the behavior of the gist and
gin index opclasses is not locale-specific. It's the to_tsvector calls
that built the tsvector heap column that have a locale specified or
implicit. We need some way of annotating the
Gregory Stark [EMAIL PROTECTED] writes:
Tom Lane [EMAIL PROTECTED] writes:
It's not really the index's problem; IIUC the behavior of the gist and
gin index opclasses is not locale-specific. It's the to_tsvector calls
that built the tsvector heap column that have a locale specified or
The only reason the TS stuff needs an encoding spec is to figure out how
to read an external stop word file. I think my suggestion upthread is a
lot better: have just one stop word file per language, store them all in
UTF8, and convert to database encoding when loading them. The database
Hmm.
It's not really the index's problem; IIUC the behavior of the gist and
gin index opclasses is not locale-specific.
Right
It's the to_tsvector calls
that built the tsvector heap column that have a locale specified or
implicit. We need some way of annotating the heap column about this.
It
Teodor Sigaev [EMAIL PROTECTED] writes:
It's the to_tsvector calls
that built the tsvector heap column that have a locale specified or
implicit. We need some way of annotating the heap column about this.
It seems too restrictive to advanced users.
Hm, are you trying to say that it's sane to
Teodor Sigaev [EMAIL PROTECTED] writes:
Hmm. You mean to use language name in configuration, use current encoding to
define which dictionary should be used (stemmers for the same language are
different for different encoding) and recode dictionaries file from UTF8 to
current locale. Did I
Hm, are you trying to say that it's sane to have different tsvectors in
a column computed under different language settings? Maybe we're all
Yes, I think so.
That might have sense for close languages. Norwegian languages has two dialects
and one of them has advanced rules for compound words,
So, added to my plan
(http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php)
n) single encoded files. That will touch snowball, ispell, synonym, thesaurus
and simple dictionaries
n+1) use encoding names instead of locale's names in configuration
Tom Lane wrote:
Teodor Sigaev
One possibility is that the user-visible specification is just a name
(eg, english), but the actual filename out on the filesystem is,
say, name.encoding.stop (eg, english.utf8.stop) where we use PG's
names for the encodings. We could just fail if there's not a file
matching the database
Teodor Sigaev [EMAIL PROTECTED] writes:
But configuration for different languages might be differ, for example
russian (and any cyrillic-based) configuration is differ from
west-european configuration based on different character sets.
Sure. I'm just assuming that the set of stopwords doesn't
Sure. I'm just assuming that the set of stopwords doesn't need to vary
depending on the encoding you're using for a language --- that is, if
you're willing to convert the encoding then the same stopword list file
should serve for all encodings of a given language. Do you think this
might be
Teodor Sigaev [EMAIL PROTECTED] writes:
Hm, are you trying to say that it's sane to have different tsvectors in
a column computed under different language settings? Maybe we're all
Yes, I think so.
That might have sense for close languages. Norwegian languages has two
dialects
and one
To support this sanely though wouldn't you need to know which language rule a
tsvector was generated with? Like, have a byte in the tsvector tagging it with
the language rule forever more?
No. As corner case, dictionary might return just a number or a hash value.
What I'm wondering about is
Bruce Momjian [EMAIL PROTECTED] writes:
First, why are we specifying the server locale here since it never
changes:
It's poorly described. What it should really say is the language
that the text-to-be-searched is in. We can actually support multiple
languages here today, the restriction being
On Thu, 14 Jun 2007, Tom Lane wrote:
Bruce Momjian [EMAIL PROTECTED] writes:
First, why are we specifying the server locale here since it never
changes:
server's locale is used just for one purpose - to select what text search
configuration to use by default. Any text search functions can
24 matches
Mail list logo