Re: [HACKERS] How does the tsearch configuration get selected?
Teodor Sigaev wrote: So, added to my plan (http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php) n) single encoded files. That will touch snowball, ispell, synonym, thesaurus and simple dictionaries n+1) use encoding names instead of locale's names in configuration FYI, I am continuing with the documentation cleanup, though I will not do the /ref directory until we are sure which commands will be kept. We can later modify the documentation to match the new behavior. --- Tom Lane wrote: Teodor Sigaev [EMAIL PROTECTED] writes: But configuration for different languages might be differ, for example russian (and any cyrillic-based) configuration is differ from west-european configuration based on different character sets. Sure. I'm just assuming that the set of stopwords doesn't need to vary depending on the encoding you're using for a language --- that is, if you're willing to convert the encoding then the same stopword list file should serve for all encodings of a given language. Do you think this might be wrong? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] How does the tsearch configuration get selected?
Probably, having default text search configuration is not a good idea and we could just require it as a mandatory parameter, which could eliminate many confusion with selecting text search configuration. Ugh. Having default configuration (by locale or by postgresql.conf or some other way) simplifies life a lot in most cases. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] How does the tsearch configuration get selected?
Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: First, why are we specifying the server locale here since it never changes: It's poorly described. What it should really say is the language that the text-to-be-searched is in. We can actually support multiple languages here today, the restriction being that there have to be stemmer instances for the languages with the database encoding you're using. With UTF8 encoding this isn't much of a restriction. We do need to put code into the dictionary stuff to enforce that you can't use a stemmer when the database encoding isn't compatible with it. I would prefer that we not drive any of this stuff off the server's LC_xxx settings, since as you say that restricts things to just one locale. The idea they had was to set the _default_ full text configuration to match the locale, e.g.UTF8.en_US. This works well for cases where we ship a number of pre-installed full text configurations in pg_catalog. But of course you can support multiple languages with that encoding/locale, so you have to have the ability to do other languages, but not necessarily by default. Second, I can't figure out how to reference a non-default configuration. See the multi-argument versions of to_tsvector etc. I do see a problem with having to_tsvector(config, text) plus to_tsvector(text) where the latter implicitly references a config selected by a GUC variable: how can you tell whether a query using the latter matches a particular index using the former? There isn't anything in the current planner mechanisms that would make that work. Well, now that I have gotten feedback, we have a few options: 1) Require the configuration to be always specified. The problem with this is that casting (::tsquery) and operators (@@) have no way to specify a configuration. 2) Use a GUC that you can set for the configuration, and perhaps default it if possible to match the locale. Is the default affected by search_path (ouch)? How do we make sure that any index that is accessed is using the same configuration that is being used by the query, e.g. ::tsquery? Do we have to store the configuration name in the index and somehow throw an error if it doesn't match? What about changes to the configuration after the index has been created, e.g. new stop words or dictionaries? The two big open issues are whether we allow a default configuration, and whether we require the configuration name to be always specified. My guess right now is that we use a GUC that will default if a pg_catalog configuration name matches the lc_ctype locale name, and we have to throw an error if an accessed index creation GUC doesn't match the current GUC. So we create a pg_catalog full text configuration named UTF8.en-US, and some others like ru_RU.UTF-8. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] How does the tsearch configuration get selected?
Bruce Momjian wrote: My guess right now is that we use a GUC that will default if a pg_catalog configuration name matches the lc_ctype locale name, and we have to throw an error if an accessed index creation GUC doesn't match the current GUC. So we create a pg_catalog full text configuration named UTF8.en-US, and some others like ru_RU.UTF-8. Do locale names vary across operating systems? If so, we might as well skip trying to find a default. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] How does the tsearch configuration get selected?
1) Require the configuration to be always specified. The problem with this is that casting (::tsquery) and operators (@@) have no way to specify a configuration. it's not comfortable for most often cases 2) Use a GUC that you can set for the configuration, and perhaps default it if possible to match the locale. Is the default affected by search_path (ouch)? Right now it works so How do we make sure that any index that is accessed is using the same configuration that is being used by the query, e.g. ::tsquery? Do we have to store the configuration name in the index and somehow throw an error if it doesn't match? What about changes to the configuration after the index has been created, e.g. new stop words or dictionaries? That's possible intentional case, so we should not throw ERROR! The two big open issues are whether we allow a default configuration, and whether we require the configuration name to be always specified. My guess right now is that we use a GUC that will default if a pg_catalog configuration name matches the lc_ctype locale name, and we have to throw an error if an accessed index creation GUC doesn't match the current GUC. Where will index store index creation GUC? So we create a pg_catalog full text configuration named UTF8.en-US, and some others like ru_RU.UTF-8. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] How does the tsearch configuration get selected?
Teodor Sigaev [EMAIL PROTECTED] writes: My guess right now is that we use a GUC that will default if a pg_catalog configuration name matches the lc_ctype locale name, and we have to throw an error if an accessed index creation GUC doesn't match the current GUC. Where will index store index creation GUC? It's not really the index's problem; IIUC the behavior of the gist and gin index opclasses is not locale-specific. It's the to_tsvector calls that built the tsvector heap column that have a locale specified or implicit. We need some way of annotating the heap column about this. In the case of a functional index you can expose the locale: create index ... (to_tsvector('english'::regconfig, mytextcol)) but there's still the problem that the planner cannot match that to a query specified as just WHERE to_tsvector(mytextcol) @@ query. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] How does the tsearch configuration get selected?
I'd suggest allowing either full names (swedish) or the standard two-letter abbreviations (sv). But let's stay away from locale names. We can use database's encoding name (the same names used in initdb -E) -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] How does the tsearch configuration get selected?
Bruce Momjian [EMAIL PROTECTED] writes: Do locale names vary across operating systems? Yes, which is the fatal flaw in the whole thing. The ru_RU part is reasonably well standardized, but the encoding part is not. Considering that encoding is exactly the part of it we don't care about for this purpose (because we should look to the database encoding instead), I think it's just going to make life harder not easier to model search language names on locales. I'd suggest allowing either full names (swedish) or the standard two-letter abbreviations (sv). But let's stay away from locale names. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] How does the tsearch configuration get selected?
Teodor Sigaev [EMAIL PROTECTED] writes: I'd suggest allowing either full names (swedish) or the standard two-letter abbreviations (sv). But let's stay away from locale names. We can use database's encoding name (the same names used in initdb -E) AFAICS the encoding name shouldn't be anywhere near this. The only reason the TS stuff needs an encoding spec is to figure out how to read an external stop word file. I think my suggestion upthread is a lot better: have just one stop word file per language, store them all in UTF8, and convert to database encoding when loading them. The database encoding is implicit and doesn't need to be mentioned anywhere in the TS configuration. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] How does the tsearch configuration get selected?
Tom Lane [EMAIL PROTECTED] writes: It's not really the index's problem; IIUC the behavior of the gist and gin index opclasses is not locale-specific. It's the to_tsvector calls that built the tsvector heap column that have a locale specified or implicit. We need some way of annotating the heap column about this. In the case of a functional index you can expose the locale: create index ... (to_tsvector('english'::regconfig, mytextcol)) Maybe there should be a different type for each locale. I'm not exactly following this thread so I'm not entirely sure whether that would actually fit well but it's just a thought I had. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] How does the tsearch configuration get selected?
Gregory Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: It's not really the index's problem; IIUC the behavior of the gist and gin index opclasses is not locale-specific. It's the to_tsvector calls that built the tsvector heap column that have a locale specified or implicit. We need some way of annotating the heap column about this. Maybe there should be a different type for each locale. I had been idly wondering if we could do anything with using tsvector's typmod for the purpose ... regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] How does the tsearch configuration get selected?
The only reason the TS stuff needs an encoding spec is to figure out how to read an external stop word file. I think my suggestion upthread is a lot better: have just one stop word file per language, store them all in UTF8, and convert to database encoding when loading them. The database Hmm. You mean to use language name in configuration, use current encoding to define which dictionary should be used (stemmers for the same language are different for different encoding) and recode dictionaries file from UTF8 to current locale. Did I understand you right? That's possible to do. But it's incompatible changes and cause some difficulties for DBA. If server locale is ISO (or KOI8 or any other) and file is in UTF8 then text editor/tools might be confused. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] How does the tsearch configuration get selected?
It's not really the index's problem; IIUC the behavior of the gist and gin index opclasses is not locale-specific. Right It's the to_tsvector calls that built the tsvector heap column that have a locale specified or implicit. We need some way of annotating the heap column about this. It seems too restrictive to advanced users. In the case of a functional index you can expose the locale: create index ... (to_tsvector('english'::regconfig, mytextcol)) but there's still the problem that the planner cannot match that to a query specified as just WHERE to_tsvector(mytextcol) @@ query. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] How does the tsearch configuration get selected?
Teodor Sigaev [EMAIL PROTECTED] writes: It's the to_tsvector calls that built the tsvector heap column that have a locale specified or implicit. We need some way of annotating the heap column about this. It seems too restrictive to advanced users. Hm, are you trying to say that it's sane to have different tsvectors in a column computed under different language settings? Maybe we're all overthinking the problem. If the tsvector representation is presumed language-independent then I could see this being a workable approach. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] How does the tsearch configuration get selected?
Teodor Sigaev [EMAIL PROTECTED] writes: Hmm. You mean to use language name in configuration, use current encoding to define which dictionary should be used (stemmers for the same language are different for different encoding) and recode dictionaries file from UTF8 to current locale. Did I understand you right? Right. That's possible to do. But it's incompatible changes and cause some difficulties for DBA. If server locale is ISO (or KOI8 or any other) and file is in UTF8 then text editor/tools might be confused. Well, I'm not as worried about that as I am about the database being confused ;-). We need some way to deal with stopword files that are in a different encoding than the database encoding, and this has to be proof against accidental or malicious mistakes by the non-superuser users who are going to be able to specify which stopword file to use. So I don't want the specification that goes into the CREATE DICTIONARY command to involve an encoding. One possibility is that the user-visible specification is just a name (eg, english), but the actual filename out on the filesystem is, say, name.encoding.stop (eg, english.utf8.stop) where we use PG's names for the encodings. We could just fail if there's not a file matching the database encoding, or we could try that and then try utf8, or some other rule. In any case I'd want it to verify and convert encoding as necessary while reading. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] How does the tsearch configuration get selected?
Hm, are you trying to say that it's sane to have different tsvectors in a column computed under different language settings? Maybe we're all Yes, I think so. That might have sense for close languages. Norwegian languages has two dialects and one of them has advanced rules for compound words, russian and ukranian has similar rules etc. Operation @@ is language (and encoding) independent, it use just strcmp call. Most often usecase for mixing configuration is somewhere described by me in thread using two different configuration for indexing (tsvector creation) and search (tsquery creation). BTW, thesaurus dictionary could be used for similar reasons in search only configuration. OpenFTS doesn't use tsearch2 configuration at all, it has such infrastructure itself - so, tsvector shouldn't have any information about configuration. Most often change of configuration is a adding new stop words, which doesn't affect correctness of search. Removing stop words cause impossibility to find already indexed documents with query contains only removed stop-words. overthinking the problem. If the tsvector representation is presumed language-independent then I could see this being a workable approach. Actually, we should allow to only 'compatible' changes of configuration but it very hard (or even impossible) to formulate rules about that. Any dictionary has its specific dictinitoption changes to become incompatible with itself, the same is to compatibility between two dictionaries, list of dictionaries. In practice, we didn't see any disasters after changes in configuration - until reindexing search becomes less punctual. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] How does the tsearch configuration get selected?
So, added to my plan (http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php) n) single encoded files. That will touch snowball, ispell, synonym, thesaurus and simple dictionaries n+1) use encoding names instead of locale's names in configuration Tom Lane wrote: Teodor Sigaev [EMAIL PROTECTED] writes: But configuration for different languages might be differ, for example russian (and any cyrillic-based) configuration is differ from west-european configuration based on different character sets. Sure. I'm just assuming that the set of stopwords doesn't need to vary depending on the encoding you're using for a language --- that is, if you're willing to convert the encoding then the same stopword list file should serve for all encodings of a given language. Do you think this might be wrong? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] How does the tsearch configuration get selected?
One possibility is that the user-visible specification is just a name (eg, english), but the actual filename out on the filesystem is, say, name.encoding.stop (eg, english.utf8.stop) where we use PG's names for the encodings. We could just fail if there's not a file matching the database encoding, or we could try that and then try utf8, or some other rule. In any case I'd want it to verify and convert encoding as necessary while reading. I have no strong objection for UTF8-encoded files (stop words or ispell or synonym or thesaurus). Just recode it after reading. But configuration for different languages might be differ, for example russian (and any cyrillic-based) configuration is differ from west-european configuration based on different character sets. So, we should have non-obvious rules for stemmers to define which exact stemmer and stop-file should be used. For russian language with utf8 encoding it should use for lword english stemmer, but for italian language - italian stemmer. Any ASCII chars can't present in russian word, but might italian word can contains only ASCII. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] How does the tsearch configuration get selected?
Teodor Sigaev [EMAIL PROTECTED] writes: But configuration for different languages might be differ, for example russian (and any cyrillic-based) configuration is differ from west-european configuration based on different character sets. Sure. I'm just assuming that the set of stopwords doesn't need to vary depending on the encoding you're using for a language --- that is, if you're willing to convert the encoding then the same stopword list file should serve for all encodings of a given language. Do you think this might be wrong? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] How does the tsearch configuration get selected?
Sure. I'm just assuming that the set of stopwords doesn't need to vary depending on the encoding you're using for a language --- that is, if you're willing to convert the encoding then the same stopword list file should serve for all encodings of a given language. Do you think this might be wrong? No. I believe that pgsql doesn't support encoding that can not be recoded from UTF8, at least for non-hieroglyph languages. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] How does the tsearch configuration get selected?
Teodor Sigaev [EMAIL PROTECTED] writes: Hm, are you trying to say that it's sane to have different tsvectors in a column computed under different language settings? Maybe we're all Yes, I think so. That might have sense for close languages. Norwegian languages has two dialects and one of them has advanced rules for compound words, russian and ukranian has similar rules etc. Operation @@ is language (and encoding) independent, it use just strcmp call. To support this sanely though wouldn't you need to know which language rule a tsvector was generated with? Like, have a byte in the tsvector tagging it with the language rule forever more? What I'm wondering about is if you use a different rule than what was used when an index entry was inserted will you get different results using the index than you would doing a sequential scan and reapplying the operator to every datum? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] How does the tsearch configuration get selected?
To support this sanely though wouldn't you need to know which language rule a tsvector was generated with? Like, have a byte in the tsvector tagging it with the language rule forever more? No. As corner case, dictionary might return just a number or a hash value. What I'm wondering about is if you use a different rule than what was used when an index entry was inserted will you get different results using the index than you would doing a sequential scan and reapplying the operator to every datum? Rules are apllyed during creattion of tsvector, not during indexing of tsvectors. So, sequential and index scan will return identical results. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] How does the tsearch configuration get selected?
Bruce Momjian [EMAIL PROTECTED] writes: First, why are we specifying the server locale here since it never changes: It's poorly described. What it should really say is the language that the text-to-be-searched is in. We can actually support multiple languages here today, the restriction being that there have to be stemmer instances for the languages with the database encoding you're using. With UTF8 encoding this isn't much of a restriction. We do need to put code into the dictionary stuff to enforce that you can't use a stemmer when the database encoding isn't compatible with it. I would prefer that we not drive any of this stuff off the server's LC_xxx settings, since as you say that restricts things to just one locale. Second, I can't figure out how to reference a non-default configuration. See the multi-argument versions of to_tsvector etc. I do see a problem with having to_tsvector(config, text) plus to_tsvector(text) where the latter implicitly references a config selected by a GUC variable: how can you tell whether a query using the latter matches a particular index using the former? There isn't anything in the current planner mechanisms that would make that work. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] How does the tsearch configuration get selected?
On Thu, 14 Jun 2007, Tom Lane wrote: Bruce Momjian [EMAIL PROTECTED] writes: First, why are we specifying the server locale here since it never changes: server's locale is used just for one purpose - to select what text search configuration to use by default. Any text search functions can accept text search configuration as an optional parameter. It's poorly described. What it should really say is the language that the text-to-be-searched is in. We can actually support multiple languages here today, the restriction being that there have to be stemmer instances for the languages with the database encoding you're using. With UTF8 encoding this isn't much of a restriction. We do need to put code into the dictionary stuff to enforce that you can't use a stemmer when the database encoding isn't compatible with it. I would prefer that we not drive any of this stuff off the server's LC_xxx settings, since as you say that restricts things to just one locale. something like CREATE TEXT SEARCH DICTIONARY dictname [LOCALE=ru_RU.UTF-8] and raise warning/error if database encoding doesn't match dictionary encoding if specified (not all dictionaries depend on encoding, so it should be an optional parameter). Second, I can't figure out how to reference a non-default configuration. See the multi-argument versions of to_tsvector etc. I do see a problem with having to_tsvector(config, text) plus to_tsvector(text) where the latter implicitly references a config selected by a GUC variable: how can you tell whether a query using the latter matches a particular index using the former? There isn't anything in the current planner mechanisms that would make that work. Probably, having default text search configuration is not a good idea and we could just require it as a mandatory parameter, which could eliminate many confusion with selecting text search configuration. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org