Re: [HACKERS] default_text_search_config and expression indexes
Oleg Bartunov wrote: On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote: Maybe I'm missing something, but it seems to me that the configuration is more attached to a column/index thatn to the whole database. If there's a default in an expression, I'd rather expect this default to be drawn from the index involved than from a global value (like a functional index does now). I'm tired to repeat - index itself doesn't know about configuration ! Is there a way to change that? For example store the configuration in a metapage or something? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] default_text_search_config and expression indexes
Alvaro Herrera [EMAIL PROTECTED] writes: Oleg Bartunov wrote: I'm tired to repeat - index itself doesn't know about configuration ! Is there a way to change that? For example store the configuration in a metapage or something? I think Heikki's suggestion of having each configuration create a new type would effectively do the same thing. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] default_text_search_config and expression indexes
On 8/13/07, Bruce Momjian [EMAIL PROTECTED] wrote: Heikki Linnakangas wrote: Bruce Momjian wrote: Heikki Linnakangas wrote: Removing the default configuration setting altogether removes the 2nd problem, but that's not good from a usability point of view. And it doesn't solve the general issue, you can still do things like: SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ to_tsquery('confB', 'query'); True, but in that case you are specifically naming different configurations, so it is hopefully obvious you have a mismatch. There's many more subtle ways to do that. For example, filling a tsvector column using a DEFAULT clause. But then you sometimes fill it in the application instead, with a different configuration. Or if one of the function calls is buried in another user defined function. I don't think explicitly naming the configuration gives enough protection. Oh, wow, OK, well in that case the text search API isn't ready and we will have to hold this for 8.4. I've been watching this thread with a mixture of dread and hope, waiting to see where the developers' inclination will end up; whether leaving a useful foot gun available will be allowed. This is just my $0.02 as a fairly heavy user of the current tsearch2 code, but I sincerely hope you do not cripple the system by removing the ability to store tsvectors built using arbitrary configurations in a single column. Yes, it can lead to unexpected results if you do not know what you are doing, but if you have gone beyond building a single tsearch2 configuration then you are required to know what you are doing. What's more, IMO the default configuration mechanism feels very much like a CONSTRAINT, as Oleg suggests. That point is one of cognizance, where if one has gone to the trouble of setting up multiple configurations and has learned enough to do so correctly, then one necessarily understands the importance of the setting and can use it (or not, and use explicit configurations) correctly. The default config lowers the bar to an acceptable level for beginners that have no need of multiple configurations, and while I don't feel too strongly, personally, about having a default, I think it is both useful and helpful for new users -- it was for me. Now, so this email isn't entirely complaining, and as a data point for the discussion, I'll explain why I do not want to see tsearch2 crippled in the way suggested by Heikki and Bruce. My application (http://open-ils.org, which run 80% of the public libraries in Georgia, USA, http://gapines.org and http://georgialibraries.org/lib/pines.html) requires that I be able to search a corpus of bibliographic records in a mix of languages, and potentially with mixed stop-word rules, with one query. I cannot know ahead of time what languages will be used in the corpus and I cannot restrict any one query to one language. To accomplish this, the record itself will be inspected inside an INSERT/UPDATE trigger to determine the language and type, and use the correct configuration for creating the tsvector. This will obviously result in a mixed tsvector column, but that's exactly what I need. I can filter on record language if the user happens to specify a query language (and thus configuration), or simply rank the assumed (IP based, perhaps, or browser preference based) preferred language higher, or one of a hundred other things. But I won't be able to do any of that if tsvectors are required to have one and only one configuration per column. Anyway, I felt I needed to provide some outside perspective to this, as a user, since it seems that the external viewpoint (my particular viewpoint, at least) was missing from the discussion. Thanks, folks, for all the work on this so far! --miker ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] default_text_search_config and expression indexes
Mike Rylander wrote: This is just my $0.02 as a fairly heavy user of the current tsearch2 code, but I sincerely hope you do not cripple the system by removing the ability to store tsvectors built using arbitrary configurations in a single column. Yes, it can lead to unexpected results if you do not know what you are doing, but if you have gone beyond building a single tsearch2 configuration then you are required to know what you are doing. What's more, IMO the default configuration mechanism feels very much like a CONSTRAINT, as Oleg suggests. That point is one of cognizance, where if one has gone to the trouble of setting up multiple configurations and has learned enough to do so correctly, then one necessarily understands the importance of the setting and can use it (or not, and use explicit configurations) correctly. The default config lowers the bar to an acceptable level for beginners that have no need of multiple configurations, and while I don't feel too strongly, personally, about having a default, I think it is both useful and helpful for new users -- it was for me. Thanks for chiming in. As a disclaimer: I've never used tsearch2 in a real application myself. My application (http://open-ils.org, which run 80% of the public libraries in Georgia, USA, http://gapines.org and http://georgialibraries.org/lib/pines.html) requires that I be able to search a corpus of bibliographic records in a mix of languages, and potentially with mixed stop-word rules, with one query. I cannot know ahead of time what languages will be used in the corpus and I cannot restrict any one query to one language. To accomplish this, the record itself will be inspected inside an INSERT/UPDATE trigger to determine the language and type, and use the correct configuration for creating the tsvector. This will obviously result in a mixed tsvector column, but that's exactly what I need. I can filter on record language if the user happens to specify a query language (and thus configuration), or simply rank the assumed (IP based, perhaps, or browser preference based) preferred language higher, or one of a hundred other things. But I won't be able to do any of that if tsvectors are required to have one and only one configuration per column. Don't you need to use the right configuration to parse the query into a tsquery as well? What you have is basically the classic problem problem of representing inheritance in a relational model. You have a superclass, bibliographic record, and multiple subclasses, one per language with one extra field, the corpus in the right language. You've solved it by storing them all in one table, with an extra column (language) to denote which subclass the record is. Alternatively, you could solve it by using inherited tables, or having one table per language with a foreign key referencing the master table containing the other fields, or having a single table with one nullable field per configuration, and a check constraint to check that exactly one of them is not null. As a thought experiment, let me present another, not text search related example that's isomorphic to your problem: Imagine an international online auction system, where you can put items for sale and specify a minimum price. Part of the database schema is: CREATE TABLE item ( id INTEGER, description TEXT, currency CHAR(3), price NUMERIC ); It clearly doesn't make sense to directly compare prices in different currencies against each other. A query like WHERE price 1000 doesn't make sense, unless you also restrict the currency, or use an exchange rate to convert between currencies. Also, different currencies might have slightly different rules on how many decimal places are significant. In this example, as well as your bibliographic scenario, we can conveniently store prices in all currencies in the same field because they're all numerics. If we wanted to enforce per-currency rules, like NUMERIC(10, 2) for USD and NUMERIC(10,0) for Italian lires (which doesn't really exists anymore, I know), we'd need to store them in separate columns. And before the decimalisation in 1971, the British pound was divided into 20 shillings, each of which was divided to 12 pence, so it wouldn't have fit to normal numeric field, and we would again have to store that in a separate column. What I'm trying to point out is that the problem isn't unique to text search. It's an old problem, with many alternative solutions, even with strong typing. Arguably if you store data in multiple languages in same field, you have a denormalized schema. Granted, loose typing can be more convenient, but you give up the benefits of strong typing as well. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: Heikki Linnakangas wrote: Bruce Momjian wrote: Heikki Linnakangas wrote: Removing the default configuration setting altogether removes the 2nd problem, but that's not good from a usability point of view. And it doesn't solve the general issue, you can still do things like: SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ to_tsquery('confB', 'query'); True, but in that case you are specifically naming different configurations, so it is hopefully obvious you have a mismatch. There's many more subtle ways to do that. For example, filling a tsvector column using a DEFAULT clause. But then you sometimes fill it in the application instead, with a different configuration. Or if one of the function calls is buried in another user defined function. I don't think explicitly naming the configuration gives enough protection. Oh, wow, OK, well in that case the text search API isn't ready and we will have to hold this for 8.4. That would be unfortunate :(. Sorry I haven't looked at this earlier. Do you think that implementing a strongly typed system is too much work for 8.3? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] default_text_search_config and expression indexes
On Tue, 14 Aug 2007, Alvaro Herrera wrote: Oleg Bartunov wrote: On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote: Maybe I'm missing something, but it seems to me that the configuration is more attached to a column/index thatn to the whole database. If there's a default in an expression, I'd rather expect this default to be drawn from the index involved than from a global value (like a functional index does now). I'm tired to repeat - index itself doesn't know about configuration ! Is there a way to change that? For example store the configuration in a metapage or something? it's useless, in general, since you could use different configuration to build tsvector. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] default_text_search_config and expression indexes
On 8/14/07, Heikki Linnakangas [EMAIL PROTECTED] wrote: Mike Rylander wrote: [snip] Don't you need to use the right configuration to parse the query into a tsquery as well? Only if the user (or user agent) can supply enough information to move away from the configured default of, say, en-US. And even then, other tsvector configurations need to be searched. Configurations are not, of course, just about language/stemming, but also stop-words and thesauri. If it were just languages it wouldn't be too big of a deal (other than in-app DDL management...). What you have is basically the classic problem problem of representing inheritance in a relational model. You have a superclass, bibliographic record, and multiple subclasses, one per language with one extra field, the corpus in the right language. You've solved it by storing them all in one table, with an extra column (language) to denote which subclass the record is. Alternatively, you could solve it by using inherited tables, or having one table per language with a foreign key referencing the master table containing the other fields, or having a single table with one nullable field per configuration, and a check constraint to check that exactly one of them is not null. Sorry, communication problem here ... I provided an oversimplified example meant more to show the issues than to find alternate solutions, though I appreciate you taking the time to consider the problem. If I only had to consider one delimiting facet per record then it would be much simpler. But with the complication that some fields have stop-word lists (and some not), some use thesauri (and different ones, at that), and these fields (as extracted from the records) and their configurations (stem? use a stop-word list? use thesaurus x, y or z?) are user-defined ... As a thought experiment, let me present another, not text search related example that's isomorphic to your problem: Unfortunately, the problem you describe is not quite the same as the problem I need to solve. Fortunately, the current incarnation of tsearch2 does a pretty good job of solving the problem if you store everything in one place and use the query-time environment to apply some adjustments to the ranking of items. I could still work around this problem by creating inherited tables, one for each configuration on each index-providing table but I /really/ hope to avoid that. Creating new configurations for local requirements doesn't require creating new tables (and the associated management overhead in the app) today, something I'd really like to avoid. In fact, I'm starting to sweat just thinking about what the planner would go through with the number tables needed for the potential configurations in an installation that makes use of multiple thesauri and a mix of stop-word lists across, say, 30 languages. Such a dataset is not uncommon. In any case, thanks again for taking the time to think about the problem. I still think having the ability to store any old tsvector I happen to have hanging around in any column of the correct type is a GoodThing(tm). I see from Oleg's message down-thread that that's the way things will be (the tsvector type doesn't know about columns, just lexem tuples). --miker ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] default_text_search_config and expression indexes
Heikki Linnakangas wrote: Bruce Momjian wrote: Heikki Linnakangas wrote: Bruce Momjian wrote: Heikki Linnakangas wrote: Removing the default configuration setting altogether removes the 2nd problem, but that's not good from a usability point of view. And it doesn't solve the general issue, you can still do things like: SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ to_tsquery('confB', 'query'); True, but in that case you are specifically naming different configurations, so it is hopefully obvious you have a mismatch. There's many more subtle ways to do that. For example, filling a tsvector column using a DEFAULT clause. But then you sometimes fill it in the application instead, with a different configuration. Or if one of the function calls is buried in another user defined function. I don't think explicitly naming the configuration gives enough protection. Oh, wow, OK, well in that case the text search API isn't ready and we will have to hold this for 8.4. That would be unfortunate :(. Sorry I haven't looked at this earlier. Do you think that implementing a strongly typed system is too much work for 8.3? Yea, probably. See my other posting with an updated subject line. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] default_text_search_config and expression indexes
Mike Rylander wrote: This is just my $0.02 as a fairly heavy user of the current tsearch2 code, but I sincerely hope you do not cripple the system by removing the ability to store tsvectors built using arbitrary configurations in a single column. Yes, it can lead to unexpected results if you do not know what you are doing, but if you have gone beyond building a single tsearch2 configuration then you are required to know what you are doing. What's more, IMO the default configuration mechanism feels very much like a CONSTRAINT, as Oleg suggests. That point is one of cognizance, where if one has gone to the trouble of setting up multiple configurations and has learned enough to do so correctly, then one necessarily understands the importance of the setting and can use it (or not, and use explicit configurations) correctly. The default config lowers the bar to an acceptable level for beginners that have no need of multiple configurations, and while I don't feel too strongly, personally, about having a default, I think it is both useful and helpful for new users -- it was for me. What has really hurt the default GUC idea is how to do restores from a pg_dump. How do you make sure the right default is used on a restore, particularly if multiple objects are being restored, and each has a different default GUC. I suppose your trigger handles that but that isn't going to help with an expression index, nor in cases where the default of the old database is different from the new one. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian escribió: Mike Rylander wrote: This is just my $0.02 as a fairly heavy user of the current tsearch2 code, but I sincerely hope you do not cripple the system by removing the ability to store tsvectors built using arbitrary configurations in a single column. Yes, it can lead to unexpected results if you do not know what you are doing, but if you have gone beyond building a single tsearch2 configuration then you are required to know what you are doing. What's more, IMO the default configuration mechanism feels very much like a CONSTRAINT, as Oleg suggests. That point is one of cognizance, where if one has gone to the trouble of setting up multiple configurations and has learned enough to do so correctly, then one necessarily understands the importance of the setting and can use it (or not, and use explicit configurations) correctly. The default config lowers the bar to an acceptable level for beginners that have no need of multiple configurations, and while I don't feel too strongly, personally, about having a default, I think it is both useful and helpful for new users -- it was for me. What has really hurt the default GUC idea is how to do restores from a pg_dump. How do you make sure the right default is used on a restore, particularly if multiple objects are being restored, and each has a different default GUC. I suppose your trigger handles that but that isn't going to help with an expression index, nor in cases where the default of the old database is different from the new one. I guess what should happen is that pg_dump should include a SET default_text_search_config = 'foo' just before the CREATE INDEX, like we do for other variables. Of course, in order for this to work, the index itself must know what value was used on creation. Oleg already dismissed my suggestion of putting it into the index itself (a metapage or something). Maybe store it in reloptions? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] default_text_search_config and expression indexes
Oleg Bartunov wrote: On Tue, 14 Aug 2007, Alvaro Herrera wrote: Oleg Bartunov wrote: On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote: Maybe I'm missing something, but it seems to me that the configuration is more attached to a column/index thatn to the whole database. If there's a default in an expression, I'd rather expect this default to be drawn from the index involved than from a global value (like a functional index does now). I'm tired to repeat - index itself doesn't know about configuration ! Is there a way to change that? For example store the configuration in a metapage or something? it's useless, in general, since you could use different configuration to build tsvector. Hmm, sorry, I think I just understood what this was about: so you mean that the configuration is really *per row* and not per index? So I can store rows into an index using more than one configuration, and it will work? -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] default_text_search_config and expression indexes
On 8/14/07, Alvaro Herrera [EMAIL PROTECTED] wrote: Oleg Bartunov wrote: On Tue, 14 Aug 2007, Alvaro Herrera wrote: Oleg Bartunov wrote: On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote: Maybe I'm missing something, but it seems to me that the configuration is more attached to a column/index thatn to the whole database. If there's a default in an expression, I'd rather expect this default to be drawn from the index involved than from a global value (like a functional index does now). I'm tired to repeat - index itself doesn't know about configuration ! Is there a way to change that? For example store the configuration in a metapage or something? it's useless, in general, since you could use different configuration to build tsvector. Hmm, sorry, I think I just understood what this was about: so you mean that the configuration is really *per row* and not per index? So I can store rows into an index using more than one configuration, and it will work? Can and does, to great success. :) --miker ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] default_text_search_config and expression indexes
Alvaro Herrera [EMAIL PROTECTED] writes: Bruce Momjian escribió: What has really hurt the default GUC idea is how to do restores from a pg_dump. I guess what should happen is that pg_dump should include a SET default_text_search_config = 'foo' just before the CREATE INDEX, like we do for other variables. The nasty cases are in data-only dumps, that is, where you're trying to load data into a table with pre-existing indexes or triggers. A SET like the above is at least as likely to be wrong as right, if the index or trigger depends on it to tell it what to do. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] default_text_search_config and expression indexes
Mike Rylander [EMAIL PROTECTED] writes: My application (http://open-ils.org, which run 80% of the public libraries in Georgia, USA, http://gapines.org and http://georgialibraries.org/lib/pines.html) requires that I be able to search a corpus of bibliographic records in a mix of languages, and potentially with mixed stop-word rules, with one query. I cannot know ahead of time what languages will be used in the corpus and I cannot restrict any one query to one language. To accomplish this, the record itself will be inspected inside an INSERT/UPDATE trigger to determine the language and type, and use the correct configuration for creating the tsvector. This will obviously result in a mixed tsvector column, but that's exactly what I need. I can filter on record language if the user happens to specify a query language (and thus configuration), or simply rank the assumed (IP based, perhaps, or browser preference based) preferred language higher, or one of a hundred other things. But I won't be able to do any of that if tsvectors are required to have one and only one configuration per column. Anyway, I felt I needed to provide some outside perspective to this, as a user, since it seems that the external viewpoint (my particular viewpoint, at least) was missing from the discussion. This is *extremely* useful. I think it's precisely what we've been missing so far. At least, what I've been missing. So the question is what exactly happens in this case? If I search for the does that mean it will ignore matches in English where that's a stop-word but find me books on tea in French? Is that what I should expect to happen? What if I search for earl and the? Does that find me French books on Early Grey Tea but English books on all earls? What happens if I use the same operator directly on the text column? Or perhaps it's not even possible to specify stop-words when operating on a text column? Should it be? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] default_text_search_config and expression indexes
Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Bruce Momjian escribi?: What has really hurt the default GUC idea is how to do restores from a pg_dump. I guess what should happen is that pg_dump should include a SET default_text_search_config = 'foo' just before the CREATE INDEX, like we do for other variables. The nasty cases are in data-only dumps, that is, where you're trying to load data into a table with pre-existing indexes or triggers. A SET like the above is at least as likely to be wrong as right, if the index or trigger depends on it to tell it what to do. Ouch. I had not even thought that far. FYI, yes, the default tsearch GUC controls operations per row _if_ you have triggers or expression indexes that rely on the default configuration GUC. If you have specified the configuration, there is no problem, and hence my conclusion that the default GUC is too error-prone. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] default_text_search_config and expression indexes
On 8/14/07, Gregory Stark [EMAIL PROTECTED] wrote: Mike Rylander [EMAIL PROTECTED] writes: My application (http://open-ils.org, which run 80% of the public libraries in Georgia, USA, http://gapines.org and http://georgialibraries.org/lib/pines.html) requires that I be able to search a corpus of bibliographic records in a mix of languages, and potentially with mixed stop-word rules, with one query. I cannot know ahead of time what languages will be used in the corpus and I cannot restrict any one query to one language. To accomplish this, the record itself will be inspected inside an INSERT/UPDATE trigger to determine the language and type, and use the correct configuration for creating the tsvector. This will obviously result in a mixed tsvector column, but that's exactly what I need. I can filter on record language if the user happens to specify a query language (and thus configuration), or simply rank the assumed (IP based, perhaps, or browser preference based) preferred language higher, or one of a hundred other things. But I won't be able to do any of that if tsvectors are required to have one and only one configuration per column. Anyway, I felt I needed to provide some outside perspective to this, as a user, since it seems that the external viewpoint (my particular viewpoint, at least) was missing from the discussion. This is *extremely* useful. I think it's precisely what we've been missing so far. At least, what I've been missing. So the question is what exactly happens in this case? If I search for the does that mean it will ignore matches in English where that's a stop-word but find me books on tea in French? Is that what I should expect to happen? What if I search for earl and the? Does that find me French books on Early Grey Tea but English books on all earls? Oh dear ... you went and got me started... Each field type has a different set of configurations for potenial use. Title and subject fields, for instance, do not (generally) use stop-word lists, so a subject search for the will match any record with the lexem the in a subject field. Title fields are a little more complicated, because there is information in the bibliographic record about how and when to skip leading articles, but generally those are indexed as well for FTS purposes. English LCSH subjects generally don't have stop(like) words in them, so you'll probably just get French tea records. Title results would be a mix of earls and French tea records (probably) and the correlation between the user's preferred language (either chosen query lang or UI lang, whichever is available) will help adjust the ranking, pushing what are likely to be the more appropriate records to the top. Note, however, that much of this multi-tsearch2-configuration setup is not used in the implementation at http://gapines.org/ because, well, there's not much need (read: demand from librarians) for that dataset to support these more complex tricks. It's basically all en-US and exclude stop-words. Other implementations are making more use of what I describe above, including a (government mandated) French-English bilingual institution who shall remain nameless for the time being... What happens if I use the same operator directly on the text column? Or perhaps it's not even possible to specify stop-words when operating on a text column? Should it be? You mean with an expression index on a text column? I haven't considered using them for FTS. It just feels easier and more flexible to me to use an external tsvector column because of the fairly heavy processing that goes into creating each tsvector value. I may re-evaluate that position now that CREATE INDEX CONCURRENTLY exists, but I'm not developing with 8.2+ only features yet. Once 8.3 is out that may change. Also, unless I misunderstand, you have to wrap the text column in the function used to build the index. For my purposes, that makes building a generic FTS driver for my app (which, admittedly, only has a Postgresql driver ;) ) more difficult than having a hidden extra column. Again, that could change if the benefits of CREATE INDEX CONCURRENTLY end up outweighing simpler FTS driver code. --miker ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] default_text_search_config and expression indexes
On Tue, 14 Aug 2007, Alvaro Herrera wrote: Oleg Bartunov wrote: On Tue, 14 Aug 2007, Alvaro Herrera wrote: Oleg Bartunov wrote: On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote: Maybe I'm missing something, but it seems to me that the configuration is more attached to a column/index thatn to the whole database. If there's a default in an expression, I'd rather expect this default to be drawn from the index involved than from a global value (like a functional index does now). I'm tired to repeat - index itself doesn't know about configuration ! Is there a way to change that? For example store the configuration in a metapage or something? it's useless, in general, since you could use different configuration to build tsvector. Hmm, sorry, I think I just understood what this was about: so you mean that the configuration is really *per row* and not per index? So I can in the very extreme case, yes. Index doesn't care about configuration. Everything should works without index ! store rows into an index using more than one configuration, and it will work? why not. For one set of documents you can use one configuration (parser+mappings), for another - different configuration. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] default_text_search_config and expression indexes
Heikki Linnakangas wrote: Oleg Bartunov wrote: On Wed, 8 Aug 2007, Bruce Momjian wrote: Heikki Linnakangas wrote: If I understood correctly, the basic issue is that a tsvector datum created using configuration A is incompatible with a tsquery datum created using configuration B, in the sense that you won't get reasonable results if you use the tsquery to search the tsvector, or do ranking or highlighting. If the configurations happen to be similar enough, it can work, but not in general. Right. not fair. There are many cases when one can intentionally use different configurations. But I agree, this is not for beginners. Can you give an example of that? I certainly can see the need to use different configurations in one database, but what's the use case for comparing a tsvector created with configuration A against a tsquery created with configuration B? I assume you could have a configuration with different stop words or synonymns and compare them. - using an expression index instead of a tsvector-field, and always explicitly specifying the configuration, you can avoid that problem (a query with a different configuration won't use the index). But an expression index, without explicitly specifying the configuration, will get corrupted if you change the default configuration. Right. the same problem if you drop constrain from table (accidently) and then gets surprised by select results. The difference is that if you change the default configuration, you won't expect that your queries start to return funny results. It looks harmless, like changing the date style. If you drop a constraint, it's much more obvious what the consequences are. We should agree that all you describe is only for DUMMY users. From authors point of view I dislike your approach to treat text searching as a very limited tool. But I understand that we should preserve people from stupid errors. I want for beginners easy setup and error-prone functionality, but leaving experienced users to develop complex search engines. Can we have separate safe interface for text searching and explicitly recommend it for beginners ? I don't see how any of the suggestions limits what you can do with it. If we remove the default configuration parameter, you just have to be explicit. If we go with the type-system I suggested, you could still add casts and conversion functions between different tsvector types, where it make sense. I don't think the type system is workable given the ability to create new configurations on the fly. I think the configuration must be specified each time. At this point, if we keep discussing the tsearch2 API we are not going to have this in 8.3. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] default_text_search_config and expression indexes
Heikki Linnakangas wrote: Bruce Momjian wrote: Heikki Linnakangas wrote: Removing the default configuration setting altogether removes the 2nd problem, but that's not good from a usability point of view. And it doesn't solve the general issue, you can still do things like: SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ to_tsquery('confB', 'query'); True, but in that case you are specifically naming different configurations, so it is hopefully obvious you have a mismatch. There's many more subtle ways to do that. For example, filling a tsvector column using a DEFAULT clause. But then you sometimes fill it in the application instead, with a different configuration. Or if one of the function calls is buried in another user defined function. I don't think explicitly naming the configuration gives enough protection. Oh, wow, OK, well in that case the text search API isn't ready and we will have to hold this for 8.4. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: Heikki Linnakangas wrote: Removing the default configuration setting altogether removes the 2nd problem, but that's not good from a usability point of view. And it doesn't solve the general issue, you can still do things like: SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ to_tsquery('confB', 'query'); True, but in that case you are specifically naming different configurations, so it is hopefully obvious you have a mismatch. There's many more subtle ways to do that. For example, filling a tsvector column using a DEFAULT clause. But then you sometimes fill it in the application instead, with a different configuration. Or if one of the function calls is buried in another user defined function. I don't think explicitly naming the configuration gives enough protection. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] default_text_search_config and expression indexes
Oleg Bartunov wrote: On Wed, 8 Aug 2007, Bruce Momjian wrote: Heikki Linnakangas wrote: If I understood correctly, the basic issue is that a tsvector datum created using configuration A is incompatible with a tsquery datum created using configuration B, in the sense that you won't get reasonable results if you use the tsquery to search the tsvector, or do ranking or highlighting. If the configurations happen to be similar enough, it can work, but not in general. Right. not fair. There are many cases when one can intentionally use different configurations. But I agree, this is not for beginners. Can you give an example of that? I certainly can see the need to use different configurations in one database, but what's the use case for comparing a tsvector created with configuration A against a tsquery created with configuration B? - using an expression index instead of a tsvector-field, and always explicitly specifying the configuration, you can avoid that problem (a query with a different configuration won't use the index). But an expression index, without explicitly specifying the configuration, will get corrupted if you change the default configuration. Right. the same problem if you drop constrain from table (accidently) and then gets surprised by select results. The difference is that if you change the default configuration, you won't expect that your queries start to return funny results. It looks harmless, like changing the date style. If you drop a constraint, it's much more obvious what the consequences are. We should agree that all you describe is only for DUMMY users. From authors point of view I dislike your approach to treat text searching as a very limited tool. But I understand that we should preserve people from stupid errors. I want for beginners easy setup and error-prone functionality, but leaving experienced users to develop complex search engines. Can we have separate safe interface for text searching and explicitly recommend it for beginners ? I don't see how any of the suggestions limits what you can do with it. If we remove the default configuration parameter, you just have to be explicit. If we go with the type-system I suggested, you could still add casts and conversion functions between different tsvector types, where it make sense. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] default_text_search_config and expression indexes
Oleg Bartunov wrote: Yea, seems more work than is necessary. If we require the configuration to be always supplied, and document that mismatches are a problem, I think we are in good shape. We should agree that all you describe is only for DUMMY users. From authors point of view I dislike your approach to treat text searching as a very limited tool. But I understand that we should preserve people from stupid errors. I want for beginners easy setup and error-prone functionality, but leaving experienced users to develop complex search engines. Can we have separate safe interface for text searching and explicitly recommend it for beginners ? I am glad we are moving this interface discussion forward. It seems Heikki has similar concerns about the interface being error-prone. It would be nice to have a novice and advanced interface, but we would have to document both, and then that is going to be confusing for users. As I see it, specifying the configuration name in every function call is the novice interface, and avoids the most common errors. I can see defaulting the interface name as being an advanced interface, but I don't think it has enough of a feature to be worth documenting and implementing. If we figure out something better in 8.4 we can implement it, but at this point I can't think of any good solution to not specifying the configuration name every time. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] default_text_search_config and expression indexes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Thu, Aug 09, 2007 at 02:36:41AM -0400, Bruce Momjian wrote: Oleg Bartunov wrote: Yea, seems more work than is necessary. If we require the configuration to be always supplied, and document that mismatches are a problem, I think we are in good shape. We should agree that all you describe is only for DUMMY users. From authors point of view I dislike your approach to treat text searching as a very limited tool [...] [...] I am glad we are moving this interface discussion forward. It seems Heikki has similar concerns about the interface being error-prone. It would be nice to have a novice and advanced interface, but we would have to document both, and then that is going to be confusing for users. As I see it, specifying the configuration name in every function call is the novice interface, and avoids the most common errors. I can see defaulting the interface name as being an advanced interface, but I don't think it has enough of a feature to be worth documenting and implementing. If we figure out something better in 8.4 we can implement it, but at this point I can't think of any good solution to not specifying the configuration name every time. Maybe I'm missing something, but it seems to me that the configuration is more attached to a column/index thatn to the whole database. If there's a default in an expression, I'd rather expect this default to be drawn from the index involved than from a global value (like a functional index does now). Regards - -- tomás -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFGuuAoBcgs9XrR2kYRAqiiAJsFL+Iu/b/xYaLza5ozmi839Qh5awCeOp+f SZHKDPUHZ3u99XzLBn2ZKjw= =twEt -END PGP SIGNATURE- ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] default_text_search_config and expression indexes
On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote: Maybe I'm missing something, but it seems to me that the configuration is more attached to a column/index thatn to the whole database. If there's a default in an expression, I'd rather expect this default to be drawn from the index involved than from a global value (like a functional index does now). I'm tired to repeat - index itself doesn't know about configuration ! It just index tsvector data type. tsvector in turn can be obtained using various ways: 1. manually 2. to_tsvector tsvector can be stored in a separate attribute or be fully virtual like in expressional index. Moreover, tsvector can be obtained using various configurations depending on your application. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] default_text_search_config and expression indexes
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Thu, Aug 09, 2007 at 02:03:13PM +0400, Oleg Bartunov wrote: On Thu, 9 Aug 2007, [EMAIL PROTECTED] wrote: Maybe I'm missing something [...] I'm tired to repeat - index itself doesn't know about configuration ! It just index tsvector data type. tsvector in turn can be obtained using various ways: 1. manually 2. to_tsvector tsvector can be stored in a separate attribute or be fully virtual like in expressional index. Moreover, tsvector can be obtained using various configurations depending on your application. Yep. I think I got that. Now what will be the normal case for a new user? * build an index with a virtual tsvector (using a configuration). Then I would expect the index to know the whole function to calculate its entries -- that would include the config used. This would be more compatible with the approach stated elsewhere to always mention explicitly the config. * manually. Would a novice do that? Or is that advanced stuff? Regards -- and sorry for my stupid questions :) - -- tomás -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFGu1ciBcgs9XrR2kYRAnVqAJ9mSc463I44JxLLDPpUZ/JirUVH5ACeOhUr 2/4aXs0ukMnvP8YCq8pamwQ= =IgfC -END PGP SIGNATURE- ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] default_text_search_config and expression indexes
Heikki Linnakangas wrote: Sure, but you have make sure you use the right configuration in the trigger, no? Does the tsquery have to use the same configuration? I wish I knew this myself. :-) Whatever I had done happened to work but that was largely through people on IRC walking me through it. This illustrates the major issue --- that this has to be simple for people to get started, while keeping the capabilities for experienced users. I am now thinking that making users always specify the configuration name and not allowing :: casting is going to be the best approach. We can always add more in 8.4 after it is in wide use. I just read the docs and I'm trying to get a grip of the problem here. If I understood correctly, the basic issue is that a tsvector datum created using configuration A is incompatible with a tsquery datum created using configuration B, in the sense that you won't get reasonable results if you use the tsquery to search the tsvector, or do ranking or highlighting. If the configurations happen to be similar enough, it can work, but not in general. Right. That underlying issue manifests itself in many ways, including: - if you create table with a field of type tsvector, typically kept up-to-date by triggers, and do a search on it using a different configuration, you get incorrect results. Right. - using an expression index instead of a tsvector-field, and always explicitly specifying the configuration, you can avoid that problem (a query with a different configuration won't use the index). But an expression index, without explicitly specifying the configuration, will get corrupted if you change the default configuration. Right. Removing the default configuration setting altogether removes the 2nd problem, but that's not good from a usability point of view. And it doesn't solve the general issue, you can still do things like: SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ to_tsquery('confB', 'query'); True, but in that case you are specifically naming different configurations, so it is hopefully obvious you have a mismatch. ISTM we should have a separate tsvector and tsquery data type for each configuration, and throw an error if you try to mix and match them in a query. to_tsquery and to_tsvector would be new kind of polymorphic functions that work with the types. Or we could automatically create a copy of them when you create a new configuration. We could have a default configuration setting and rewrite queries that don't explicitly specify a configuration to use the default. That is going to make multiple configurations quite complex in the backend, and I think for little value. You could still get into trouble if you alter the configuration after starting to use it. We could solve that by not allowing you to ALTER CONFIGURATION, at least not if it's used in tables or indexes. Forcing people to create a new configuration, and to recreate all indexes and tsvector columns every time you add a word to a stop-list, for example, seems too onerous, though. Not sure what to do about that. Yea, seems more work than is necessary. If we require the configuration to be always supplied, and document that mismatches are a problem, I think we are in good shape. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] default_text_search_config and expression indexes
On Wed, 8 Aug 2007, Bruce Momjian wrote: Heikki Linnakangas wrote: Sure, but you have make sure you use the right configuration in the trigger, no? Does the tsquery have to use the same configuration? I wish I knew this myself. :-) Whatever I had done happened to work but that was largely through people on IRC walking me through it. This illustrates the major issue --- that this has to be simple for people to get started, while keeping the capabilities for experienced users. I am now thinking that making users always specify the configuration name and not allowing :: casting is going to be the best approach. We can always add more in 8.4 after it is in wide use. I just read the docs and I'm trying to get a grip of the problem here. If I understood correctly, the basic issue is that a tsvector datum created using configuration A is incompatible with a tsquery datum created using configuration B, in the sense that you won't get reasonable results if you use the tsquery to search the tsvector, or do ranking or highlighting. If the configurations happen to be similar enough, it can work, but not in general. Right. not fair. There are many cases when one can intentionally use different configurations. But I agree, this is not for beginners. That underlying issue manifests itself in many ways, including: - if you create table with a field of type tsvector, typically kept up-to-date by triggers, and do a search on it using a different configuration, you get incorrect results. Right. again, you might want to use different configuration. - using an expression index instead of a tsvector-field, and always explicitly specifying the configuration, you can avoid that problem (a query with a different configuration won't use the index). But an expression index, without explicitly specifying the configuration, will get corrupted if you change the default configuration. Right. the same problem if you drop constrain from table (accidently) and then gets surprised by select results. Removing the default configuration setting altogether removes the 2nd problem, but that's not good from a usability point of view. And it doesn't solve the general issue, you can still do things like: SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ to_tsquery('confB', 'query'); True, but in that case you are specifically naming different configurations, so it is hopefully obvious you have a mismatch. ISTM we should have a separate tsvector and tsquery data type for each configuration, and throw an error if you try to mix and match them in a query. to_tsquery and to_tsvector would be new kind of polymorphic functions that work with the types. Or we could automatically create a copy of them when you create a new configuration. We could have a default configuration setting and rewrite queries that don't explicitly specify a configuration to use the default. That is going to make multiple configurations quite complex in the backend, and I think for little value. You could still get into trouble if you alter the configuration after starting to use it. We could solve that by not allowing you to ALTER CONFIGURATION, at least not if it's used in tables or indexes. Forcing people to create a new configuration, and to recreate all indexes and tsvector columns every time you add a word to a stop-list, for example, seems too onerous, though. Not sure what to do about that. Yea, seems more work than is necessary. If we require the configuration to be always supplied, and document that mismatches are a problem, I think we are in good shape. We should agree that all you describe is only for DUMMY users. From authors point of view I dislike your approach to treat text searching as a very limited tool. But I understand that we should preserve people from stupid errors. I want for beginners easy setup and error-prone functionality, but leaving experienced users to develop complex search engines. Can we have separate safe interface for text searching and explicitly recommend it for beginners ? Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: Ron Mayer wrote: We need more feedback from users. Well, I am waiting for other hackers to get involved, but if they don't, I have to evaluate it myself on the email lists. Personally, I think documentation changes would be an OK way to to handle it. Something that makes it extremely clear to the user the advantages of having the extra column and the risks of avoiding them. Sure, but you have make sure you use the right configuration in the trigger, no? Does the tsquery have to use the same configuration? I wish I knew this myself. :-) Whatever I had done happened to work but that was largely through people on IRC walking me through it. This illustrates the major issue --- that this has to be simple for people to get started, while keeping the capabilities for experienced users. I am now thinking that making users always specify the configuration name and not allowing :: casting is going to be the best approach. We can always add more in 8.4 after it is in wide use. I just read the docs and I'm trying to get a grip of the problem here. If I understood correctly, the basic issue is that a tsvector datum created using configuration A is incompatible with a tsquery datum created using configuration B, in the sense that you won't get reasonable results if you use the tsquery to search the tsvector, or do ranking or highlighting. If the configurations happen to be similar enough, it can work, but not in general. That underlying issue manifests itself in many ways, including: - if you create table with a field of type tsvector, typically kept up-to-date by triggers, and do a search on it using a different configuration, you get incorrect results. - using an expression index instead of a tsvector-field, and always explicitly specifying the configuration, you can avoid that problem (a query with a different configuration won't use the index). But an expression index, without explicitly specifying the configuration, will get corrupted if you change the default configuration. Removing the default configuration setting altogether removes the 2nd problem, but that's not good from a usability point of view. And it doesn't solve the general issue, you can still do things like: SELECT * FROM foo WHERE to_tsvector('confA', textcol) @@ to_tsquery('confB', 'query'); ISTM we should have a separate tsvector and tsquery data type for each configuration, and throw an error if you try to mix and match them in a query. to_tsquery and to_tsvector would be new kind of polymorphic functions that work with the types. Or we could automatically create a copy of them when you create a new configuration. We could have a default configuration setting and rewrite queries that don't explicitly specify a configuration to use the default. You could still get into trouble if you alter the configuration after starting to use it. We could solve that by not allowing you to ALTER CONFIGURATION, at least not if it's used in tables or indexes. Forcing people to create a new configuration, and to recreate all indexes and tsvector columns every time you add a word to a stop-list, for example, seems too onerous, though. Not sure what to do about that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: Oleg Bartunov wrote: What is a basis of your assumption ? In my opinion, it's very limited use of text search, because it doesn't supports ranking. For 4-5 years of tsearch2 usage I never used it and I never seem in mailing lists. This is very user-oriented feature and we could probably ask -general people for their opinion. I think I asked about this kind of usage a couple years back; and Oleg pointed out other reasons why it wasn't as good an idea too. http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php The particular question I had asked why the functional index was slower than maintaining the extra column; with the explanation that the lossy index having to call the function (including parsing, dictionary lookup, etc) for re-checking the data made it inadvisable to avoid the extra column anyway. I doubt 'general' is going to understand the details of merging this into the backend. I assume we have enough people on hackers to decide this. Are you saying the majority of users have a separate column with a trigger? I think so. At least when I was using it in 2005 the second column with the trigger was faster than using a functional index. We need more feedback from users. Well, I am waiting for other hackers to get involved, but if they don't, I have to evaluate it myself on the email lists. Personally, I think documentation changes would be an OK way to to handle it. Something that makes it extremely clear to the user the advantages of having the extra column and the risks of avoiding them. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] default_text_search_config and expression indexes
Oleg Bartunov wrote: On Tue, 31 Jul 2007, Bruce Momjian wrote: Oleg Bartunov wrote: On Tue, 31 Jul 2007, Bruce Momjian wrote: And if we have to require the configuration name in CREATE INDEX, it has to be used in WHERE, so we might as well just remove the default capability and always require the configuration name. this is very rare use case for text searching 1. expression index without configuration name 2. default_text_search_config can be changed by somebody If you are going to be using the configuration name with the create expression index, you have to use it in the WHERE clause (or the index doesn't work), and I assume that is 90% of the text search uses. I don't see it as rare at all. What is a basis of your assumption ? In my opinion, it's very limited use of text search, because it doesn't supports ranking. For 4-5 years of tsearch2 usage I never used it and I never seem in mailing lists. This is very user-oriented feature and we could probably ask -general people for their opinion. I doubt 'general' is going to understand the details of merging this into the backend. I assume we have enough people on hackers to decide this. I mean not technical details, but use case. Does they need expressional index without ranking but sacrifice ability to use default configuration in other cases too ? My prediction is that people doesn't ever thought about this possibility until we said them about. In a choice between expression indexes and default_text_search_config, there is no question in my mind that expression indexes are more useful. Lack of default_text_search_config only means you have to specify the configuration name every time, and can't do casting to a text search data type. Are you saying the majority of users have a separate column with a trigger? Does the trigger specify the configuation? I don't see that as a parameter argument to tsvector_update_trigger(). If you reload a pg_dump, what does it use for the configuration? yes, separate column with custom trigger works fine. It's up to you how to keep your data actual and it's up to you how to write trigger. Our tsvector_update_trigger() is a tsvector_update_trigger_example() ! Well, that is the major problem --- that this is very error-prone, especially considering that the tsvector_update_trigger() doesn't get it right either. Why is a separate column better than the index? Just ranking? ranking + composite documents. I already mentioned, that this could be rather expensive. Also, having separate column allow people various ways to say what is a document and even change it. OK, I am confused why an expression index can't use those features if a separate column can. I realize the index can't store that information, but why can the code pick it out of a heap column but not run the function on the heap row to get that information. I assume it is something that is just hard to implement. The reason the expression index is nice is this feature has to be easy to use for people who are new to full text and even PostgreSQL. Right now /contrib is fine for experts to use, but we want a larger user base for this feature. I agree here. This was one of the main reason of our work for 8.3. Probably, we shold think in another direction - not to curtail tsearch2 and confuse rather big existing users, but to add an ability to save somehow configuration used for creating of *document* either implicitly (in expression index, or just gin(text_column)), or explicitly (separate column). There is no problem with index itself ! Agreed. We need to find a way to save the configuration when the output of a text search function is stored, either in an expression index or via a trigger into a separate column, but only if we allow the default configuration to be changed by non-super-users. Should we hold the patch for 8.4? If we're not agree to say in docs, that implicit usage of text search configuration in CREATE INDEX command doesn't supported. Could we leave default_text_search_config for super-users, at least ? Anyway, let's wait what other people say. The big problem is that not many people have taken the time to fully understand how full text search works. I hoped that putting the updated documentation online would help: http://momjian.us/expire/fulltext/HTML/textsearch.html but it seems it hasn't. What we could do it if we make default_text_search_config super-user-only and tell users at the start that if default_text_search_config doesn't match the language they want to use, then they have to read a documentation section that explains the problem of configuration mismatches. The problem with that is that we should be setting default_text_search_config in the pg_dump output, like we do for client_encoding, but because it is a super-user-only, it will fail for non-super-user restores. So, I am back to thinking
Re: [HACKERS] default_text_search_config and expression indexes
Ron Mayer wrote: Bruce Momjian wrote: Oleg Bartunov wrote: What is a basis of your assumption ? In my opinion, it's very limited use of text search, because it doesn't supports ranking. For 4-5 years of tsearch2 usage I never used it and I never seem in mailing lists. This is very user-oriented feature and we could probably ask -general people for their opinion. I think I asked about this kind of usage a couple years back; and Oleg pointed out other reasons why it wasn't as good an idea too. http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php The particular question I had asked why the functional index was slower than maintaining the extra column; with the explanation that the lossy index having to call the function (including parsing, dictionary lookup, etc) for re-checking the data made it inadvisable to avoid the extra column anyway. I doubt 'general' is going to understand the details of merging this into the backend. I assume we have enough people on hackers to decide this. Are you saying the majority of users have a separate column with a trigger? I think so. At least when I was using it in 2005 the second column with the trigger was faster than using a functional index. OK, it is good you measured it. I wonder how GIN would behave because it is not lossy. We need more feedback from users. Well, I am waiting for other hackers to get involved, but if they don't, I have to evaluate it myself on the email lists. Personally, I think documentation changes would be an OK way to to handle it. Something that makes it extremely clear to the user the advantages of having the extra column and the risks of avoiding them. Sure, but you have make sure you use the right configuration in the trigger, no? Does the tsquery have to use the same configuration? -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: Ron Mayer wrote: Bruce Momjian wrote: Oleg Bartunov wrote: What is a basis of your assumption ? I think I asked about this kind of usage a couple years back;... http://archives.postgresql.org/pgsql-general/2005-10/msg00475.php http://archives.postgresql.org/pgsql-general/2005-10/msg00477.php ...why the functional index was slower than maintaining the extra column; with the explanation that the lossy index having to call the function (including parsing, dictionary lookup, etc) for re-checking the data ... ... Are you saying the majority of users have a separate column with a trigger? I think so. At least when I was using it in 2005 the second column with the trigger was faster than using a functional index. OK, it is good you measured it. I wonder how GIN would behave because it is not lossy. Too bad I don't have the same database around anymore. It seems the re-parsing for re-checking for the lossy index was very expensive, tho. In the end, I suspect it depends greatly on what fraction of rows match. We need more feedback from users. Well, I am waiting for other hackers to get involved, but if they don't, I have to evaluate it myself on the email lists. Personally, I think documentation changes would be an OK way to to handle it. Something that makes it extremely clear to the user the advantages of having the extra column and the risks of avoiding them. Sure, but you have make sure you use the right configuration in the trigger, no? Does the tsquery have to use the same configuration? I wish I knew this myself. :-) Whatever I had done happened to work but that was largely through people on IRC walking me through it. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] default_text_search_config and expression indexes
Ron Mayer wrote: We need more feedback from users. Well, I am waiting for other hackers to get involved, but if they don't, I have to evaluate it myself on the email lists. Personally, I think documentation changes would be an OK way to to handle it. Something that makes it extremely clear to the user the advantages of having the extra column and the risks of avoiding them. Sure, but you have make sure you use the right configuration in the trigger, no? Does the tsquery have to use the same configuration? I wish I knew this myself. :-) Whatever I had done happened to work but that was largely through people on IRC walking me through it. This illustrates the major issue --- that this has to be simple for people to get started, while keeping the capabilities for experienced users. I am now thinking that making users always specify the configuration name and not allowing :: casting is going to be the best approach. We can always add more in 8.4 after it is in wide use. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: Ron Mayer wrote: I wish I knew this myself. :-) Whatever I had done happened to work but that was largely through people on IRC walking me through it. This illustrates the major issue --- that this has to be simple for people to get started, while keeping the capabilities for experienced users. I am now thinking that making users always specify the configuration name and not allowing :: casting is going to be the best approach. We can always add more in 8.4 after it is in wide use. That's fair. Either the docs need to make it totally obvious or the software should force people to do something safe. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] default_text_search_config and expression indexes
On Mon, 30 Jul 2007, Bruce Momjian wrote: Oleg Bartunov wrote: OK, here is what I am thinking. If we make default_text_search_config super-user-only, then the user can't do SET (using zero_damaged_pages as a superuser-only example): test= set zero_damaged_pages = on; ERROR: permission denied to set parameter zero_damaged_pages test= alter user guest set zero_damaged_pages = on; ERROR: permission denied to set parameter zero_damaged_pages but the super-user can set it in postgresql.conf, or: test=# alter user guest set zero_damaged_pages = on; ALTER ROLE or test=# alter database vendor3 set zero_damaged_pages = on; ALTER ROLE meaning while it will be super-user-only, the administrator can set the default for specific databases and users. Is that the best approach? A user can still over-ride the default by specifying the configuration in the function call. This is ok, but it will not work in hosting environment and still doesn't prevent errors. Agreed. super-user-only now seems strange to me because it isn't a security issue, but rather an attempt to avoid people causing errors. The fundamental issue is that if you do a query using tsvector and tsquery everything will work find because default_text_search_config will be the same for both queries. The problem is if do an expression index lookup that doesn't specify the configuration name and your default_text_search_config doesn't match the index, or you INSERT or UPDATE into an expression index with a mismatched default_text_search_config. If we do make default_text_search_config super-user-only it prevents a database owner from doing ALTER DATABASE db1 SET default_text_search_config = 'english', which seems like a pretty big limitation because I think per-database default_text_search_config makes the most sense. And, again, if you specify the configuration in the expression index you have to specify it in the WHERE clause and then default_text_search_config is pretty useless. agree. Notice, this is very limited usage case. If we required the configuration to always be specified, you could still store multiple configurations in the same column by having a secondary column hold the configuration name: I don't understand this. Let's don't discuss indexes at all, since indexes doesn't know about configuratons at all CREATE INDEX i on x USING gist (to_tsvector(config_col, body)); Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] default_text_search_config and expression indexes
On Mon, 30 Jul 2007, Bruce Momjian wrote: Bruce Momjian wrote: We have to decide if we want a GUC default_text_search_config, and if so when can it be changed. Right now there are three ways to create a tsvector (or tsquery) ::tsvector to_tsvector(value) to_tsvector(config, value) (ignoring plainto_tsvector) Only the last one specifies the configuration. The others use the configuration specified by default_text_search_config. (We had an previous discussion on what the default value of default_text_search_config should be, and it was decided it should be set via initdb based on a flag or the locale.) Now, because most people use a single configuration, they can just set default_text_search_config and there is no need to specify the configuration name. However, expression indexes cause a problem here: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX We recommend that users create an expression index on the column they want to do a full text search on, e.g. CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body)); However, the big problem is that the expressions used in expression indexes should not change their output based on the value of a GUC variable (because it would corrupt the index), but in the case above, default_text_search_config controls what configuration is used, and hence the output of to_tsvector is changed if default_text_search_config changes. We have a few possible options: 1) Document the problem and do nothing else. 2) Make default_text_search_config a postgresql.conf-only setting, thereby making it impossible to change by non-super users, or make it a super-user-only setting. 3) Remove default_text_search_config and require the configuration to be specified in each function call. If we remove default_text_search_config, it would also make ::tsvector casting useless as well. OK, I just found a case that I think is going to make #3 a requirement (remove default_text_search_config). How is a CREATE INDEX ... to_tsvector(col) going to restore from a pg_dump? I see no way of guaranteeing that the default_text_search_config is correct on the restore, and in fact I don't think we have any way of knowing the default_text_search_config used for the index. And if we have to require the configuration name in CREATE INDEX, it has to be used in WHERE, so we might as well just remove the default capability and always require the configuration name. this is very rare use case for text searching 1. expression index without configuration name 2. default_text_search_config can be changed by somebody If somebody really need it, then he should be adviced to use configuration name, else we don't guarantee that somebody could change default_text_search_config variable and this could lead to incorrect dump/restore. I don't think we should remove default_text_search_config because of this rare case. Regards, Oleg PS. Bruce, I'm in the mountains the Northern Caucasia and internet is a bit unreliable :( _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] default_text_search_config and expression indexes
Oleg Bartunov wrote: If we remove default_text_search_config, it would also make ::tsvector casting useless as well. OK, I just found a case that I think is going to make #3 a requirement (remove default_text_search_config). How is a CREATE INDEX ... to_tsvector(col) going to restore from a pg_dump? I see no way of guaranteeing that the default_text_search_config is correct on the restore, and in fact I don't think we have any way of knowing the default_text_search_config used for the index. And if we have to require the configuration name in CREATE INDEX, it has to be used in WHERE, so we might as well just remove the default capability and always require the configuration name. this is very rare use case for text searching 1. expression index without configuration name 2. default_text_search_config can be changed by somebody If you are going to be using the configuration name with the create expression index, you have to use it in the WHERE clause (or the index doesn't work), and I assume that is 90% of the text search uses. I don't see it as rare at all. If somebody really need it, then he should be adviced to use configuration name, else we don't guarantee that somebody could change default_text_search_config variable and this could lead to incorrect dump/restore. I don't think we should remove default_text_search_config because of this rare case. I still feel the default_text_search_config has to be removed. We have tried all sorts of ways to make it work but having it not be 100% reliable for pg_dump/restore means it might as well be in /contrib and unsupported. If we have it in core, it has to work 100%. We can't have tons of examples that don't specify the configuration name and then expect every create expression index and WHERE clause to use it. default_text_search_config _can_ work, but it seems so easy to break and so easy to get wrong that I think it must be removed. If we are going to keep it, I need someone to explain why my comments above are wrong. If I am right, someone has to remove default_text_search_config from the patch. I can do the documentation. Bruce, I'm in the mountains the Northern Caucasia and internet is a bit unreliable :( Thanks. I noticed a lag in your reply. Hope you are having a good time. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] default_text_search_config and expression indexes
On Tue, 31 Jul 2007, Bruce Momjian wrote: And if we have to require the configuration name in CREATE INDEX, it has to be used in WHERE, so we might as well just remove the default capability and always require the configuration name. this is very rare use case for text searching 1. expression index without configuration name 2. default_text_search_config can be changed by somebody If you are going to be using the configuration name with the create expression index, you have to use it in the WHERE clause (or the index doesn't work), and I assume that is 90% of the text search uses. I don't see it as rare at all. What is a basis of your assumption ? In my opinion, it's very limited use of text search, because it doesn't supports ranking. For 4-5 years of tsearch2 usage I never used it and I never seem in mailing lists. This is very user-oriented feature and we could probably ask -general people for their opinion. If somebody really need it, then he should be adviced to use configuration name, else we don't guarantee that somebody could change default_text_search_config variable and this could lead to incorrect dump/restore. I don't think we should remove default_text_search_config because of this rare case. I still feel the default_text_search_config has to be removed. We have tried all sorts of ways to make it work but having it not be 100% reliable for pg_dump/restore means it might as well be in /contrib and unsupported. If we have it in core, it has to work 100%. We can't have tons of examples that don't specify the configuration name and then expect every create expression index and WHERE clause to use it. default_text_search_config _can_ work, but it seems so easy to break and so easy to get wrong that I think it must be removed. I'd better say we don't support text searching using expression index than remove default_text_search_config. Anyway, I don't feel myself responisble for such important problem. We need more feedback from users. If we are going to keep it, I need someone to explain why my comments above are wrong. If I am right, someone has to remove default_text_search_config from the patch. I can do the documentation. I'm in conference and then will be busy writing my applications and earning money, Teodor is in vacation. I don't want to do hasty conclusion, since we're very tired to change our patch from one solution to another. We need consensus of developers and users. I'm almost exhausted and have no time to continue this discussion. Would you be so kind to write separate post about this problem and call -hackers and -general for feedback. Let's experienced users show their needs. We said everything and has nothing to add. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] default_text_search_config and expression indexes
Oleg Bartunov wrote: On Tue, 31 Jul 2007, Bruce Momjian wrote: And if we have to require the configuration name in CREATE INDEX, it has to be used in WHERE, so we might as well just remove the default capability and always require the configuration name. this is very rare use case for text searching 1. expression index without configuration name 2. default_text_search_config can be changed by somebody If you are going to be using the configuration name with the create expression index, you have to use it in the WHERE clause (or the index doesn't work), and I assume that is 90% of the text search uses. I don't see it as rare at all. What is a basis of your assumption ? In my opinion, it's very limited use of text search, because it doesn't supports ranking. For 4-5 years of tsearch2 usage I never used it and I never seem in mailing lists. This is very user-oriented feature and we could probably ask -general people for their opinion. I doubt 'general' is going to understand the details of merging this into the backend. I assume we have enough people on hackers to decide this. Are you saying the majority of users have a separate column with a trigger? Does the trigger specify the configuation? I don't see that as a parameter argument to tsvector_update_trigger(). If you reload a pg_dump, what does it use for the configuration? Why is a separate column better than the index? Just ranking? The reason the expression index is nice is this feature has to be easy to use for people who are new to full text and even PostgreSQL. Right now /contrib is fine for experts to use, but we want a larger user base for this feature. If somebody really need it, then he should be adviced to use configuration name, else we don't guarantee that somebody could change default_text_search_config variable and this could lead to incorrect dump/restore. I don't think we should remove default_text_search_config because of this rare case. I still feel the default_text_search_config has to be removed. We have tried all sorts of ways to make it work but having it not be 100% reliable for pg_dump/restore means it might as well be in /contrib and unsupported. If we have it in core, it has to work 100%. We can't have tons of examples that don't specify the configuration name and then expect every create expression index and WHERE clause to use it. default_text_search_config _can_ work, but it seems so easy to break and so easy to get wrong that I think it must be removed. I'd better say we don't support text searching using expression index than remove default_text_search_config. Anyway, I don't feel myself responisble for such important problem. We need more feedback from users. Well, I am waiting for other hackers to get involved, but if they don't, I have to evaluate it myself on the email lists. If we are going to keep it, I need someone to explain why my comments above are wrong. If I am right, someone has to remove default_text_search_config from the patch. I can do the documentation. I'm in conference and then will be busy writing my applications and earning money, Teodor is in vacation. I don't want to do hasty conclusion, since we're very tired to change our patch from one solution to another. We need consensus of developers and users. I'm almost exhausted and have no time to continue this discussion. Would you be so kind to write separate post about this problem and call -hackers and -general for feedback. Let's experienced users show their needs. We said everything and has nothing to add. If you have no time to continue discussion and perhaps update the patch, we can consider this patch dead for 8.3 and we can hold it for 8.4 because I can guarantee you this is going to need more discussion and patch modification before it gets into CVS. This patch is being treated fairly and exactly the same as every other patch. Should we hold the patch for 8.4? -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] default_text_search_config and expression indexes
On Tue, 31 Jul 2007, Bruce Momjian wrote: Oleg Bartunov wrote: On Tue, 31 Jul 2007, Bruce Momjian wrote: And if we have to require the configuration name in CREATE INDEX, it has to be used in WHERE, so we might as well just remove the default capability and always require the configuration name. this is very rare use case for text searching 1. expression index without configuration name 2. default_text_search_config can be changed by somebody If you are going to be using the configuration name with the create expression index, you have to use it in the WHERE clause (or the index doesn't work), and I assume that is 90% of the text search uses. I don't see it as rare at all. What is a basis of your assumption ? In my opinion, it's very limited use of text search, because it doesn't supports ranking. For 4-5 years of tsearch2 usage I never used it and I never seem in mailing lists. This is very user-oriented feature and we could probably ask -general people for their opinion. I doubt 'general' is going to understand the details of merging this into the backend. I assume we have enough people on hackers to decide this. I mean not technical details, but use case. Does they need expressional index without ranking but sacrifice ability to use default configuration in other cases too ? My prediction is that people doesn't ever thought about this possibility until we said them about. Are you saying the majority of users have a separate column with a trigger? Does the trigger specify the configuation? I don't see that as a parameter argument to tsvector_update_trigger(). If you reload a pg_dump, what does it use for the configuration? yes, separate column with custom trigger works fine. It's up to you how to keep your data actual and it's up to you how to write trigger. Our tsvector_update_trigger() is a tsvector_update_trigger_example() ! Why is a separate column better than the index? Just ranking? ranking + composite documents. I already mentioned, that this could be rather expensive. Also, having separate column allow people various ways to say what is a document and even change it. The reason the expression index is nice is this feature has to be easy to use for people who are new to full text and even PostgreSQL. Right now /contrib is fine for experts to use, but we want a larger user base for this feature. I agree here. This was one of the main reason of our work for 8.3. Probably, we shold think in another direction - not to curtail tsearch2 and confuse rather big existing users, but to add an ability to save somehow configuration used for creating of *document* either implicitly (in expression index, or just gin(text_column)), or explicitly (separate column). There is no problem with index itself ! I'd better say we don't support text searching using expression index than remove default_text_search_config. Anyway, I don't feel myself responisble for such important problem. We need more feedback from users. Well, I am waiting for other hackers to get involved, but if they don't, I have to evaluate it myself on the email lists. If we are going to keep it, I need someone to explain why my comments above are wrong. If I am right, someone has to remove default_text_search_config from the patch. I can do the documentation. I'm in conference and then will be busy writing my applications and earning money, Teodor is in vacation. I don't want to do hasty conclusion, since we're very tired to change our patch from one solution to another. We need consensus of developers and users. I'm almost exhausted and have no time to continue this discussion. Would you be so kind to write separate post about this problem and call -hackers and -general for feedback. Let's experienced users show their needs. We said everything and has nothing to add. If you have no time to continue discussion and perhaps update the patch, we can consider this patch dead for 8.3 and we can hold it for 8.4 because I can guarantee you this is going to need more discussion and patch modification before it gets into CVS. This patch is being treated fairly and exactly the same as every other patch. why do you say this ? I didn't complain about this. Should we hold the patch for 8.4? If we're not agree to say in docs, that implicit usage of text search configuration in CREATE INDEX command doesn't supported. Could we leave default_text_search_config for super-users, at least ? Anyway, let's wait what other people say. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] default_text_search_config and expression indexes
Oleg Bartunov wrote: OK, here is what I am thinking. If we make default_text_search_config super-user-only, then the user can't do SET (using zero_damaged_pages as a superuser-only example): test= set zero_damaged_pages = on; ERROR: permission denied to set parameter zero_damaged_pages test= alter user guest set zero_damaged_pages = on; ERROR: permission denied to set parameter zero_damaged_pages but the super-user can set it in postgresql.conf, or: test=# alter user guest set zero_damaged_pages = on; ALTER ROLE or test=# alter database vendor3 set zero_damaged_pages = on; ALTER ROLE meaning while it will be super-user-only, the administrator can set the default for specific databases and users. Is that the best approach? A user can still over-ride the default by specifying the configuration in the function call. This is ok, but it will not work in hosting environment and still doesn't prevent errors. Agreed. super-user-only now seems strange to me because it isn't a security issue, but rather an attempt to avoid people causing errors. The fundamental issue is that if you do a query using tsvector and tsquery everything will work find because default_text_search_config will be the same for both queries. The problem is if do an expression index lookup that doesn't specify the configuration name and your default_text_search_config doesn't match the index, or you INSERT or UPDATE into an expression index with a mismatched default_text_search_config. If we do make default_text_search_config super-user-only it prevents a database owner from doing ALTER DATABASE db1 SET default_text_search_config = 'english', which seems like a pretty big limitation because I think per-database default_text_search_config makes the most sense. And, again, if you specify the configuration in the expression index you have to specify it in the WHERE clause and then default_text_search_config is pretty useless. If we required the configuration to always be specified, you could still store multiple configurations in the same column by having a secondary column hold the configuration name: CREATE INDEX i on x USING gist (to_tsvector(config_col, body)); -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: We have to decide if we want a GUC default_text_search_config, and if so when can it be changed. Right now there are three ways to create a tsvector (or tsquery) ::tsvector to_tsvector(value) to_tsvector(config, value) (ignoring plainto_tsvector) Only the last one specifies the configuration. The others use the configuration specified by default_text_search_config. (We had an previous discussion on what the default value of default_text_search_config should be, and it was decided it should be set via initdb based on a flag or the locale.) Now, because most people use a single configuration, they can just set default_text_search_config and there is no need to specify the configuration name. However, expression indexes cause a problem here: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX We recommend that users create an expression index on the column they want to do a full text search on, e.g. CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body)); However, the big problem is that the expressions used in expression indexes should not change their output based on the value of a GUC variable (because it would corrupt the index), but in the case above, default_text_search_config controls what configuration is used, and hence the output of to_tsvector is changed if default_text_search_config changes. We have a few possible options: 1) Document the problem and do nothing else. 2) Make default_text_search_config a postgresql.conf-only setting, thereby making it impossible to change by non-super users, or make it a super-user-only setting. 3) Remove default_text_search_config and require the configuration to be specified in each function call. If we remove default_text_search_config, it would also make ::tsvector casting useless as well. OK, I just found a case that I think is going to make #3 a requirement (remove default_text_search_config). How is a CREATE INDEX ... to_tsvector(col) going to restore from a pg_dump? I see no way of guaranteeing that the default_text_search_config is correct on the restore, and in fact I don't think we have any way of knowing the default_text_search_config used for the index. And if we have to require the configuration name in CREATE INDEX, it has to be used in WHERE, so we might as well just remove the default capability and always require the configuration name. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: Bruce Momjian wrote: 3) Remove default_text_search_config and require the configuration to be specified in each function call. If we remove default_text_search_config, it would also make ::tsvector casting useless as well. OK, I just found a case that I think is going to make #3 a requirement (remove default_text_search_config). How is a CREATE INDEX ... to_tsvector(col) going to restore from a pg_dump? I see no way of guaranteeing that the default_text_search_config is correct on the restore, and in fact I don't think we have any way of knowing the default_text_search_config used for the index. Make pg_dump emit only CREATE INDEX sentences with two-param format. In fact I think it would make sense to convert internally the one-param format to two-param, before hitting the catalogs. This would also solve your problem about usability of WHERE clauses, if you rewrite the one-param calls to two-params before the optimizer kicks in. -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J Nadie esta tan esclavizado como el que se cree libre no siendolo (Goethe) ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] default_text_search_config and expression indexes
Alvaro Herrera wrote: Bruce Momjian wrote: Bruce Momjian wrote: 3) Remove default_text_search_config and require the configuration to be specified in each function call. If we remove default_text_search_config, it would also make ::tsvector casting useless as well. OK, I just found a case that I think is going to make #3 a requirement (remove default_text_search_config). How is a CREATE INDEX ... to_tsvector(col) going to restore from a pg_dump? I see no way of guaranteeing that the default_text_search_config is correct on the restore, and in fact I don't think we have any way of knowing the default_text_search_config used for the index. Make pg_dump emit only CREATE INDEX sentences with two-param format. In fact I think it would make sense to convert internally the one-param format to two-param, before hitting the catalogs. This would also solve your problem about usability of WHERE clauses, if you rewrite the one-param calls to two-params before the optimizer kicks in. Yes, that could be done but as far as I know we weren't planning to have those areas of our backend be aware of configuration parameters to text search functions, and I doubt we want to do that for 8.3, if ever. There are many tsearch functions and doing this would spill tsearch function checks all over the backend, reducing our modularity. The default capability just isn't worth it, and in addition is error-prone. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] default_text_search_config and expression indexes
Bruce Momjian wrote: Magnus Hagander wrote: However, the big problem is that the expressions used in expression indexes should not change their output based on the value of a GUC variable (because it would corrupt the index), but in the case above, default_text_search_config controls what configuration is used, and hence the output of to_tsvector is changed if default_text_search_config changes. It wuoldn't actually *corrupt* the index, right? You could end up with wrong results, which might be regarded as corruption in one way, but as long as you change the value back the index still works, no? Right, it would _temporarily_ corrupt it. ;-) We have a few possible options: 1) Document the problem and do nothing else. 2) Make default_text_search_config a postgresql.conf-only setting, thereby making it impossible to change by non-super users, or make it a super-user-only setting. 3) Remove default_text_search_config and require the configuration to be specified in each function call. If we remove default_text_search_config, it would also make ::tsvector casting useless as well. I think 3 is a really bad solution. 2 is a half-bad solution. Do we have a way to say that it can be set at database-level for example, but not at user session? Making it superuser-only to change it but not postgresql.conf-only could accomplish that, along with warnings in the docs for the super user about the effects on current indexes by changing it. OK, here is what I am thinking. If we make default_text_search_config super-user-only, then the user can't do SET (using zero_damaged_pages as a superuser-only example): test= set zero_damaged_pages = on; ERROR: permission denied to set parameter zero_damaged_pages test= alter user guest set zero_damaged_pages = on; ERROR: permission denied to set parameter zero_damaged_pages but the super-user can set it in postgresql.conf, or: test=# alter user guest set zero_damaged_pages = on; ALTER ROLE or test=# alter database vendor3 set zero_damaged_pages = on; ALTER ROLE meaning while it will be super-user-only, the administrator can set the default for specific databases and users. Is that the best approach? That's exactly the one I was trying to suggest. And then with clear warnings in the docs around the place that if you set it to different values accessing the same index (for different users, for example) weird things may happen. But I see the ALTER DATABASE a fairly common scenario (I know I would use it), and from what I can tell thereis no risk at all with that. //Magnus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] default_text_search_config and expression indexes
On Fri, 27 Jul 2007, Bruce Momjian wrote: Magnus Hagander wrote: However, the big problem is that the expressions used in expression indexes should not change their output based on the value of a GUC variable (because it would corrupt the index), but in the case above, default_text_search_config controls what configuration is used, and hence the output of to_tsvector is changed if default_text_search_config changes. It wuoldn't actually *corrupt* the index, right? You could end up with wrong results, which might be regarded as corruption in one way, but as long as you change the value back the index still works, no? Right, it would _temporarily_ corrupt it. ;-) We have a few possible options: 1) Document the problem and do nothing else. 2) Make default_text_search_config a postgresql.conf-only setting, thereby making it impossible to change by non-super users, or make it a super-user-only setting. 3) Remove default_text_search_config and require the configuration to be specified in each function call. If we remove default_text_search_config, it would also make ::tsvector casting useless as well. I think 3 is a really bad solution. 2 is a half-bad solution. Do we have a way to say that it can be set at database-level for example, but not at user session? Making it superuser-only to change it but not postgresql.conf-only could accomplish that, along with warnings in the docs for the super user about the effects on current indexes by changing it. OK, here is what I am thinking. If we make default_text_search_config super-user-only, then the user can't do SET (using zero_damaged_pages as a superuser-only example): test= set zero_damaged_pages = on; ERROR: permission denied to set parameter zero_damaged_pages test= alter user guest set zero_damaged_pages = on; ERROR: permission denied to set parameter zero_damaged_pages but the super-user can set it in postgresql.conf, or: test=# alter user guest set zero_damaged_pages = on; ALTER ROLE or test=# alter database vendor3 set zero_damaged_pages = on; ALTER ROLE meaning while it will be super-user-only, the administrator can set the default for specific databases and users. Is that the best approach? A user can still over-ride the default by specifying the configuration in the function call. This is ok, but it will not work in hosting environment and still doesn't prevent errors. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] default_text_search_config and expression indexes
1) Document the problem and do nothing else. 2) Make default_text_search_config a postgresql.conf-only setting, thereby making it impossible to change by non-super users, or make it a super-user-only setting. 3) Remove default_text_search_config and require the configuration to be specified in each function call. Hello, 2+. Regards Pavel Stehule ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] default_text_search_config and expression indexes
On Fri, 27 Jul 2007, Pavel Stehule wrote: 1) Document the problem and do nothing else. 2) Make default_text_search_config a postgresql.conf-only setting, thereby making it impossible to change by non-super users, or make it a super-user-only setting. 3) Remove default_text_search_config and require the configuration to be specified in each function call. Hello, 2+. One of the most important purpose of integrating tsearch2 was to facilitate full-text search for people in hosting environment. Usually, they have no superuser rights. I'm asking don't forget about them ! There is no problem with current behaviour once user understand what he do. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] default_text_search_config and expression indexes
On Fri, 27 Jul 2007, Pavel Stehule wrote: 2007/7/27, Oleg Bartunov [EMAIL PROTECTED]: On Fri, 27 Jul 2007, Pavel Stehule wrote: 1) Document the problem and do nothing else. 2) Make default_text_search_config a postgresql.conf-only setting, thereby making it impossible to change by non-super users, or make it a super-user-only setting. 3) Remove default_text_search_config and require the configuration to be specified in each function call. Hello, 2+. One of the most important purpose of integrating tsearch2 was to facilitate full-text search for people in hosting environment. Usually, they have no superuser rights. I'm asking don't forget about them ! There is no problem with current behaviour once user understand what he do. I am not sure if postgresql is well for multilangual hosting environment. There is problem with locales. Without COLLATE support postgresql can't be used in similar environment. :( configuration has NOTHING with language ! This is a most frequent myth about configuration. It's just the way we chose for default_text_search_config to use language part of locale at initdb time. text search configuration is just a bind between parser to use for breaking document by lexems and mapping between lexeme type and dictionaries. nice a day Pavel Stehule Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] default_text_search_config and expression indexes
2007/7/27, Oleg Bartunov [EMAIL PROTECTED]: On Fri, 27 Jul 2007, Pavel Stehule wrote: 1) Document the problem and do nothing else. 2) Make default_text_search_config a postgresql.conf-only setting, thereby making it impossible to change by non-super users, or make it a super-user-only setting. 3) Remove default_text_search_config and require the configuration to be specified in each function call. Hello, 2+. One of the most important purpose of integrating tsearch2 was to facilitate full-text search for people in hosting environment. Usually, they have no superuser rights. I'm asking don't forget about them ! There is no problem with current behaviour once user understand what he do. I am not sure if postgresql is well for multilangual hosting environment. There is problem with locales. Without COLLATE support postgresql can't be used in similar environment. :( nice a day Pavel Stehule ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] default_text_search_config and expression indexes
configuration has NOTHING with language ! This is a most frequent myth about configuration. It's just the way we chose for default_text_search_config to use language part of locale at initdb time. text search configuration is just a bind between parser to use for breaking document by lexems and mapping between lexeme type and dictionaries. I spoke about impossibility well configuration of postgresql without administrator's rights. For my czech environment is administrator's rights necessary too, because czech dictionary aren't in default installation. Regards Pavel Stehule ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] default_text_search_config and expression indexes
On Thu, Jul 26, 2007 at 06:23:51PM -0400, Bruce Momjian wrote: Oleg Bartunov wrote: Second, I can't figure out how to reference a non-default configuration. See the multi-argument versions of to_tsvector etc. I do see a problem with having to_tsvector(config, text) plus to_tsvector(text) where the latter implicitly references a config selected by a GUC variable: how can you tell whether a query using the latter matches a particular index using the former? There isn't anything in the current planner mechanisms that would make that work. Probably, having default text search configuration is not a good idea and we could just require it as a mandatory parameter, which could eliminate many confusion with selecting text search configuration. We have to decide if we want a GUC default_text_search_config, and if so when can it be changed. Right now there are three ways to create a tsvector (or tsquery) ::tsvector to_tsvector(value) to_tsvector(config, value) (ignoring plainto_tsvector) Only the last one specifies the configuration. The others use the configuration specified by default_text_search_config. (We had an previous discussion on what the default value of default_text_search_config should be, and it was decided it should be set via initdb based on a flag or the locale.) Now, because most people use a single configuration, they can just set default_text_search_config and there is no need to specify the configuration name. However, expression indexes cause a problem here: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX We recommend that users create an expression index on the column they want to do a full text search on, e.g. CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body)); However, the big problem is that the expressions used in expression indexes should not change their output based on the value of a GUC variable (because it would corrupt the index), but in the case above, default_text_search_config controls what configuration is used, and hence the output of to_tsvector is changed if default_text_search_config changes. It wuoldn't actually *corrupt* the index, right? You could end up with wrong results, which might be regarded as corruption in one way, but as long as you change the value back the index still works, no? We have a few possible options: 1) Document the problem and do nothing else. 2) Make default_text_search_config a postgresql.conf-only setting, thereby making it impossible to change by non-super users, or make it a super-user-only setting. 3) Remove default_text_search_config and require the configuration to be specified in each function call. If we remove default_text_search_config, it would also make ::tsvector casting useless as well. I think 3 is a really bad solution. 2 is a half-bad solution. Do we have a way to say that it can be set at database-level for example, but not at user session? Making it superuser-only to change it but not postgresql.conf-only could accomplish that, along with warnings in the docs for the super user about the effects on current indexes by changing it. //Magnus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] default_text_search_config and expression indexes
Magnus Hagander wrote: However, the big problem is that the expressions used in expression indexes should not change their output based on the value of a GUC variable (because it would corrupt the index), but in the case above, default_text_search_config controls what configuration is used, and hence the output of to_tsvector is changed if default_text_search_config changes. It wuoldn't actually *corrupt* the index, right? You could end up with wrong results, which might be regarded as corruption in one way, but as long as you change the value back the index still works, no? Right, it would _temporarily_ corrupt it. ;-) We have a few possible options: 1) Document the problem and do nothing else. 2) Make default_text_search_config a postgresql.conf-only setting, thereby making it impossible to change by non-super users, or make it a super-user-only setting. 3) Remove default_text_search_config and require the configuration to be specified in each function call. If we remove default_text_search_config, it would also make ::tsvector casting useless as well. I think 3 is a really bad solution. 2 is a half-bad solution. Do we have a way to say that it can be set at database-level for example, but not at user session? Making it superuser-only to change it but not postgresql.conf-only could accomplish that, along with warnings in the docs for the super user about the effects on current indexes by changing it. OK, here is what I am thinking. If we make default_text_search_config super-user-only, then the user can't do SET (using zero_damaged_pages as a superuser-only example): test= set zero_damaged_pages = on; ERROR: permission denied to set parameter zero_damaged_pages test= alter user guest set zero_damaged_pages = on; ERROR: permission denied to set parameter zero_damaged_pages but the super-user can set it in postgresql.conf, or: test=# alter user guest set zero_damaged_pages = on; ALTER ROLE or test=# alter database vendor3 set zero_damaged_pages = on; ALTER ROLE meaning while it will be super-user-only, the administrator can set the default for specific databases and users. Is that the best approach? A user can still over-ride the default by specifying the configuration in the function call. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] default_text_search_config and expression indexes
Oleg Bartunov wrote: Second, I can't figure out how to reference a non-default configuration. See the multi-argument versions of to_tsvector etc. I do see a problem with having to_tsvector(config, text) plus to_tsvector(text) where the latter implicitly references a config selected by a GUC variable: how can you tell whether a query using the latter matches a particular index using the former? There isn't anything in the current planner mechanisms that would make that work. Probably, having default text search configuration is not a good idea and we could just require it as a mandatory parameter, which could eliminate many confusion with selecting text search configuration. We have to decide if we want a GUC default_text_search_config, and if so when can it be changed. Right now there are three ways to create a tsvector (or tsquery) ::tsvector to_tsvector(value) to_tsvector(config, value) (ignoring plainto_tsvector) Only the last one specifies the configuration. The others use the configuration specified by default_text_search_config. (We had an previous discussion on what the default value of default_text_search_config should be, and it was decided it should be set via initdb based on a flag or the locale.) Now, because most people use a single configuration, they can just set default_text_search_config and there is no need to specify the configuration name. However, expression indexes cause a problem here: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-INDEX We recommend that users create an expression index on the column they want to do a full text search on, e.g. CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(body)); However, the big problem is that the expressions used in expression indexes should not change their output based on the value of a GUC variable (because it would corrupt the index), but in the case above, default_text_search_config controls what configuration is used, and hence the output of to_tsvector is changed if default_text_search_config changes. We have a few possible options: 1) Document the problem and do nothing else. 2) Make default_text_search_config a postgresql.conf-only setting, thereby making it impossible to change by non-super users, or make it a super-user-only setting. 3) Remove default_text_search_config and require the configuration to be specified in each function call. If we remove default_text_search_config, it would also make ::tsvector casting useless as well. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq