Re: [HACKERS] tsearch_core patch: permissions and security issues
On Fri, 2007-06-15 at 10:36 -0400, Tom Lane wrote: > "Simon Riggs" <[EMAIL PROTECTED]> writes: > > Although I'm happy to see tsearch finally hit the big time, I'm a bit > > disappointed to see so many new datatype-specific SQL commands created. > > Per subsequent discussion we are down to just one new set of commands, > CREATE/ALTER/DROP TEXT SEARCH CONFIGURATION, so it's not as big a > footprint as it was to start with. Thats a lot better, thanks. I'm sure that will work better in PgAdmin and many other places too. > I have been thinking that it would be smart to try to use the generic > "definition list" syntax, like CREATE OPERATOR and CREATE AGGREGATE. > But the motivation for that is just to avoid defining more keywords > (which has an overall impact on parser size and performance). It's > not really going to do anything for us in terms of having an > implementation that can be shared with anything else. It's OK; ALTER RFID TAG NOMENCLATURE has a nice ring to it. :-) -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] tsearch_core patch: permissions and security issues
On Friday 15 June 2007 00:46, Oleg Bartunov wrote: > On Thu, 14 Jun 2007, Tom Lane wrote: > > [ thinks some more... ] If we revived the GENERATED AS patch, > > you could imagine computing tsvector columns via "GENERATED AS > > to_tsvector('english'::regconfig, big_text_col)" instead of a > > trigger, and then again you've got the dependency exposed where > > the system can see it. I don't wanna try to do that for 8.3, > > but it might be a good path to pursue in future, instead of assuming > > that triggers will be the way forevermore. > > > > Thoughts? > > No way with standard. GENERATED AS says that "all columns references in an > expression associated with a generated column must be to columns of the > base table containing that generated column." > > tsvector could be result of rather complex select involving several tables. > Is there some reason for this restriction in the standard? I might be in favor of "extending" the standard to allow this case if not. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] tsearch_core patch: permissions and security issues
On Thursday 14 June 2007 15:10, Teodor Sigaev wrote: > That changes are doable for several days. I'd like to make changes together > with replacing of FULLTEXT keyword to TEXT SEARCH as you suggested. AIUI the discussion on this change took place off list? Can we get a preview of what the commands will look like and maybe a summary of the discussion? It may sound a bit pedantic, but the concept/wording of "full text searching" is pretty well understood by the database community, so switching to just TEXT SEARCH sounds like we're adding ambiguity for reasons that escape me -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] tsearch_core patch: permissions and security issues
"Simon Riggs" <[EMAIL PROTECTED]> writes: > Although I'm happy to see tsearch finally hit the big time, I'm a bit > disappointed to see so many new datatype-specific SQL commands created. Per subsequent discussion we are down to just one new set of commands, CREATE/ALTER/DROP TEXT SEARCH CONFIGURATION, so it's not as big a footprint as it was to start with. > Can we consider CREATE TYPE CONFIGURATION with subsets such as... > CREATE TYPE CONFIGURATION name USING FULLTEXT (map) > CREATE TYPE CONFIGURATION name USING FULLTEXT (dictionary) > CREATE TYPE CONFIGURATION name USING FULLTEXT (parser) This seems entirely cosmetic, unless you have some proposal for allowing a uniform underlying implementation not only syntax. In the absence of some concrete cases to consider, I don't see how we could imagine that we know how to implement a useful generic approach. I have been thinking that it would be smart to try to use the generic "definition list" syntax, like CREATE OPERATOR and CREATE AGGREGATE. But the motivation for that is just to avoid defining more keywords (which has an overall impact on parser size and performance). It's not really going to do anything for us in terms of having an implementation that can be shared with anything else. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] tsearch_core patch: permissions and security issues
On Wed, 2007-06-13 at 18:06 -0400, Bruce Momjian wrote: > You bring up a very good point. There are fifteen new commands being > added for full text indexing: > > alter-fulltext-config.sgml alter-fulltext-owner.sgml > create-fulltext-dict.sgml drop-fulltext-dict.sgml > alter-fulltext-dict.sgmlalter-fulltext-parser.sgml > create-fulltext-map.sgmldrop-fulltext-map.sgml > alter-fulltext-dictset.sgml comment-fulltext.sgml > create-fulltext-parser.sgml drop-fulltext-parser.sgml > alter-fulltext-map.sgml create-fulltext-config.sgml > drop-fulltext-config.sgml Although I'm happy to see tsearch finally hit the big time, I'm a bit disappointed to see so many new datatype-specific SQL commands created. That sets a bad precedent for other datatype authors, as well as all those people that want to invent new things like SKYLINE etc.. Whatever the reasons for the new commands, those reasons must also be potentially shared by authors of other datatypes too. Is there a way to genericise these commands so that we can offer those same benefits to all datatypes, rather than setting a double standard? Or do we think that when a Geodetic datatype tries to come into core we would have commands like CREATE COORDINATE TRANSFORM? Can we consider CREATE TYPE CONFIGURATION with subsets such as... CREATE TYPE CONFIGURATION name USING FULLTEXT (map) CREATE TYPE CONFIGURATION name USING FULLTEXT (dictionary) CREATE TYPE CONFIGURATION name USING FULLTEXT (parser) Your choice of syntax may vary, but my point is about creating a mechanism by which any datatype author can reference complex configuration details. We managed to do this for INDEXes and OPERATORs, so it seems a shame to go for the full 15 new commands when we could do the same thing with much fewer commands and ones that could then be utilised by others, e.g. PostGIS. Last minute change true, but only parser+docs changes suggested. I want the full tsearch2 functionality as much as anyone. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] tsearch_core patch: permissions and security issues
On Thu, 14 Jun 2007, Tom Lane wrote: Oleg Bartunov <[EMAIL PROTECTED]> writes: You're correct. But we can't defend users from all possible errors. Other side, that we need somehow to help user to identify what fts configuration was used to produce tsvector. For example, comment on tsvector column would be useful, but we don't know how to do this automatically. Yeah, I was wondering about that too. The only way we could relax the superuser, you-better-know-what-you're-doing restriction on changing configurations would be if we had a way to identify which tsvector columns needed to be updated. Right now that's pretty hard to find out because the references to configurations are buried in the bodies of trigger functions. That whole trigger-function business is not the nicest part of tsearch2, either ... it'd be better if we could automate tsvector maintenance more. yes, trigger function is a complex stuff, our tsearch() trigger is an example of automated stuff. It could be written very easy on plpgsql, for example. =# create function my_update() returns trigger as $$ BEGIN NEW.fts= setweight( to_tsvector('english',NEW.t1),'A') || ' ' || setweight( to_tsvector('english',NEW.t2),'B'); RETURN NEW; END; $$ language plpgsql; One thing I was thinking about is that rather than storing a physical tsvector column, people might index a "virtual" column using functional indexes: create index ... (to_tsvector('english', big_text_col)) which could be queried select ... where to_tsvector('english', big_text_col) @@ tsquery this is already possible for gin index create index gin_text_idx on test using gin ( ( coalesce(to_tsvector(title),'') || coalesce(to_tsvector(body),'') ) ); apod=# select title from test where (coalesce(to_tsvector(title),'') || coalesce(to_tsvector(body),'') ) @@ to_tsquery('supernovae') order by sdate desc limit 10; Assuming that the index is lossy, the index condition would have to be rechecked, so to_tsvector() would have to be recomputed, but only at the rows identified as candidate matches by the index. The I/O savings from eliminating the heap's tsvector column might counterbalance the extra CPU for recomputing tsvectors. Or not, but in any case this is attractive because it doesn't need any handmade maintenance support like a trigger --- the regular index maintenance code does it all. I'm afraid it wouldn't work for all cases. We already have headline() function which had to reparse document to produce text snippet and it's very slow and eats most select time. ALso, trigger stuff is a normal machinery for databases. It strikes me that we could play the same kind of game we played to make nextval() references to sequences be recognized as dependencies on sequences. Invent a "regconfig" OID type that's just like regclass except it handles OIDs of ts_config entries instead of pg_class entries, and make the first argument of to_tsvector be one of those: create index ... (to_tsvector('english'::regconfig, big_text_col)) Now dependency.c can be taught to recognize the regconfig Const as depending on the referenced ts_config entry, and voila we have a pg_depend entry showing that the index depends on the configuration. What we actually do about it is another question, but this at least gets the knowledge into the system. interesting. And \di could display all configuration stuff for text search indexes ? [ thinks some more... ] If we revived the GENERATED AS patch, you could imagine computing tsvector columns via "GENERATED AS to_tsvector('english'::regconfig, big_text_col)" instead of a trigger, and then again you've got the dependency exposed where the system can see it. I don't wanna try to do that for 8.3, but it might be a good path to pursue in future, instead of assuming that triggers will be the way forevermore. Thoughts? No way with standard. GENERATED AS says that "all columns references in an expression associated with a generated column must be to columns of the base table containing that generated column." tsvector could be result of rather complex select involving several tables. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] tsearch_core patch: permissions and security issues
It should be. Instances of ispell (and synonym, thesaurus) dictionaries are different only in dict_initoption part, so it will be only one entry in pg_ts_dict_template and several ones in pg_ts_dict. No, I was thinking of still having just one pg_ts_dict catalog (no template) but removing its dictinit field. Instead, the init strings would be stored with configuration mapping entries. This would mean having to remember to provide the right option along with the dictionary name when doing ALTER CONFIGURATION ADD MAPPING. Not sure if that would be harder or easier to use than what you're thinking of. Hmm. Dictionary may present in several lists of dictionaries in one configuration. Suppose, it isn't practical to store dictinitoption several times. In other hand, the same dictionary (template) with different init option may present on configuration too. Typical example is configuration for russian language: lword, lpword tokens have dictionary's list {ispell_en, stem_en} nlword, nlpword tokens have dictionary's list {ispell_ru, stem_ru} stem_(ru|en) is a Snowball's stemmer, but ispell_(ru|en) is a ispell dictionary (template) with different dictinitoption. Next, configurations may share dictionaries. And, init option may be rather big. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] tsearch_core patch: permissions and security issues
Michael Paesold <[EMAIL PROTECTED]> writes: > After reading the discussion and the introduction, here is what I think > tsearch in core should at least accomplish in 8.3. > ... > - Stop words in tables, not in external files. I realized that there's a pretty serious problem with doing that, which is encoding. We don't have any way to deal with preloaded catalog data that exceeds 7-bit-ASCII, because when you do CREATE DATABASE ... ENCODING it's going to be copied over exactly as-is. And there's plenty of not-ASCII stuff in the non-English stopword files. This is something we need to solve eventually, but I think it ties into the whole multiple locale can-of-worms; there's no way we're getting it done for 8.3. So I'm afraid we have to settle for stop words in external files for the moment. I do have two suggestions though: * Let's have just one stopword file for each language, with the convention that the file is stored in UTF8 no matter what language you're talking about. We can have the stopword reading code convert to the database encoding on-the-fly when it reads the file. Without this there's just a whole bunch of foot-guns there. We'd at least need to have encoding verification checks when reading the files, which seems hardly cheaper than just translating the data. * Let's fix it so the reference to the stoplist in the user-visible options is just a name, with no path or anything like that. (Similar to the handling of timezone_abbreviations.) Then it will be feasible to re-interpret the option as a reference to a named list in a catalog someday, when we solve the encoding problem. Right now the patch has things like + DATA(insert OID = 5140 ( "ru_stem_koi8" PGNSP PGUID 5135 5137 "dicts_data/russian.stop.koi8")); which is really binding the option pretty tightly to being a filename; not to mention the large security risks involved in letting anyone but a superuser have control of such an option. > What I don't really like is the number of commands introduced without > any strong reference to full text search. E.g. CREATE CONFIGURATION > gives no hint at all that this is about full text search. Yeah. We had some off-list discussion about this and concluded that TEXT SEARCH seemed to be the right phrase to use in the command names. That hasn't gotten reflected into the patch yet. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] tsearch_core patch: permissions and security issues
"Oleg Bartunov" <[EMAIL PROTECTED]> writes: > On Thu, 14 Jun 2007, Gregory Stark wrote: > >> Am I correct to think of this like changing collations leaving your btree >> index "corrupt"? In that case it probably won't cause any backend crash >> either >> but you will get incorrect results. For example, returning different results >> depending on whether the index or a full table scan is used. > > You're correct. But we can't defend users from all possible errors. Sure, but it seems like a the line, at least in existing cases, is that if you fiddle with catalogs directly then you should know what consequences you need to be careful of. But when if you make changes through a supported, documented interface then the system will protect you from breaking things. Hm, I went to construct an example and accidentally found a precedent for not necessarily protecting users from themselves in every case: postgres=# create table x (i integer); CREATE TABLE postgres=# create function f(integer) returns integer as 'select $1' immutable strict language sql; CREATE FUNCTION postgres=# select f(1); f --- 1 (1 row) postgres=# create index xi on x (f(i)); CREATE INDEX postgres=# insert into x values (1); INSERT 0 1 postgres=# insert into x values (2); INSERT 0 1 postgres=# create or replace function f(integer) returns integer as 'select -$1' immutable strict language sql; CREATE FUNCTION Uhm. Oops! And yes, the resulting index is, of course, corrupt: postgres=# insert into x (select random() from generate_series(1,2000)); INSERT 0 2000 postgres=# select count(*) from x where f(i) = -1; count --- 0 (1 row) postgres=# set enable_bitmapscan = off; SET postgres=# set enable_indexscan = off; SET postgres=# select count(*) from x where f(i) = -1; count --- 1003 (1 row) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] tsearch_core patch: permissions and security issues
Teodor Sigaev <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> Teodor Sigaev <[EMAIL PROTECTED]> writes: >>> The reason to save SQLish interface to dictionaries is a simplicity of >>> configuration. Snowball's stemmers are useful as is, but ispell dictionary >>> requires some configuration action before using. >> >> Yeah. I had been wondering about moving the dict_initoption over to the >> configuration entry --- is that sane at all? It would mean that > It should be. Instances of ispell (and synonym, thesaurus) dictionaries are > different only in dict_initoption part, so it will be only one entry in > pg_ts_dict_template and several ones in pg_ts_dict. No, I was thinking of still having just one pg_ts_dict catalog (no template) but removing its dictinit field. Instead, the init strings would be stored with configuration mapping entries. This would mean having to remember to provide the right option along with the dictionary name when doing ALTER CONFIGURATION ADD MAPPING. Not sure if that would be harder or easier to use than what you're thinking of. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] tsearch_core patch: permissions and security issues
Bruce Momjian wrote: I an attempt to communicate what full text search does, and what features we are thinking of adding/removing, I have put up the introduction in HTML: http://momjian.us/expire/fulltext/HTML/fulltext-intro.html Very good idea, Bruce! After reading the discussion and the introduction, here is what I think tsearch in core should at least accomplish in 8.3. Please bear in mind, that (a) I am talking from a user perspective (there might be technical arguments against my thoughts) and (b) I have no hands-on experience with Tsearch2 yet, so more experienced users might have different needs. - Basic full text search usable for non-superusers - Out-of-the-box working configuration for as many languages as reasonable (Teodor named quite a number of languages working as-is, so this is really an improvement over contrib, great!) - No foot-guns accessible to non-superuser - Agreement on function names, perhaps some should be changed. For instance to_tsquery() and plainto_tsquery() seem rather unintuitive because they don't have a common prefix, and they are not consistent about using underscores. Perhaps to_tsquery() and to_tsquery_plain()? - Future compatibility for all features available to non-superusers - Stop words in tables, not in external files. - At least for superusers, all features available in contrib now, should be available, too (don't know about pg_dump). What I don't really like is the number of commands introduced without any strong reference to full text search. E.g. CREATE CONFIGURATION gives no hint at all that this is about full text search. IMHO there are more configurations than just full text ones. :-) So perhaps better spell this CREATE FULLTEXT CONFIGURATION etc.? (Think about tab completion in psql, for instance.) I guess this is in line with what Tom said about mapping objects and CREATE ATTRIBUTE vs. CREATE/ALTER CONFIGURATION. (http://archives.postgresql.org/pgsql-hackers/2007-06/msg00522.php) After all, I would really welcome having full text search capabilities in core. Best Regards Michael Paesold ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] tsearch_core patch: permissions and security issues
Oleg Bartunov <[EMAIL PROTECTED]> writes: > You're correct. But we can't defend users from all possible errors. > Other side, that we need somehow to help user to identify what fts > configuration was used to produce tsvector. For example, comment on > tsvector column would be useful, but we don't know how to do this > automatically. Yeah, I was wondering about that too. The only way we could relax the superuser, you-better-know-what-you're-doing restriction on changing configurations would be if we had a way to identify which tsvector columns needed to be updated. Right now that's pretty hard to find out because the references to configurations are buried in the bodies of trigger functions. That whole trigger-function business is not the nicest part of tsearch2, either ... it'd be better if we could automate tsvector maintenance more. One thing I was thinking about is that rather than storing a physical tsvector column, people might index a "virtual" column using functional indexes: create index ... (to_tsvector('english', big_text_col)) which could be queried select ... where to_tsvector('english', big_text_col) @@ tsquery Assuming that the index is lossy, the index condition would have to be rechecked, so to_tsvector() would have to be recomputed, but only at the rows identified as candidate matches by the index. The I/O savings from eliminating the heap's tsvector column might counterbalance the extra CPU for recomputing tsvectors. Or not, but in any case this is attractive because it doesn't need any handmade maintenance support like a trigger --- the regular index maintenance code does it all. It strikes me that we could play the same kind of game we played to make nextval() references to sequences be recognized as dependencies on sequences. Invent a "regconfig" OID type that's just like regclass except it handles OIDs of ts_config entries instead of pg_class entries, and make the first argument of to_tsvector be one of those: create index ... (to_tsvector('english'::regconfig, big_text_col)) Now dependency.c can be taught to recognize the regconfig Const as depending on the referenced ts_config entry, and voila we have a pg_depend entry showing that the index depends on the configuration. What we actually do about it is another question, but this at least gets the knowledge into the system. [ thinks some more... ] If we revived the GENERATED AS patch, you could imagine computing tsvector columns via "GENERATED AS to_tsvector('english'::regconfig, big_text_col)" instead of a trigger, and then again you've got the dependency exposed where the system can see it. I don't wanna try to do that for 8.3, but it might be a good path to pursue in future, instead of assuming that triggers will be the way forevermore. Thoughts? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] tsearch_core patch: permissions and security issues
Am I correct to think of this like changing collations leaving your btree index "corrupt"? In that case it probably won't cause any backend crash either but you will get incorrect results. For example, returning different results depending on whether the index or a full table scan is used. Without exotic cases, maximum disaster may be that queries with some words will return more or less results than should be. Because of wrong stemming or wrong match of stop-word or wrong mapping. By default, configuration is useful for most users and works for danish, dutch, finnish, french, german, hungarian, italian, norwegian, portuguese, spanish, swedish, russin and english languages. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] tsearch_core patch: permissions and security issues
But, there are reasons to allow users register new templates and in fact we know people/projects with application-dependent dictionaries. How they could dump/reload their dictionaries ? The same way as pg_am does. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] tsearch_core patch: permissions and security issues
On Thu, 14 Jun 2007, Gregory Stark wrote: "Teodor Sigaev" <[EMAIL PROTECTED]> writes: But they still need some more thought about permissions, because AFAICS mucking with a configuration can invalidate some other user's data. ouch. could mucking with a configuration create a corrupt index? Depending on what you mean 'corrupted'. It will not corrupted as non-readable or cause backend crash. But usage of such tsvector column could be limited - not all words will be searchable. Am I correct to think of this like changing collations leaving your btree index "corrupt"? In that case it probably won't cause any backend crash either but you will get incorrect results. For example, returning different results depending on whether the index or a full table scan is used. You're correct. But we can't defend users from all possible errors. Other side, that we need somehow to help user to identify what fts configuration was used to produce tsvector. For example, comment on tsvector column would be useful, but we don't know how to do this automatically. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] tsearch_core patch: permissions and security issues
On Thu, 14 Jun 2007, Tom Lane wrote: Teodor Sigaev <[EMAIL PROTECTED]> writes: The reason to save SQLish interface to dictionaries is a simplicity of configuration. Snowball's stemmers are useful as is, but ispell dictionary requires some configuration action before using. Yeah. I had been wondering about moving the dict_initoption over to the configuration entry --- is that sane at all? It would mean that dict_init functions would have to guard themselves against invalid options, but they probably ought to do that anyway. If we did that, I think we could have a fixed set of dictionaries without too much problem, and focus on just configurations as being user-alterable. currently, all dictionaries we provide are all template dictionaries, so users could change only parameters. But, there are reasons to allow users register new templates and in fact we know people/projects with application-dependent dictionaries. How they could dump/reload their dictionaries ? Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] tsearch_core patch: permissions and security issues
Tom Lane wrote: Teodor Sigaev <[EMAIL PROTECTED]> writes: The reason to save SQLish interface to dictionaries is a simplicity of configuration. Snowball's stemmers are useful as is, but ispell dictionary requires some configuration action before using. Yeah. I had been wondering about moving the dict_initoption over to the configuration entry --- is that sane at all? It would mean that It should be. Instances of ispell (and synonym, thesaurus) dictionaries are different only in dict_initoption part, so it will be only one entry in pg_ts_dict_template and several ones in pg_ts_dict. ALTER FULLTEXT CONFIGURATION cfgname ADD MAPPING FOR tokentypename[, ...] WITH dictname1[, ...]; ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING FOR tokentypename[, ...] WITH dictname1[, ...]; sets dictionary's list for token's type(s) ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING [FOR tokentypename[, ...]] REPLACE olddictname TO newdictname; Replace dictionary to another dictionary in dictionary's list for token's type(s). This command is very useful for tweaking configuration and for creating new configuration which differs from already existing one only by pair of dictionary. ALTER FULLTEXT CONFIGURATION cfgname DROP MAPPING [IF EXISTS] FOR tokentypename; Is it looking reasonable? Er ... what's the difference between the second and third forms? That changes are doable for several days. I'd like to make changes together with replacing of FULLTEXT keyword to TEXT SEARCH as you suggested. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] tsearch_core patch: permissions and security issues
"Teodor Sigaev" <[EMAIL PROTECTED]> writes: >>> But they still need some more thought about permissions, because AFAICS >>> mucking with a configuration can invalidate some other user's data. >> >> ouch. could mucking with a configuration create a corrupt index? > > Depending on what you mean 'corrupted'. It will not corrupted as non-readable > or cause backend crash. But usage of such tsvector column could be limited - > not all words will be searchable. Am I correct to think of this like changing collations leaving your btree index "corrupt"? In that case it probably won't cause any backend crash either but you will get incorrect results. For example, returning different results depending on whether the index or a full table scan is used. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] tsearch_core patch: permissions and security issues
Teodor Sigaev <[EMAIL PROTECTED]> writes: > The reason to save SQLish interface to dictionaries is a simplicity of > configuration. Snowball's stemmers are useful as is, but ispell dictionary > requires some configuration action before using. Yeah. I had been wondering about moving the dict_initoption over to the configuration entry --- is that sane at all? It would mean that dict_init functions would have to guard themselves against invalid options, but they probably ought to do that anyway. If we did that, I think we could have a fixed set of dictionaries without too much problem, and focus on just configurations as being user-alterable. >>> Next, it took me a while to understand how Mapping objects fit into >>> the scheme at all, and now that (I think) I understand, I'm wondering >>> why treat them as an independent concept. > ALTER FULLTEXT CONFIGURATION cfgname ADD MAPPING FOR tokentypename[, ...] > WITH > dictname1[, ...]; > ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING FOR tokentypename[, ...] > WITH > dictname1[, ...]; > ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING [FOR tokentypename[, ...]] > REPLACE olddictname TO newdictname; > ALTER FULLTEXT CONFIGURATION cfgname DROP MAPPING [IF EXISTS] FOR > tokentypename; > Is it looking reasonable? Er ... what's the difference between the second and third forms? regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] tsearch_core patch: permissions and security issues
But they still need some more thought about permissions, because AFAICS mucking with a configuration can invalidate some other user's data. ouch. could mucking with a configuration create a corrupt index? Depending on what you mean 'corrupted'. It will not corrupted as non-readable or cause backend crash. But usage of such tsvector column could be limited - not all words will be searchable. This sounds sort of analogous to the issues collation bring up. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] tsearch_core patch: permissions and security issues
can we hard-code into the backend, and just update for every major release like we do for encodings? Sorry, no one of them :(. We know projects which introduce new parser, new dictionary. Config and map are changes very often. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] tsearch_core patch: permissions and security issues
Well assuming we have any SQL-level support at all I think we should strive to avoid these functions taking INTERNAL arguments. That gets us down to just needing to worry about whether we like the SQL representation of configurations. Which is still a nontrivial issue, but at least it seems manageable on a timescale that's reasonable for 8.3. Possible solution is to split pg_ts_dict (I'll talk about dictionaries, but the same way is possible to parsers, but now it's looked as overdesign) to two table like pg_am and pg_opclass. First table, pg_ts_dict_template (I don't know the exact name yet) which contains columns: oid, template_name, dict_init, dict_lexize and second: pg_ts_dict with colimns: oid, template_oid, owner, schema, dict_initoption. CREATE/ALTER/DROP DICTIONARY affects only second table and access to first one is only select/update/insert/delete similar to pg_am. IMHO, this interface solves problems with security and dumping. The reason to save SQLish interface to dictionaries is a simplicity of configuration. Snowball's stemmers are useful as is, but ispell dictionary requires some configuration action before using. Next, INTERNAL arguments parser's and dictionary's APIs are used because if performance reason. During creation of tsvector from text, there are a lot of calls of parsers and dictionaries. And internal structures of they states may be rather complex and cannot be matched in any pgsql's type, even in flat memory structure. > Next, it took me a while to understand how Mapping objects fit into > the scheme at all, and now that (I think) I understand, I'm wondering > why treat them as an independent concept. ALTER FULLTEXT CONFIGURATION cfgname ADD MAPPING FOR tokentypename[, ...] WITH dictname1[, ...]; ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING FOR tokentypename[, ...] WITH dictname1[, ...]; ALTER FULLTEXT CONFIGURATION cfgname ALTER MAPPING [FOR tokentypename[, ...]] REPLACE olddictname TO newdictname; ALTER FULLTEXT CONFIGURATION cfgname DROP MAPPING [IF EXISTS] FOR tokentypename; Is it looking reasonable? > TSConfiguration objects are a different story, since they have only > type-safe dependencies on parsers, locales, and dictionaries. But they > still need some more thought about permissions, because AFAICS mucking > with a configuration can invalidate some other user's data.Do we want > to allow runtime changes in a configuration that existing tsvector > columns already depend on? How can we even recognize whether there is > stored data that will be affected by a configuration change? (AFAICS Very complex task: experienced users could use several configuration simultaneously. For example: indexing use configuration which doesn't reject stop-words, but for default searching use configuration which rejects stop-words. BTW, the same effects may be produced by dictionary's change. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] tsearch_core patch: permissions and security issues
I an attempt to communicate what full text search does, and what features we are thinking of adding/removing, I have put up the introduction in HTML: http://momjian.us/expire/fulltext/HTML/fulltext-intro.html The links to the other sections don't work yet. --- Tom Lane wrote: > "Joshua D. Drake" <[EMAIL PROTECTED]> writes: > > Well my argument has always been the "core" feature argument. Perhaps I > > am missing some info here, but when I read what you wrote, I read that > > Tsearch will now be "harder" to work with. Not easier. :( > > Then you misread it. What I was proposing was essentially that there > won't be any need for pg_dump support because everything's built-in > (at least as far as parsers/dictionaries go). > > As for the permissions issues, that's just formalizing something that's > true today with the contrib module: if you change a configuration, it's > *your* problem whether that invalidates any table entries, the system > won't take care of it for you. > > regards, tom lane > > ---(end of broadcast)--- > TIP 6: explain analyze is your friend -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] tsearch_core patch: permissions and security issues
Tom Lane wrote: "Joshua D. Drake" <[EMAIL PROTECTED]> writes: Well my argument has always been the "core" feature argument. Perhaps I am missing some info here, but when I read what you wrote, I read that Tsearch will now be "harder" to work with. Not easier. :( Then you misread it. What I was proposing was essentially that there won't be any need for pg_dump support because everything's built-in (at least as far as parsers/dictionaries go). As for the permissions issues, that's just formalizing something that's true today with the contrib module: if you change a configuration, it's *your* problem whether that invalidates any table entries, the system won't take care of it for you. O.k. :) Joshua D. Drake regards, tom lane -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] tsearch_core patch: permissions and security issues
"Joshua D. Drake" <[EMAIL PROTECTED]> writes: > Well my argument has always been the "core" feature argument. Perhaps I > am missing some info here, but when I read what you wrote, I read that > Tsearch will now be "harder" to work with. Not easier. :( Then you misread it. What I was proposing was essentially that there won't be any need for pg_dump support because everything's built-in (at least as far as parsers/dictionaries go). As for the permissions issues, that's just formalizing something that's true today with the contrib module: if you change a configuration, it's *your* problem whether that invalidates any table entries, the system won't take care of it for you. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] tsearch_core patch: permissions and security issues
Tom Lane wrote: "Joshua D. Drake" <[EMAIL PROTECTED]> writes: O.k. I am not trying to throw any cold water on this, but with the limitations we are suggesting, does the patch gain us anything over just leaving tsearch in contrib? Well, if you want to take a hard-nosed approach, no form of the patch would gain us anything over leaving it in contrib, at least not from a functionality standpoint. The argument in favor has always been about perception, really: if it's a "core" feature not an "add-on", then people will take it more seriously. And there's a rather weak ease-of-use argument that you don't have to install a contrib module. (The idea that it's targeted at people who can't or won't install a contrib module is another reason why I think we can skip user-defined parsers and dictionaries ...) Well my argument has always been the "core" feature argument. Perhaps I am missing some info here, but when I read what you wrote, I read that Tsearch will now be "harder" to work with. Not easier. :( Removal of pg_dump support kind of hurts us, as we already have problems with pg_dump support and tsearch2. Adding work to have to re-assign permissions to vector columns because we make changes... I would grant that having the SQL extensions would certainly be nice. Anyway, I am not trying to stop the progress. I would like to see Tsearch2 in core but I also don't want to add complexity. You did say here: And we reserve the right to whack around the API for the functions referenced by the catalog entries. Which kind of gets us back to upgrade problems doesn't it? Sincerely, Joshua D. Drake regards, tom lane -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] tsearch_core patch: permissions and security issues
Joshua D. Drake wrote: > Tom Lane wrote: > > Gregory Stark <[EMAIL PROTECTED]> writes: > >> Well assuming we have any SQL-level support at all I think we should > >> strive to avoid these functions taking INTERNAL arguments. > > > That gets us down to just needing to worry about whether we like the > > SQL representation of configurations. Which is still a nontrivial > > issue, but at least it seems manageable on a timescale that's > > reasonable for 8.3. > > O.k. I am not trying to throw any cold water on this, but with the > limitations we are suggesting, does the patch gain us anything over just > leaving tsearch in contrib? The idea is that common operations like searching and mapping dictionaries will be easier to do, but the more complex stuff will require catalog manipulations. -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] tsearch_core patch: permissions and security issues
"Joshua D. Drake" <[EMAIL PROTECTED]> writes: > O.k. I am not trying to throw any cold water on this, but with the > limitations we are suggesting, does the patch gain us anything over just > leaving tsearch in contrib? Well, if you want to take a hard-nosed approach, no form of the patch would gain us anything over leaving it in contrib, at least not from a functionality standpoint. The argument in favor has always been about perception, really: if it's a "core" feature not an "add-on", then people will take it more seriously. And there's a rather weak ease-of-use argument that you don't have to install a contrib module. (The idea that it's targeted at people who can't or won't install a contrib module is another reason why I think we can skip user-defined parsers and dictionaries ...) regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] tsearch_core patch: permissions and security issues
Tom Lane wrote: Gregory Stark <[EMAIL PROTECTED]> writes: Well assuming we have any SQL-level support at all I think we should strive to avoid these functions taking INTERNAL arguments. That gets us down to just needing to worry about whether we like the SQL representation of configurations. Which is still a nontrivial issue, but at least it seems manageable on a timescale that's reasonable for 8.3. O.k. I am not trying to throw any cold water on this, but with the limitations we are suggesting, does the patch gain us anything over just leaving tsearch in contrib? Sincerely, Joshua D. Drake regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] tsearch_core patch: permissions and security issues
Gregory Stark <[EMAIL PROTECTED]> writes: > Well assuming we have any SQL-level support at all I think we should > strive to avoid these functions taking INTERNAL arguments. I don't think I want to get into redesigning the patch at that level of detail, at least not for 8.3. It seems like something possibly worth thinking about for 8.4 though. The idea that we might want to change the API for parser and dictionary support routines seems like another good argument for not exposing user-level facilities for creating them right now. What I'm realizing as I look at it is that this is an enormous patch, and it's not as close to being ready to apply as I had supposed. If we don't scale it back, then either it doesn't get into 8.3 or 8.3 gets delayed a whole lot longer. So we need to look at what we can trim or postpone for a later release. So all these factors seem to me to point in the same direction: at least for the time being, we should treat TS parsers and dictionaries the way we treat index access methods. There'll be a catalog, which the adventurous can insert new entries into, but no SQL-level support for doing it, hence no pg_dump support. And we reserve the right to whack around the API for the functions referenced by the catalog entries. That still leaves us with the question of SQL-level support for TS configurations, which are built on top of parsers and dictionaries. We definitely need some level of capability for that. For the permissions and dependencies issues, the minimalistic approach is to say "only superusers can create or alter TS configurations, and if you alter one it's your responsibility to fix up any dependent tsvector columns or indexes." We currently handle index operator classes that way, so it's not completely ridiculous. Sure it would be nice to do better, but maybe that's a post-8.3 project. That gets us down to just needing to worry about whether we like the SQL representation of configurations. Which is still a nontrivial issue, but at least it seems manageable on a timescale that's reasonable for 8.3. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] tsearch_core patch: permissions and security issues
"Tom Lane" <[EMAIL PROTECTED]> writes: > You could remove the immediate source of this objection if you could > redesign the APIs for the underlying support functions to be more > type-safe. I'm not sure how feasible or useful that would be though. > The bottom-line question here is whether developing a new parser or > dictionary implementation is really something that ordinary users might > do. If not, then having all this SQL-level support for setting up > catalog entries seems like wasted effort. Well assuming we have any SQL-level support at all I think we should strive to avoid these functions taking INTERNAL arguments. I feel like having them in the GIST interface has been a major impediment to more people defining GIST indexes for more datatypes. Because you need to write C code dealing with internal data structures to handle page splits the bar to implement GIST index operator classes is too high for most users. So instead of a simple SQL command we end up with contrib modules implementing each type of GIST index. A while back I proposed that we implement the same page-split algorithm that most (or all?) of those contrib modules copy-paste between them as a default implementation. That would allow defining a GIST index in terms of a handful of operators like "distance" which could be defined with a type-safe api. This would be less flexible than the existing generic solution but it would allow defining new GIST indexes without writing C code. > But they still need some more thought about permissions, because AFAICS > mucking with a configuration can invalidate some other user's data. ouch. could mucking with a configuration create a corrupt index? This sounds sort of analogous to the issues collation bring up. > It seems to me that the single easiest and most useful part of a > configuration to change is the stop word list; but this setup guarantees > that no one but a DBA can do that, and what's more that pg_dump won't record > your changes. I would second that, in the past I was expected to provide an administrative web interface to adjust the list of stop words. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] tsearch_core patch: permissions and security issues
You bring up a very good point. There are fifteen new commands being added for full text indexing: alter-fulltext-config.sgml alter-fulltext-owner.sgml create-fulltext-dict.sgml drop-fulltext-dict.sgml alter-fulltext-dict.sgmlalter-fulltext-parser.sgml create-fulltext-map.sgmldrop-fulltext-map.sgml alter-fulltext-dictset.sgml comment-fulltext.sgml create-fulltext-parser.sgml drop-fulltext-parser.sgml alter-fulltext-map.sgml create-fulltext-config.sgml drop-fulltext-config.sgml I think encoding is a good example to follow. We allow users to create new conversions (CREATE CONVERSION), but we don't allow them to create new encodings --- those are hard-coded in the backend. Which of the following full text objects: config dict map dictset parser can we hard-code into the backend, and just update for every major release like we do for encodings? --- Tom Lane wrote: > I've been looking at the tsearch patch a bit, and I think there needs to > be more thought given to the permissions required to mess around with > tsearch configuration objects. > > The TSParser objects reference functions declared to take and return > INTERNAL arguments. This means that the underlying functions must be > coded in C and can only be installed by a superuser, which in turn means > that there is no scenario where it is really useful for a non-superuser > to execute CREATE PARSER. What's more, allowing a non-superuser to do > it creates security holes: if you can find an unrelated function taking > the right number of INTERNAL arguments, you can install it as a TSParser > support function. That trivially allows crashing the backend, and it > could allow worse security holes than that. > > TSDictionary objects have exactly the same issues since they also depend > on functions with INTERNAL arguments. > > At minimum this means that we should restrict CREATE/DROP/ALTER commands > for these objects to superusers. (Which in turn means there's no point > in tracking an ownership column for them; every superuser is the same as > every other one, permissions-wise.) I'm wondering though whether this > doesn't mean that we don't need manipulation commands for them at all. > Is it likely that people will be adding parser or dictionary support to > an installation on the fly? Maybe we can just create 'em all at initdb > time and be done, similar to the way index access methods are treated. > This doesn't say that it's not possible to add more; you can add an > index access method on the fly too, if you want, by inserting stuff into > pg_am by hand. I'm just wondering whether all that SQL-statement > support and pg_dump support for custom parsers and dictionaries is > really worth the code space and future maintenance effort it'll eat up. > > You could remove the immediate source of this objection if you could > redesign the APIs for the underlying support functions to be more > type-safe. I'm not sure how feasible or useful that would be though. > The bottom-line question here is whether developing a new parser or > dictionary implementation is really something that ordinary users might > do. If not, then having all this SQL-level support for setting up > catalog entries seems like wasted effort. > > TSConfiguration objects are a different story, since they have only > type-safe dependencies on parsers, locales, and dictionaries. But they > still need some more thought about permissions, because AFAICS mucking > with a configuration can invalidate some other user's data. Do we want > to allow runtime changes in a configuration that existing tsvector > columns already depend on? How can we even recognize whether there is > stored data that will be affected by a configuration change? (AFAICS > the patch doesn't put anything into the pg_depend machinery that could > deal with this.) And who gets to decide which configuration is default, > anyway? > > I'm also a bit disturbed that you've made searches for TSConfiguration > objects be search-path-sensitive. That is likely to create problems > similar to those we've recently recognized for function lookup, eg, > an insertion into a full-text-indexed column gets treated differently > depending on the caller's search path. It's particularly bad to have > the default object be search-path-dependent. We learned the hard way > not to do that for default index operator classes; let's not make the > same mistake again for tsearch configurations. > > Next, it took me a while to understand how Mapping objects fit into > the scheme at all, and now that (I think) I understand, I'm wondering > why treat them as an independent concept. Seems like the mapping from > token types to dictionaries is really a property of a configuration, > and we ought to be handling it t