Re: [HACKERS] Updated tsearch documentation
On Thu, 26 Jul 2007, Bruce Momjian wrote: Oleg Bartunov wrote: Bruce, I sent you link to my wiki page with summary of changes http://www.sai.msu.su/~megera/wiki/ts_changes Your documentation looks rather old. I have updated it to reflect your changes: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html Bruce, I noticed you miss many changes. For example, options for stemmer has changed (it's documented in my ts_changes), so in http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-CONFIGURATION ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 'english-utf8.stop'; should be ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 'StopFile=english-utf8.stop, Language=english'; Also, this is wrong DROP TEXT SEARCH CONFIGURATION MAPPING ON pg FOR email, url, sfloat, uri, float; it should be ALTER TEXT SEARCH CONFIGURATION pg DROP MAPPING FOR email, url, sfloat, uri, float; Configuration now doesn't have DEFAULT flag, so \dF should not display 'Y' = \dF pg_catalog | russian | Y public | pg | Y This is what I see now postgres=# \dF public.* List of fulltext configurations Schema | Name | Description +--+- public | pg | --- Oleg On Tue, 24 Jul 2007, Bruce Momjian wrote: I have added more documentation to try to show how full text search is used by user tables. I think this the documentaiton is almost done: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html --- Oleg Bartunov wrote: On Wed, 18 Jul 2007, Bruce Momjian wrote: Oleg, Teodor, I am confused by the following example. How does gin know to create a tsvector, or does it? Does gist know too? No, gist doesn't know. I don't remember why, Teodor ? For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php for discussion FYI, at some point we need to chat via instant messenger or IRC to discuss the open items. My chat information is here: http://momjian.us/main/contact.html I send you invitation for google talk, I use only chat in gmail. My gmail account is [EMAIL PROTECTED] --- SELECT title FROM pgweb WHERE textcat(title,body) @@ plainto_tsquery('create table') ORDER BY dlm DESC LIMIT 10; CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Updated tsearch documentation
On Thu, 26 Jul 2007, Bruce Momjian wrote: Oleg Bartunov wrote: On Wed, 25 Jul 2007, Erikjan wrote: In http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT it says: A document is any text file that can be opened, read, and modified. OOps, in my original documentation it was: Document, in usual meaning, is a text file, that one could open, read and modify. I stress that in database document is something another. http://www.sai.msu.su/~megera/postgres/fts/doc/fts-whatdb.html I have updated the documentation: http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT Is't worth to reference OpenFTS which used for indexing file system ? Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Updated tsearch documentation
Oleg Bartunov wrote: On Thu, 26 Jul 2007, Bruce Momjian wrote: Oleg Bartunov wrote: On Wed, 25 Jul 2007, Erikjan wrote: In http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT it says: A document is any text file that can be opened, read, and modified. OOps, in my original documentation it was: Document, in usual meaning, is a text file, that one could open, read and modify. I stress that in database document is something another. http://www.sai.msu.su/~megera/postgres/fts/doc/fts-whatdb.html I have updated the documentation: http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT Is't worth to reference OpenFTS which used for indexing file system ? Uh, not sure. I don't think so but we can add a URL to it if you can find the right place. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Updated tsearch documentation
Thanks, I found a few more places that needed updating. It should be accurate now. Thanks for the report. --- Oleg Bartunov wrote: On Thu, 26 Jul 2007, Bruce Momjian wrote: Oleg Bartunov wrote: Bruce, I sent you link to my wiki page with summary of changes http://www.sai.msu.su/~megera/wiki/ts_changes Your documentation looks rather old. I have updated it to reflect your changes: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html Bruce, I noticed you miss many changes. For example, options for stemmer has changed (it's documented in my ts_changes), so in http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-CONFIGURATION ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 'english-utf8.stop'; should be ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 'StopFile=english-utf8.stop, Language=english'; Also, this is wrong DROP TEXT SEARCH CONFIGURATION MAPPING ON pg FOR email, url, sfloat, uri, float; it should be ALTER TEXT SEARCH CONFIGURATION pg DROP MAPPING FOR email, url, sfloat, uri, float; Configuration now doesn't have DEFAULT flag, so \dF should not display 'Y' = \dF pg_catalog | russian | Y public | pg | Y This is what I see now postgres=# \dF public.* List of fulltext configurations Schema | Name | Description +--+- public | pg | --- Oleg On Tue, 24 Jul 2007, Bruce Momjian wrote: I have added more documentation to try to show how full text search is used by user tables. I think this the documentaiton is almost done: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html --- Oleg Bartunov wrote: On Wed, 18 Jul 2007, Bruce Momjian wrote: Oleg, Teodor, I am confused by the following example. How does gin know to create a tsvector, or does it? Does gist know too? No, gist doesn't know. I don't remember why, Teodor ? For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php for discussion FYI, at some point we need to chat via instant messenger or IRC to discuss the open items. My chat information is here: http://momjian.us/main/contact.html I send you invitation for google talk, I use only chat in gmail. My gmail account is [EMAIL PROTECTED] --- SELECT title FROM pgweb WHERE textcat(title,body) @@ plainto_tsquery('create table') ORDER BY dlm DESC LIMIT 10; CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Updated tsearch documentation
Dimitri Fontaine wrote: -- Start of PGP signed section. Hi, Le mercredi 25 juillet 2007, Bruce Momjian a ?crit?: I have added more documentation to try to show how full text search is used by user tables. I think this the documentaiton is almost done: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html I've come to understand that GIN indexes are far more costly to update than GiST one, and Oleg's wiki advice users to partition data and use GiST index for live part and GIN index for archive part only. Is it worth mentioning this into this part of the documentation? And if mentioned here, partitioning step could certainly be part of the example... or let it as a user exercise, but then explaining why GIN is a good choice in the provided example. Partitioning is already in the documentation: Partitioning of big collections and the proper use of GiST and GIN indexes allows the implementation of very fast searches with online update. Partitioning can be done at the database level using table inheritance and varnameconstraint_exclusion/, or distributing documents over servers and collecting search results using the filenamecontrib/dblink/ extension module. The latter is possible because ranking functions use only local information. I don't see a reason to provide an example beyond the existing examples of how to do partitioning. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Updated tsearch documentation
Oleg Bartunov wrote: On Wed, 25 Jul 2007, Erikjan wrote: In http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT it says: A document is any text file that can be opened, read, and modified. OOps, in my original documentation it was: Document, in usual meaning, is a text file, that one could open, read and modify. I stress that in database document is something another. http://www.sai.msu.su/~megera/postgres/fts/doc/fts-whatdb.html I have updated the documentation: http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Updated tsearch documentation
Oleg Bartunov wrote: Bruce, I sent you link to my wiki page with summary of changes http://www.sai.msu.su/~megera/wiki/ts_changes Your documentation looks rather old. I have updated it to reflect your changes: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html --- Oleg On Tue, 24 Jul 2007, Bruce Momjian wrote: I have added more documentation to try to show how full text search is used by user tables. I think this the documentaiton is almost done: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html --- Oleg Bartunov wrote: On Wed, 18 Jul 2007, Bruce Momjian wrote: Oleg, Teodor, I am confused by the following example. How does gin know to create a tsvector, or does it? Does gist know too? No, gist doesn't know. I don't remember why, Teodor ? For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php for discussion FYI, at some point we need to chat via instant messenger or IRC to discuss the open items. My chat information is here: http://momjian.us/main/contact.html I send you invitation for google talk, I use only chat in gmail. My gmail account is [EMAIL PROTECTED] --- SELECT title FROM pgweb WHERE textcat(title,body) @@ plainto_tsquery('create table') ORDER BY dlm DESC LIMIT 10; CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Updated tsearch documentation
Hi, Le mercredi 25 juillet 2007, Bruce Momjian a écrit : I have added more documentation to try to show how full text search is used by user tables. I think this the documentaiton is almost done: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html I've come to understand that GIN indexes are far more costly to update than GiST one, and Oleg's wiki advice users to partition data and use GiST index for live part and GIN index for archive part only. Is it worth mentioning this into this part of the documentation? And if mentioned here, partitioning step could certainly be part of the example... or let it as a user exercise, but then explaining why GIN is a good choice in the provided example. Hope this helps, regards, -- dim signature.asc Description: This is a digitally signed message part.
Re: [HACKERS] Updated tsearch documentation
In http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT it says: A document is any text file that can be opened, read, and modified. Is this an openfts docs relic? tsearch2 is not meant to be be reading out-of-database *files*, or is it? If it is actually the case that the present tsearch2 implementation (for 8.3) is going to be able to store pointers into external files, maybe this should be made more explicitly clear? oh, and another little derussification (russians don't seem to like articles, be they definite or indefinite): is seen as different function should be is seen as a different function Thanks, Erik Rijkers ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Updated tsearch documentation
On Wed, 25 Jul 2007, Erikjan wrote: In http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT it says: A document is any text file that can be opened, read, and modified. OOps, in my original documentation it was: Document, in usual meaning, is a text file, that one could open, read and modify. I stress that in database document is something another. http://www.sai.msu.su/~megera/postgres/fts/doc/fts-whatdb.html Is this an openfts docs relic? tsearch2 is not meant to be be reading out-of-database *files*, or is it? If it is actually the case that the present tsearch2 implementation (for 8.3) is going to be able to store pointers into external files, maybe this should be made more explicitly clear? oh, and another little derussification (russians don't seem to like articles, be they definite or indefinite): is seen as different function should be is seen as a different function Thanks, Erik Rijkers ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Updated tsearch documentation
I have added more documentation to try to show how full text search is used by user tables. I think this the documentaiton is almost done: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html --- Oleg Bartunov wrote: On Wed, 18 Jul 2007, Bruce Momjian wrote: Oleg, Teodor, I am confused by the following example. How does gin know to create a tsvector, or does it? Does gist know too? No, gist doesn't know. I don't remember why, Teodor ? For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php for discussion FYI, at some point we need to chat via instant messenger or IRC to discuss the open items. My chat information is here: http://momjian.us/main/contact.html I send you invitation for google talk, I use only chat in gmail. My gmail account is [EMAIL PROTECTED] --- SELECT title FROM pgweb WHERE textcat(title,body) @@ plainto_tsquery('create table') ORDER BY dlm DESC LIMIT 10; CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Updated tsearch documentation
Bruce, I sent you link to my wiki page with summary of changes http://www.sai.msu.su/~megera/wiki/ts_changes Your documentation looks rather old. Oleg On Tue, 24 Jul 2007, Bruce Momjian wrote: I have added more documentation to try to show how full text search is used by user tables. I think this the documentaiton is almost done: http://momjian.us/expire/fulltext/HTML/textsearch-tables.html --- Oleg Bartunov wrote: On Wed, 18 Jul 2007, Bruce Momjian wrote: Oleg, Teodor, I am confused by the following example. How does gin know to create a tsvector, or does it? Does gist know too? No, gist doesn't know. I don't remember why, Teodor ? For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php for discussion FYI, at some point we need to chat via instant messenger or IRC to discuss the open items. My chat information is here: http://momjian.us/main/contact.html I send you invitation for google talk, I use only chat in gmail. My gmail account is [EMAIL PROTECTED] --- SELECT title FROM pgweb WHERE textcat(title,body) @@ plainto_tsquery('create table') ORDER BY dlm DESC LIMIT 10; CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Updated tsearch documentation
On Tue, 17 Jul 2007, Bruce Momjian wrote: Oleg Bartunov wrote: On Tue, 17 Jul 2007, Bruce Momjian wrote: I think the tsearch documentation is nearing completion: http://momjian.us/expire/fulltext/HTML/textsearch.html but I am not happy with how tsearch is enabled in a user table: http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html Aside from the fact that it needs more examples, it only illustrates an example where someone creates a table, populates it, then adds a tsvector column, populates that, then creates an index. That seems quite inflexible. Is there a way to avoid having a separate tsvector column? What happens if the table is dynamic? How is that column updated based on table changes? Triggers? Where are the examples? Can you create an index like this: I agree, that there are could be more examples, but text search doesn't require something special ! *Example* of trigger function is documented on http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html Yes, I see that in tsearch() here: http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$ I assume my_filter_name is optional right? I have updated the prototype to be: tsearch([vector_column_name], [my_filter_name], text_column_name [, ... ]) Is this accurate? What does this text below it mean? no, this in inaccurate. First, vector_column_name is not optional argument, it's a name of tsvector column name. There can be many functions and text columns specified in a tsearch() trigger. The following rule is used: a function is applied to all subsequent TEXT columns until the next matching column occurs. The idea, is to provide user to preprocess text before applying tsearch machinery. my_filter_name() preprocess text_column_name1, text_column_name2, The original syntax allows to specify for every text columns their preprocessing functions. So, I suggest to keep original syntax, change 'vector_column_name' to 'tsvector_column_name'. Why are we allowing my_filter_name here? Isn't that something for a custom trigger. Is calling it tsearch() a good idea? Why not tsvector_trigger(). I don't see any benefit from the tsvector_trigger() name. If you want to add some semantic, than tsvector_update_trigger() would be better. Anyway, this trigger is an illustration. CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); That avoids having to have a separate column because you can just say: WHERE to_query('XXX') @@ to_tsvector(column) yes, it's possible, but without ranking, since currently it's impossible to store any information in index (it's pg's feature). btw, this should works and for GiST index also. What if they use @@@. Wouldn't that work because it is going to check the heap? It would work, it'd recalculate to_tsvector(column) for rows found ( for GiST - to remove false hits and for weight information, for GIN - for weight information only). That kind of search is useful if there is another natural ordering of search results, for example, by timestamp. How do we make sure that the to_query is using the same text search configuration as the 'column' or index? Perhaps we should suggest: please, keep in mind, it's not mandatory to use the same configuration at search time, that was used at index creation. Well, sort of. If you have stop words in the tquery configuration, you aren't going to hit any matches in the tsvector, right? Same for synonymns, I suppose. I can see that stemming would work if there was a mismatch between tsquery and tsvector. CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); so that at least the configuration is documented in the index. yes, it's better to always explicitly specify configuration name and not rely on default configuration. Unfortunately, configuration name doesn't saved in the index. as Teodor corrected me, index doesn't know about configuration at all ! What accurate user could do, is to provide configuration name in the comment for tsvector column. Configuration name is an accessory of to_tsvector() function. In principle, tsvector as any data type could be obtained by any other ways, for example, OpenFTS construct tsvector following its own rules. I was more concerned that there is nothing documenting the configuration used by the index or the tsvector table column trigger. By doing: again, index has nothing with configuration name. Our trigger function is an example, which uses default configuration name. User could easily write it's own trigger to keep tsvector column up to date and use configuration name as a parameter. CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); you guarantee that the index uses 'english' for all its entries. If you omit the 'english' or use a different configuration, it will heap scan the
Re: [HACKERS] Updated tsearch documentation
Oleg Bartunov wrote: I agree, that there are could be more examples, but text search doesn't require something special ! *Example* of trigger function is documented on http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html Yes, I see that in tsearch() here: http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$ I assume my_filter_name is optional right? I have updated the prototype to be: tsearch([vector_column_name], [my_filter_name], text_column_name [, ... ]) Is this accurate? What does this text below it mean? no, this in inaccurate. First, vector_column_name is not optional argument, it's a name of tsvector column name. Fixed. There can be many functions and text columns specified in a tsearch() trigger. The following rule is used: a function is applied to all subsequent TEXT columns until the next matching column occurs. The idea, is to provide user to preprocess text before applying tsearch machinery. my_filter_name() preprocess text_column_name1, text_column_name2, The original syntax allows to specify for every text columns their preprocessing functions. So, I suggest to keep original syntax, change 'vector_column_name' to 'tsvector_column_name'. OK, change made. Why are we allowing my_filter_name here? Isn't that something for a custom trigger. Is calling it tsearch() a good idea? Why not tsvector_trigger(). I don't see any benefit from the tsvector_trigger() name. If you want to add some semantic, than tsvector_update_trigger() would be better. Anyway, this trigger is an illustration. Well, the filter that removes '@' might be an example, but tsearch() is indeed sort of built-in trigger function to be used for simple cases. My point is that because it is only for simple cases, why add complexity and allow a filter? It seems best to just remove the filter idea and let people write their own triggers if they want that functionality. CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); That avoids having to have a separate column because you can just say: WHERE to_query('XXX') @@ to_tsvector(column) yes, it's possible, but without ranking, since currently it's impossible to store any information in index (it's pg's feature). btw, this should works and for GiST index also. What if they use @@@. Wouldn't that work because it is going to check the heap? It would work, it'd recalculate to_tsvector(column) for rows found ( for GiST - to remove false hits and for weight information, for GIN - for weight information only). Right. Currently to use text search on a table, you have to do three things: o add a tsvector column to the table o add a trigger to keep the tsvector column current o add an index to the tsvector column My question is why bother with the first two steps? If you do: CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column)); you don't need a separate column and a trigger to keep it current. The index is kept current as part of normal query processing. The only downside is that you have to do to_tsvector() in the heap to avoid false hits, but that seems minor compared to the disk savings of not having the separate column. Is to_tsvector() an expensive function? CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); so that at least the configuration is documented in the index. yes, it's better to always explicitly specify configuration name and not rely on default configuration. Unfortunately, configuration name doesn't saved in the index. as Teodor corrected me, index doesn't know about configuration at all ! What accurate user could do, is to provide configuration name in the comment for tsvector column. Configuration name is an accessory of to_tsvector() function. Well, if you create the index with the configuration name it is guaranteed to match: CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column)); --- And if someone does: WHERE 'friend'::tsquery @@ to_tsvector('english',column)) the index is used. Now if the default configuration is 'english' and they use: WHERE 'friend'::tsquery @@ to_tsvector(column)) the index is not used, but this just a good example of why default configurations aren't that useful. One problem I see is that if the default configuration is not 'english', then when the index consults the heap, it would be using a different configuration and yield incorrect results. I am unsure how to fix that. With the trigger idea, you have to be sure your configuration is the same every time you INSERT/UPDATE the table or the index will have mixed configuration entries and it will yield incorrect results, aside from the heap configuration lookup not matching the index. Once
Re: [HACKERS] Updated tsearch documentation
On Wed, 18 Jul 2007, Bruce Momjian wrote: Why are we allowing my_filter_name here? Isn't that something for a custom trigger. Is calling it tsearch() a good idea? Why not tsvector_trigger(). I don't see any benefit from the tsvector_trigger() name. If you want to add some semantic, than tsvector_update_trigger() would be better. Anyway, this trigger is an illustration. Well, the filter that removes '@' might be an example, but tsearch() is indeed sort of built-in trigger function to be used for simple cases. My point is that because it is only for simple cases, why add complexity and allow a filter? It seems best to just remove the filter idea and let people write their own triggers if they want that functionality. If you aware about documentation simplicity than we could just document two versions: 1. without filter function - simple, well understood syntax 2. with filter function - for advanced users I don't want to remove the feature which works for year without any problem. CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); That avoids having to have a separate column because you can just say: WHERE to_query('XXX') @@ to_tsvector(column) yes, it's possible, but without ranking, since currently it's impossible to store any information in index (it's pg's feature). btw, this should works and for GiST index also. What if they use @@@. Wouldn't that work because it is going to check the heap? It would work, it'd recalculate to_tsvector(column) for rows found ( for GiST - to remove false hits and for weight information, for GIN - for weight information only). Right. Currently to use text search on a table, you have to do three things: o add a tsvector column to the table o add a trigger to keep the tsvector column current o add an index to the tsvector column My question is why bother with the first two steps? If you do: CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column)); you don't need a separate column and a trigger to keep it current. The index is kept current as part of normal query processing. The only downside is that you have to do to_tsvector() in the heap to avoid false hits, but that seems minor compared to the disk savings of not having the separate column. Is to_tsvector() an expensive function? Bruce, you oversimplify the text search, the document could be fully virtual, not a column(s), it could be a result of any SQL commands, so it could be very expensive just to obtain document, and yes, to_tsvector could be very expensive, depending on the document size, parser and dictionaries used. And, again, current postgres architecture forces to use heap to store positional and weight information for ranking. The use case for what you described is very limited - simple text search on one/several column of the same table without ranking. CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); so that at least the configuration is documented in the index. yes, it's better to always explicitly specify configuration name and not rely on default configuration. Unfortunately, configuration name doesn't saved in the index. as Teodor corrected me, index doesn't know about configuration at all ! What accurate user could do, is to provide configuration name in the comment for tsvector column. Configuration name is an accessory of to_tsvector() function. Well, if you create the index with the configuration name it is guaranteed to match: CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column)); --- And if someone does: WHERE 'friend'::tsquery @@ to_tsvector('english',column)) the index is used. Now if the default configuration is 'english' and they use: WHERE 'friend'::tsquery @@ to_tsvector(column)) the index is not used, but this just a good example of why default configurations aren't that useful. One problem I see is that if the default configuration is not 'english', then when the index consults the heap, it would be using a different configuration and yield incorrect results. I am unsure how to fix that. again, you consider very simple case and actually, your example is a good example of usefulness of default configuration ! Just think before you develop your application, but this is very general rule. There are zillions situations you could do bad things, after all. Moreover, consider text search on text column, there is no way to specify configuration at all ! We rely on default configuration here CREATE INDEX textsearch_idx ON pgweb USING gin(title); With the trigger idea, you have to be sure your configuration is the same every time you INSERT/UPDATE the table or the index will have mixed configuration entries and it will yield incorrect results, aside from the heap configuration lookup not matching the index.
Re: [HACKERS] Updated tsearch documentation
Oleg Bartunov wrote: On Wed, 18 Jul 2007, Bruce Momjian wrote: Why are we allowing my_filter_name here? Isn't that something for a custom trigger. Is calling it tsearch() a good idea? Why not tsvector_trigger(). I don't see any benefit from the tsvector_trigger() name. If you want to add some semantic, than tsvector_update_trigger() would be better. Anyway, this trigger is an illustration. Well, the filter that removes '@' might be an example, but tsearch() is indeed sort of built-in trigger function to be used for simple cases. My point is that because it is only for simple cases, why add complexity and allow a filter? It seems best to just remove the filter idea and let people write their own triggers if they want that functionality. If you aware about documentation simplicity than we could just document two versions: 1. without filter function - simple, well understood syntax 2. with filter function - for advanced users I don't want to remove the feature which works for year without any problem. Yes, this is what I want. I would like to show the simple usage first, then explain that a more complex usage is possible. This will help people get started using text search. Triggers and secondary columns are fine, but to start using it the CREATE INDEX-only case is best. I don't suggest we remove any capabilities, only suggest simple solutions. CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); That avoids having to have a separate column because you can just say: WHERE to_query('XXX') @@ to_tsvector(column) yes, it's possible, but without ranking, since currently it's impossible to store any information in index (it's pg's feature). btw, this should works and for GiST index also. What if they use @@@. Wouldn't that work because it is going to check the heap? It would work, it'd recalculate to_tsvector(column) for rows found ( for GiST - to remove false hits and for weight information, for GIN - for weight information only). Right. Currently to use text search on a table, you have to do three things: o add a tsvector column to the table o add a trigger to keep the tsvector column current o add an index to the tsvector column My question is why bother with the first two steps? If you do: CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column)); you don't need a separate column and a trigger to keep it current. The index is kept current as part of normal query processing. The only downside is that you have to do to_tsvector() in the heap to avoid false hits, but that seems minor compared to the disk savings of not having the separate column. Is to_tsvector() an expensive function? Bruce, you oversimplify the text search, the document could be fully virtual, not a column(s), it could be a result of any SQL commands, so it could be very expensive just to obtain document, and yes, to_tsvector could be very expensive, depending on the document size, parser and dictionaries used. And, again, current postgres architecture forces to use heap to store positional and weight information for ranking. The use case for what you described is very limited - simple text search on one/several column of the same table without ranking. Right, but I bet that that is all the majority of users need, at least at first as they start to use text search. CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); so that at least the configuration is documented in the index. yes, it's better to always explicitly specify configuration name and not rely on default configuration. Unfortunately, configuration name doesn't saved in the index. as Teodor corrected me, index doesn't know about configuration at all ! What accurate user could do, is to provide configuration name in the comment for tsvector column. Configuration name is an accessory of to_tsvector() function. Well, if you create the index with the configuration name it is guaranteed to match: CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column)); --- And if someone does: WHERE 'friend'::tsquery @@ to_tsvector('english',column)) the index is used. Now if the default configuration is 'english' and they use: WHERE 'friend'::tsquery @@ to_tsvector(column)) the index is not used, but this just a good example of why default configurations aren't that useful. One problem I see is that if the default configuration is not 'english', then when the index consults the heap, it would be using a different configuration and yield incorrect results. I am unsure how to fix that. again, you consider very simple case and actually, your example is a good example of usefulness of default
Re: [HACKERS] Updated tsearch documentation
Oleg, Teodor, I am confused by the following example. How does gin know to create a tsvector, or does it? Does gist know too? FYI, at some point we need to chat via instant messenger or IRC to discuss the open items. My chat information is here: http://momjian.us/main/contact.html --- SELECT title FROM pgweb WHERE textcat(title,body) @@ plainto_tsquery('create table') ORDER BY dlm DESC LIMIT 10; CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Updated tsearch documentation
On Wed, 18 Jul 2007, Bruce Momjian wrote: Oleg, Teodor, I am confused by the following example. How does gin know to create a tsvector, or does it? Does gist know too? No, gist doesn't know. I don't remember why, Teodor ? For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php for discussion FYI, at some point we need to chat via instant messenger or IRC to discuss the open items. My chat information is here: http://momjian.us/main/contact.html I send you invitation for google talk, I use only chat in gmail. My gmail account is [EMAIL PROTECTED] --- SELECT title FROM pgweb WHERE textcat(title,body) @@ plainto_tsquery('create table') ORDER BY dlm DESC LIMIT 10; CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body)); Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Updated tsearch documentation
On Tue, 17 Jul 2007, Bruce Momjian wrote: I think the tsearch documentation is nearing completion: http://momjian.us/expire/fulltext/HTML/textsearch.html but I am not happy with how tsearch is enabled in a user table: http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html Aside from the fact that it needs more examples, it only illustrates an example where someone creates a table, populates it, then adds a tsvector column, populates that, then creates an index. That seems quite inflexible. Is there a way to avoid having a separate tsvector column? What happens if the table is dynamic? How is that column updated based on table changes? Triggers? Where are the examples? Can you create an index like this: I agree, that there are could be more examples, but text search doesn't require something special ! *Example* of trigger function is documented on http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); That avoids having to have a separate column because you can just say: WHERE to_query('XXX') @@ to_tsvector(column) yes, it's possible, but without ranking, since currently it's impossible to store any information in index (it's pg's feature). btw, this should works and for GiST index also. That kind of search is useful if there is another natural ordering of search results, for example, by timestamp. How do we make sure that the to_query is using the same text search configuration as the 'column' or index? Perhaps we should suggest: please, keep in mind, it's not mandatory to use the same configuration at search time, that was used at index creation. CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); so that at least the configuration is documented in the index. yes, it's better to always explicitly specify configuration name and not rely on default configuration. Unfortunately, configuration name doesn't saved in the index. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Updated tsearch documentation
On Tue, 17 Jul 2007, Oleg Bartunov wrote: On Tue, 17 Jul 2007, Bruce Momjian wrote: I think the tsearch documentation is nearing completion: http://momjian.us/expire/fulltext/HTML/textsearch.html but I am not happy with how tsearch is enabled in a user table: http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html Aside from the fact that it needs more examples, it only illustrates an example where someone creates a table, populates it, then adds a tsvector column, populates that, then creates an index. That seems quite inflexible. Is there a way to avoid having a separate tsvector column? What happens if the table is dynamic? How is that column updated based on table changes? Triggers? Where are the examples? Can you create an index like this: I agree, that there are could be more examples, but text search doesn't require something special ! *Example* of trigger function is documented on http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html Bruce, below is an example of trigger for insert/update of example table create function pgweb_update() returns trigger as $$ BEGIN NEW.textsearch_index= setweight( to_tsvector( coalesce (title,'')), 'A' ) || ' ' || setweight( to_tsvector(coalesce (body,'')),'D'); RETURN NEW; END; $$ language plpgsql; CREATE TRIGGER fts_update BEFORE INSERT OR UPDATE ON pgweb FOR EACH ROW EXECUTE PROCEDURE pgweb_update(); CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); That avoids having to have a separate column because you can just say: WHERE to_query('XXX') @@ to_tsvector(column) yes, it's possible, but without ranking, since currently it's impossible to store any information in index (it's pg's feature). btw, this should works and for GiST index also. That kind of search is useful if there is another natural ordering of search results, for example, by timestamp. How do we make sure that the to_query is using the same text search configuration as the 'column' or index? Perhaps we should suggest: please, keep in mind, it's not mandatory to use the same configuration at search time, that was used at index creation. one example is when text search index created without taking into account stop-words. Then you could search famous 'to be or not to be' with the same configuration, or ignore stop words with other. CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); so that at least the configuration is documented in the index. yes, it's better to always explicitly specify configuration name and not rely on default configuration. Unfortunately, configuration name doesn't saved in the index. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Updated tsearch documentation
Oleg Bartunov wrote: On Tue, 17 Jul 2007, Bruce Momjian wrote: I think the tsearch documentation is nearing completion: http://momjian.us/expire/fulltext/HTML/textsearch.html but I am not happy with how tsearch is enabled in a user table: http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html Aside from the fact that it needs more examples, it only illustrates an example where someone creates a table, populates it, then adds a tsvector column, populates that, then creates an index. That seems quite inflexible. Is there a way to avoid having a separate tsvector column? What happens if the table is dynamic? How is that column updated based on table changes? Triggers? Where are the examples? Can you create an index like this: I agree, that there are could be more examples, but text search doesn't require something special ! *Example* of trigger function is documented on http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html Yes, I see that in tsearch() here: http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$ I assume my_filter_name is optional right? I have updated the prototype to be: tsearch([vector_column_name], [my_filter_name], text_column_name [, ... ]) Is this accurate? What does this text below it mean? There can be many functions and text columns specified in a tsearch() trigger. The following rule is used: a function is applied to all subsequent TEXT columns until the next matching column occurs. Why are we allowing my_filter_name here? Isn't that something for a custom trigger. Is calling it tsearch() a good idea? Why not tsvector_trigger(). CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); That avoids having to have a separate column because you can just say: WHERE to_query('XXX') @@ to_tsvector(column) yes, it's possible, but without ranking, since currently it's impossible to store any information in index (it's pg's feature). btw, this should works and for GiST index also. What if they use @@@. Wouldn't that work because it is going to check the heap? That kind of search is useful if there is another natural ordering of search results, for example, by timestamp. How do we make sure that the to_query is using the same text search configuration as the 'column' or index? Perhaps we should suggest: please, keep in mind, it's not mandatory to use the same configuration at search time, that was used at index creation. Well, sort of. If you have stop words in the tquery configuration, you aren't going to hit any matches in the tsvector, right? Same for synonymns, I suppose. I can see that stemming would work if there was a mismatch between tsquery and tsvector. CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); so that at least the configuration is documented in the index. yes, it's better to always explicitly specify configuration name and not rely on default configuration. Unfortunately, configuration name doesn't saved in the index. I was more concerned that there is nothing documenting the configuration used by the index or the tsvector table column trigger. By doing: CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); you guarantee that the index uses 'english' for all its entries. If you omit the 'english' or use a different configuration, it will heap scan the table, which at least gives the right answer. Also, how do you guarantee that tsearch() triggers always uses the same configuration? The existing tsearch() API seems to make that impossible. I am wondering if we need to add the configuration name as a mandatory parameter to tsearch(). -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Updated tsearch documentation
On Jul 17, 2007, at 16:24 , Bruce Momjian wrote: I assume my_filter_name is optional right? I have updated the prototype to be: tsearch([vector_column_name], [my_filter_name], text_column_name [, ... ]) Just a style point, but would [filter_name] be better than [my_filter_name]? You're not qualifying the others with my_ ... or is there something you want to tell us, Bruce? :) Michael Glaesemann grzm seespotcode net ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Updated tsearch documentation
Michael Glaesemann wrote: On Jul 17, 2007, at 16:24 , Bruce Momjian wrote: I assume my_filter_name is optional right? I have updated the prototype to be: tsearch([vector_column_name], [my_filter_name], text_column_name [, ... ]) Just a style point, but would [filter_name] be better than [my_filter_name]? You're not qualifying the others with my_ ... or is there something you want to tell us, Bruce? :) Agreed. Done. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Updated tsearch documentation
I think the tsearch documentation is nearing completion: http://momjian.us/expire/fulltext/HTML/textsearch.html but I am not happy with how tsearch is enabled in a user table: http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html Aside from the fact that it needs more examples, it only illustrates an example where someone creates a table, populates it, then adds a tsvector column, populates that, then creates an index. That seems quite inflexible. Is there a way to avoid having a separate tsvector column? What happens if the table is dynamic? How is that column updated based on table changes? Triggers? Where are the examples? Can you create an index like this: CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column)); That avoids having to have a separate column because you can just say: WHERE to_query('XXX') @@ to_tsvector(column) How do we make sure that the to_query is using the same text search configuration as the 'column' or index? Perhaps we should suggest: CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column)); so that at least the configuration is documented in the index. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Updated tsearch documentation
Thanks, I applied this patch and rebuild HTML version. I was wondering how I was going to make all the changes accurately. ;-) --- Nicolas Barbier wrote: 2007/7/7, Bruce Momjian [EMAIL PROTECTED]: FYI, I have massively reorganized the text search documentation and it is getting closer to something I am happy with: http://momjian.us/expire/fulltext/HTML/textsearch.html The following is the result of me proofreading, mainly searching for small mistakes such as spelling/grammatical errors (that means no document structure comments, etc). All corrections are relative to the version of the text at above URL at the time of me reading it :-). General It seems to be a recurring problem that commas are not put between the brackets when an argument is optional. For example: to_tsvector([conf_name], document TEXT) - I guess this should be to_tsvector([conf_name,] document TEXT) Full-text vs. full text and stop-word vs. stop word are not used consistently. Also, capitalization of full text searching is not used consistently. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Updated tsearch documentation
2007/7/7, Bruce Momjian [EMAIL PROTECTED]: FYI, I have massively reorganized the text search documentation and it is getting closer to something I am happy with: http://momjian.us/expire/fulltext/HTML/textsearch.html The following is the result of me proofreading, mainly searching for small mistakes such as spelling/grammatical errors (that means no document structure comments, etc). All corrections are relative to the version of the text at above URL at the time of me reading it :-). General It seems to be a recurring problem that commas are not put between the brackets when an argument is optional. For example: to_tsvector([conf_name], document TEXT) - I guess this should be to_tsvector([conf_name,] document TEXT) Full-text vs. full text and stop-word vs. stop word are not used consistently. Also, capitalization of full text searching is not used consistently. 14.1. Introduction * indexinging - indexing * There is no linguistic support, even in English - for instead of in? * e.g.satisfies - add a space before satisfies * have several thousands derivatives - should this not use the singular form thousand? * infinitive form - is this the right term? I think it only applies to verbs (also occurs in 14.4 and probably others) * over how lexemes creation - not sure what this should be. are created maybe? * Map synonyms to a single word. ispell. - why is ispell a standalone word? * so it is natural to introduce a new data type - this does not sound like documentation * Also, full-text search operator @@ - add the before full-text * A document is any text file that can be opened, read, and modified - file sounds as if it should be a file on a filesystem. * However, the document file must be uniquely identified in the database. - why? * COALESCE - should be a link * during calculation of document rank - add the before calculation and before document * which supports boolean operators, (AND) - remove the ,. maybe add the before boolean * parenthesis - parentheses * Tsquery consists of - maybe add A before Tsquery 14.2. Operators And Functions ^^^ - a non-capital a in and seems to be more consistent with the rest of the manual * TSVECTOR, otherwise false: - and false if not or and false otherwise (occurs 3 times in this section) * The text should be formatted to match the way a vector is displayed by SELECT. - what a strange definition, I think something like input format or so should be used (and defined somewhere, didn't see it yet) (used twice in this section) * tsearch([vector_column_name], my_filter_name | text_column_name1 [...], text_column_nameN) - I do not understand the notation * The following rule is used: a function is applied to all subsequent TEXT columns until next matching column occurs. - I don't get it * stat([sqlquery text ], [weight text ]) returns SETOF statinfo - I guess that not both of the arguments are optional? * stop-words candidates - stop-word candidates * tsvectors are compared with each other using lexicographical ordering. - of the output representation or something else? * Accepts querytext, which should be single tokens separated by - replace be with consist of * and | or, and ! not - putting parentheses around the and or and not would be more readable. also, a comma is missing before the | sign * break it onto tokens - into instead of onto * since GIN indexes do not support negate queries - something like: queries with negation or negated queries (depending on what the correct rule is) * Arguments to rewrite() function - the .. functions or to .. (without the function) * can be column names of type tsquery - names of columns of type tsquery (the names are not of type tsquery, the columns are) * we can change rewriting rule online - add the, possibly use another word for online (it is not clear what that means to me) 14.3. Additional Controls * Full text searching in PostgreSQL provides function - add the * we see the resulting - maybe we see that the resulting does not contain a, on, or it, word rats became rat, and the punctuation sign - was ignored - does not contain the words (or lexemes, or tokens), add the before word rats, add quotes around the - * on words - into words * they are too frequent - they occur too frequently (I think a word cannot be frequent) * The Punctuation sign - - The punctuation sign - + put quotes around the - * which shows all details of full text machinery - add the before full * is to mark out the different parts of document - add a before document * by the 1 + logarithm - by 1 + the logarithm * i.e., ordering of search results will not change - add the before ordering, maybe also before search * note that second example - add the before second * than ones with labeled with D - than ones labeled with D or than ones that are labeled with D * Unfortunately, it is almost impossible to avoid since full text indexing in a database should work without indexes - I don't get it * to show part of each document - add
Re: [HACKERS] Updated tsearch documentation
Oleg Bartunov wrote: On Wed, 20 Jun 2007, Bruce Momjian wrote: We need to decide if we need oids as user-visible argument. I don't see any value, probably Teodor think other way. This is a good time to clean up the API because there are going to be user-visible changes anyway. Bruce, just remove oid argument specification from documentation. Done. I am attaching the current function prototypes. If they don't match the C code, please let me know. I have also updated with some minor corrections I received from Erik. I will be adding more to the documentation hopefully this week: http://momjian.us/expire/fulltext/HTML/ -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + *** /pgsgml/fulltext-opfunc.sgml Sat Jun 16 23:30:11 2007 --- fulltext-opfunc.sgml Mon Jul 2 21:17:15 2007 *** *** 141,147 term synopsis ! to_tsvector(optionalreplaceable class=PARAMETERconfiguration/replaceable,/optional replaceable class=PARAMETERdocument/replaceable TEXT) returns TSVECTOR /synopsis /term --- 141,147 term synopsis ! to_tsvector(optionalreplaceable class=PARAMETERconf_name/replaceable/optional, replaceable class=PARAMETERdocument/replaceable TEXT) returns TSVECTOR /synopsis /term *** *** 285,306 term synopsis ! tsearch(replaceable class=PARAMETERvector_column_name/replaceableoptional, (replaceable class=PARAMETERmy_filter_name/replaceable | replaceable class=PARAMETERtext_column_name1/replaceable) optional.../optional /optional, replaceable class=PARAMETERtext_column_nameN/replaceable) /synopsis /term *** *** 323,329 term synopsis ! stat(replaceable class=PARAMETERsqlquery/replaceable text optional, weight text /optional) returns SETOF statinfo /synopsis listitem --- 322,328 term synopsis ! stat(optionalreplaceable class=PARAMETERsqlquery/replaceable text /optional, weight text /optional) returns SETOF statinfo /synopsis listitem *** *** 403,409 term synopsis ! to_tsquery(optionalreplaceable class=PARAMETERconfiguration/replaceable,/optional replaceable class=PARAMETERquerytext/replaceable text) returns TSQUERY /synopsis /term --- 402,408 term synopsis ! to_tsquery(optionalreplaceable class=PARAMETERconf_name/replaceable/optional, replaceable class=PARAMETERquerytext/replaceable text) returns TSQUERY /synopsis /term *** *** 446,452 term synopsis ! plainto_tsquery(optionalreplaceable class=PARAMETERconfiguration/replaceable,/optional replaceable class=PARAMETERquerytext/replaceable text) returns TSQUERY /synopsis /term --- 445,451 term synopsis ! plainto_tsquery(optionalreplaceable class=PARAMETERconf_name/replaceable/optional, replaceable class=PARAMETERquerytext/replaceable text) returns TSQUERY /synopsis /term *** *** 989,995 term synopsis ! rank(optional replaceable class=PARAMETERweights/replaceable float4[], /optional replaceable class=PARAMETERvector/replaceable TSVECTOR, replaceable class=PARAMETERquery/replaceable TSQUERY, optional replaceable class=PARAMETERnormalization/replaceable int4 /optional) returns float4 /synopsis /term --- 988,994 term synopsis ! rank(optional replaceable class=PARAMETERweights/replaceable float4[]/optional, replaceable class=PARAMETERvector/replaceable TSVECTOR, replaceable class=PARAMETERquery/replaceable TSQUERY, optional replaceable class=PARAMETERnormalization/replaceable int4 /optional) returns float4 /synopsis /term *** *** 1084,1090 term synopsis ! headline(optional replaceable class=PARAMETERid/replaceable int4, | replaceable class=PARAMETERts_name/replaceable text, /optional replaceable class=PARAMETERdocument/replaceable text, replaceable class=PARAMETERquery/replaceable TSQUERY, optional replaceable class=PARAMETERoptions/replaceable text /optional) returns text /synopsis /term --- 1083,1089 term synopsis ! headline(optional replaceable class=PARAMETERts_name/replaceable text/optional, replaceable class=PARAMETERdocument/replaceable text, replaceable class=PARAMETERquery/replaceable TSQUERY, optional replaceable class=PARAMETERoptions/replaceable text /optional) returns text /synopsis /term *** *** 1351,1357 term synopsis ! lexize(optional replaceable class=PARAMETERoid/replaceable, | replaceable class=PARAMETERdict_name/replaceable text, replaceable class=PARAMETERlexeme/replaceable text) returns text[] /synopsis /term --- 1350,1356 term synopsis ! lexize(optional replaceable class=PARAMETERdict_name/replaceable text/optional, replaceable class=PARAMETERlexeme/replaceable text) returns text[] /synopsis /term
Re: [HACKERS] Updated tsearch documentation
On Wed, 20 Jun 2007, Bruce Momjian wrote: Oleg Bartunov wrote: On Wed, 20 Jun 2007, Bruce Momjian wrote: Comments to editorial work of Bruce Momjian. fulltext-intro.sgml: it is useful to have a predefined list of lexemes. Bruce, here should be list of types of lexemes ! Agreed. Are the list of lexemes parser-specific? yes, it it parser which defines types of lexemes. OK, how will users get a list of supported lexemes? Do we need a list per supported parser? it's documented, see Parser functions for token_type(); postgres=# select * from token_type('default'); tokid |alias |description ---+--+--- 1 | lword| Latin word 2 | nlword | Non-latin word 3 | word | Word 4 | email| Email 5 | url | URL 6 | host | Host 7 | sfloat | Scientific notation 8 | version | VERSION 9 | part_hword | Part of hyphenated word 10 | nlpart_hword | Non-latin part of hyphenated word 11 | lpart_hword | Latin part of hyphenated word 12 | blank| Space symbols 13 | tag | HTML Tag 14 | protocol | Protocol head 15 | hword| Hyphenated word 16 | lhword | Latin hyphenated word 17 | nlhword | Non-latin hyphenated word 18 | uri | URI 19 | file | File or path name 20 | float| Decimal notation 21 | int | Signed integer 22 | uint | Unsigned integer 23 | entity | HTML Entity The integer option controls several behaviors which is done using bit-wise fields and literal|/literal (for example, literal2|4/literal): !-- why so complex? -- to avoid 2 arguments But I don't see why you would want to set two of those values --- they seem mutually exclusive, e.g. 1 divides the rank by the 1 + logarithm of the document length 2 divides the rank by the length itself I assume you do either one, not both. but what's about others variants ? OK, here is the full list: 0 (the default) ignores document length 1 divides the rank by the 1 + logarithm of the document length 2 divides the rank by the length itself 4 divides the rank by the mean harmonic distance between extents 8 divides the rank by the number of unique words in document 16 divides the rank by 1 + logarithm of the number of unique words in document so which ones would be both enabled? no one ! This is a list of possible values of rank normalization flag, which could be ORed together. =# select rank_cd('1:1,2,3 4:5 6:7', '14',1); rank_cd --- 0.0279055 =# select rank_cd('1:1,2,3 4:5 6:7', '14',1|16); rank_cd --- 0.0139528 What I missed is the definition of extent. From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking Extent is a shortest and non-nested sequence of words, which satisfy a query. I don't understand how that relates to this. because of 4 divides the rank by the mean harmonic distance between extents ^^^ it reflects how dense extents which satisfy query are in document. its replaceableid/replaceable or replaceablets_name/replaceable; !-- n if none is specified that the current configuration is used. I don't understand this question Same issue as above --- why allow a number here when the name works just fine. We don't allow tables to be specified by number, so why configurations? para !-- why? -- Note that the cascade dropping of the functionheadline/function function cause dropping of the literalparser/literal used in fulltext configuration replaceabletsname/replaceable. /para hmm, probably it should be reversed - cascade dropping of the parser cause dropping of the headline function. Agreed. In example below, literalfulltext_idx/literal is a GIN index:!-- why isn't this automatic -- It's explained above. The problem is that current index api doesn't allow to say if search was lossy or exact, so to preserve performance of GIN index we had to introduce @@@ operator, which is the same as @@, but lossy. Well, then we have to fix the API. Telling users to use a different operator based on what index is defined is just bad style. This was raised by Heikki and we discussed it a bit in Ottawa, but it's unclear if it's doable for 8.3. @@@ operator is in rare use, so we could say it will be improved in future versions. Uh, I am wondering if we just have to force heap access in all cases until it is fixed. no-no ! We'll lost performance of GIN index, which isn't lossy and don't need heap access. I don't see what's wrong if we say that some feature doesn't supported by text search operator with GIN index. We need to decide if we need oids as user-visible argument. I don't see any value, probably Teodor think other way. This is a good time
Re: [HACKERS] Updated tsearch documentation
On Wed, 20 Jun 2007, Bruce Momjian wrote: We need to decide if we need oids as user-visible argument. I don't see any value, probably Teodor think other way. This is a good time to clean up the API because there are going to be user-visible changes anyway. Bruce, just remove oid argument specification from documentation. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Updated tsearch documentation
Oleg Bartunov wrote: On Sun, 17 Jun 2007, Bruce Momjian wrote: I have completed my first pass over the tsearch documentation: http://momjian.us/expire/fulltext/HTML/sql.html They are from section 14 and following. I have come up with a number of questions that I placed in SGML comments in these files: http://momjian.us/expire/fulltext/SGML/ Teodor/Oleg, let me know when you want to go over my questions. Below are my answers (marked as ) OK. Comments to editorial work of Bruce Momjian. fulltext-intro.sgml: it is useful to have a predefined list of lexemes. Bruce, here should be list of types of lexemes ! Agreed. Are the list of lexemes parser-specific? /para/listitem !-- SEEMS UNNECESSARY It useless to attempt normalize typeemail address/type using morphological dictionary of russian language, but looks reasonable to pick out typedomain name/type and be able to search for typedomain name/type. -- I dont' understand where did you get this para :) Uh, it was in the SGML. I have removed it. fulltext-opfunc.sgml: All of the following functions that accept a configuration argument can use either an integer !-- why an integer -- or a textual configuration name to select a configuration. originally it was integer id, probably better use typeoid/type Uh, my question is why are you allowing specification as an integer/oid when the name works just fine. I don't see the value in allowing numbers here. This returns the query used for searching an index. It can be used to test for an empty query. The commandSELECT/ below returns literal'T'/, !-- lowercase? -- which corresponds to an empty query since GIN indexes do not support negate queries (a full index scan is inefficient): capital case. This looks cumbersome, probably querytree() should just return NULL. Agreed. The integer option controls several behaviors which is done using bit-wise fields and literal|/literal (for example, literal2|4/literal): !-- why so complex? -- to avoid 2 arguments But I don't see why you would want to set two of those values --- they seem mutually exclusive, e.g. 1 divides the rank by the 1 + logarithm of the document length 2 divides the rank by the length itself I assume you do either one, not both. its replaceableid/replaceable or replaceablets_name/replaceable; !-- n if none is specified that the current configuration is used. I don't understand this question Same issue as above --- why allow a number here when the name works just fine. We don't allow tables to be specified by number, so why configurations? para !-- why? -- Note that the cascade dropping of the functionheadline/function function cause dropping of the literalparser/literal used in fulltext configuration replaceabletsname/replaceable. /para hmm, probably it should be reversed - cascade dropping of the parser cause dropping of the headline function. Agreed. In example below, literalfulltext_idx/literal is a GIN index:!-- why isn't this automatic -- It's explained above. The problem is that current index api doesn't allow to say if search was lossy or exact, so to preserve performance of GIN index we had to introduce @@@ operator, which is the same as @@, but lossy. Well, then we have to fix the API. Telling users to use a different operator based on what index is defined is just bad style. nly the tokenlword/token lexeme, then a acronymTZ/acronym definition like ' one 1:11' will not work since lexeme type tokendigit/token is not assigned to the acronymTZ/acronym. !-- what do these numbers mean? -- /para OK, I changed it to be clearer. nothing special, just numbers for example. functionts_debug/ displays information about every token of replaceable class=PARAMETERdocument/replaceable as produced by the parser and processed by the configured dictionaries using the configuration specified by replaceable class=PARAMETERcfgname/replaceable or replaceable class=PARAMETERoid/replaceable. !-- no need for oid don't understand this comment. ts_debug accepts cfgname or its oid Again, no need for oid. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Updated tsearch documentation
On Wed, 20 Jun 2007, Bruce Momjian wrote: Oleg Bartunov wrote: On Sun, 17 Jun 2007, Bruce Momjian wrote: I have completed my first pass over the tsearch documentation: http://momjian.us/expire/fulltext/HTML/sql.html They are from section 14 and following. I have come up with a number of questions that I placed in SGML comments in these files: http://momjian.us/expire/fulltext/SGML/ Teodor/Oleg, let me know when you want to go over my questions. Below are my answers (marked as ) OK. Comments to editorial work of Bruce Momjian. fulltext-intro.sgml: it is useful to have a predefined list of lexemes. Bruce, here should be list of types of lexemes ! Agreed. Are the list of lexemes parser-specific? yes, it it parser which defines types of lexemes. fulltext-opfunc.sgml: All of the following functions that accept a configuration argument can use either an integer !-- why an integer -- or a textual configuration name to select a configuration. originally it was integer id, probably better use typeoid/type Uh, my question is why are you allowing specification as an integer/oid when the name works just fine. I don't see the value in allowing numbers here. for compatibility reason. Hmm, indeed, i don't recall where oid's could be important. This returns the query used for searching an index. It can be used to test for an empty query. The commandSELECT/ below returns literal'T'/, !-- lowercase? -- which corresponds to an empty query since GIN indexes do not support negate queries (a full index scan is inefficient): capital case. This looks cumbersome, probably querytree() should just return NULL. Agreed. The integer option controls several behaviors which is done using bit-wise fields and literal|/literal (for example, literal2|4/literal): !-- why so complex? -- to avoid 2 arguments But I don't see why you would want to set two of those values --- they seem mutually exclusive, e.g. 1 divides the rank by the 1 + logarithm of the document length 2 divides the rank by the length itself I assume you do either one, not both. but what's about others variants ? What I missed is the definition of extent. From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking Extent is a shortest and non-nested sequence of words, which satisfy a query. its replaceableid/replaceable or replaceablets_name/replaceable; !-- n if none is specified that the current configuration is used. I don't understand this question Same issue as above --- why allow a number here when the name works just fine. We don't allow tables to be specified by number, so why configurations? para !-- why? -- Note that the cascade dropping of the functionheadline/function function cause dropping of the literalparser/literal used in fulltext configuration replaceabletsname/replaceable. /para hmm, probably it should be reversed - cascade dropping of the parser cause dropping of the headline function. Agreed. In example below, literalfulltext_idx/literal is a GIN index:!-- why isn't this automatic -- It's explained above. The problem is that current index api doesn't allow to say if search was lossy or exact, so to preserve performance of GIN index we had to introduce @@@ operator, which is the same as @@, but lossy. Well, then we have to fix the API. Telling users to use a different operator based on what index is defined is just bad style. This was raised by Heikki and we discussed it a bit in Ottawa, but it's unclear if it's doable for 8.3. @@@ operator is in rare use, so we could say it will be improved in future versions. nly the tokenlword/token lexeme, then a acronymTZ/acronym definition like ' one 1:11' will not work since lexeme type tokendigit/token is not assigned to the acronymTZ/acronym. !-- what do these numbers mean? -- /para OK, I changed it to be clearer. nothing special, just numbers for example. functionts_debug/ displays information about every token of replaceable class=PARAMETERdocument/replaceable as produced by the parser and processed by the configured dictionaries using the configuration specified by replaceable class=PARAMETERcfgname/replaceable or replaceable class=PARAMETERoid/replaceable. !-- no need for oid don't understand this comment. ts_debug accepts cfgname or its oid Again, no need for oid. We need to decide if we need oids as user-visible argument. I don't see any value, probably Teodor think other way. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Updated tsearch documentation
Oleg Bartunov wrote: On Wed, 20 Jun 2007, Bruce Momjian wrote: Comments to editorial work of Bruce Momjian. fulltext-intro.sgml: it is useful to have a predefined list of lexemes. Bruce, here should be list of types of lexemes ! Agreed. Are the list of lexemes parser-specific? yes, it it parser which defines types of lexemes. OK, how will users get a list of supported lexemes? Do we need a list per supported parser? fulltext-opfunc.sgml: All of the following functions that accept a configuration argument can use either an integer !-- why an integer -- or a textual configuration name to select a configuration. originally it was integer id, probably better use typeoid/type Uh, my question is why are you allowing specification as an integer/oid when the name works just fine. I don't see the value in allowing numbers here. for compatibility reason. Hmm, indeed, i don't recall where oid's could be important. Well, if neither of ussee no reason for it, let's remove it. We don't need to support a feature that has no usefulness. This returns the query used for searching an index. It can be used to test for an empty query. The commandSELECT/ below returns literal'T'/, !-- lowercase? -- which corresponds to an empty query since GIN indexes do not support negate queries (a full index scan is inefficient): capital case. This looks cumbersome, probably querytree() should just return NULL. Agreed. The integer option controls several behaviors which is done using bit-wise fields and literal|/literal (for example, literal2|4/literal): !-- why so complex? -- to avoid 2 arguments But I don't see why you would want to set two of those values --- they seem mutually exclusive, e.g. 1 divides the rank by the 1 + logarithm of the document length 2 divides the rank by the length itself I assume you do either one, not both. but what's about others variants ? OK, here is the full list: 0 (the default) ignores document length 1 divides the rank by the 1 + logarithm of the document length 2 divides the rank by the length itself 4 divides the rank by the mean harmonic distance between extents 8 divides the rank by the number of unique words in document 16 divides the rank by 1 + logarithm of the number of unique words in document so which ones would be both enabled? What I missed is the definition of extent. From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking Extent is a shortest and non-nested sequence of words, which satisfy a query. I don't understand how that relates to this. its replaceableid/replaceable or replaceablets_name/replaceable; !-- n if none is specified that the current configuration is used. I don't understand this question Same issue as above --- why allow a number here when the name works just fine. We don't allow tables to be specified by number, so why configurations? para !-- why? -- Note that the cascade dropping of the functionheadline/function function cause dropping of the literalparser/literal used in fulltext configuration replaceabletsname/replaceable. /para hmm, probably it should be reversed - cascade dropping of the parser cause dropping of the headline function. Agreed. In example below, literalfulltext_idx/literal is a GIN index:!-- why isn't this automatic -- It's explained above. The problem is that current index api doesn't allow to say if search was lossy or exact, so to preserve performance of GIN index we had to introduce @@@ operator, which is the same as @@, but lossy. Well, then we have to fix the API. Telling users to use a different operator based on what index is defined is just bad style. This was raised by Heikki and we discussed it a bit in Ottawa, but it's unclear if it's doable for 8.3. @@@ operator is in rare use, so we could say it will be improved in future versions. Uh, I am wondering if we just have to force heap access in all cases until it is fixed. nly the tokenlword/token lexeme, then a acronymTZ/acronym definition like ' one 1:11' will not work since lexeme type tokendigit/token is not assigned to the acronymTZ/acronym. !-- what do these numbers mean? -- /para OK, I changed it to be clearer. nothing special, just numbers for example. functionts_debug/ displays information about every token of replaceable class=PARAMETERdocument/replaceable as produced by the parser and processed by the configured dictionaries using the configuration specified by replaceable class=PARAMETERcfgname/replaceable or replaceable class=PARAMETERoid/replaceable. !-- no need for oid don't understand this comment. ts_debug accepts cfgname or its oid Again, no need for oid. We need to decide if we need oids as user-visible argument. I don't see any value, probably Teodor think
Re: [HACKERS] Updated tsearch documentation
On Sun, 17 Jun 2007, Bruce Momjian wrote: I have completed my first pass over the tsearch documentation: http://momjian.us/expire/fulltext/HTML/sql.html They are from section 14 and following. I have come up with a number of questions that I placed in SGML comments in these files: http://momjian.us/expire/fulltext/SGML/ Teodor/Oleg, let me know when you want to go over my questions. Below are my answers (marked as ) Comments to editorial work of Bruce Momjian. fulltext-intro.sgml: it is useful to have a predefined list of lexemes. Bruce, here should be list of types of lexemes ! /para/listitem !-- SEEMS UNNECESSARY It useless to attempt normalize typeemail address/type using morphological dictionary of russian language, but looks reasonable to pick out typedomain name/type and be able to search for typedomain name/type. -- I dont' understand where did you get this para :) fulltext-opfunc.sgml: All of the following functions that accept a configuration argument can use either an integer !-- why an integer -- or a textual configuration name to select a configuration. originally it was integer id, probably better use typeoid/type This returns the query used for searching an index. It can be used to test for an empty query. The commandSELECT/ below returns literal'T'/, !-- lowercase? -- which corresponds to an empty query since GIN indexes do not support negate queries (a full index scan is inefficient): capital case. This looks cumbersome, probably querytree() should just return NULL. The integer option controls several behaviors which is done using bit-wise fields and literal|/literal (for example, literal2|4/literal): !-- why so complex? -- to avoid 2 arguments its replaceableid/replaceable or replaceablets_name/replaceable; !-- n if none is specified that the current configuration is used. I don't understand this question para !-- why? -- Note that the cascade dropping of the functionheadline/function function cause dropping of the literalparser/literal used in fulltext configuration replaceabletsname/replaceable. /para hmm, probably it should be reversed - cascade dropping of the parser cause dropping of the headline function. In example below, literalfulltext_idx/literal is a GIN index:!-- why isn't this automatic -- It's explained above. The problem is that current index api doesn't allow to say if search was lossy or exact, so to preserve performance of GIN index we had to introduce @@@ operator, which is the same as @@, but lossy. nly the tokenlword/token lexeme, then a acronymTZ/acronym definition like ' one 1:11' will not work since lexeme type tokendigit/token is not assigned to the acronymTZ/acronym. !-- what do these numbers mean? -- /para nothing special, just numbers for example. functionts_debug/ displays information about every token of replaceable class=PARAMETERdocument/replaceable as produced by the parser and processed by the configured dictionaries using the configuration specified by replaceable class=PARAMETERcfgname/replaceable or replaceable class=PARAMETERoid/replaceable. !-- no need for oid don't understand this comment. ts_debug accepts cfgname or its oid Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly