Re: [HACKERS] Updated tsearch documentation

2007-07-27 Thread Oleg Bartunov

On Thu, 26 Jul 2007, Bruce Momjian wrote:


Oleg Bartunov wrote:

Bruce,

I sent you link to my wiki page with summary of changes
http://www.sai.msu.su/~megera/wiki/ts_changes

Your documentation looks rather old.


I have updated it to reflect your changes:

http://momjian.us/expire/fulltext/HTML/textsearch-tables.html



Bruce, I noticed you miss many changes. For example,


options for stemmer has changed (it's documented in my ts_changes), 
so in 
http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-CONFIGURATION


ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 'english-utf8.stop';

should be


ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 
'StopFile=english-utf8.stop, Language=english';



Also, this is wrong

DROP TEXT SEARCH CONFIGURATION MAPPING ON pg FOR email, url, sfloat, uri, float;

it should be

ALTER TEXT SEARCH CONFIGURATION pg DROP MAPPING FOR email, url, sfloat, uri, 
float;

Configuration now doesn't have DEFAULT flag, so \dF should not display 'Y'


= \dF
pg_catalog | russian  | Y
public | pg   | Y


This is what I see now

postgres=# \dF public.*
List of fulltext configurations
 Schema | Name | Description
+--+-
 public | pg   |





---




Oleg
On Tue, 24 Jul 2007, Bruce Momjian wrote:



I have added more documentation to try to show how full text search is
used by user tables.  I think this the documentaiton is almost done:

http://momjian.us/expire/fulltext/HTML/textsearch-tables.html

---

Oleg Bartunov wrote:

On Wed, 18 Jul 2007, Bruce Momjian wrote:


Oleg, Teodor,

I am confused by the following example.  How does gin know to create a
tsvector, or does it?  Does gist know too?


No, gist doesn't know. I don't remember why, Teodor ?

For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php
for discussion



FYI, at some point we need to chat via instant messenger or IRC to
discuss the open items.  My chat information is here:

http://momjian.us/main/contact.html


I send you invitation for google talk, I use only chat in gmail.
My gmail account is [EMAIL PROTECTED]



---

SELECT title
FROM pgweb
WHERE textcat(title,body) @@ plainto_tsquery('create table')
ORDER BY dlm DESC LIMIT 10;

CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Updated tsearch documentation

2007-07-27 Thread Oleg Bartunov

On Thu, 26 Jul 2007, Bruce Momjian wrote:


Oleg Bartunov wrote:

On Wed, 25 Jul 2007, Erikjan wrote:


In
http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT

it says:

A document is any text file that can be opened, read, and modified.


OOps, in my original documentation it was:
Document, in usual meaning, is a text file, that one could open, read and 
modify.
I stress that in database document is something another.

http://www.sai.msu.su/~megera/postgres/fts/doc/fts-whatdb.html


I have updated the documentation:


http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT



Is't worth to reference OpenFTS which used for indexing file system ?


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Updated tsearch documentation

2007-07-27 Thread Bruce Momjian
Oleg Bartunov wrote:
 On Thu, 26 Jul 2007, Bruce Momjian wrote:
 
  Oleg Bartunov wrote:
  On Wed, 25 Jul 2007, Erikjan wrote:
 
  In
  http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT
 
  it says:
 
  A document is any text file that can be opened, read, and modified.
 
  OOps, in my original documentation it was:
  Document, in usual meaning, is a text file, that one could open, read and 
  modify.
  I stress that in database document is something another.
 
  http://www.sai.msu.su/~megera/postgres/fts/doc/fts-whatdb.html
 
  I have updated the documentation:
 
  
  http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT
 
 
 Is't worth to reference OpenFTS which used for indexing file system ?

Uh, not sure.  I don't think so but we can add a URL to it if you can
find the right place.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Updated tsearch documentation

2007-07-27 Thread Bruce Momjian

Thanks, I found a few more places that needed updating.  It should be
accurate now.  Thanks for the report.

---

Oleg Bartunov wrote:
 On Thu, 26 Jul 2007, Bruce Momjian wrote:
 
  Oleg Bartunov wrote:
  Bruce,
 
  I sent you link to my wiki page with summary of changes
  http://www.sai.msu.su/~megera/wiki/ts_changes
 
  Your documentation looks rather old.
 
  I have updated it to reflect your changes:
 
  http://momjian.us/expire/fulltext/HTML/textsearch-tables.html
 
 
 Bruce, I noticed you miss many changes. For example,
 
 
 options for stemmer has changed (it's documented in my ts_changes), 
 so in 
 http://momjian.us/expire/fulltext/HTML/textsearch-tables.html#TEXTSEARCH-TABLES-CONFIGURATION
 
 ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 'english-utf8.stop';
 
 should be
 
 
 ALTER TEXT SEARCH DICTIONARY en_stem SET OPTION 
 'StopFile=english-utf8.stop, Language=english';
 
 
 Also, this is wrong
 
 DROP TEXT SEARCH CONFIGURATION MAPPING ON pg FOR email, url, sfloat, uri, 
 float;
 
 it should be
 
 ALTER TEXT SEARCH CONFIGURATION pg DROP MAPPING FOR email, url, sfloat, uri, 
 float;
 
 Configuration now doesn't have DEFAULT flag, so \dF should not display 'Y'
 
 
 = \dF
 pg_catalog | russian  | Y
 public | pg   | Y
 
 
 This is what I see now
 
 postgres=# \dF public.*
 List of fulltext configurations
   Schema | Name | Description
 +--+-
   public | pg   |
 
 
 
 
  ---
 
 
 
  Oleg
  On Tue, 24 Jul 2007, Bruce Momjian wrote:
 
 
  I have added more documentation to try to show how full text search is
  used by user tables.  I think this the documentaiton is almost done:
 
http://momjian.us/expire/fulltext/HTML/textsearch-tables.html
 
  ---
 
  Oleg Bartunov wrote:
  On Wed, 18 Jul 2007, Bruce Momjian wrote:
 
  Oleg, Teodor,
 
  I am confused by the following example.  How does gin know to create a
  tsvector, or does it?  Does gist know too?
 
  No, gist doesn't know. I don't remember why, Teodor ?
 
  For GIN see 
  http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php
  for discussion
 
 
  FYI, at some point we need to chat via instant messenger or IRC to
  discuss the open items.  My chat information is here:
 
  http://momjian.us/main/contact.html
 
  I send you invitation for google talk, I use only chat in gmail.
  My gmail account is [EMAIL PROTECTED]
 
 
  ---
 
  SELECT title
  FROM pgweb
  WHERE textcat(title,body) @@ plainto_tsquery('create table')
  ORDER BY dlm DESC LIMIT 10;
 
  CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));
 
 
 
   Regards,
   Oleg
  _
  Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
  Sternberg Astronomical Institute, Moscow University, Russia
  Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
  phone: +007(495)939-16-83, +007(495)939-23-83
 
 
 
 Regards,
 Oleg
  _
  Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
  Sternberg Astronomical Institute, Moscow University, Russia
  Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
  phone: +007(495)939-16-83, +007(495)939-23-83
 
 
 
   Regards,
   Oleg
 _
 Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
 Sternberg Astronomical Institute, Moscow University, Russia
 Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
 phone: +007(495)939-16-83, +007(495)939-23-83

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Updated tsearch documentation

2007-07-27 Thread Bruce Momjian
Dimitri Fontaine wrote:
-- Start of PGP signed section.
 Hi,
 
 Le mercredi 25 juillet 2007, Bruce Momjian a ?crit?:
  I have added more documentation to try to show how full text search is
  used by user tables.  I think this the documentaiton is almost done:
 
  http://momjian.us/expire/fulltext/HTML/textsearch-tables.html
 
 I've come to understand that GIN indexes are far more costly to update than 
 GiST one, and Oleg's wiki advice users to partition data and use GiST index 
 for live part and GIN index for archive part only.
 
 Is it worth mentioning this into this part of the documentation?
 And if mentioned here, partitioning step could certainly be part of the 
 example... or let it as a user exercise, but then explaining why GIN is a 
 good choice in the provided example.

Partitioning is already in the documentation:

Partitioning of big collections and the proper use of GiST and GIN
indexes allows the implementation of very fast searches with online
update. Partitioning can be done at the database level using table
inheritance and varnameconstraint_exclusion/, or distributing
documents over servers and collecting search results using the
filenamecontrib/dblink/ extension module. The latter is possible
because ranking functions use only local information.

I don't see a reason to provide an example beyond the existing examples
of how to do partitioning.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Updated tsearch documentation

2007-07-26 Thread Bruce Momjian
Oleg Bartunov wrote:
 On Wed, 25 Jul 2007, Erikjan wrote:
 
  In
  http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT
 
  it says:
 
  A document is any text file that can be opened, read, and modified.
 
 OOps, in my original documentation it was:
 Document, in usual meaning, is a text file, that one could open, read and 
 modify.
 I stress that in database document is something another.
 
 http://www.sai.msu.su/~megera/postgres/fts/doc/fts-whatdb.html

I have updated the documentation:


http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Updated tsearch documentation

2007-07-26 Thread Bruce Momjian
Oleg Bartunov wrote:
 Bruce,
 
 I sent you link to my wiki page with summary of changes
 http://www.sai.msu.su/~megera/wiki/ts_changes
 
 Your documentation looks rather old.

I have updated it to reflect your changes:

http://momjian.us/expire/fulltext/HTML/textsearch-tables.html

---


 
 Oleg
 On Tue, 24 Jul 2007, Bruce Momjian wrote:
 
 
  I have added more documentation to try to show how full text search is
  used by user tables.  I think this the documentaiton is almost done:
 
  http://momjian.us/expire/fulltext/HTML/textsearch-tables.html
 
  ---
 
  Oleg Bartunov wrote:
  On Wed, 18 Jul 2007, Bruce Momjian wrote:
 
  Oleg, Teodor,
 
  I am confused by the following example.  How does gin know to create a
  tsvector, or does it?  Does gist know too?
 
  No, gist doesn't know. I don't remember why, Teodor ?
 
  For GIN see 
  http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php
  for discussion
 
 
  FYI, at some point we need to chat via instant messenger or IRC to
  discuss the open items.  My chat information is here:
 
http://momjian.us/main/contact.html
 
  I send you invitation for google talk, I use only chat in gmail.
  My gmail account is [EMAIL PROTECTED]
 
 
  ---
 
  SELECT title
  FROM pgweb
  WHERE textcat(title,body) @@ plainto_tsquery('create table')
  ORDER BY dlm DESC LIMIT 10;
 
  CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));
 
 
 
 Regards,
 Oleg
  _
  Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
  Sternberg Astronomical Institute, Moscow University, Russia
  Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
  phone: +007(495)939-16-83, +007(495)939-23-83
 
 
 
   Regards,
   Oleg
 _
 Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
 Sternberg Astronomical Institute, Moscow University, Russia
 Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
 phone: +007(495)939-16-83, +007(495)939-23-83

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Updated tsearch documentation

2007-07-25 Thread Dimitri Fontaine
Hi,

Le mercredi 25 juillet 2007, Bruce Momjian a écrit :
 I have added more documentation to try to show how full text search is
 used by user tables.  I think this the documentaiton is almost done:

   http://momjian.us/expire/fulltext/HTML/textsearch-tables.html

I've come to understand that GIN indexes are far more costly to update than 
GiST one, and Oleg's wiki advice users to partition data and use GiST index 
for live part and GIN index for archive part only.

Is it worth mentioning this into this part of the documentation?
And if mentioned here, partitioning step could certainly be part of the 
example... or let it as a user exercise, but then explaining why GIN is a 
good choice in the provided example.

Hope this helps, regards,
-- 
dim


signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] Updated tsearch documentation

2007-07-25 Thread Erikjan
In
http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT

it says:

A document is any text file that can be opened, read, and modified.

Is this an openfts docs relic? tsearch2 is not meant to be be reading
out-of-database *files*, or is it?

If it is actually the case that the present tsearch2 implementation (for
8.3) is going to be able to store pointers into external files, maybe this
should be made more explicitly clear?


oh, and another little derussification (russians don't seem to like
articles, be they definite or indefinite):
is seen as different function should be is seen as a different function


Thanks,

Erik Rijkers









---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Updated tsearch documentation

2007-07-25 Thread Oleg Bartunov

On Wed, 25 Jul 2007, Erikjan wrote:


In
http://momjian.us/expire/fulltext/HTML/textsearch-intro.html#TEXTSEARCH-DOCUMENT

it says:

A document is any text file that can be opened, read, and modified.


OOps, in my original documentation it was:
Document, in usual meaning, is a text file, that one could open, read and 
modify.
I stress that in database document is something another.

http://www.sai.msu.su/~megera/postgres/fts/doc/fts-whatdb.html




Is this an openfts docs relic? tsearch2 is not meant to be be reading
out-of-database *files*, or is it?

If it is actually the case that the present tsearch2 implementation (for
8.3) is going to be able to store pointers into external files, maybe this
should be made more explicitly clear?


oh, and another little derussification (russians don't seem to like
articles, be they definite or indefinite):
is seen as different function should be is seen as a different function


Thanks,

Erik Rijkers









---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Updated tsearch documentation

2007-07-24 Thread Bruce Momjian

I have added more documentation to try to show how full text search is
used by user tables.  I think this the documentaiton is almost done:

http://momjian.us/expire/fulltext/HTML/textsearch-tables.html

---

Oleg Bartunov wrote:
 On Wed, 18 Jul 2007, Bruce Momjian wrote:
 
  Oleg, Teodor,
 
  I am confused by the following example.  How does gin know to create a
  tsvector, or does it?  Does gist know too?
 
 No, gist doesn't know. I don't remember why, Teodor ?
 
 For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php
 for discussion
 
 
  FYI, at some point we need to chat via instant messenger or IRC to
  discuss the open items.  My chat information is here:
 
  http://momjian.us/main/contact.html
 
 I send you invitation for google talk, I use only chat in gmail.
 My gmail account is [EMAIL PROTECTED]
 
 
  ---
 
  SELECT title
  FROM pgweb
  WHERE textcat(title,body) @@ plainto_tsquery('create table')
  ORDER BY dlm DESC LIMIT 10;
 
  CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));
 
 
 
   Regards,
   Oleg
 _
 Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
 Sternberg Astronomical Institute, Moscow University, Russia
 Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
 phone: +007(495)939-16-83, +007(495)939-23-83

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Updated tsearch documentation

2007-07-24 Thread Oleg Bartunov

Bruce,

I sent you link to my wiki page with summary of changes
http://www.sai.msu.su/~megera/wiki/ts_changes

Your documentation looks rather old.

Oleg
On Tue, 24 Jul 2007, Bruce Momjian wrote:



I have added more documentation to try to show how full text search is
used by user tables.  I think this the documentaiton is almost done:

http://momjian.us/expire/fulltext/HTML/textsearch-tables.html

---

Oleg Bartunov wrote:

On Wed, 18 Jul 2007, Bruce Momjian wrote:


Oleg, Teodor,

I am confused by the following example.  How does gin know to create a
tsvector, or does it?  Does gist know too?


No, gist doesn't know. I don't remember why, Teodor ?

For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php
for discussion



FYI, at some point we need to chat via instant messenger or IRC to
discuss the open items.  My chat information is here:

http://momjian.us/main/contact.html


I send you invitation for google talk, I use only chat in gmail.
My gmail account is [EMAIL PROTECTED]



---

SELECT title
FROM pgweb
WHERE textcat(title,body) @@ plainto_tsquery('create table')
ORDER BY dlm DESC LIMIT 10;

CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Updated tsearch documentation

2007-07-18 Thread Oleg Bartunov

On Tue, 17 Jul 2007, Bruce Momjian wrote:


Oleg Bartunov wrote:

On Tue, 17 Jul 2007, Bruce Momjian wrote:


I think the tsearch documentation is nearing completion:

http://momjian.us/expire/fulltext/HTML/textsearch.html

but I am not happy with how tsearch is enabled in a user table:

http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html

Aside from the fact that it needs more examples, it only illustrates an
example where someone creates a table, populates it, then adds a
tsvector column, populates that, then creates an index.

That seems quite inflexible.  Is there a way to avoid having a separate
tsvector column?  What happens if the table is dynamic?  How is that
column updated based on table changes?  Triggers?  Where are the
examples?  Can you create an index like this:


I agree, that there are could be more examples, but text search doesn't
require something special !
*Example* of trigger function is documented on
http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html


Yes, I see that in tsearch() here:

http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$

I assume my_filter_name is optional right?  I have updated the prototype
to be:

tsearch([vector_column_name], [my_filter_name], text_column_name [, ... 
])

Is this accurate?  What does this text below it mean?


no, this in inaccurate. First, vector_column_name is not optional argument,
it's a name of tsvector column name.



There can be many functions and text columns specified in a tsearch()
trigger. The following rule is used: a function is applied to all
subsequent TEXT columns until the next matching column occurs.


The idea, is to provide user to preprocess text before applying 
tsearch machinery. my_filter_name() preprocess text_column_name1,

text_column_name2,
The original syntax allows to specify for every text columns their 
preprocessing functions.


So, I suggest to keep original syntax, change 'vector_column_name' to
'tsvector_column_name'.



Why are we allowing my_filter_name here?  Isn't that something for a
custom trigger.  Is calling it tsearch() a good idea?  Why not
tsvector_trigger().


I don't see any benefit from the tsvector_trigger() name. If you want to add
some semantic, than tsvector_update_trigger() would be better.  Anyway,
this trigger is an illustration.




CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));

That avoids having to have a separate column because you can just say:

WHERE to_query('XXX') @@ to_tsvector(column)


yes, it's possible, but without ranking, since currently it's impossible
to store any information in index (it's pg's feature). btw, this should
works and for GiST index also.


What if they use @@@.  Wouldn't that work because it is going to check
the heap?


It would work, it'd recalculate to_tsvector(column) for rows found
( for GiST - to remove false hits and for weight information, for 
GIN - for weight information only).





That kind of search is useful if there is  another natural ordering of search
results, for example, by timestamp.



How do we make sure that the to_query is using the same text search
configuration as the 'column' or index?  Perhaps we should suggest:


please, keep in mind, it's not mandatory to use the same configuration
at search time, that was used at index creation.


Well, sort of.  If you have stop words in the tquery configuration, you
aren't going to hit any matches in the tsvector, right?  Same for
synonymns, I suppose.  I can see that stemming would work if there was a
mismatch between tsquery and tsvector.


 CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));

so that at least the configuration is documented in the index.


yes, it's better to always explicitly specify configuration name and not
rely on default configuration.
Unfortunately, configuration name doesn't saved in the index.


as Teodor corrected me, index doesn't know about configuration at all !
What accurate user could do, is to provide configuration name in the 
comment for tsvector column. Configuration name is an accessory of

to_tsvector() function.

In principle, tsvector as any data type could be obtained by any other ways,
for example, OpenFTS construct tsvector following its own rules.



I was more concerned that there is nothing documenting the configuration
used by the index or the tsvector table column trigger.  By doing:


again, index has nothing with configuration name.
Our trigger function is an example, which uses default configuration name.
User could easily write it's own trigger to keep tsvector column up to date 
and use configuration name as a parameter.




CREATE INDEX textsearch_idx ON pgweb USING 
gin(to_tsvector('english',column));

you guarantee that the index uses 'english' for all its entries.  If you
omit the 'english' or use a different configuration, it will heap scan
the 

Re: [HACKERS] Updated tsearch documentation

2007-07-18 Thread Bruce Momjian
Oleg Bartunov wrote:
  I agree, that there are could be more examples, but text search doesn't
  require something special !
  *Example* of trigger function is documented on
  http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html
 
  Yes, I see that in tsearch() here:
 
  http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$
 
  I assume my_filter_name is optional right?  I have updated the prototype
  to be:
 
  tsearch([vector_column_name], [my_filter_name], text_column_name [, ... 
  ])
 
  Is this accurate?  What does this text below it mean?
 
 no, this in inaccurate. First, vector_column_name is not optional argument,
 it's a name of tsvector column name.

Fixed.

  There can be many functions and text columns specified in a tsearch()
  trigger. The following rule is used: a function is applied to all
  subsequent TEXT columns until the next matching column occurs.
 
 The idea, is to provide user to preprocess text before applying 
 tsearch machinery. my_filter_name() preprocess text_column_name1,
 text_column_name2,
 The original syntax allows to specify for every text columns their 
 preprocessing functions.
 
 So, I suggest to keep original syntax, change 'vector_column_name' to
 'tsvector_column_name'.

OK, change made.

  Why are we allowing my_filter_name here?  Isn't that something for a
  custom trigger.  Is calling it tsearch() a good idea?  Why not
  tsvector_trigger().
 
 I don't see any benefit from the tsvector_trigger() name. If you want to add
 some semantic, than tsvector_update_trigger() would be better.  Anyway,
 this trigger is an illustration.

Well, the filter that removes '@' might be an example, but tsearch() is
indeed sort of built-in trigger function to be used for simple cases. 
My point is that because it is only for simple cases, why add complexity
and allow a filter?  It seems best to just remove the filter idea and
let people write their own triggers if they want that functionality.

CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
 
  That avoids having to have a separate column because you can just say:
 
WHERE to_query('XXX') @@ to_tsvector(column)
 
  yes, it's possible, but without ranking, since currently it's impossible
  to store any information in index (it's pg's feature). btw, this should
  works and for GiST index also.
 
  What if they use @@@.  Wouldn't that work because it is going to check
  the heap?
 
 It would work, it'd recalculate to_tsvector(column) for rows found
 ( for GiST - to remove false hits and for weight information, for 
 GIN - for weight information only).

Right.  Currently to use text search on a table, you have to do three
things:

o  add a tsvector column to the table
o  add a trigger to keep the tsvector column current
o  add an index to the tsvector column

My question is why bother with the first two steps?  If you do:

 CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));

you don't need a separate column and a trigger to keep it current.  The
index is kept current as part of normal query processing.  The only
downside is that you have to do to_tsvector() in the heap to avoid false
hits, but that seems minor compared to the disk savings of not having
the separate column.  Is to_tsvector() an expensive function?

   CREATE INDEX textsearch_idx ON pgweb USING 
  gin(to_tsvector('english',column));
 
  so that at least the configuration is documented in the index.
 
  yes, it's better to always explicitly specify configuration name and not
  rely on default configuration.
  Unfortunately, configuration name doesn't saved in the index.
 
 as Teodor corrected me, index doesn't know about configuration at all !
 What accurate user could do, is to provide configuration name in the 
 comment for tsvector column. Configuration name is an accessory of
 to_tsvector() function.

Well, if you create the index with the configuration name it is
guaranteed to match:

 CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
  ---
And if someone does:

WHERE 'friend'::tsquery @@ to_tsvector('english',column))

the index is used.  Now if the default configuration is 'english' and
they use:

WHERE 'friend'::tsquery @@ to_tsvector(column))

the index is not used, but this just a good example of why default
configurations aren't that useful.  One problem I see is that if the
default configuration is not 'english', then when the index consults the
heap, it would be using a different configuration and yield incorrect
results.  I am unsure how to fix that.

With the trigger idea, you have to be sure your configuration is the same
every time you INSERT/UPDATE the table or the index will have mixed
configuration entries and it will yield incorrect results, aside from
the heap configuration lookup not matching the index.

Once 

Re: [HACKERS] Updated tsearch documentation

2007-07-18 Thread Oleg Bartunov

On Wed, 18 Jul 2007, Bruce Momjian wrote:




Why are we allowing my_filter_name here?  Isn't that something for a
custom trigger.  Is calling it tsearch() a good idea?  Why not
tsvector_trigger().


I don't see any benefit from the tsvector_trigger() name. If you want to add
some semantic, than tsvector_update_trigger() would be better.  Anyway,
this trigger is an illustration.


Well, the filter that removes '@' might be an example, but tsearch() is
indeed sort of built-in trigger function to be used for simple cases.
My point is that because it is only for simple cases, why add complexity
and allow a filter?  It seems best to just remove the filter idea and
let people write their own triggers if they want that functionality.


If you aware about documentation simplicity than we could just document 
two versions:

1. without filter function - simple, well understood syntax
2. with filter function - for advanced users

I don't want to remove the feature which works for year without any problem.





CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));

That avoids having to have a separate column because you can just say:

WHERE to_query('XXX') @@ to_tsvector(column)


yes, it's possible, but without ranking, since currently it's impossible
to store any information in index (it's pg's feature). btw, this should
works and for GiST index also.


What if they use @@@.  Wouldn't that work because it is going to check
the heap?


It would work, it'd recalculate to_tsvector(column) for rows found
( for GiST - to remove false hits and for weight information, for
GIN - for weight information only).


Right.  Currently to use text search on a table, you have to do three
things:

o  add a tsvector column to the table
o  add a trigger to keep the tsvector column current
o  add an index to the tsvector column

My question is why bother with the first two steps?  If you do:

CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));

you don't need a separate column and a trigger to keep it current.  The
index is kept current as part of normal query processing.  The only
downside is that you have to do to_tsvector() in the heap to avoid false
hits, but that seems minor compared to the disk savings of not having
the separate column.  Is to_tsvector() an expensive function?


Bruce, you oversimplify the text search, the document could be fully virtual,
not a column(s), it could be a result of any SQL commands, so it could be 
very expensive just to obtain document, and yes, to_tsvector could be

very expensive, depending on the document size, parser and dictionaries used.

And, again, current postgres architecture forces to use heap to store
positional and weight information for ranking.

The use case for what you described is very limited - simple text search
on one/several column of the same table without ranking.




 CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));

so that at least the configuration is documented in the index.


yes, it's better to always explicitly specify configuration name and not
rely on default configuration.
Unfortunately, configuration name doesn't saved in the index.


as Teodor corrected me, index doesn't know about configuration at all !
What accurate user could do, is to provide configuration name in the
comment for tsvector column. Configuration name is an accessory of
to_tsvector() function.


Well, if you create the index with the configuration name it is
guaranteed to match:

CREATE INDEX textsearch_idx ON pgweb USING gist(to_tsvector('english',column));
 ---
And if someone does:

WHERE 'friend'::tsquery @@ to_tsvector('english',column))

the index is used.  Now if the default configuration is 'english' and
they use:

WHERE 'friend'::tsquery @@ to_tsvector(column))

the index is not used, but this just a good example of why default
configurations aren't that useful.  One problem I see is that if the
default configuration is not 'english', then when the index consults the
heap, it would be using a different configuration and yield incorrect
results.  I am unsure how to fix that.


again, you consider very simple case  and actually, your example is a 
good example of usefulness of default configuration ! Just think before

you develop your application, but this is very general rule. There are
zillions situations you could do bad things, after all.

Moreover, consider text search on text column, there is no way to specify 
configuration at all ! We rely on default configuration here


CREATE INDEX textsearch_idx ON pgweb USING gin(title);



With the trigger idea, you have to be sure your configuration is the same
every time you INSERT/UPDATE the table or the index will have mixed
configuration entries and it will yield incorrect results, aside from
the heap configuration lookup not matching the index.

Re: [HACKERS] Updated tsearch documentation

2007-07-18 Thread Bruce Momjian
Oleg Bartunov wrote:
 On Wed, 18 Jul 2007, Bruce Momjian wrote:
 
 
  Why are we allowing my_filter_name here?  Isn't that something for a
  custom trigger.  Is calling it tsearch() a good idea?  Why not
  tsvector_trigger().
 
  I don't see any benefit from the tsvector_trigger() name. If you want to 
  add
  some semantic, than tsvector_update_trigger() would be better.  Anyway,
  this trigger is an illustration.
 
  Well, the filter that removes '@' might be an example, but tsearch() is
  indeed sort of built-in trigger function to be used for simple cases.
  My point is that because it is only for simple cases, why add complexity
  and allow a filter?  It seems best to just remove the filter idea and
  let people write their own triggers if they want that functionality.
 
 If you aware about documentation simplicity than we could just document 
 two versions:
 1. without filter function - simple, well understood syntax
 2. with filter function - for advanced users
 
 I don't want to remove the feature which works for year without any problem.

Yes, this is what I want.  I would like to show the simple usage first,
then explain that a more complex usage is possible.  This will help
people get started using text search.  Triggers and secondary columns
are fine, but to start using it the CREATE INDEX-only case is best.  I
don't suggest we remove any capabilities, only suggest simple solutions.

  CREATE INDEX textsearch_id ON pgweb USING 
  gin(to_tsvector(column));
 
  That avoids having to have a separate column because you can just say:
 
  WHERE to_query('XXX') @@ to_tsvector(column)
 
  yes, it's possible, but without ranking, since currently it's impossible
  to store any information in index (it's pg's feature). btw, this should
  works and for GiST index also.
 
  What if they use @@@.  Wouldn't that work because it is going to check
  the heap?
 
  It would work, it'd recalculate to_tsvector(column) for rows found
  ( for GiST - to remove false hits and for weight information, for
  GIN - for weight information only).
 
  Right.  Currently to use text search on a table, you have to do three
  things:
 
  o  add a tsvector column to the table
  o  add a trigger to keep the tsvector column current
  o  add an index to the tsvector column
 
  My question is why bother with the first two steps?  If you do:
 
  CREATE INDEX textsearch_idx ON pgweb USING 
  gist(to_tsvector('english',column));
 
  you don't need a separate column and a trigger to keep it current.  The
  index is kept current as part of normal query processing.  The only
  downside is that you have to do to_tsvector() in the heap to avoid false
  hits, but that seems minor compared to the disk savings of not having
  the separate column.  Is to_tsvector() an expensive function?
 
 Bruce, you oversimplify the text search, the document could be fully virtual,
 not a column(s), it could be a result of any SQL commands, so it could be 
 very expensive just to obtain document, and yes, to_tsvector could be
 very expensive, depending on the document size, parser and dictionaries used.
 
 And, again, current postgres architecture forces to use heap to store
 positional and weight information for ranking.
 
 The use case for what you described is very limited - simple text search
 on one/several column of the same table without ranking.

Right, but I bet that that is all the majority of users need, at least
at first as they start to use text search.

   CREATE INDEX textsearch_idx ON pgweb USING 
  gin(to_tsvector('english',column));
 
  so that at least the configuration is documented in the index.
 
  yes, it's better to always explicitly specify configuration name and not
  rely on default configuration.
  Unfortunately, configuration name doesn't saved in the index.
 
  as Teodor corrected me, index doesn't know about configuration at all !
  What accurate user could do, is to provide configuration name in the
  comment for tsvector column. Configuration name is an accessory of
  to_tsvector() function.
 
  Well, if you create the index with the configuration name it is
  guaranteed to match:
 
  CREATE INDEX textsearch_idx ON pgweb USING 
  gist(to_tsvector('english',column));
   ---
  And if someone does:
 
  WHERE 'friend'::tsquery @@ to_tsvector('english',column))
 
  the index is used.  Now if the default configuration is 'english' and
  they use:
 
  WHERE 'friend'::tsquery @@ to_tsvector(column))
 
  the index is not used, but this just a good example of why default
  configurations aren't that useful.  One problem I see is that if the
  default configuration is not 'english', then when the index consults the
  heap, it would be using a different configuration and yield incorrect
  results.  I am unsure how to fix that.
 
 again, you consider very simple case  and actually, your example is a 
 good example of usefulness of default 

Re: [HACKERS] Updated tsearch documentation

2007-07-18 Thread Bruce Momjian
Oleg, Teodor,

I am confused by the following example.  How does gin know to create a
tsvector, or does it?  Does gist know too?   

FYI, at some point we need to chat via instant messenger or IRC to
discuss the open items.  My chat information is here:

http://momjian.us/main/contact.html

---

SELECT title
FROM pgweb
WHERE textcat(title,body) @@ plainto_tsquery('create table')
ORDER BY dlm DESC LIMIT 10;

CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Updated tsearch documentation

2007-07-18 Thread Oleg Bartunov

On Wed, 18 Jul 2007, Bruce Momjian wrote:


Oleg, Teodor,

I am confused by the following example.  How does gin know to create a
tsvector, or does it?  Does gist know too?


No, gist doesn't know. I don't remember why, Teodor ?

For GIN see http://archives.postgresql.org/pgsql-hackers/2007-05/msg00625.php
for discussion



FYI, at some point we need to chat via instant messenger or IRC to
discuss the open items.  My chat information is here:

http://momjian.us/main/contact.html


I send you invitation for google talk, I use only chat in gmail.
My gmail account is [EMAIL PROTECTED]



---

SELECT title
FROM pgweb
WHERE textcat(title,body) @@ plainto_tsquery('create table')
ORDER BY dlm DESC LIMIT 10;

CREATE INDEX pgweb_idx ON pgweb USING gin(textcat(title,body));




Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Updated tsearch documentation

2007-07-17 Thread Oleg Bartunov

On Tue, 17 Jul 2007, Bruce Momjian wrote:


I think the tsearch documentation is nearing completion:

http://momjian.us/expire/fulltext/HTML/textsearch.html

but I am not happy with how tsearch is enabled in a user table:

http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html

Aside from the fact that it needs more examples, it only illustrates an
example where someone creates a table, populates it, then adds a
tsvector column, populates that, then creates an index.

That seems quite inflexible.  Is there a way to avoid having a separate
tsvector column?  What happens if the table is dynamic?  How is that
column updated based on table changes?  Triggers?  Where are the
examples?  Can you create an index like this:


I agree, that there are could be more examples, but text search doesn't
require something special !
*Example* of trigger function is documented on 
http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html





CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));

That avoids having to have a separate column because you can just say:

WHERE to_query('XXX') @@ to_tsvector(column)


yes, it's possible, but without ranking, since currently it's impossible 
to store any information in index (it's pg's feature). btw, this should

works and for GiST index also.

That kind of search is useful if there is  another natural ordering of search 
results, for example, by timestamp.




How do we make sure that the to_query is using the same text search
configuration as the 'column' or index?  Perhaps we should suggest:


please, keep in mind, it's not mandatory to use the same configuration
at search time, that was used at index creation.



 CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));

so that at least the configuration is documented in the index.


yes, it's better to always explicitly specify configuration name and not 
rely on default configuration. 
Unfortunately, configuration name doesn't saved in the index.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Updated tsearch documentation

2007-07-17 Thread Oleg Bartunov

On Tue, 17 Jul 2007, Oleg Bartunov wrote:


On Tue, 17 Jul 2007, Bruce Momjian wrote:


I think the tsearch documentation is nearing completion:

http://momjian.us/expire/fulltext/HTML/textsearch.html

but I am not happy with how tsearch is enabled in a user table:

http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html

Aside from the fact that it needs more examples, it only illustrates an
example where someone creates a table, populates it, then adds a
tsvector column, populates that, then creates an index.

That seems quite inflexible.  Is there a way to avoid having a separate
tsvector column?  What happens if the table is dynamic?  How is that
column updated based on table changes?  Triggers?  Where are the
examples?  Can you create an index like this:


I agree, that there are could be more examples, but text search doesn't
require something special !
*Example* of trigger function is documented on 
http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html




Bruce,

below is an example of trigger for  insert/update of example table

create function pgweb_update() returns trigger as 
$$

BEGIN
   NEW.textsearch_index=
   setweight( to_tsvector( coalesce (title,'')), 'A' ) || ' ' ||
   setweight( to_tsvector(coalesce (body,'')),'D'); RETURN NEW;
END;
$$ 
language plpgsql;


CREATE TRIGGER fts_update BEFORE INSERT OR UPDATE ON pgweb
FOR EACH ROW EXECUTE PROCEDURE pgweb_update();






CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));

That avoids having to have a separate column because you can just say:

WHERE to_query('XXX') @@ to_tsvector(column)


yes, it's possible, but without ranking, since currently it's impossible to 
store any information in index (it's pg's feature). btw, this should

works and for GiST index also.

That kind of search is useful if there is  another natural ordering of search 
results, for example, by timestamp.




How do we make sure that the to_query is using the same text search
configuration as the 'column' or index?  Perhaps we should suggest:


please, keep in mind, it's not mandatory to use the same configuration
at search time, that was used at index creation.



one example is when text search index created without taking into account 
stop-words. Then you could search famous 'to be or not to be' with the

same configuration, or ignore stop words with other.




 CREATE INDEX textsearch_idx ON pgweb USING 
gin(to_tsvector('english',column));


so that at least the configuration is documented in the index.


yes, it's better to always explicitly specify configuration name and not rely 
on default configuration. Unfortunately, configuration name doesn't saved in 
the index.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 4: Have you searched our list archives?

 http://archives.postgresql.org



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Updated tsearch documentation

2007-07-17 Thread Bruce Momjian
Oleg Bartunov wrote:
 On Tue, 17 Jul 2007, Bruce Momjian wrote:
 
  I think the tsearch documentation is nearing completion:
 
  http://momjian.us/expire/fulltext/HTML/textsearch.html
 
  but I am not happy with how tsearch is enabled in a user table:
 
  http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html
 
  Aside from the fact that it needs more examples, it only illustrates an
  example where someone creates a table, populates it, then adds a
  tsvector column, populates that, then creates an index.
 
  That seems quite inflexible.  Is there a way to avoid having a separate
  tsvector column?  What happens if the table is dynamic?  How is that
  column updated based on table changes?  Triggers?  Where are the
  examples?  Can you create an index like this:
 
 I agree, that there are could be more examples, but text search doesn't
 require something special !
 *Example* of trigger function is documented on 
 http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html

Yes, I see that in tsearch() here:

http://momjian.us/expire/fulltext/HTML/textsearch-opfunc.html#TEXTSEARC$

I assume my_filter_name is optional right?  I have updated the prototype
to be:

tsearch([vector_column_name], [my_filter_name], text_column_name [, ... 
])

Is this accurate?  What does this text below it mean?

There can be many functions and text columns specified in a tsearch()
trigger. The following rule is used: a function is applied to all
subsequent TEXT columns until the next matching column occurs. 

Why are we allowing my_filter_name here?  Isn't that something for a
custom trigger.  Is calling it tsearch() a good idea?  Why not
tsvector_trigger().

  CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));
 
  That avoids having to have a separate column because you can just say:
 
  WHERE to_query('XXX') @@ to_tsvector(column)
 
 yes, it's possible, but without ranking, since currently it's impossible 
 to store any information in index (it's pg's feature). btw, this should
 works and for GiST index also.

What if they use @@@.  Wouldn't that work because it is going to check
the heap?

 That kind of search is useful if there is  another natural ordering of search 
 results, for example, by timestamp.
 
 
  How do we make sure that the to_query is using the same text search
  configuration as the 'column' or index?  Perhaps we should suggest:
 
 please, keep in mind, it's not mandatory to use the same configuration
 at search time, that was used at index creation.

Well, sort of.  If you have stop words in the tquery configuration, you
aren't going to hit any matches in the tsvector, right?  Same for
synonymns, I suppose.  I can see that stemming would work if there was a
mismatch between tsquery and tsvector.

   CREATE INDEX textsearch_idx ON pgweb USING 
  gin(to_tsvector('english',column));
 
  so that at least the configuration is documented in the index.
 
 yes, it's better to always explicitly specify configuration name and not 
 rely on default configuration. 
 Unfortunately, configuration name doesn't saved in the index.

I was more concerned that there is nothing documenting the configuration
used by the index or the tsvector table column trigger.  By doing:

CREATE INDEX textsearch_idx ON pgweb USING 
gin(to_tsvector('english',column));

you guarantee that the index uses 'english' for all its entries.  If you
omit the 'english' or use a different configuration, it will heap scan
the table, which at least gives the right answer.

Also, how do you guarantee that tsearch() triggers always uses the same
configuration?  The existing tsearch() API seems to make that
impossible.  I am wondering if we need to add the configuration name as
a mandatory parameter to tsearch().

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Updated tsearch documentation

2007-07-17 Thread Michael Glaesemann


On Jul 17, 2007, at 16:24 , Bruce Momjian wrote:

I assume my_filter_name is optional right?  I have updated the  
prototype

to be:

	tsearch([vector_column_name], [my_filter_name], text_column_name  
[, ... ])


Just a style point, but would [filter_name] be better than  
[my_filter_name]? You're not qualifying the others with my_ ... or is  
there something you want to tell us, Bruce? :)


Michael Glaesemann
grzm seespotcode net



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Updated tsearch documentation

2007-07-17 Thread Bruce Momjian
Michael Glaesemann wrote:
 
 On Jul 17, 2007, at 16:24 , Bruce Momjian wrote:
 
  I assume my_filter_name is optional right?  I have updated the  
  prototype
  to be:
 
  tsearch([vector_column_name], [my_filter_name], text_column_name  
  [, ... ])
 
 Just a style point, but would [filter_name] be better than  
 [my_filter_name]? You're not qualifying the others with my_ ... or is  
 there something you want to tell us, Bruce? :)

Agreed.  Done.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Updated tsearch documentation

2007-07-16 Thread Bruce Momjian
I think the tsearch documentation is nearing completion:

http://momjian.us/expire/fulltext/HTML/textsearch.html

but I am not happy with how tsearch is enabled in a user table:

http://momjian.us/expire/fulltext/HTML/textsearch-app-tutorial.html

Aside from the fact that it needs more examples, it only illustrates an
example where someone creates a table, populates it, then adds a
tsvector column, populates that, then creates an index.

That seems quite inflexible.  Is there a way to avoid having a separate
tsvector column?  What happens if the table is dynamic?  How is that
column updated based on table changes?  Triggers?  Where are the
examples?  Can you create an index like this:

CREATE INDEX textsearch_id ON pgweb USING gin(to_tsvector(column));

That avoids having to have a separate column because you can just say:

WHERE to_query('XXX') @@ to_tsvector(column)

How do we make sure that the to_query is using the same text search
configuration as the 'column' or index?  Perhaps we should suggest:

  CREATE INDEX textsearch_idx ON pgweb USING gin(to_tsvector('english',column));

so that at least the configuration is documented in the index.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Updated tsearch documentation

2007-07-10 Thread Bruce Momjian

Thanks, I applied this patch and rebuild HTML version.  I was wondering
how I was going to make all the changes accurately.  ;-)

---

Nicolas Barbier wrote:
 2007/7/7, Bruce Momjian [EMAIL PROTECTED]:
 
  FYI, I have massively reorganized the text search documentation and it
  is getting closer to something I am happy with:
 
  http://momjian.us/expire/fulltext/HTML/textsearch.html
 
 The following is the result of me proofreading, mainly searching for
 small mistakes such as spelling/grammatical errors (that means no
 document structure comments, etc).
 
 All corrections are relative to the version of the text at above URL
 at the time of me reading it :-).
 
 General
 
 It seems to be a recurring problem that commas are not put between the
 brackets when an argument is optional. For example:
 to_tsvector([conf_name], document TEXT) - I guess this should be
 to_tsvector([conf_name,] document TEXT)
 
 Full-text vs. full text and stop-word vs. stop word are not used
 consistently. Also, capitalization of full text searching is not used
 consistently.
 

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Updated tsearch documentation

2007-07-07 Thread Nicolas Barbier

2007/7/7, Bruce Momjian [EMAIL PROTECTED]:


FYI, I have massively reorganized the text search documentation and it
is getting closer to something I am happy with:

http://momjian.us/expire/fulltext/HTML/textsearch.html


The following is the result of me proofreading, mainly searching for
small mistakes such as spelling/grammatical errors (that means no
document structure comments, etc).

All corrections are relative to the version of the text at above URL
at the time of me reading it :-).

General

It seems to be a recurring problem that commas are not put between the
brackets when an argument is optional. For example:
to_tsvector([conf_name], document TEXT) - I guess this should be
to_tsvector([conf_name,] document TEXT)

Full-text vs. full text and stop-word vs. stop word are not used
consistently. Also, capitalization of full text searching is not used
consistently.

14.1. Introduction

* indexinging -  indexing
* There is no linguistic support, even in English - for instead of in?
* e.g.satisfies - add a space before satisfies
* have several thousands derivatives - should this not use the
singular form thousand?
* infinitive form - is this the right term? I think it only applies
to verbs (also occurs in 14.4 and probably others)
* over how lexemes creation - not sure what this should be. are
created maybe?
* Map synonyms to a single word. ispell. - why is ispell a standalone word?
* so it is natural to introduce a new data type - this does not
sound like documentation
* Also, full-text search operator @@ - add the before full-text
* A document is any text file that can be opened, read, and modified
- file sounds as if it should be a file on a filesystem.
* However, the document file must be uniquely identified in the
database. - why?
* COALESCE - should be a link
* during calculation of document rank - add the before
calculation and before document
* which supports boolean operators,  (AND) - remove the ,. maybe
add the before boolean
* parenthesis - parentheses
* Tsquery consists of - maybe add A before Tsquery

14.2. Operators And Functions
   ^^^ - a non-capital a in and seems to be more
consistent with the rest of the manual

* TSVECTOR, otherwise false: - and false if not or and false
otherwise (occurs 3 times in this section)
* The text should be formatted to match the way a vector is displayed
by SELECT. - what a strange definition, I think something like
input format or so should be used (and defined somewhere, didn't see
it yet) (used twice in this section)
* tsearch([vector_column_name], my_filter_name | text_column_name1
[...], text_column_nameN) - I do not understand the notation
* The following rule is used: a function is applied to all subsequent
TEXT columns until next matching column occurs. - I don't get it
* stat([sqlquery text ], [weight text ]) returns SETOF statinfo - I
guess that not both of the arguments are optional?
* stop-words candidates - stop-word candidates
* tsvectors are compared with each other using lexicographical
ordering. - of the output representation or something else?
* Accepts querytext, which should be single tokens separated by -
replace be with consist of
*  and | or, and ! not - putting parentheses around the and or
and not would be more readable. also, a comma is missing before the
| sign
* break it onto tokens - into instead of onto
* since GIN indexes do not support negate queries - something like:
queries with negation or negated queries (depending on what the
correct rule is)
* Arguments to rewrite() function - the .. functions or to ..
(without the function)
* can be column names of type tsquery - names of columns of type
tsquery (the names are not of type tsquery, the columns are)
* we can change rewriting rule online - add the, possibly use
another word for online (it is not clear what that means to me)

14.3. Additional Controls

* Full text searching in PostgreSQL provides function - add the
* we see the resulting - maybe we see that the resulting
does not contain a, on, or it, word rats became rat, and the
punctuation sign - was ignored - does not contain the words (or
lexemes, or tokens), add the before word rats, add quotes around
the -
* on words - into words
* they are too frequent - they occur too frequently (I think a
word cannot be frequent)
* The Punctuation sign - - The punctuation sign - + put quotes
around the -
* which shows all details of full text machinery - add the before full
* is to mark out the different parts of document - add a before document
* by the 1 + logarithm - by 1 + the logarithm
* i.e., ordering of search results will not change - add the
before ordering, maybe also before search
* note that second example - add the before second
* than ones with labeled with D - than ones labeled with D or
than ones that are labeled with D
* Unfortunately, it is almost impossible to avoid since full text
indexing in a database should work without indexes - I don't get it
* to show part of each document - add 

Re: [HACKERS] Updated tsearch documentation

2007-07-02 Thread Bruce Momjian
Oleg Bartunov wrote:
 On Wed, 20 Jun 2007, Bruce Momjian wrote:
 
  We need to decide if we need oids as user-visible argument. I don't see
  any value, probably Teodor think other way.
 
  This is a good time to clean up the API because there are going to be
  user-visible changes anyway.
 
 Bruce, just remove oid argument specification from documentation.

Done.  I am attaching the current function prototypes.  If they don't
match the C code, please let me know.

I have also updated with some minor corrections I received from Erik.  I
will be adding more to the documentation hopefully this week:

http://momjian.us/expire/fulltext/HTML/

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
*** /pgsgml/fulltext-opfunc.sgml	Sat Jun 16 23:30:11 2007
--- fulltext-opfunc.sgml	Mon Jul  2 21:17:15 2007
***
*** 141,147 
  
  term
  synopsis
! to_tsvector(optionalreplaceable class=PARAMETERconfiguration/replaceable,/optional  replaceable class=PARAMETERdocument/replaceable TEXT) returns TSVECTOR
  /synopsis
  /term
  
--- 141,147 
  
  term
  synopsis
! to_tsvector(optionalreplaceable class=PARAMETERconf_name/replaceable/optional,  replaceable class=PARAMETERdocument/replaceable TEXT) returns TSVECTOR
  /synopsis
  /term
  
***
*** 285,306 
  
  term
  synopsis
! tsearch(replaceable class=PARAMETERvector_column_name/replaceableoptional, (replaceable class=PARAMETERmy_filter_name/replaceable | replaceable class=PARAMETERtext_column_name1/replaceable) optional.../optional /optional, replaceable class=PARAMETERtext_column_nameN/replaceable)
  /synopsis
  /term
  
***
*** 323,329 
  
  term
  synopsis
! stat(replaceable class=PARAMETERsqlquery/replaceable text optional, weight text /optional) returns SETOF statinfo
  /synopsis
  
  listitem
--- 322,328 
  
  term
  synopsis
! stat(optionalreplaceable class=PARAMETERsqlquery/replaceable text /optional, weight text /optional) returns SETOF statinfo
  /synopsis
  
  listitem
***
*** 403,409 
  
  term
  synopsis
! to_tsquery(optionalreplaceable class=PARAMETERconfiguration/replaceable,/optional replaceable class=PARAMETERquerytext/replaceable text) returns TSQUERY
  /synopsis
  /term
  
--- 402,408 
  
  term
  synopsis
! to_tsquery(optionalreplaceable class=PARAMETERconf_name/replaceable/optional, replaceable class=PARAMETERquerytext/replaceable text) returns TSQUERY
  /synopsis
  /term
  
***
*** 446,452 
  
  term
  synopsis
! plainto_tsquery(optionalreplaceable class=PARAMETERconfiguration/replaceable,/optional  replaceable class=PARAMETERquerytext/replaceable text) returns TSQUERY
  /synopsis
  /term
  
--- 445,451 
  
  term
  synopsis
! plainto_tsquery(optionalreplaceable class=PARAMETERconf_name/replaceable/optional,  replaceable class=PARAMETERquerytext/replaceable text) returns TSQUERY
  /synopsis
  /term
  
***
*** 989,995 
  
  term
  synopsis
! rank(optional replaceable class=PARAMETERweights/replaceable float4[], /optional replaceable class=PARAMETERvector/replaceable TSVECTOR, replaceable class=PARAMETERquery/replaceable TSQUERY, optional replaceable class=PARAMETERnormalization/replaceable int4 /optional) returns float4
  /synopsis
  /term
  
--- 988,994 
  
  term
  synopsis
! rank(optional replaceable class=PARAMETERweights/replaceable float4[]/optional, replaceable class=PARAMETERvector/replaceable TSVECTOR, replaceable class=PARAMETERquery/replaceable TSQUERY, optional replaceable class=PARAMETERnormalization/replaceable int4 /optional) returns float4
  /synopsis
  /term
  
***
*** 1084,1090 
  
  term
  synopsis
! headline(optional replaceable class=PARAMETERid/replaceable int4, | replaceable class=PARAMETERts_name/replaceable text, /optional replaceable class=PARAMETERdocument/replaceable text, replaceable class=PARAMETERquery/replaceable TSQUERY, optional replaceable class=PARAMETERoptions/replaceable text /optional) returns text
  /synopsis
  /term
  
--- 1083,1089 
  
  term
  synopsis
! headline(optional replaceable class=PARAMETERts_name/replaceable text/optional, replaceable class=PARAMETERdocument/replaceable text, replaceable class=PARAMETERquery/replaceable TSQUERY, optional replaceable class=PARAMETERoptions/replaceable text /optional) returns text
  /synopsis
  /term
  
***
*** 1351,1357 
  
  term
  synopsis
! lexize(optional replaceable class=PARAMETERoid/replaceable, | replaceable class=PARAMETERdict_name/replaceable text, replaceable class=PARAMETERlexeme/replaceable text) returns text[]
  /synopsis
  /term
  
--- 1350,1356 
  
  term
  synopsis
! lexize(optional replaceable class=PARAMETERdict_name/replaceable text/optional, replaceable class=PARAMETERlexeme/replaceable text) returns text[]
  /synopsis
  /term
  

Re: [HACKERS] Updated tsearch documentation

2007-06-21 Thread Oleg Bartunov

On Wed, 20 Jun 2007, Bruce Momjian wrote:


Oleg Bartunov wrote:

On Wed, 20 Jun 2007, Bruce Momjian wrote:

Comments to editorial work of Bruce Momjian.

fulltext-intro.sgml:

it is useful to have a predefined list of lexemes.

Bruce, here should be list of types of lexemes !


Agreed.  Are the list of lexemes parser-specific?



yes, it it parser which defines types of lexemes.


OK, how will users get a list of supported lexemes?  Do we need a list
per supported parser?


it's documented, see Parser functions for token_type();

postgres=# select * from token_type('default');
 tokid |alias |description
---+--+---
 1 | lword| Latin word
 2 | nlword   | Non-latin word
 3 | word | Word
 4 | email| Email
 5 | url  | URL
 6 | host | Host
 7 | sfloat   | Scientific notation
 8 | version  | VERSION
 9 | part_hword   | Part of hyphenated word
10 | nlpart_hword | Non-latin part of hyphenated word
11 | lpart_hword  | Latin part of hyphenated word
12 | blank| Space symbols
13 | tag  | HTML Tag
14 | protocol | Protocol head
15 | hword| Hyphenated word
16 | lhword   | Latin hyphenated word
17 | nlhword  | Non-latin hyphenated word
18 | uri  | URI
19 | file | File or path name
20 | float| Decimal notation
21 | int  | Signed integer
22 | uint | Unsigned integer
23 | entity   | HTML Entity


The integer option controls several behaviors which is done using bit-wise
fields and literal|/literal (for example, literal2|4/literal):
!-- why so complex? --


to avoid 2 arguments


But I don't see why you would want to set two of those values --- they
seem mutually exclusive, e.g.

1 divides the rank by the 1 + logarithm of the document length
2 divides the rank by the length itself

I assume you do either one, not both.


but what's about others variants ?


OK, here is the full list:

0 (the default) ignores document length
1 divides the rank by the 1 + logarithm of the document length
2 divides the rank by the length itself
4 divides the rank by the mean harmonic distance between extents
8 divides the rank by the number of unique words in document
16 divides the rank by 1 + logarithm of the number of unique words in
   document

so which ones would be both enabled?


no one ! This is a list of possible values of rank normalization flag, which 
could be ORed together.


=# select rank_cd('1:1,2,3 4:5 6:7', '14',1);
  rank_cd
---
 0.0279055
=# select rank_cd('1:1,2,3 4:5 6:7', '14',1|16);
  rank_cd
---
 0.0139528






What I missed is the definition of extent.


From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking

Extent is a shortest and non-nested sequence of words, which satisfy a query.


I don't understand how that relates to this.


because of 
4 divides the rank by the mean harmonic distance between extents

  ^^^
it reflects how dense extents which satisfy query are in document.





its replaceableid/replaceable or replaceablets_name/replaceable; !-- n
if none is specified that the current configuration is used.


I don't understand this question


Same issue as above --- why allow a number here when the name works just
fine.  We don't allow tables to be specified by number, so why
configurations?


para
!-- why?  --
Note that the cascade dropping of the functionheadline/function function
cause dropping of the literalparser/literal used in fulltext configuration
replaceabletsname/replaceable.
/para


hmm, probably it should be reversed - cascade dropping of the parser cause
dropping of the headline function.


Agreed.



In example below, literalfulltext_idx/literal is
a GIN index:!-- why isn't this automatic --


It's explained above. The problem is that current index api doesn't allow
to say if search was lossy or exact, so to preserve performance of
GIN index we had to introduce @@@ operator, which is the same as @@, but
lossy.


Well, then we have to fix the API.  Telling users to use a different
operator based on what index is defined is just bad style.


This was raised by Heikki and we discussed it a bit in Ottawa, but it's
unclear if it's doable for 8.3.  @@@ operator is in rare use, so we could
say it will be improved in future versions.


Uh, I am wondering if we just have to force heap access in all cases
until it is fixed.


no-no ! We'll lost performance of GIN index, which isn't lossy and don't
need heap access. I don't see what's wrong if we say that some feature
doesn't supported by text search operator with GIN index.


We need to decide if we need oids as user-visible argument. I don't see
any value, probably Teodor think other way.


This is a good time 

Re: [HACKERS] Updated tsearch documentation

2007-06-21 Thread Oleg Bartunov

On Wed, 20 Jun 2007, Bruce Momjian wrote:


We need to decide if we need oids as user-visible argument. I don't see
any value, probably Teodor think other way.


This is a good time to clean up the API because there are going to be
user-visible changes anyway.


Bruce, just remove oid argument specification from documentation.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Updated tsearch documentation

2007-06-20 Thread Bruce Momjian
Oleg Bartunov wrote:
 On Sun, 17 Jun 2007, Bruce Momjian wrote:
 
  I have completed my first pass over the tsearch documentation:
 
  http://momjian.us/expire/fulltext/HTML/sql.html
 
  They are from section 14 and following.
 
  I have come up with a number of questions that I placed in SGML comments
  in these files:
 
  http://momjian.us/expire/fulltext/SGML/
 
  Teodor/Oleg, let me know when you want to go over my questions.
 
 Below are my answers (marked as )

OK.
 
 Comments to editorial work of Bruce Momjian.
 
 fulltext-intro.sgml:
 
 it is useful to have a predefined list of lexemes.
 
Bruce, here should be list of types of lexemes !

Agreed.  Are the list of lexemes parser-specific?

 /para/listitem
 
 !--
 SEEMS UNNECESSARY
 It useless to attempt normalize typeemail address/type using
 morphological dictionary of russian language, but looks reasonable to pick
 out typedomain name/type and be able to search for typedomain
 name/type.
 --
 
 I dont' understand where did you get this para :)

Uh, it was in the SGML.  I have removed it.

 fulltext-opfunc.sgml:
 
 All of the following functions that accept a configuration argument can
 use either an integer !-- why an integer -- or a textual configuration
 name to select a configuration.
 
 originally it was integer id, probably better use typeoid/type

Uh, my question is why are you allowing specification as an integer/oid
when the name works just fine.  I don't see the value in allowing
numbers here.

 This returns the query used for searching an index. It can be used to test
 for an empty query. The commandSELECT/ below returns literal'T'/,
 !-- lowercase? -- which corresponds to an empty query since GIN indexes
 do not support negate queries (a full index scan is inefficient):
 
  capital case. This looks cumbersome, probably querytree() should
  just return NULL.

Agreed.

 The integer option controls several behaviors which is done using bit-wise
 fields and literal|/literal (for example, literal2|4/literal):
 !-- why so complex? --
 
  to avoid 2 arguments

But I don't see why you would want to set two of those values --- they
seem mutually exclusive, e.g.

1 divides the rank by the 1 + logarithm of the document length
2 divides the rank by the length itself

I assume you do either one, not both.

 its replaceableid/replaceable or replaceablets_name/replaceable; !-- 
 n
 if none is specified that the current configuration is used.
 
  I don't understand this question

Same issue as above --- why allow a number here when the name works just
fine.  We don't allow tables to be specified by number, so why
configurations?

 para
 !-- why?  --
 Note that the cascade dropping of the functionheadline/function function
 cause dropping of the literalparser/literal used in fulltext configuration
 replaceabletsname/replaceable.
 /para
 
  hmm, probably it should be reversed - cascade dropping of the parser cause
  dropping of the headline function.

Agreed.

 
 In example below, literalfulltext_idx/literal is
 a GIN index:!-- why isn't this automatic --
 
  It's explained above. The problem is that current index api doesn't allow
  to say if search was lossy or exact, so to preserve performance of
  GIN index we had to introduce @@@ operator, which is the same as @@, but
  lossy.

Well, then we have to fix the API.  Telling users to use a different
operator based on what index is defined is just bad style.

 nly the tokenlword/token lexeme, then a acronymTZ/acronym
 definition like ' one 1:11' will not work since lexeme type
 tokendigit/token is not assigned to the acronymTZ/acronym.
 !-- what do these numbers mean? --
 /para

OK, I changed it to be clearer.

  nothing special, just numbers for example.
 
 functionts_debug/ displays information about every token of
 replaceable class=PARAMETERdocument/replaceable as produced by the
 parser and processed by the configured dictionaries using the configuration
 specified by replaceable class=PARAMETERcfgname/replaceable or
 replaceable class=PARAMETERoid/replaceable. !-- no need for oid
 
  don't understand this comment. ts_debug accepts cfgname or its oid

Again, no need for oid.

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Updated tsearch documentation

2007-06-20 Thread Oleg Bartunov

On Wed, 20 Jun 2007, Bruce Momjian wrote:


Oleg Bartunov wrote:

On Sun, 17 Jun 2007, Bruce Momjian wrote:


I have completed my first pass over the tsearch documentation:

http://momjian.us/expire/fulltext/HTML/sql.html

They are from section 14 and following.

I have come up with a number of questions that I placed in SGML comments
in these files:

http://momjian.us/expire/fulltext/SGML/

Teodor/Oleg, let me know when you want to go over my questions.


Below are my answers (marked as )


OK.


Comments to editorial work of Bruce Momjian.

fulltext-intro.sgml:

it is useful to have a predefined list of lexemes.

Bruce, here should be list of types of lexemes !


Agreed.  Are the list of lexemes parser-specific?



yes, it it parser which defines types of lexemes.


fulltext-opfunc.sgml:

All of the following functions that accept a configuration argument can
use either an integer !-- why an integer -- or a textual configuration
name to select a configuration.

originally it was integer id, probably better use typeoid/type


Uh, my question is why are you allowing specification as an integer/oid
when the name works just fine.  I don't see the value in allowing
numbers here.


for compatibility reason. Hmm, indeed, i don't recall where oid's could be 
important.





This returns the query used for searching an index. It can be used to test
for an empty query. The commandSELECT/ below returns literal'T'/,
!-- lowercase? -- which corresponds to an empty query since GIN indexes
do not support negate queries (a full index scan is inefficient):


capital case. This looks cumbersome, probably querytree() should
just return NULL.


Agreed.


The integer option controls several behaviors which is done using bit-wise
fields and literal|/literal (for example, literal2|4/literal):
!-- why so complex? --


to avoid 2 arguments


But I don't see why you would want to set two of those values --- they
seem mutually exclusive, e.g.

1 divides the rank by the 1 + logarithm of the document length
2 divides the rank by the length itself

I assume you do either one, not both.


but what's about others variants ?

What I missed is the definition of extent.


From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking

Extent is a shortest and non-nested sequence of words, which satisfy a query.





its replaceableid/replaceable or replaceablets_name/replaceable; !-- n
if none is specified that the current configuration is used.


I don't understand this question


Same issue as above --- why allow a number here when the name works just
fine.  We don't allow tables to be specified by number, so why
configurations?


para
!-- why?  --
Note that the cascade dropping of the functionheadline/function function
cause dropping of the literalparser/literal used in fulltext configuration
replaceabletsname/replaceable.
/para


hmm, probably it should be reversed - cascade dropping of the parser cause
dropping of the headline function.


Agreed.



In example below, literalfulltext_idx/literal is
a GIN index:!-- why isn't this automatic --


It's explained above. The problem is that current index api doesn't allow
to say if search was lossy or exact, so to preserve performance of
GIN index we had to introduce @@@ operator, which is the same as @@, but
lossy.


Well, then we have to fix the API.  Telling users to use a different
operator based on what index is defined is just bad style.


This was raised by Heikki and we discussed it a bit in Ottawa, but it's
unclear if it's doable for 8.3.  @@@ operator is in rare use, so we could
say it will be improved in future versions.




nly the tokenlword/token lexeme, then a acronymTZ/acronym
definition like ' one 1:11' will not work since lexeme type
tokendigit/token is not assigned to the acronymTZ/acronym.
!-- what do these numbers mean? --
/para


OK, I changed it to be clearer.


nothing special, just numbers for example.


functionts_debug/ displays information about every token of
replaceable class=PARAMETERdocument/replaceable as produced by the
parser and processed by the configured dictionaries using the configuration
specified by replaceable class=PARAMETERcfgname/replaceable or
replaceable class=PARAMETERoid/replaceable. !-- no need for oid


don't understand this comment. ts_debug accepts cfgname or its oid


Again, no need for oid.


We need to decide if we need oids as user-visible argument. I don't see
any value, probably Teodor think other way.


Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Updated tsearch documentation

2007-06-20 Thread Bruce Momjian
Oleg Bartunov wrote:
 On Wed, 20 Jun 2007, Bruce Momjian wrote:
  Comments to editorial work of Bruce Momjian.
 
  fulltext-intro.sgml:
 
  it is useful to have a predefined list of lexemes.
 
  Bruce, here should be list of types of lexemes !
 
  Agreed.  Are the list of lexemes parser-specific?
 
 
 yes, it it parser which defines types of lexemes.

OK, how will users get a list of supported lexemes?  Do we need a list
per supported parser?

  fulltext-opfunc.sgml:
 
  All of the following functions that accept a configuration argument can
  use either an integer !-- why an integer -- or a textual configuration
  name to select a configuration.
 
  originally it was integer id, probably better use typeoid/type
 
  Uh, my question is why are you allowing specification as an integer/oid
  when the name works just fine.  I don't see the value in allowing
  numbers here.
 
 for compatibility reason. Hmm, indeed, i don't recall where oid's could be 
 important.

Well, if neither of ussee no reason for it, let's remove it.  We don't
need to support a feature that has no usefulness.

  This returns the query used for searching an index. It can be used to test
  for an empty query. The commandSELECT/ below returns literal'T'/,
  !-- lowercase? -- which corresponds to an empty query since GIN indexes
  do not support negate queries (a full index scan is inefficient):
 
  capital case. This looks cumbersome, probably querytree() should
  just return NULL.
 
  Agreed.
 
  The integer option controls several behaviors which is done using bit-wise
  fields and literal|/literal (for example, literal2|4/literal):
  !-- why so complex? --
 
  to avoid 2 arguments
 
  But I don't see why you would want to set two of those values --- they
  seem mutually exclusive, e.g.
 
  1 divides the rank by the 1 + logarithm of the document length
  2 divides the rank by the length itself
 
  I assume you do either one, not both.
 
 but what's about others variants ?

OK, here is the full list:

0 (the default) ignores document length
1 divides the rank by the 1 + logarithm of the document length
2 divides the rank by the length itself
4 divides the rank by the mean harmonic distance between extents
8 divides the rank by the number of unique words in document
16 divides the rank by 1 + logarithm of the number of unique words in
   document

so which ones would be both enabled?

 
 What I missed is the definition of extent.
 
 From http://www.sai.msu.su/~megera/wiki/NewExtentsBasedRanking
 Extent is a shortest and non-nested sequence of words, which satisfy a query.

I don't understand how that relates to this.

 
  its replaceableid/replaceable or replaceablets_name/replaceable; 
  !-- n
  if none is specified that the current configuration is used.
 
  I don't understand this question
 
  Same issue as above --- why allow a number here when the name works just
  fine.  We don't allow tables to be specified by number, so why
  configurations?
 
  para
  !-- why?  --
  Note that the cascade dropping of the functionheadline/function 
  function
  cause dropping of the literalparser/literal used in fulltext 
  configuration
  replaceabletsname/replaceable.
  /para
 
  hmm, probably it should be reversed - cascade dropping of the parser cause
  dropping of the headline function.
 
  Agreed.
 
 
  In example below, literalfulltext_idx/literal is
  a GIN index:!-- why isn't this automatic --
 
  It's explained above. The problem is that current index api doesn't allow
  to say if search was lossy or exact, so to preserve performance of
  GIN index we had to introduce @@@ operator, which is the same as @@, but
  lossy.
 
  Well, then we have to fix the API.  Telling users to use a different
  operator based on what index is defined is just bad style.
 
 This was raised by Heikki and we discussed it a bit in Ottawa, but it's
 unclear if it's doable for 8.3.  @@@ operator is in rare use, so we could
 say it will be improved in future versions.

Uh, I am wondering if we just have to force heap access in all cases
until it is fixed.

  nly the tokenlword/token lexeme, then a acronymTZ/acronym
  definition like ' one 1:11' will not work since lexeme type
  tokendigit/token is not assigned to the acronymTZ/acronym.
  !-- what do these numbers mean? --
  /para
 
  OK, I changed it to be clearer.
 
  nothing special, just numbers for example.
 
  functionts_debug/ displays information about every token of
  replaceable class=PARAMETERdocument/replaceable as produced by the
  parser and processed by the configured dictionaries using the configuration
  specified by replaceable class=PARAMETERcfgname/replaceable or
  replaceable class=PARAMETERoid/replaceable. !-- no need for oid
 
  don't understand this comment. ts_debug accepts cfgname or its oid
 
  Again, no need for oid.
 
 We need to decide if we need oids as user-visible argument. I don't see
 any value, probably Teodor think 

Re: [HACKERS] Updated tsearch documentation

2007-06-17 Thread Oleg Bartunov

On Sun, 17 Jun 2007, Bruce Momjian wrote:


I have completed my first pass over the tsearch documentation:

http://momjian.us/expire/fulltext/HTML/sql.html

They are from section 14 and following.

I have come up with a number of questions that I placed in SGML comments
in these files:

http://momjian.us/expire/fulltext/SGML/

Teodor/Oleg, let me know when you want to go over my questions.


Below are my answers (marked as )


Comments to editorial work of Bruce Momjian.

fulltext-intro.sgml:

it is useful to have a predefined list of lexemes.


Bruce, here should be list of types of lexemes !



/para/listitem

!--
SEEMS UNNECESSARY
It useless to attempt normalize typeemail address/type using
morphological dictionary of russian language, but looks reasonable to pick
out typedomain name/type and be able to search for typedomain
name/type.
--


I dont' understand where did you get this para :)


fulltext-opfunc.sgml:

All of the following functions that accept a configuration argument can
use either an integer !-- why an integer -- or a textual configuration
name to select a configuration.


originally it was integer id, probably better use typeoid/type



This returns the query used for searching an index. It can be used to test
for an empty query. The commandSELECT/ below returns literal'T'/,
!-- lowercase? -- which corresponds to an empty query since GIN indexes
do not support negate queries (a full index scan is inefficient):


capital case. This looks cumbersome, probably querytree() should
just return NULL.


The integer option controls several behaviors which is done using bit-wise
fields and literal|/literal (for example, literal2|4/literal):
!-- why so complex? --


to avoid 2 arguments


its replaceableid/replaceable or replaceablets_name/replaceable; !-- n
if none is specified that the current configuration is used.


I don't understand this question


para
!-- why?  --
Note that the cascade dropping of the functionheadline/function function
cause dropping of the literalparser/literal used in fulltext configuration
replaceabletsname/replaceable.
/para


hmm, probably it should be reversed - cascade dropping of the parser cause
dropping of the headline function.


In example below, literalfulltext_idx/literal is
a GIN index:!-- why isn't this automatic --


It's explained above. The problem is that current index api doesn't allow
to say if search was lossy or exact, so to preserve performance of
GIN index we had to introduce @@@ operator, which is the same as @@, but
lossy.


nly the tokenlword/token lexeme, then a acronymTZ/acronym
definition like ' one 1:11' will not work since lexeme type
tokendigit/token is not assigned to the acronymTZ/acronym.
!-- what do these numbers mean? --
/para


nothing special, just numbers for example.


functionts_debug/ displays information about every token of
replaceable class=PARAMETERdocument/replaceable as produced by the
parser and processed by the configured dictionaries using the configuration
specified by replaceable class=PARAMETERcfgname/replaceable or
replaceable class=PARAMETERoid/replaceable. !-- no need for oid


don't understand this comment. ts_debug accepts cfgname or its oid





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly