Re: [HACKERS] Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2008-05-06 Thread Bruce Momjian
Tom Lane wrote:
> Bruce Momjian <[EMAIL PROTECTED]> writes:
> > Added to TODO:
> 
> > * Allow text search dictionary to filter out only stop words
> 
> >   http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php
> 
> That's a poor description.  I thought the TODO was something more like
> "allow dictionaries to change the token that is passed on to later
> dictionaries".

TODO updated as described.

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [HACKERS] Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2008-03-07 Thread Tom Lane
Bruce Momjian <[EMAIL PROTECTED]> writes:
> Added to TODO:

> * Allow text search dictionary to filter out only stop words

>   http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php

That's a poor description.  I thought the TODO was something more like
"allow dictionaries to change the token that is passed on to later
dictionaries".

regards, tom lane

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2008-03-07 Thread Bruce Momjian

Added to TODO:

* Allow text search dictionary to filter out only stop words

  http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php


---

Tom Lane wrote:
> Oleg Bartunov <[EMAIL PROTECTED]> writes:
> > Let's consider one example - removing accents.
> > In the past I always recommend people to use regex functions before
> > to_tsvector conversion to remove accents, but recently I was noticed that
> > such trick doesn't work with headline(). So, the only way is to have
> > special dictionary dict_remove_accent before, which  works as a filter.
> 
> > I don't remember why do we left this for future releases, though.
> 
> That would require a system-to-dictionary API change (to be able to
> modify the token under inspection), no?  So it's certainly something
> I'd say is too late for 8.3.
> 
> One thought that came to mind is that the option name should be just
> "Accept" not "AcceptAll".  To me "All" implies that it would accept
> *everything* ... including stopwords.
> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 4: Have you searched our list archives?
> 
>http://archives.postgresql.org

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-14 Thread Tom Lane
Oleg Bartunov <[EMAIL PROTECTED]> writes:
> On Wed, 14 Nov 2007, Tom Lane wrote:
>> Huh?  This is just an option for the "simple" dictionary, it's got
>> nothing to do with thesaurus AFAICS.

> I can assign simple dictionary as a normalization dictionary for thesaurus

Sure.  So what?  You wouldn't use this option in that case.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-14 Thread Oleg Bartunov

On Wed, 14 Nov 2007, Tom Lane wrote:


Oleg Bartunov <[EMAIL PROTECTED]> writes:

On Wed, 14 Nov 2007, Tom Lane wrote:

Huh?  This is just an option for the "simple" dictionary, it's got
nothing to do with thesaurus AFAICS.



I can assign simple dictionary as a normalization dictionary for thesaurus


Sure.  So what?  You wouldn't use this option in that case.


Right. That should be documented to avoid possible confusion.

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-14 Thread Oleg Bartunov

On Wed, 14 Nov 2007, Tom Lane wrote:


Oleg Bartunov <[EMAIL PROTECTED]> writes:

On Wed, 14 Nov 2007, Tom Lane wrote:

One thought that came to mind is that the option name should be just
"Accept" not "AcceptAll".  To me "All" implies that it would accept
*everything* ... including stopwords.



wait, I remind the problem with filters. How it will works with thesaurus ?


Huh?  This is just an option for the "simple" dictionary, it's got
nothing to do with thesaurus AFAICS.


I can assign simple dictionary as a normalization dictionary for thesaurus

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-14 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes:
>> This bit should be replaced with defGetBoolean.  Otherwise it looks
>> reasonably sane.

> Fixed that, thank you.

Applied with minor revisions (changed the parameter name, avoided
probably-insignificant memory leak).

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-14 Thread Tom Lane
Oleg Bartunov <[EMAIL PROTECTED]> writes:
> On Wed, 14 Nov 2007, Tom Lane wrote:
>> One thought that came to mind is that the option name should be just
>> "Accept" not "AcceptAll".  To me "All" implies that it would accept
>> *everything* ... including stopwords.

> wait, I remind the problem with filters. How it will works with thesaurus ?

Huh?  This is just an option for the "simple" dictionary, it's got
nothing to do with thesaurus AFAICS.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-14 Thread Oleg Bartunov

On Wed, 14 Nov 2007, Tom Lane wrote:


Oleg Bartunov <[EMAIL PROTECTED]> writes:

Let's consider one example - removing accents.
In the past I always recommend people to use regex functions before
to_tsvector conversion to remove accents, but recently I was noticed that
such trick doesn't work with headline(). So, the only way is to have
special dictionary dict_remove_accent before, which  works as a filter.



I don't remember why do we left this for future releases, though.


That would require a system-to-dictionary API change (to be able to
modify the token under inspection), no?  So it's certainly something


It requires one reserved option for dictionaries and  ability to get dictionary 
option.  Unless somebody have dictionary with the same option, this change

looks harmless.


I'd say is too late for 8.3.


yes, probably we get better idea.



One thought that came to mind is that the option name should be just
"Accept" not "AcceptAll".  To me "All" implies that it would accept
*everything* ... including stopwords.


wait, I remind the problem with filters. How it will works with thesaurus ?

Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-14 Thread Tom Lane
Oleg Bartunov <[EMAIL PROTECTED]> writes:
> Let's consider one example - removing accents.
> In the past I always recommend people to use regex functions before
> to_tsvector conversion to remove accents, but recently I was noticed that
> such trick doesn't work with headline(). So, the only way is to have
> special dictionary dict_remove_accent before, which  works as a filter.

> I don't remember why do we left this for future releases, though.

That would require a system-to-dictionary API change (to be able to
modify the token under inspection), no?  So it's certainly something
I'd say is too late for 8.3.

One thought that came to mind is that the option name should be just
"Accept" not "AcceptAll".  To me "All" implies that it would accept
*everything* ... including stopwords.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-14 Thread Oleg Bartunov
In principle the right way is to allow any dictionary have option 
like 'PassThrough' and internal function get_dict_options(dict, option) 
to check if PassThrough option is true.

Let's consider one example - removing accents.
In the past I always recommend people to use regex functions before
to_tsvector conversion to remove accents, but recently I was noticed that
such trick doesn't work with headline(). So, the only way is to have
special dictionary dict_remove_accent before, which  works as a filter.

I don't remember why do we left this for future releases, though.

Oleg
On Wed, 14 Nov 2007, Tom Lane wrote:


This patch:
http://archives.postgresql.org/pgsql-patches/2007-11/msg00137.php
seems simple and useful enough that I think we ought to slip it into
8.3, even though we are far past feature freeze.

As the "simple" dictionary type stands in CVS HEAD, it is only useful as
the last dictionary in a stack, since it never passes anything on as
unrecognized.  With the proposed AcceptAll = false option, it could be
used to filter out some stopwords before feeding tokens to another
dictionary.  While most dictionary types have their own stopword support,
some of them match stopwords after their own normalization processing,
and so there's no way to filter on pre-normalized words.  That seems
like a good improvement, even without the specific need-example that
Jan provided at the start of the thread.

Normally we'd never consider adding a new feature so late in the
development cycle, but this seems small enough and useful enough
to make an exception.  Comments?

regards, tom lane



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-14 Thread Bruce Momjian
Tom Lane wrote:
> This patch:
> http://archives.postgresql.org/pgsql-patches/2007-11/msg00137.php
> seems simple and useful enough that I think we ought to slip it into
> 8.3, even though we are far past feature freeze.
> 
> As the "simple" dictionary type stands in CVS HEAD, it is only useful as
> the last dictionary in a stack, since it never passes anything on as
> unrecognized.  With the proposed AcceptAll = false option, it could be
> used to filter out some stopwords before feeding tokens to another
> dictionary.  While most dictionary types have their own stopword support,
> some of them match stopwords after their own normalization processing,
> and so there's no way to filter on pre-normalized words.  That seems
> like a good improvement, even without the specific need-example that
> Jan provided at the start of the thread.
> 
> Normally we'd never consider adding a new feature so late in the
> development cycle, but this seems small enough and useful enough
> to make an exception.  Comments?

Agreed.  The logic is that textsearch is getting a major overhaul in 8.3
and it is reasonable to keep adjusting things during beta.

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-14 Thread Tom Lane
This patch:
http://archives.postgresql.org/pgsql-patches/2007-11/msg00137.php
seems simple and useful enough that I think we ought to slip it into
8.3, even though we are far past feature freeze.

As the "simple" dictionary type stands in CVS HEAD, it is only useful as
the last dictionary in a stack, since it never passes anything on as
unrecognized.  With the proposed AcceptAll = false option, it could be
used to filter out some stopwords before feeding tokens to another
dictionary.  While most dictionary types have their own stopword support,
some of them match stopwords after their own normalization processing,
and so there's no way to filter on pre-normalized words.  That seems
like a good improvement, even without the specific need-example that
Jan provided at the start of the thread.

Normally we'd never consider adding a new feature so late in the
development cycle, but this seems small enough and useful enough
to make an exception.  Comments?

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-14 Thread Jan Urbański
> This bit should be replaced with defGetBoolean.  Otherwise it looks
> reasonably sane.

Fixed that, thank you.

Regards,
Jan Urbanski
-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin
diff -Naur postgresql-8.3beta2-orig/doc/src/sgml/textsearch.sgml 
postgresql-8.3beta2/doc/src/sgml/textsearch.sgml
--- postgresql-8.3beta2-orig/doc/src/sgml/textsearch.sgml   2007-10-27 
02:19:45.0 +0200
+++ postgresql-8.3beta2/doc/src/sgml/textsearch.sgml2007-11-14 
03:35:48.0 +0100
@@ -2090,9 +2090,10 @@

 The simple dictionary template operates by converting the
 input token to lower case and checking it against a file of stop words.
-If it is found in the file then NULL is returned, causing
-the token to be discarded.  If not, the lower-cased form of the word
-is returned as the normalized lexeme.
+If it is found in the file then an empty array is returned. If not, the
+return value depends on the configuration. The default is to return the
+lower-cased form of the word, but one might choose to
+return NULL insead.

 

@@ -2135,6 +2136,34 @@
 

 
+   
+ We can also choose to return NULL insead of the lower-cased
+ lexeme if it is not found in the stop words file. This can be useful if
+ we just want to pass the unchanged lexeme to another dictionary instead
+ of reporting it as reckognized. We can control this behaviour through
+ the AcceptAll parameter. Correct values for this parameter
+ are true and false, the default
+ is true.
+   
+
+   
+ Using the same configuration as in the previous example:
+
+
+ALTER TEXT SEARCH DICTIONARY public.simple_dict ( AcceptAll = false );
+
+SELECT ts_lexize('public.simple_dict','YeS');
+ ts_lexize
+---
+
+
+SELECT ts_lexize('public.simple_dict','The');
+ ts_lexize
+---
+ {}
+
+   
+

 
  Most types of dictionaries rely on configuration files, such as files of
diff -Naur postgresql-8.3beta2-orig/src/backend/tsearch/dict_simple.c 
postgresql-8.3beta2/src/backend/tsearch/dict_simple.c
--- postgresql-8.3beta2-orig/src/backend/tsearch/dict_simple.c  2007-08-25 
02:03:59.0 +0200
+++ postgresql-8.3beta2/src/backend/tsearch/dict_simple.c   2007-11-14 
12:17:05.0 +0100
@@ -23,6 +23,7 @@
 typedef struct
 {
StopListstoplist;
+   boolacceptAll;
 } DictSimple;
 
 
@@ -31,9 +32,12 @@
 {
List   *dictoptions = (List *) PG_GETARG_POINTER(0);
DictSimple *d = (DictSimple *) palloc0(sizeof(DictSimple));
-   boolstoploaded = false;
+   boolstoploaded = false,
+   acceptloaded = false;
ListCell   *l;
 
+   d->acceptAll = true;
+
foreach(l, dictoptions)
{
DefElem*defel = (DefElem *) lfirst(l);
@@ -47,6 +51,18 @@
readstoplist(defGetString(defel), &d->stoplist, 
lowerstr);
stoploaded = true;
}
+   else if (pg_strcasecmp("AcceptAll", defel->defname) == 0)
+   {
+   if (acceptloaded)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("multiple AcceptAll 
parameters")));
+   if (defGetBoolean(defel))
+   d->acceptAll = true;
+   else
+   d->acceptAll = false;
+   acceptloaded = true;
+   }
else
{
ereport(ERROR,
@@ -71,9 +87,18 @@
txt = lowerstr_with_len(in, len);
 
if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
+   {
pfree(txt);
+   PG_RETURN_POINTER(res);
+   }
else
-   res[0].lexeme = txt;
-
-   PG_RETURN_POINTER(res);
+   {
+   if (d->acceptAll)
+   {
+   res[0].lexeme = txt;
+   PG_RETURN_POINTER(res);
+   }
+   else
+   PG_RETURN_POINTER(NULL);
+   }
 }


signature.asc
Description: OpenPGP digital signature


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-14 Thread Jan Urbański
Jan Urbański wrote:
> Great, I didn't know the API was that convenient in 8.3. I'll try
> posting a working patch for 8.3 during the weekend.

Here's the patch for 8.3beta2. As was suggested I added a configuration
parameter to the 'simple' dictionary called AcceptAll so now it can work
in two modes: either accept everything (the default) or do not
reckognize anything (return NULL). Of couse stopwords are still being
weeded out.

The patch includes changes to the documentation (which was inconsistent
by the way: it stated that the 'simple' dictionary returns NULL for
stopwords, when in fact it returns an empty array).

Regards,
Jan Urbanski
-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin
diff -Naur postgresql-8.3beta2-orig/doc/src/sgml/textsearch.sgml 
postgresql-8.3beta2/doc/src/sgml/textsearch.sgml
--- postgresql-8.3beta2-orig/doc/src/sgml/textsearch.sgml   2007-10-27 
02:19:45.0 +0200
+++ postgresql-8.3beta2/doc/src/sgml/textsearch.sgml2007-11-14 
03:35:48.0 +0100
@@ -2090,9 +2090,10 @@

 The simple dictionary template operates by converting the
 input token to lower case and checking it against a file of stop words.
-If it is found in the file then NULL is returned, causing
-the token to be discarded.  If not, the lower-cased form of the word
-is returned as the normalized lexeme.
+If it is found in the file then an empty array is returned. If not, the
+return value depends on the configuration. The default is to return the
+lower-cased form of the word, but one might choose to
+return NULL insead.

 

@@ -2135,6 +2136,34 @@
 

 
+   
+ We can also choose to return NULL insead of the lower-cased
+ lexeme if it is not found in the stop words file. This can be useful if
+ we just want to pass the unchanged lexeme to another dictionary instead
+ of reporting it as reckognized. We can control this behaviour through
+ the AcceptAll parameter. Correct values for this parameter
+ are true and false, the default
+ is true.
+   
+
+   
+ Using the same configuration as in the previous example:
+
+
+ALTER TEXT SEARCH DICTIONARY public.simple_dict ( AcceptAll = false );
+
+SELECT ts_lexize('public.simple_dict','YeS');
+ ts_lexize
+---
+
+
+SELECT ts_lexize('public.simple_dict','The');
+ ts_lexize
+---
+ {}
+
+   
+

 
  Most types of dictionaries rely on configuration files, such as files of
diff -Naur postgresql-8.3beta2-orig/src/backend/tsearch/dict_simple.c 
postgresql-8.3beta2/src/backend/tsearch/dict_simple.c
--- postgresql-8.3beta2-orig/src/backend/tsearch/dict_simple.c  2007-08-25 
02:03:59.0 +0200
+++ postgresql-8.3beta2/src/backend/tsearch/dict_simple.c   2007-11-14 
03:39:45.0 +0100
@@ -23,6 +23,7 @@
 typedef struct
 {
StopListstoplist;
+   boolacceptAll;
 } DictSimple;
 
 
@@ -31,8 +32,12 @@
 {
List   *dictoptions = (List *) PG_GETARG_POINTER(0);
DictSimple *d = (DictSimple *) palloc0(sizeof(DictSimple));
-   boolstoploaded = false;
+   boolstoploaded = false,
+   acceptloaded = false;
ListCell   *l;
+   const char  *defstring;
+
+   d->acceptAll = true;
 
foreach(l, dictoptions)
{
@@ -47,6 +52,24 @@
readstoplist(defGetString(defel), &d->stoplist, 
lowerstr);
stoploaded = true;
}
+   else if (pg_strcasecmp("AcceptAll", defel->defname) == 0)
+   {
+   if (acceptloaded)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("multiple AcceptAll 
parameters")));
+   defstring = defGetString(defel);
+   if (pg_strcasecmp(defstring, "True") == 0)
+   d->acceptAll = true;
+   else if (pg_strcasecmp(defstring, "False") == 0)
+   d->acceptAll = false;
+   else
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("invalid value for AcceptAll 
parameter: \"%s\"",
+   defstring)));
+   acceptloaded = true;
+   }
else
{
ereport(ERROR,
@@ -71,9 +94,18 @@
txt = lowerstr_with_len(in, len);
 
if (*txt == '\0' || searchstoplist(&(d->stoplist), txt))
+   {
pfree(txt);
+   PG_RETURN_POINTER(res);
+   }
else
-   res[0].lexeme = txt;
-
-   PG_RETURN_POINTER(res);
+   {
+   if (d->acceptAll)
+   {
+ 

Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-13 Thread Tom Lane
> + defstring = defGetString(defel);
> + if (pg_strcasecmp(defstring, "True") == 0)
> + d->acceptAll = true;
> + else if (pg_strcasecmp(defstring, "False") == 0)
> + d->acceptAll = false;
> + else
> + ereport(ERROR,
> + 
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +  errmsg("invalid value for AcceptAll 
> parameter: \"%s\"",
> + defstring)));

This bit should be replaced with defGetBoolean.  Otherwise it looks
reasonably sane.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-09 Thread Jan Urbański
> That doesn't have a whole lot to do with where we are today:
> http://developer.postgresql.org/pgdocs/postgres/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY
> http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/tsearch/dict_simple.c

Great, I didn't know the API was that convenient in 8.3. I'll try
posting a working patch for 8.3 during the weekend.

Regards,
-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin



signature.asc
Description: OpenPGP digital signature


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-09 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes:
>> If there is a use-case for it, IMHO it'd be better to add a boolean
>> accept-or-pass-on parameter to the "simple" dictionary than to add a
>> whole new dictionary type.

> Ah, I never thought of it. You may be very right - it does look like an
> easier solution. However, it would require coding some basic parsing
> logic into the dex_init procedure, because right now the 'simple'
> dictionary expects dict_initoption to be a path to the stopwords file.

That doesn't have a whole lot to do with where we are today:
http://developer.postgresql.org/pgdocs/postgres/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY
http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/tsearch/dict_simple.c

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-09 Thread Jan Urbański
> This example still doesn't seem very convincing --- why would you not
> merely attach the stopword list to the pl_ispell dictionary?

Because the ispell-based dictionaries first stem the lexeme and then
search for it in the stopwords file. The situation here is that a
stopword is first stemmed to produce another lexeme (which is not in the
stopwords file, as it's a perfectly valid word) and then gets indexed,
instead of being discarded.
To restate: the word 'od' in Polish is both a preposition and a declined
form of the noun 'oda'. The ispell dictionary when passed the lexeme
'od' first stems it to produce 'oda' and then fails to find it in the
stopwords file. If I'd include the word 'oda' in the stopwords file, I'd
be losing information about the noun 'oda' appearing in documents.

I'm still trying to find an English example, as I'm sure it would be
easier to understand by most readers of this list. Nothing comes to my
mind, however - I guess some languages just have rotten luck with their
grammar.

> If there is a use-case for it, IMHO it'd be better to add a boolean
> accept-or-pass-on parameter to the "simple" dictionary than to add a
> whole new dictionary type.

Ah, I never thought of it. You may be very right - it does look like an
easier solution. However, it would require coding some basic parsing
logic into the dex_init procedure, because right now the 'simple'
dictionary expects dict_initoption to be a path to the stopwords file.
Do you mean something like 'StopFile="/path/to/stopwords",
AcceptUnknown=0'" ?

Regards,
Jan Urbanski
-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin



signature.asc
Description: OpenPGP digital signature


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-09 Thread Jan Urbański
> dictionaries. In this case, you would first check against one stopword
> list, eliminating 'od', then check the ispell dictionary, and then check
> another stopword list without 'od'.

My problem is basically solved using the patch I sent earlier. I use
'{stop, pl_ispell, simple}' which has the effect of:
a) eliminating words that are stopwords but stemmed produce
non-stopwords (such as  'od', that gets stemmed to 'oda')
b) stemming non-stopwords properly (using an ispell dictionary)
c) indexing words that are not reckognized by ispell, (for instance
'postgresql' gets indexed as 'postgresql')

> I suggested that a while ago
> (http://archives.postgresql.org/pgsql-hackers/2007-08/msg01036.php).
> Hopefully Oleg or someone else gets around restructuring the
> dictionaries in a future release.

I'm gald to see I'm not the only one who is in need of a more
sophisticated way of dealing with dictionaries chaining. I understand
however the problems that arise when one wants to extend the dictionary
API beyond the reject/accept/pass-on schema. For these three we have an
easy way of passing the result from lexize - it returns an empty array,
an array of stemmed lexemes or NULL. If more complex actions were to be
taken, I'm afraid lexize would have to return something more complex
than just text[].

> I wonder if you could hack the ispell dictionary file to treat oda
> specially?

I thought about it, but it turned out that writing a custom dictionary
was easier than figuring out how ispell works internally.

Regards,
-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin



signature.asc
Description: OpenPGP digital signature


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-09 Thread Heikki Linnakangas

Jan Urbański wrote:

The solution I came up with was simple: write a dictionary, that does
only one thing: looks up the lexeme in a stopwords file and either
discards it or returns NULL.

Doesn't the "simple" dictionary handle this?


I don't think so. The 'simple' dictionary discards stopwords, but
accepts any other lexemes. So if use {'simple', 'pl_ispell'} for my
config, I'll get rid of the stopwords, but I won't get any lexemes
stemmed by ispell. Every lexeme that's not a stopword will produce the
very same lexeme (this is how I think the 'simple' dictionary works).

My dictionary does basically the same thing as the 'simple' dictionary,
but it returns NULL instead of the original lexeme in case the lexeme is
not found in the stopwords file.


In the long term, what we really need a more flexible way to chain 
dictionaries. In this case, you would first check against one stopword 
list, eliminating 'od', then check the ispell dictionary, and then check 
another stopword list without 'od'.


I suggested that a while ago 
(http://archives.postgresql.org/pgsql-hackers/2007-08/msg01036.php). 
Hopefully Oleg or someone else gets around restructuring the 
dictionaries in a future release.


I wonder if you could hack the ispell dictionary file to treat oda 
specially?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-09 Thread Jan Urbański
>> The solution I came up with was simple: write a dictionary, that does
>> only one thing: looks up the lexeme in a stopwords file and either
>> discards it or returns NULL.
> 
> Doesn't the "simple" dictionary handle this?

I don't think so. The 'simple' dictionary discards stopwords, but
accepts any other lexemes. So if use {'simple', 'pl_ispell'} for my
config, I'll get rid of the stopwords, but I won't get any lexemes
stemmed by ispell. Every lexeme that's not a stopword will produce the
very same lexeme (this is how I think the 'simple' dictionary works).

My dictionary does basically the same thing as the 'simple' dictionary,
but it returns NULL instead of the original lexeme in case the lexeme is
not found in the stopwords file.

Regards,
-- 
Jan Urbanski
GPG key ID: E583D7D2

ouden estin



signature.asc
Description: OpenPGP digital signature


Re: [PATCHES] a tsearch2 (8.2.4) dictionary that only filters out stopwords

2007-11-08 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <[EMAIL PROTECTED]> writes:
> The solution I came up with was simple: write a dictionary, that does
> only one thing: looks up the lexeme in a stopwords file and either
> discards it or returns NULL.

Doesn't the "simple" dictionary handle this?

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match