Re: [HACKERS] Tsearch vs Snowball, or what's a source file?

2007-06-18 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes:
 1) rename FULLTEXT to TEXT SEARCH in SQL command

   Working on it, I found rather obvious undesired side-effect: if TEXT
 becomes a keyword then any output of name of text type becomes
 quoted. Even if TEXT is in unreserved_keyword list.

Yeah, I was aware that would happen.  What I've been thinking for some
time is that we ought to fix quote_ident and ruleutils.c to distinguish
unreserved keywords from the other ones, and not quote unreserved
ones.  The list of unreserved words has gotten longer and more invasive
in every release, and I don't see that stopping soon.  It's already
annoying that perfectly ordinary words like document and key get
quoted.

The main argument I can think of for not doing this is that if a
currently non-reserved keyword becomes reserved in a future release,
then having quoted it would prevent problems with restoring dump
scripts.  I don't find this argument very compelling, though; it seems
just as likely that words we don't even have as keywords today will
become reserved later.  The only thing I see on the horizon that does
fit that argument is WITH, which we could special-case.

Comments?  I'm willing to make this happen if there are no objections.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Tsearch vs Snowball, or what's a source file?

2007-06-15 Thread Teodor Sigaev

I've done some more work on this point.  After looking at the Snowball
code in more detail, I'm thinking it'd be a good idea to keep it at
arm's length in a loadable shared library, instead of incorporating it


I splited stemmers to two sets because of regression test. As I remember, there 
was some problems with loadable conversions  and configure's flag --disable-shared




the only entry points being snowball_init() and snowball_lexize()
tsearch dictionary support functions.  (I am thinking of having just one
such function pair, with the init function taking an init option to
select which stemmer to use, instead of a separate Postgres function
pair per stemmer.)
So, it's needed to change dictinitoption format of snowball dictionaries to 
point both stop-word file and language's name.




If there're no objections, I'll push forward with completing the
dictionary support functions to go with this infrastructure.

How will we synchronize our changes in patch?

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Tsearch vs Snowball, or what's a source file?

2007-06-15 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes:
 I splited stemmers to two sets because of regression test. As I
 remember, there was some problems with loadable conversions and
 configure's flag --disable-shared

I'm not worried about supporting --disable-shared installations very
much.  They didn't have tsearch support before, either, and they have
not passed regression in many years for lack of plpgsql.

 So, it's needed to change dictinitoption format of snowball dictionaries to 
 point both stop-word file and language's name.

Right.

 How will we synchronize our changes in patch?

Go ahead and make the changes you want, and then I'll work on this.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Tsearch vs Snowball, or what's a source file?

2007-06-15 Thread Teodor Sigaev
danish, dutch, finnish, french, german, hungarian, italian, norwegian, 
portuguese, spanish, swedish, russin and english


Albe Laurenz wrote:

Tom Lane wrote:

Teodor Sigaev [EMAIL PROTECTED] writes:

So, it's needed to change dictinitoption format of snowball
dictionaries to 

point both stop-word file and language's name.

Right.


Is there any chance to get support for other languages than English and
Russian into the tsearch2 distribution?

Yours,
Laurenz Albe

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Tsearch vs Snowball, or what's a source file?

2007-06-15 Thread Albe Laurenz
Tom Lane wrote:
 Teodor Sigaev [EMAIL PROTECTED] writes:
 So, it's needed to change dictinitoption format of snowball
dictionaries to 
 point both stop-word file and language's name.
 
 Right.

Is there any chance to get support for other languages than English and
Russian into the tsearch2 distribution?

Yours,
Laurenz Albe

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Tsearch vs Snowball, or what's a source file?

2007-06-15 Thread Teodor Sigaev

Go ahead and make the changes you want, and then I'll work on this.


So, I'm planing on this weekend:
1) rename FULLTEXT to TEXT SEARCH in SQL command
2) rework Snowball stemmer's as Tom suggested
3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING
4) remove support of default configuration per scheme. Default configuration
   will be only one per locale.


About security holes in PARSER/DICTIONARY. I see following ways now:
1) Allow to superuser only to do CREATE/ALTER/DROP PARSER/DICTIONARY
   Disadvantage: hosting users will not be able to change dictionaries
2) Remove CREATE/ALTER/DROP PARSER, split pg_ts_dict to pg_ts_dict_template
   and pg_ts_dict and accordingly change CREATE/ALTER/DROP DICTIONARY
   Disadvantage: parser and dictionary's template will not dump/restore,
 it should be restored manually (just a INSERT into
 pg_ts_parser/pg_ts_dict_template)
3) Similar to previous point, but:
   * CREATE/ALTER/DROP PARSER - super-user only
   * CREATE/ALTER/DROP DICTIONARY TEMPLATE - super-user only
   * CREATE/ALTER/DROP DICTIONARY - allowed to non-superuser
   Disadvantage: new command CREATE/ALTER/DROP DICTIONARY TEMPLATE


Which way do we choose? or I miss some variant?

I would like to go by 3) way... Comments?

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Tsearch vs Snowball, or what's a source file?

2007-06-14 Thread Tom Lane
I wrote:
 Teodor Sigaev [EMAIL PROTECTED] writes:
 2 Snowball's compiling infrastructure doesn't support Windows target.

 Yeah.  Another problem with using their original source code is that
 running the Snowball compiler during build would not work for
 cross-compiled builds of Postgres, at least not without solving the
 problem of building some code for the host platform instead of the
 target.

 So what I'm thinking now is we should import libstemmer instead of the
 snowball_code representation.  I haven't gotten as far as thinking about
 exactly how to lay out the files though.

I've done some more work on this point.  After looking at the Snowball
code in more detail, I'm thinking it'd be a good idea to keep it at
arm's length in a loadable shared library, instead of incorporating it
directly into the backend.  This is because they don't see anything
wrong with exporting random global function names like eq_v and
skip_utf8; so the probability of name collisions is a bit too high for
my taste.  The current tsearch_core patch envisions having a couple of
the snowball stemmers in the core backend and the rest in a loadable
library, but I suggest we just put them all in a loadable library, with
the only entry points being snowball_init() and snowball_lexize()
tsearch dictionary support functions.  (I am thinking of having just one
such function pair, with the init function taking an init option to
select which stemmer to use, instead of a separate Postgres function
pair per stemmer.)

Attached is a rough proof-of-concept patch for this.  It doesn't do
anything useful, but it does prove that we can compile and link the
Snowball stemmers into a Postgres loadable module with only trivial
changes to their source code.  The code compiles cleanly (zero warnings
in gcc).  The file layout is

src/backend/snowball/Makefile   our files
src/backend/snowball/README
src/backend/snowball/dict_snowball.c
src/backend/snowball/libstemmer/*.c their .c files

src/include/snowball/header.h   intercepting .h file
src/include/snowball/libstemmer/*.h their .h files

If there're no objections, I'll push forward with completing the
dictionary support functions to go with this infrastructure.

regards, tom lane



binodtypuIVWP.bin
Description: snowball-add.tar.gz

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Tsearch vs Snowball, or what's a source file?

2007-06-09 Thread Teodor Sigaev
1 Compiling from .sbl by original Snowball's makefile requires Perl and 
doesn't work cleanly:

AFAIK, you don't receive any answers on your email in snowball mailing list.


2 Snowball's compiling infrastructure doesn't support Windows target.


3 I understand your wish about including only real sources and snowball_code.tgz 
contains some extra data (doc, examples) which isn't needed for tsearch  pgsql


4 Your suggest to simplify support of snowball by using original file's layout 
from Snowball


For that reasons, we suggest to put into source tree (in src/snowball) only 
three directory for snowball_code.tgz:

 - /compiler - compiler from *.sbl to *.c
 - /runtime - common code for all stemmers
 - /algorithms - *.sbl files
and use pgsql's makefile infrastructure to compiling stemmers.

Comments, objections?

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Tsearch vs Snowball, or what's a source file?

2007-06-09 Thread Tom Lane
Teodor Sigaev [EMAIL PROTECTED] writes:
 2 Snowball's compiling infrastructure doesn't support Windows target.

Yeah.  Another problem with using their original source code is that
running the Snowball compiler during build would not work for
cross-compiled builds of Postgres, at least not without solving the
problem of building some code for the host platform instead of the
target.

So what I'm thinking now is we should import libstemmer instead of the
snowball_code representation.  I haven't gotten as far as thinking about
exactly how to lay out the files though.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Tsearch vs Snowball, or what's a source file?

2007-06-04 Thread Teodor Sigaev

This makes no difference in terms of the ease of tracking their changes,
of course, but it just feels better to me to be distributing real
source code and not derived files.


Hmm.
1 Compiling from .sbl by original Snowball's makefile requires Perl and doesn't 
work cleanly:

% gmake

cc -o snowball compiler/space.o compiler/tokeniser.o compiler/analyser.o 
compiler/generator.o compiler/driver.o compiler/generator_java.o
gmake: *** No rule to make target `libstemmer/libstemmer_c.in', needed by 
`libstemmer/libstemmer.c'.  Stop.

I used http://snowball.tartarus.org/dist/snowball_code.tgz tarball.

2 Snowball's compiling infrastructure doesn't support Windows target.

I agree with simplify support process but, IMHO, it's much simpler to do it with 
C sources with pgsql's building infrastructure


And where should it be placed? src/snowball directory?

--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Tsearch vs Snowball, or what's a source file?

2007-06-02 Thread Josh Berkus
Tom,

 Is there a reasonable way to treat libstemmer as an external library?

Hmmm ... do we want to do that if we're distributing it in core?  That 
would require us to have a --with-tsearch compile switch so that people 
who don't want to find  build libstemmer can build PostgreSQL.  I thought 
the whole point of this feature was to have a version of Tsearch which 
just worked for users.

As annoying as it may be to keep it updated, I think it's probably worth it 
from a user experience standpoint.  However, we should definitely put the 
exact libstemmer C files, as distributed by the project, somewhere so that 
updating stemmer each time we do a patch release is simply a matter of 
download and rsync.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Tsearch vs Snowball, or what's a source file?

2007-06-02 Thread Tom Lane
Josh Berkus [EMAIL PROTECTED] writes:
 Is there a reasonable way to treat libstemmer as an external library?

 Hmmm ... do we want to do that if we're distributing it in core?  That 
 would require us to have a --with-tsearch compile switch so that people 
 who don't want to find  build libstemmer can build PostgreSQL.  I thought 
 the whole point of this feature was to have a version of Tsearch which 
 just worked for users.

True.

I just noticed that the upstream master distribution (their compiler
source and .sbl files) weighs in at half the size of the libstemmer
distribution: 68K vs 129K in tar.gz format --- no doubt due to all the
repetitive boilerplate in the generated files.  I'm not sure if the
compiler source has any portability issues, but if not it is interesting
to consider the idea of bundling the master distro instead of
libstemmer.  This would fix at least one issue that we otherwise will
have, which is that the #include-paths they chose to generate libstemmer
with seem a bit unfriendly for our purposes.  The #include commands are
determined by compiler options, so we could fix them if compiling the
.sbl files on the fly.

This makes no difference in terms of the ease of tracking their changes,
of course, but it just feels better to me to be distributing real
source code and not derived files.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match