Re: [HACKERS] Tsearch vs Snowball, or what's a source file?
Teodor Sigaev [EMAIL PROTECTED] writes: 1) rename FULLTEXT to TEXT SEARCH in SQL command Working on it, I found rather obvious undesired side-effect: if TEXT becomes a keyword then any output of name of text type becomes quoted. Even if TEXT is in unreserved_keyword list. Yeah, I was aware that would happen. What I've been thinking for some time is that we ought to fix quote_ident and ruleutils.c to distinguish unreserved keywords from the other ones, and not quote unreserved ones. The list of unreserved words has gotten longer and more invasive in every release, and I don't see that stopping soon. It's already annoying that perfectly ordinary words like document and key get quoted. The main argument I can think of for not doing this is that if a currently non-reserved keyword becomes reserved in a future release, then having quoted it would prevent problems with restoring dump scripts. I don't find this argument very compelling, though; it seems just as likely that words we don't even have as keywords today will become reserved later. The only thing I see on the horizon that does fit that argument is WITH, which we could special-case. Comments? I'm willing to make this happen if there are no objections. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Tsearch vs Snowball, or what's a source file?
I've done some more work on this point. After looking at the Snowball code in more detail, I'm thinking it'd be a good idea to keep it at arm's length in a loadable shared library, instead of incorporating it I splited stemmers to two sets because of regression test. As I remember, there was some problems with loadable conversions and configure's flag --disable-shared the only entry points being snowball_init() and snowball_lexize() tsearch dictionary support functions. (I am thinking of having just one such function pair, with the init function taking an init option to select which stemmer to use, instead of a separate Postgres function pair per stemmer.) So, it's needed to change dictinitoption format of snowball dictionaries to point both stop-word file and language's name. If there're no objections, I'll push forward with completing the dictionary support functions to go with this infrastructure. How will we synchronize our changes in patch? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Tsearch vs Snowball, or what's a source file?
Teodor Sigaev [EMAIL PROTECTED] writes: I splited stemmers to two sets because of regression test. As I remember, there was some problems with loadable conversions and configure's flag --disable-shared I'm not worried about supporting --disable-shared installations very much. They didn't have tsearch support before, either, and they have not passed regression in many years for lack of plpgsql. So, it's needed to change dictinitoption format of snowball dictionaries to point both stop-word file and language's name. Right. How will we synchronize our changes in patch? Go ahead and make the changes you want, and then I'll work on this. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Tsearch vs Snowball, or what's a source file?
danish, dutch, finnish, french, german, hungarian, italian, norwegian, portuguese, spanish, swedish, russin and english Albe Laurenz wrote: Tom Lane wrote: Teodor Sigaev [EMAIL PROTECTED] writes: So, it's needed to change dictinitoption format of snowball dictionaries to point both stop-word file and language's name. Right. Is there any chance to get support for other languages than English and Russian into the tsearch2 distribution? Yours, Laurenz Albe ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Tsearch vs Snowball, or what's a source file?
Tom Lane wrote: Teodor Sigaev [EMAIL PROTECTED] writes: So, it's needed to change dictinitoption format of snowball dictionaries to point both stop-word file and language's name. Right. Is there any chance to get support for other languages than English and Russian into the tsearch2 distribution? Yours, Laurenz Albe ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Tsearch vs Snowball, or what's a source file?
Go ahead and make the changes you want, and then I'll work on this. So, I'm planing on this weekend: 1) rename FULLTEXT to TEXT SEARCH in SQL command 2) rework Snowball stemmer's as Tom suggested 3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING 4) remove support of default configuration per scheme. Default configuration will be only one per locale. About security holes in PARSER/DICTIONARY. I see following ways now: 1) Allow to superuser only to do CREATE/ALTER/DROP PARSER/DICTIONARY Disadvantage: hosting users will not be able to change dictionaries 2) Remove CREATE/ALTER/DROP PARSER, split pg_ts_dict to pg_ts_dict_template and pg_ts_dict and accordingly change CREATE/ALTER/DROP DICTIONARY Disadvantage: parser and dictionary's template will not dump/restore, it should be restored manually (just a INSERT into pg_ts_parser/pg_ts_dict_template) 3) Similar to previous point, but: * CREATE/ALTER/DROP PARSER - super-user only * CREATE/ALTER/DROP DICTIONARY TEMPLATE - super-user only * CREATE/ALTER/DROP DICTIONARY - allowed to non-superuser Disadvantage: new command CREATE/ALTER/DROP DICTIONARY TEMPLATE Which way do we choose? or I miss some variant? I would like to go by 3) way... Comments? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Tsearch vs Snowball, or what's a source file?
I wrote: Teodor Sigaev [EMAIL PROTECTED] writes: 2 Snowball's compiling infrastructure doesn't support Windows target. Yeah. Another problem with using their original source code is that running the Snowball compiler during build would not work for cross-compiled builds of Postgres, at least not without solving the problem of building some code for the host platform instead of the target. So what I'm thinking now is we should import libstemmer instead of the snowball_code representation. I haven't gotten as far as thinking about exactly how to lay out the files though. I've done some more work on this point. After looking at the Snowball code in more detail, I'm thinking it'd be a good idea to keep it at arm's length in a loadable shared library, instead of incorporating it directly into the backend. This is because they don't see anything wrong with exporting random global function names like eq_v and skip_utf8; so the probability of name collisions is a bit too high for my taste. The current tsearch_core patch envisions having a couple of the snowball stemmers in the core backend and the rest in a loadable library, but I suggest we just put them all in a loadable library, with the only entry points being snowball_init() and snowball_lexize() tsearch dictionary support functions. (I am thinking of having just one such function pair, with the init function taking an init option to select which stemmer to use, instead of a separate Postgres function pair per stemmer.) Attached is a rough proof-of-concept patch for this. It doesn't do anything useful, but it does prove that we can compile and link the Snowball stemmers into a Postgres loadable module with only trivial changes to their source code. The code compiles cleanly (zero warnings in gcc). The file layout is src/backend/snowball/Makefile our files src/backend/snowball/README src/backend/snowball/dict_snowball.c src/backend/snowball/libstemmer/*.c their .c files src/include/snowball/header.h intercepting .h file src/include/snowball/libstemmer/*.h their .h files If there're no objections, I'll push forward with completing the dictionary support functions to go with this infrastructure. regards, tom lane binodtypuIVWP.bin Description: snowball-add.tar.gz ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Tsearch vs Snowball, or what's a source file?
1 Compiling from .sbl by original Snowball's makefile requires Perl and doesn't work cleanly: AFAIK, you don't receive any answers on your email in snowball mailing list. 2 Snowball's compiling infrastructure doesn't support Windows target. 3 I understand your wish about including only real sources and snowball_code.tgz contains some extra data (doc, examples) which isn't needed for tsearch pgsql 4 Your suggest to simplify support of snowball by using original file's layout from Snowball For that reasons, we suggest to put into source tree (in src/snowball) only three directory for snowball_code.tgz: - /compiler - compiler from *.sbl to *.c - /runtime - common code for all stemmers - /algorithms - *.sbl files and use pgsql's makefile infrastructure to compiling stemmers. Comments, objections? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Tsearch vs Snowball, or what's a source file?
Teodor Sigaev [EMAIL PROTECTED] writes: 2 Snowball's compiling infrastructure doesn't support Windows target. Yeah. Another problem with using their original source code is that running the Snowball compiler during build would not work for cross-compiled builds of Postgres, at least not without solving the problem of building some code for the host platform instead of the target. So what I'm thinking now is we should import libstemmer instead of the snowball_code representation. I haven't gotten as far as thinking about exactly how to lay out the files though. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Tsearch vs Snowball, or what's a source file?
This makes no difference in terms of the ease of tracking their changes, of course, but it just feels better to me to be distributing real source code and not derived files. Hmm. 1 Compiling from .sbl by original Snowball's makefile requires Perl and doesn't work cleanly: % gmake cc -o snowball compiler/space.o compiler/tokeniser.o compiler/analyser.o compiler/generator.o compiler/driver.o compiler/generator_java.o gmake: *** No rule to make target `libstemmer/libstemmer_c.in', needed by `libstemmer/libstemmer.c'. Stop. I used http://snowball.tartarus.org/dist/snowball_code.tgz tarball. 2 Snowball's compiling infrastructure doesn't support Windows target. I agree with simplify support process but, IMHO, it's much simpler to do it with C sources with pgsql's building infrastructure And where should it be placed? src/snowball directory? -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Tsearch vs Snowball, or what's a source file?
Tom, Is there a reasonable way to treat libstemmer as an external library? Hmmm ... do we want to do that if we're distributing it in core? That would require us to have a --with-tsearch compile switch so that people who don't want to find build libstemmer can build PostgreSQL. I thought the whole point of this feature was to have a version of Tsearch which just worked for users. As annoying as it may be to keep it updated, I think it's probably worth it from a user experience standpoint. However, we should definitely put the exact libstemmer C files, as distributed by the project, somewhere so that updating stemmer each time we do a patch release is simply a matter of download and rsync. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Tsearch vs Snowball, or what's a source file?
Josh Berkus [EMAIL PROTECTED] writes: Is there a reasonable way to treat libstemmer as an external library? Hmmm ... do we want to do that if we're distributing it in core? That would require us to have a --with-tsearch compile switch so that people who don't want to find build libstemmer can build PostgreSQL. I thought the whole point of this feature was to have a version of Tsearch which just worked for users. True. I just noticed that the upstream master distribution (their compiler source and .sbl files) weighs in at half the size of the libstemmer distribution: 68K vs 129K in tar.gz format --- no doubt due to all the repetitive boilerplate in the generated files. I'm not sure if the compiler source has any portability issues, but if not it is interesting to consider the idea of bundling the master distro instead of libstemmer. This would fix at least one issue that we otherwise will have, which is that the #include-paths they chose to generate libstemmer with seem a bit unfriendly for our purposes. The #include commands are determined by compiler options, so we could fix them if compiling the .sbl files on the fly. This makes no difference in terms of the ease of tracking their changes, of course, but it just feels better to me to be distributing real source code and not derived files. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match