Re: [HACKERS] integrated tsearch has different results than tsearch2
2007/9/4, Heikki Linnakangas <[EMAIL PROTECTED]>: > Pavel Stehule wrote: > > I used dictionaries from fedora core packages > > > > hunspell-cs-20060303-5.fc7.i386.rpm > > > > then I converted it to utf8 with iconv > > Ok, thanks. > > Apparently it's a bug I introduced when I refactored spell.c to use the > readline function for reading and recoding the input file. I didn't > notice that some calls to STRNCMP used the non-lowercased version of the > input line. Patch attached. > > -- It works Thank you Pavel ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] integrated tsearch has different results than tsearch2
Pavel Stehule wrote: > I used dictionaries from fedora core packages > > hunspell-cs-20060303-5.fc7.i386.rpm > > then I converted it to utf8 with iconv Ok, thanks. Apparently it's a bug I introduced when I refactored spell.c to use the readline function for reading and recoding the input file. I didn't notice that some calls to STRNCMP used the non-lowercased version of the input line. Patch attached. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/tsearch/spell.c === RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tsearch/spell.c,v retrieving revision 1.2 diff -c -r1.2 spell.c *** src/backend/tsearch/spell.c 25 Aug 2007 00:03:59 - 1.2 --- src/backend/tsearch/spell.c 4 Sep 2007 12:31:55 - *** *** 733,739 while ((recoded = t_readline(affix)) != NULL) { pstr = lowerstr(recoded); - pfree(recoded); lineno++; --- 733,738 *** *** 813,820 flag = (unsigned char) *s; goto nextline; } ! if (STRNCMP(str, "COMPOUNDFLAG") == 0 || STRNCMP(str, "COMPOUNDMIN") == 0 || ! STRNCMP(str, "PFX") == 0 || STRNCMP(str, "SFX") == 0) { if (oldformat) ereport(ERROR, --- 812,819 flag = (unsigned char) *s; goto nextline; } ! if (STRNCMP(recoded, "COMPOUNDFLAG") == 0 || STRNCMP(recoded, "COMPOUNDMIN") == 0 || ! STRNCMP(recoded, "PFX") == 0 || STRNCMP(recoded, "SFX") == 0) { if (oldformat) ereport(ERROR, *** *** 834,839 --- 833,839 NIAddAffix(Conf, flag, flagflags, mask, find, repl, suffixes ? FF_SUFFIX : FF_PREFIX); nextline: + pfree(recoded); pfree(pstr); } FreeFile(affix); ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] integrated tsearch has different results than tsearch2
I used dictionaries from fedora core packages hunspell-cs-20060303-5.fc7.i386.rpm then I converted it to utf8 with iconv Pavel 2007/9/4, Heikki Linnakangas <[EMAIL PROTECTED]>: > Pavel Stehule wrote: > > 2007/9/3, Teodor Sigaev <[EMAIL PROTECTED]>: > >>> 1. I am not able use fulltext with latin2 encoding :( I missing note > >>> about only utf8 dictionaries in doc). > >> You can use any server encoding, but dictionary's files should be in utf8 - > >> dictionary will convert utf8 files into server encoding. > >> > >>> > >>> 2. with hspell dictionaries (fresh copy from open office) I got > >>> different and wrong results. > >>> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté > >>> vody') @@ to_tsquery('cs','napít'); > >>> ?column? > >>> -- > >>> f > >>> (1 row) > >> Pls, output of: > >> select ts_lexize('cspell','napil'); > >> select to_tsvector('cs','Příliš žlutý kůň se napil žluté > >> vody'); > >> > >> > > postgres=# select ts_lexize('cspell','napil'); > > ts_lexize > > --- > > > > (1 row) > > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody'); > > to_tsvector > > --- > > 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1 > > (1 row) > > > > There is difference > > 8.2.x > > postgres=# select lexize('cz_ispell','jablka'); > > lexize > > -- > > {jablko} > > (1 row) > > 8.3 > > postgres=# select ts_lexize('cspell','jablka'); > > ts_lexize > > --- > > > > (1 row) > > postgres=# select ts_lexize('cspell','jablko'); > > ts_lexize > > --- > > {jablko} > > (1 row) > > Can you post a link to the ispell dictionary file you're using so I and > others can reproduce that? > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] integrated tsearch has different results than tsearch2
Pavel Stehule wrote: > 2007/9/3, Teodor Sigaev <[EMAIL PROTECTED]>: >>> 1. I am not able use fulltext with latin2 encoding :( I missing note >>> about only utf8 dictionaries in doc). >> You can use any server encoding, but dictionary's files should be in utf8 - >> dictionary will convert utf8 files into server encoding. >> >>> >>> 2. with hspell dictionaries (fresh copy from open office) I got >>> different and wrong results. >>> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté >>> vody') @@ to_tsquery('cs','napít'); >>> ?column? >>> -- >>> f >>> (1 row) >> Pls, output of: >> select ts_lexize('cspell','napil'); >> select to_tsvector('cs','Příliš žlutý kůň se napil žluté >> vody'); >> >> > postgres=# select ts_lexize('cspell','napil'); > ts_lexize > --- > > (1 row) > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody'); > to_tsvector > --- > 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1 > (1 row) > > There is difference > 8.2.x > postgres=# select lexize('cz_ispell','jablka'); > lexize > -- > {jablko} > (1 row) > 8.3 > postgres=# select ts_lexize('cspell','jablka'); > ts_lexize > --- > > (1 row) > postgres=# select ts_lexize('cspell','jablko'); > ts_lexize > --- > {jablko} > (1 row) Can you post a link to the ispell dictionary file you're using so I and others can reproduce that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] integrated tsearch has different results than tsearch2
2007/9/3, Teodor Sigaev <[EMAIL PROTECTED]>: > > 1. I am not able use fulltext with latin2 encoding :( I missing note > > about only utf8 dictionaries in doc). > You can use any server encoding, but dictionary's files should be in utf8 - > dictionary will convert utf8 files into server encoding. > > > > > > > 2. with hspell dictionaries (fresh copy from open office) I got > > different and wrong results. > > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté > > vody') @@ to_tsquery('cs','napít'); > > ?column? > > -- > > f > > (1 row) > > Pls, output of: > select ts_lexize('cspell','napil'); > select to_tsvector('cs','Příliš žlutý kůň se napil žluté > vody'); > > postgres=# select ts_lexize('cspell','napil'); ts_lexize --- (1 row) postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody'); to_tsvector --- 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1 (1 row) There is difference 8.2.x postgres=# select lexize('cz_ispell','jablka'); lexize -- {jablko} (1 row) 8.3 postgres=# select ts_lexize('cspell','jablka'); ts_lexize --- (1 row) postgres=# select ts_lexize('cspell','jablko'); ts_lexize --- {jablko} (1 row) Pavel Stehule ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] integrated tsearch has different results than tsearch2
1. I am not able use fulltext with latin2 encoding :( I missing note about only utf8 dictionaries in doc). You can use any server encoding, but dictionary's files should be in utf8 - dictionary will convert utf8 files into server encoding. 2. with hspell dictionaries (fresh copy from open office) I got different and wrong results. postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody') @@ to_tsquery('cs','napít'); ?column? -- f (1 row) Pls, output of: select ts_lexize('cspell','napil'); select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody'); -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] integrated tsearch has different results than tsearch2
Pavel, I can't read your posting. Can you use plain text format ? Oleg On Mon, 3 Sep 2007, Pavel Stehule wrote: Hello I am testing fulltext. 1. I am not able use fulltext with latin2 encoding :( I missing noteabout only utf8 dictionaries in doc). 2. with hspell dictionaries (fresh copy from open office) I gotdifferent and wrong results. Original (old) result ts=# select * from ts_debug('P??li? ?lu?ou?k? k?? se napil ?lut? vody'); ts_name| tok_type | description | token | dict_name | tsvector --+--+-+---+---+ default_czech | word | Word| P??li? |{cz_ispell,simple} | 'p??li?' default_czech | word | Word| ?lu?ou?k? |{cz_ispell,simple} | '?lu?ou?k?' default_czech | word | Word | k?? | {cz_ispell,simple} | 'k??' default_czech | lword| Latin word | se| {cz_ispell,simple} | default_czech | lword| Latin word | napil |{cz_ispell,simple} | 'nap?t' default_czech | word | Word | ?lut? |{cz_ispell,simple} | '?lut?' default_czech | lword| Latin word | vody |{cz_ispell,simple} | 'voda' (7 ??dek) New results:postgres=# create Text search dictionary cspell(template=ispell,dictfile=czech, afffile=czech, stopwords=czech);CREATE TEXT SEARCH DICTIONARYpostgres=# CREATE text search configuration cs (copy=english);CREATE TEXT SEARCH CONFIGURATION postgres=# alter text search configuration cs alter mapping for word,lword with cspell, simple;ALTER TEXT SEARCH CONFIGURATIONpostgres=# select * from ts_debug('cs','P??li? ?lu?ou?k? k?? se napil?lut? vody'); Alias | Description | Token | Dictionaries |Lexized token---+---+---+-+- word | Word | P??li?| {cspell,simple} | cspell: {p??li?} blank | Space symbols | | {} | word | Word | ?lu?ou?k? | {cspell,simple} | cspell: {?lu?ou?k?} blank | Space symbols | | {} | word | Word | k?? | {cspell,simple} | cspell: {k??} blank | Space symbols | | {} | lword | Latin word | se| {cspell,simple} | cspell: {} blank | Space symbols | | {} | lword | Latin word| napil | {cspell,simple} | simple: {napil} blank | Space symbols | | {} | word | Word | ?lut? | {cspell,simple} | simple: {?lut?} blank | Space symbols | | {} | lword | Latin word| vody | {cspell,simple} | simple: {vody}(13 rows) This query returned true in 8.2 and now: postgres=# select to_tsvector('cs','P??li? ?lut? k?? se napil ?lut?vody') @@ to_tsquery('cs','nap?t'); ?column?-- f(1 row) RegardsPavel Stehule Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[HACKERS] integrated tsearch has different results than tsearch2
Hello I am testing fulltext. 1. I am not able use fulltext with latin2 encoding :( I missing note about only utf8 dictionaries in doc). 2. with hspell dictionaries (fresh copy from open office) I got different and wrong results. Original (old) result ts=# select * from ts_debug('Příliš žluťoučký kůň se napil žluté vody'); ts_name| tok_type | description | token | dict_name | tsvector --+--+-+---+ ---+ default_czech | word | Word| Příliš| {cz_ispell,simple} | 'příliš' default_czech | word | Word| žluťoučký | {cz_ispell,simple} | 'žluťoučký' default_czech | word | Word| kůň | {cz_ispell,simple} | 'kůň' default_czech | lword| Latin word | se| {cz_ispell,simple} | default_czech | lword| Latin word | napil | {cz_ispell,simple} | 'napít' default_czech | word | Word| žluté | {cz_ispell,simple} | 'žlutý' default_czech | lword| Latin word | vody | {cz_ispell,simple} | 'voda' (7 řádek) New results: postgres=# create Text search dictionary cspell(template=ispell, dictfile=czech, afffile=czech, stopwords=czech); CREATE TEXT SEARCH DICTIONARY postgres=# CREATE text search configuration cs (copy=english); CREATE TEXT SEARCH CONFIGURATION postgres=# alter text search configuration cs alter mapping for word, lword with cspell, simple; ALTER TEXT SEARCH CONFIGURATION postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil žluté vody'); Alias | Description | Token | Dictionaries |Lexized token ---+---+---+-+- word | Word | Příliš| {cspell,simple} | cspell: {příliš} blank | Space symbols | | {} | word | Word | žluťoučký | {cspell,simple} | cspell: {žluťoučký} blank | Space symbols | | {} | word | Word | kůň | {cspell,simple} | cspell: {kůň} blank | Space symbols | | {} | lword | Latin word| se| {cspell,simple} | cspell: {} blank | Space symbols | | {} | lword | Latin word| napil | {cspell,simple} | simple: {napil} blank | Space symbols | | {} | word | Word | žluté | {cspell,simple} | simple: {žluté} blank | Space symbols | | {} | lword | Latin word| vody | {cspell,simple} | simple: {vody} (13 rows) This query returned true in 8.2 and now: postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody') @@ to_tsquery('cs','napít'); ?column? -- f (1 row) Regards Pavel Stehule ---(end of broadcast)--- TIP 6: explain analyze is your friend