Re: [HACKERS] integrated tsearch has different results than tsearch2

2007-09-04 Thread Pavel Stehule
2007/9/4, Heikki Linnakangas <[EMAIL PROTECTED]>:
> Pavel Stehule wrote:
> > I used dictionaries from fedora core packages
> >
> > hunspell-cs-20060303-5.fc7.i386.rpm
> >
> > then I converted it to utf8 with iconv
>
> Ok, thanks.
>
> Apparently it's a bug I introduced when I refactored spell.c to use the
> readline function for reading and recoding the input file. I didn't
> notice that some calls to STRNCMP used the non-lowercased version of the
> input line. Patch attached.
>
> --

It works

Thank you
Pavel

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] integrated tsearch has different results than tsearch2

2007-09-04 Thread Heikki Linnakangas
Pavel Stehule wrote:
> I used dictionaries from fedora core packages
> 
> hunspell-cs-20060303-5.fc7.i386.rpm
> 
> then I converted it to utf8 with iconv

Ok, thanks.

Apparently it's a bug I introduced when I refactored spell.c to use the
readline function for reading and recoding the input file. I didn't
notice that some calls to STRNCMP used the non-lowercased version of the
input line. Patch attached.

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/tsearch/spell.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/tsearch/spell.c,v
retrieving revision 1.2
diff -c -r1.2 spell.c
*** src/backend/tsearch/spell.c	25 Aug 2007 00:03:59 -	1.2
--- src/backend/tsearch/spell.c	4 Sep 2007 12:31:55 -
***
*** 733,739 
  	while ((recoded = t_readline(affix)) != NULL)
  	{
  		pstr = lowerstr(recoded);
- 		pfree(recoded);
  
  		lineno++;
  
--- 733,738 
***
*** 813,820 
  			flag = (unsigned char) *s;
  			goto nextline;
  		}
! 		if (STRNCMP(str, "COMPOUNDFLAG") == 0 || STRNCMP(str, "COMPOUNDMIN") == 0 ||
! 			STRNCMP(str, "PFX") == 0 || STRNCMP(str, "SFX") == 0)
  		{
  			if (oldformat)
  ereport(ERROR,
--- 812,819 
  			flag = (unsigned char) *s;
  			goto nextline;
  		}
! 		if (STRNCMP(recoded, "COMPOUNDFLAG") == 0 || STRNCMP(recoded, "COMPOUNDMIN") == 0 ||
! 			STRNCMP(recoded, "PFX") == 0 || STRNCMP(recoded, "SFX") == 0)
  		{
  			if (oldformat)
  ereport(ERROR,
***
*** 834,839 
--- 833,839 
  		NIAddAffix(Conf, flag, flagflags, mask, find, repl, suffixes ? FF_SUFFIX : FF_PREFIX);
  
  	nextline:
+ 		pfree(recoded);
  		pfree(pstr);
  	}
  	FreeFile(affix);

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] integrated tsearch has different results than tsearch2

2007-09-04 Thread Pavel Stehule
I used dictionaries from fedora core packages

hunspell-cs-20060303-5.fc7.i386.rpm

then I converted it to utf8 with iconv

Pavel

2007/9/4, Heikki Linnakangas <[EMAIL PROTECTED]>:
> Pavel Stehule wrote:
> > 2007/9/3, Teodor Sigaev <[EMAIL PROTECTED]>:
> >>> 1. I am not able use fulltext with latin2 encoding :( I missing note
> >>> about only utf8 dictionaries in doc).
> >> You can use any server encoding, but dictionary's files should be in utf8 -
> >> dictionary will convert utf8 files into server encoding.
> >>
> >>>
> >>> 2. with hspell dictionaries (fresh copy from open office) I got
> >>> different and wrong results.
> >>> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
> >>> vody') @@ to_tsquery('cs','napít');
> >>>  ?column?
> >>> --
> >>>  f
> >>> (1 row)
> >> Pls, output of:
> >> select ts_lexize('cspell','napil');
> >> select to_tsvector('cs','Příliš žlutý kůň se napil žluté
> >> vody');
> >>
> >>
> > postgres=# select ts_lexize('cspell','napil');
> >  ts_lexize
> > ---
> >
> > (1 row)
> > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody');
> > to_tsvector
> > ---
> >  'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1
> > (1 row)
> >
> > There is difference
> > 8.2.x
> > postgres=# select lexize('cz_ispell','jablka');
> >   lexize
> > --
> >  {jablko}
> > (1 row)
> > 8.3
> > postgres=# select ts_lexize('cspell','jablka');
> >  ts_lexize
> > ---
> >
> > (1 row)
> > postgres=# select ts_lexize('cspell','jablko');
> >  ts_lexize
> > ---
> >  {jablko}
> > (1 row)
>
> Can you post a link to the ispell dictionary file you're using so I and
> others can  reproduce that?
>
> --
>   Heikki Linnakangas
>   EnterpriseDB   http://www.enterprisedb.com
>

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] integrated tsearch has different results than tsearch2

2007-09-04 Thread Heikki Linnakangas
Pavel Stehule wrote:
> 2007/9/3, Teodor Sigaev <[EMAIL PROTECTED]>:
>>> 1. I am not able use fulltext with latin2 encoding :( I missing note
>>> about only utf8 dictionaries in doc).
>> You can use any server encoding, but dictionary's files should be in utf8 -
>> dictionary will convert utf8 files into server encoding.
>>
>>>
>>> 2. with hspell dictionaries (fresh copy from open office) I got
>>> different and wrong results.
>>> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
>>> vody') @@ to_tsquery('cs','napít');
>>>  ?column?
>>> --
>>>  f
>>> (1 row)
>> Pls, output of:
>> select ts_lexize('cspell','napil');
>> select to_tsvector('cs','Příliš žlutý kůň se napil žluté
>> vody');
>>
>>
> postgres=# select ts_lexize('cspell','napil');
>  ts_lexize
> ---
> 
> (1 row)
> postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody');
> to_tsvector
> ---
>  'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1
> (1 row)
> 
> There is difference
> 8.2.x
> postgres=# select lexize('cz_ispell','jablka');
>   lexize
> --
>  {jablko}
> (1 row)
> 8.3
> postgres=# select ts_lexize('cspell','jablka');
>  ts_lexize
> ---
> 
> (1 row)
> postgres=# select ts_lexize('cspell','jablko');
>  ts_lexize
> ---
>  {jablko}
> (1 row)

Can you post a link to the ispell dictionary file you're using so I and
others can  reproduce that?

-- 
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] integrated tsearch has different results than tsearch2

2007-09-04 Thread Pavel Stehule
2007/9/3, Teodor Sigaev <[EMAIL PROTECTED]>:
> > 1. I am not able use fulltext with latin2 encoding :( I missing note
> > about only utf8 dictionaries in doc).
> You can use any server encoding, but dictionary's files should be in utf8 -
> dictionary will convert utf8 files into server encoding.
>
> >
> >
> > 2. with hspell dictionaries (fresh copy from open office) I got
> > different and wrong results.
> > postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
> > vody') @@ to_tsquery('cs','napít');
> >  ?column?
> > --
> >  f
> > (1 row)
>
> Pls, output of:
> select ts_lexize('cspell','napil');
> select to_tsvector('cs','Příliš žlutý kůň se napil žluté
> vody');
>
>
postgres=# select ts_lexize('cspell','napil');
 ts_lexize
---

(1 row)
postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté vody');
to_tsvector
---
 'vody':7 'kůň':3 'napil':5 'žluté':6 'žlutý':2 'příliš':1
(1 row)

There is difference
8.2.x
postgres=# select lexize('cz_ispell','jablka');
  lexize
--
 {jablko}
(1 row)
8.3
postgres=# select ts_lexize('cspell','jablka');
 ts_lexize
---

(1 row)
postgres=# select ts_lexize('cspell','jablko');
 ts_lexize
---
 {jablko}
(1 row)

Pavel Stehule

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] integrated tsearch has different results than tsearch2

2007-09-03 Thread Teodor Sigaev

1. I am not able use fulltext with latin2 encoding :( I missing note
about only utf8 dictionaries in doc).
You can use any server encoding, but dictionary's files should be in utf8 - 
dictionary will convert utf8 files into server encoding.





2. with hspell dictionaries (fresh copy from open office) I got
different and wrong results.
postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody') @@ to_tsquery('cs','napít');
 ?column?
--
 f
(1 row)


Pls, output of:
select ts_lexize('cspell','napil');
select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody');




--
Teodor Sigaev   E-mail: [EMAIL PROTECTED]
   WWW: http://www.sigaev.ru/

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] integrated tsearch has different results than tsearch2

2007-09-03 Thread Oleg Bartunov

Pavel,

I can't read your posting. Can you use plain text format ?

Oleg
On Mon, 3 Sep 2007, Pavel Stehule wrote:


Hello
I am testing fulltext.
1. I am not able use fulltext with latin2 encoding :( I missing noteabout only 
utf8 dictionaries in doc).

2. with hspell dictionaries (fresh copy from open office) I gotdifferent and 
wrong results.
Original (old) result
ts=# select * from ts_debug('P??li? ?lu?ou?k? k?? se napil ?lut? vody');
ts_name| tok_type | description |   token   | dict_name  |  tsvector 
--+--+-+---+---+ 
 default_czech | word | Word| P??li?
|{cz_ispell,simple} | 'p??li?' default_czech | word | Word| 
?lu?ou?k? |{cz_ispell,simple} | '?lu?ou?k?' default_czech | word | Word 
   | k??   | {cz_ispell,simple} | 'k??' default_czech | lword| Latin 
word  | se| {cz_ispell,simple} | default_czech | lword| Latin word  
| napil |{cz_ispell,simple} | 'nap?t' default_czech | word | Word   
 | ?lut? |{cz_ispell,simple} | '?lut?' default_czech | lword| Latin 
word  | vody  |{cz_ispell,simple} | 'voda' (7 ??dek)
New results:postgres=# create Text search dictionary 
cspell(template=ispell,dictfile=czech, afffile=czech, stopwords=czech);CREATE 
TEXT SEARCH DICTIONARYpostgres=# CREATE text search configuration cs 
(copy=english);CREATE TEXT SEARCH CONFIGURATION
postgres=# alter text search configuration cs alter mapping for word,lword  
with cspell, simple;ALTER TEXT SEARCH CONFIGURATIONpostgres=# select * from 
ts_debug('cs','P??li? ?lu?ou?k? k?? se napil?lut? vody'); Alias |  Description  
|   Token   |  Dictionaries   |Lexized 
token---+---+---+-+-
 word  | Word  | P??li?| {cspell,simple} | cspell: {p??li?} blank | 
Space symbols |   | {}  | word  | Word  | ?lu?ou?k? 
| {cspell,simple} | cspell: {?lu?ou?k?} blank | Space symbols |   | {}  
| word  | Word  | k??   | {cspell,simple} | cspell: 
{k??} blank | Space symbols |   | {}  | lword | Latin word  
  | se| {cspell,simple} | cspell: {} blank | Space symbols |   
| {}  | lword | Latin word| napil | {cspell,simple} | 
simple: {napil} blank | Space symbols |   | {}  | word  | 
Word  | ?lut? | {cspell,simple} | simple: {?lut?} blank | Space 
symbols |   | {}  | lword | Latin word| vody  | 
{cspell,simple} | simple: {vody}(13 rows)
This query returned true in 8.2 and now:
postgres=# select to_tsvector('cs','P??li? ?lut? k?? se napil ?lut?vody') @@ 
to_tsquery('cs','nap?t'); ?column?-- f(1 row)
RegardsPavel Stehule



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


[HACKERS] integrated tsearch has different results than tsearch2

2007-09-03 Thread Pavel Stehule
Hello

I am testing fulltext.

1. I am not able use fulltext with latin2 encoding :( I missing note
about only utf8 dictionaries in doc).


2. with hspell dictionaries (fresh copy from open office) I got
different and wrong results.

Original (old) result

ts=# select * from ts_debug('Příliš žluťoučký kůň se napil žluté vody');
ts_name| tok_type | description |   token   | dict_name
  |  tsvector
 --+--+-+---+
---+ 
 default_czech | word | Word| Příliš|
{cz_ispell,simple} | 'příliš'
 default_czech | word | Word| žluťoučký |
{cz_ispell,simple} | 'žluťoučký'
 default_czech | word | Word| kůň   | {cz_ispell,simple} | 'kůň'
 default_czech | lword| Latin word  | se| {cz_ispell,simple} |
 default_czech | lword| Latin word  | napil |
{cz_ispell,simple} | 'napít'
 default_czech | word | Word| žluté |
{cz_ispell,simple} | 'žlutý'
 default_czech | lword| Latin word  | vody  |
{cz_ispell,simple} | 'voda'
 (7 řádek)

New results:
postgres=# create Text search dictionary cspell(template=ispell,
dictfile=czech, afffile=czech, stopwords=czech);
CREATE TEXT SEARCH DICTIONARY
postgres=# CREATE text search configuration cs (copy=english);
CREATE TEXT SEARCH CONFIGURATION

postgres=# alter text search configuration cs alter mapping for word,
lword  with cspell, simple;
ALTER TEXT SEARCH CONFIGURATION
postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil
žluté vody');
 Alias |  Description  |   Token   |  Dictionaries   |Lexized token
---+---+---+-+-
 word  | Word  | Příliš| {cspell,simple} | cspell: {příliš}
 blank | Space symbols |   | {}  |
 word  | Word  | žluťoučký | {cspell,simple} | cspell: {žluťoučký}
 blank | Space symbols |   | {}  |
 word  | Word  | kůň   | {cspell,simple} | cspell: {kůň}
 blank | Space symbols |   | {}  |
 lword | Latin word| se| {cspell,simple} | cspell: {}
 blank | Space symbols |   | {}  |
 lword | Latin word| napil | {cspell,simple} | simple: {napil}
 blank | Space symbols |   | {}  |
 word  | Word  | žluté | {cspell,simple} | simple: {žluté}
 blank | Space symbols |   | {}  |
 lword | Latin word| vody  | {cspell,simple} | simple: {vody}
(13 rows)

This query returned true in 8.2 and now:

postgres=# select to_tsvector('cs','Příliš žlutý kůň se napil žluté
vody') @@ to_tsquery('cs','napít');
 ?column?
--
 f
(1 row)

Regards
Pavel Stehule

---(end of broadcast)---
TIP 6: explain analyze is your friend