Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-02-17 Thread Artur Zakirov

On 16.02.2016 18:14, Artur Zakirov wrote:

I attached a new version of the patch.



Sorry for noise. I attached new version of the patch. I saw mistakes in 
DecodeFlag(). This patch fix them.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***
*** 2615,2632  SELECT plainto_tsquery('supernova star');
 
  
 
! To create an Ispell dictionary, use the built-in
! ispell template and specify several parameters:
 
! 
  
! CREATE TEXT SEARCH DICTIONARY english_ispell (
  TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
  
  
 
  Here, DictFile, AffFile, and StopWords
--- 2615,2655 
 
  
 
! To create an Ispell dictionary perform these steps:
 
!
! 
!  
!   download dictionary configuration files. OpenOffice
!   extension files have the .oxt extension. It is necessary
!   to extract .aff and .dic files, change
!   extensions to .affix and .dict. For some
!   dictionary files it is also needed to convert characters to the UTF-8
!   encoding with commands (for example, for norwegian language dictionary):
  
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! 
!  
! 
! 
!  
!   copy files to the $SHAREDIR/tsearch_data directory
!  
! 
! 
!  
!   load files into PostgreSQL with the following command:
! 
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
  TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
  
+  
+ 
+
  
 
  Here, DictFile, AffFile, and StopWords
***
*** 2643,2648  CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2721 
 
  
 
+ The .affix file of Ispell has the following
+ structure:
+ 
+ prefixes
+ flag *A:
+ .   >   RE  # As in enter > reenter
+ suffixes
+ flag T:
+ E   >   ST  # As in late > latest
+ [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y>   EST # As in gray > grayest
+ [^EY]   >   EST # As in small > smallest
+ 
+
+
+ And the .dict file has the following structure:
+ 
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ 
+
+ 
+
+ Format of the .dict file is:
+ 
+ basic_form/affix_class_name
+ 
+
+ 
+
+ In the .affix file every affix flag is described in the
+ following format:
+ 
+ condition > [-stripping_letters,] adding_affix
+ 
+
+ 
+
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings [...] and [^...].
+ For example, [AEIOU]Y means that the last letter of the word
+ is "y" and the penultimate letter is "a",
+ "e", "i", "o" or "u".
+ [^EY] means that the last letter is neither "e"
+ nor "y".
+
+ 
+
  Ispell dictionaries support splitting compound words;
  a useful feature.
  Notice that the affix file should specify a special flag using the
***
*** 2663,2668  SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2736,2800 
  
 
  
+
+ MySpell is very similar to Hunspell.
+ The .affix file of Hunspell has the following
+ structure:
+ 
+ PFX A Y 1
+ PFX A   0 re .
+ SFX T N 4
+ SFX T   0 st e
+ SFX T   y iest   [^aeiou]y
+ SFX T   0 est[aeiou]y
+ SFX T   0 est[^ey]
+ 
+
+ 
+
+ The first line of an affix class is the header. Fields of an affix rules are
+ listed after the header:
+
+
+ 
+  
+   parameter name (PFX or SFX)
+  
+ 
+ 
+  
+   flag (name of the affix class)
+  
+ 
+ 
+  
+   stripping characters from beginning (at prefix) or end (at suffix) of the
+   word
+  
+ 
+ 
+  
+   adding affix
+  
+ 
+ 
+  
+   condition that has a format similar to the format of regular expressions.
+  
+ 
+
+ 
+
+ The .dict file looks like the .dict file of
+ Ispell:
+ 
+ larder/M
+ lardy/RT
+ large/RSPMYT
+ largehearted
+ 
+
+ 
 
  
   MySpell does not support compound words.
*** a/src/backend/tsearch/Makefile
--- b/src/backend/tsearch/Makefile
***
*** 13,20  include $(top_builddir)/src/Makefile.global
  
  DICTDIR=tsearch_data
  
! DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \
! 	ispell_sample.affix ispell_sample.dict
  
  OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \
  	dict_simple.o dict_synonym.o dict_thesaurus.o \
--- 13,23 
  
  DICTDIR=tsearch_data
  
! DICTFILES=dicts/synonym_sample.syn dicts/thesaurus_sample.ths \
! 	dicts/hunspell_sample.affix \
! 	dicts/ispell_sample.affix dicts

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-02-16 Thread Artur Zakirov

I attached a new version of the patch.

On 10.02.2016 19:46, Teodor Sigaev wrote:



I duplicate the patch here.


it's very good thing to update disctionaries to support modern versions.
And thank you for improving documentation. Also I've impressed by long
description in spell.c header.

Som notices about code:

1
  struct SPELL. Why do you remove union p? You leave comment
  about using d struct instead of flag field and as can see
  it's right comment. It increases size of SPELL structure.


Fixed.



2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields
should be less or equal to size of integer. In opposite case, suppose,
we can get undefined behavior. Please, split  bitfields  to two integers.


Fixed.



3 unsigned char flagval[65000];
   Is it forbidden to use 6 number? In any case, decodeFlag() doesn't
   restrict return value. I suggest to enlarge array to 1<<16 and add limit
   to return value of decodeFlag().


flagval array was enlarged. Added verification of return value of 
DecodeFlag() for for various FLAG parameter (FM_LONG, FM_NUM and FM_CHAR).




4
  I'd like to see a short comment describing at least new functions


Added some comments which describe new functions and old functions for 
loading dictionaries into PostgreSQL. This patch adds new functions and 
modifies functions which is used for loading dictionaries.


At the moment, comments does not describe functions which used for word 
normalization. But I can add more comments.




5
  Pls, add tests for new code.




Added tests. Old sample dictionaries files was moved to the folder 
"dicts". New sample dictionaries files was added:

- hunspell_sample_long.affix
- hunspell_sample_long.dict
- hunspell_sample_num.affix
- hunspell_sample_num.dict

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***
*** 2615,2632  SELECT plainto_tsquery('supernova star');
 
  
 
! To create an Ispell dictionary, use the built-in
! ispell template and specify several parameters:
 
! 
  
! CREATE TEXT SEARCH DICTIONARY english_ispell (
  TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
  
  
 
  Here, DictFile, AffFile, and StopWords
--- 2615,2655 
 
  
 
! To create an Ispell dictionary perform these steps:
 
!
! 
!  
!   download dictionary configuration files. OpenOffice
!   extension files have the .oxt extension. It is necessary
!   to extract .aff and .dic files, change extensions
!   to .affix and .dict. For some dictionary
!   files it is also needed to convert characters to the UTF-8 encoding
!   with commands (for example, for norwegian language dictionary):
  
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! 
!  
! 
! 
!  
!   copy files to the $SHAREDIR/tsearch_data directory
!  
! 
! 
!  
!   load files into PostgreSQL with the following command:
! 
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
  TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
  
+  
+ 
+
  
 
  Here, DictFile, AffFile, and StopWords
***
*** 2643,2648  CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2720 
 
  
 
+ The .affix file of Ispell has the following structure:
+ 
+ prefixes
+ flag *A:
+ .   >   RE  # As in enter > reenter
+ suffixes
+ flag T:
+ E   >   ST  # As in late > latest
+ [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y>   EST # As in gray > grayest
+ [^EY]   >   EST # As in small > smallest
+ 
+
+
+ And the .dict file has the following structure:
+ 
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ 
+
+ 
+
+ Format of the .dict file is:
+ 
+ basic_form/affix_class_name
+ 
+
+ 
+
+ In the .affix file every affix flag is described in the
+ following format:
+ 
+ condition > [-stripping_letters,] adding_affix
+ 
+
+ 
+
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings [...] and [^...].
+ For example, [AEIOU]Y means that the last letter of the word
+ is "y" and the penultimate letter is "a",
+ "e", "i", "o" or "u".
+ [^EY] means that the last letter is neither "e"
+ nor "y".
+
+ 
+
  Ispell dictionaries support splitting compound words;
  a useful feature.
  Notice that the affix file should specify a special flag using the
***
*** 2663,2668  SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2735,2796 
  
 
  
+
+ MySpell is very similar to Hunspell.
+ The .affix file of Hunspell has the following

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-02-11 Thread Artur Zakirov

Thank you for the review.

On 10.02.2016 19:46, Teodor Sigaev wrote:



I duplicate the patch here.


it's very good thing to update disctionaries to support modern versions.
And thank you for improving documentation. Also I've impressed by long
description in spell.c header.

Som notices about code:

1
  struct SPELL. Why do you remove union p? You leave comment
  about using d struct instead of flag field and as can see
  it's right comment. It increases size of SPELL structure.


I will fix it. I had misunderstood the Alvaro's comment about it.



2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields
should be less or equal to size of integer. In opposite case, suppose,
we can get undefined behavior. Please, split  bitfields  to two integers.


I will fix it. Here I had misunderstood too.



3 unsigned char flagval[65000];
   Is it forbidden to use 6 number? In any case, decodeFlag() doesn't
   restrict return value. I suggest to enlarge array to 1<<16 and add limit
   to return value of decodeFlag().


I think it can be done.



4
  I'd like to see a short comment describing at least new functions


Now in spell.c there are more comments. I wanted to send fixed patch 
after adding all comments that I want to add. But I can send the patch now.
Also I will merge this commit 
http://www.postgresql.org/message-id/e1atf9o-0001ga...@gemulon.postgresql.org




5
  Pls, add tests for new code.




I will add.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-02-10 Thread Teodor Sigaev



I duplicate the patch here.


it's very good thing to update disctionaries to support modern versions. And 
thank you for improving documentation. Also I've impressed by long description 
in spell.c header.


Som notices about code:

1
 struct SPELL. Why do you remove union p? You leave comment
 about using d struct instead of flag field and as can see
 it's right comment. It increases size of SPELL structure.

2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields should be 
less or equal to size of integer. In opposite case, suppose, we can get 
undefined behavior. Please, split  bitfields  to two integers.


3 unsigned char flagval[65000];
  Is it forbidden to use 6 number? In any case, decodeFlag() doesn't
  restrict return value. I suggest to enlarge array to 1<<16 and add limit
  to return value of decodeFlag().

4
 I'd like to see a short comment describing at least new functions

5
 Pls, add tests for new code.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-28 Thread Artur Zakirov

Sorry, I don't know why this thread was moved to another thread.

I duplicate the patch here.


On 28.01.2016 14:19, Alvaro Herrera wrote:

Artur Zakirov wrote:


I undo the changes and the error will be raised. I will update the patch
soon.


I don't think you ever did this. I'm closing it now, but it sounds
useful stuff so please do resubmit for 2016-03.



I'm working on the patch. I wanted to send this changes after all changes.

This version of the patch has a top-level comment. Another comments I will 
provides soon.

Also this patch has some changes with ternary operators.


I don't think you ever did this. I'm closing it now, but it sounds
useful stuff so please do resubmit for 2016-03.


Moved to next CF.




--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***
*** 2615,2632  SELECT plainto_tsquery('supernova star');
 
  
 
! To create an Ispell dictionary, use the built-in
! ispell template and specify several parameters:
 
! 
  
! CREATE TEXT SEARCH DICTIONARY english_ispell (
  TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
  
  
 
  Here, DictFile, AffFile, and StopWords
--- 2615,2655 
 
  
 
! To create an Ispell dictionary perform these steps:
 
!
! 
!  
!   download dictionary configuration files. OpenOffice
!   extension files have the .oxt extension. It is necessary
!   to extract .aff and .dic files, change extensions
!   to .affix and .dict. For some dictionary
!   files it is also needed to convert characters to the UTF-8 encoding
!   with commands (for example, for norwegian language dictionary):
  
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! 
!  
! 
! 
!  
!   copy files to the $SHAREDIR/tsearch_data directory
!  
! 
! 
!  
!   load files into PostgreSQL with the following command:
! 
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
  TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
  
+  
+ 
+
  
 
  Here, DictFile, AffFile, and StopWords
***
*** 2643,2648  CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2720 
 
  
 
+ The .affix file of Ispell has the following structure:
+ 
+ prefixes
+ flag *A:
+ .   >   RE  # As in enter > reenter
+ suffixes
+ flag T:
+ E   >   ST  # As in late > latest
+ [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y>   EST # As in gray > grayest
+ [^EY]   >   EST # As in small > smallest
+ 
+
+
+ And the .dict file has the following structure:
+ 
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ 
+
+ 
+
+ Format of the .dict file is:
+ 
+ basic_form/affix_class_name
+ 
+
+ 
+
+ In the .affix file every affix flag is described in the
+ following format:
+ 
+ condition > [-stripping_letters,] adding_affix
+ 
+
+ 
+
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings [...] and [^...].
+ For example, [AEIOU]Y means that the last letter of the word
+ is "y" and the penultimate letter is "a",
+ "e", "i", "o" or "u".
+ [^EY] means that the last letter is neither "e"
+ nor "y".
+
+ 
+
  Ispell dictionaries support splitting compound words;
  a useful feature.
  Notice that the affix file should specify a special flag using the
***
*** 2663,2668  SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2735,2796 
  
 
  
+
+ MySpell is very similar to Hunspell.
+ The .affix file of Hunspell has the following structure:
+ 
+ PFX A Y 1
+ PFX A   0 re .
+ SFX T N 4
+ SFX T   0 st e
+ SFX T   y iest   [^aeiou]y
+ SFX T   0 est[aeiou]y
+ SFX T   0 est[^ey]
+ 
+
+ 
+
+ The first line of an affix class is the header. Fields of an affix rules are listed after the header:
+
+
+ 
+  
+   parameter name (PFX or SFX)
+  
+ 
+ 
+  
+   flag (name of the affix class)
+  
+ 
+ 
+  
+   stripping characters from beginning (at prefix) or end (at suffix) of the word
+  
+ 
+ 
+  
+   adding affix
+  
+ 
+ 
+  
+   condition that has a format similar to the format of regular expressions.
+  
+ 
+
+ 
+
+ The .dict file looks like the .dict file of
+ Ispell:
+ 
+ larder/M
+ lardy/RT
+ large/RSPMYT
+ largehearted
+ 
+
+ 
 
  
   MySpell does not support compound words.
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
*

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-28 Thread Artur Zakirov

On 28.01.2016 14:19, Alvaro Herrera wrote:

Artur Zakirov wrote:


I undo the changes and the error will be raised. I will update the patch
soon.


I don't think you ever did this.  I'm closing it now, but it sounds
useful stuff so please do resubmit for 2016-03.



I'm working on the patch. I wanted to send this changes after all changes.

This version of the patch has a top-level comment. Another comments I 
will provides soon.


Also this patch has some changes with ternary operators.

> I don't think you ever did this.  I'm closing it now, but it sounds
> useful stuff so please do resubmit for 2016-03.

Moved to next CF.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/doc/src/sgml/textsearch.sgml
--- b/doc/src/sgml/textsearch.sgml
***
*** 2615,2632  SELECT plainto_tsquery('supernova star');
 
  
 
! To create an Ispell dictionary, use the built-in
! ispell template and specify several parameters:
 
! 
  
! CREATE TEXT SEARCH DICTIONARY english_ispell (
  TEMPLATE = ispell,
! DictFile = english,
! AffFile = english,
! StopWords = english
! );
  
  
 
  Here, DictFile, AffFile, and StopWords
--- 2615,2655 
 
  
 
! To create an Ispell dictionary perform these steps:
 
!
! 
!  
!   download dictionary configuration files. OpenOffice
!   extension files have the .oxt extension. It is necessary
!   to extract .aff and .dic files, change extensions
!   to .affix and .dict. For some dictionary
!   files it is also needed to convert characters to the UTF-8 encoding
!   with commands (for example, for norwegian language dictionary):
  
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
! 
!  
! 
! 
!  
!   copy files to the $SHAREDIR/tsearch_data directory
!  
! 
! 
!  
!   load files into PostgreSQL with the following command:
! 
! CREATE TEXT SEARCH DICTIONARY english_hunspell (
  TEMPLATE = ispell,
! DictFile = en_us,
! AffFile = en_us,
! Stopwords = english);
  
+  
+ 
+
  
 
  Here, DictFile, AffFile, and StopWords
***
*** 2643,2648  CREATE TEXT SEARCH DICTIONARY english_ispell (
--- 2666,2720 
 
  
 
+ The .affix file of Ispell has the following structure:
+ 
+ prefixes
+ flag *A:
+ .   >   RE  # As in enter > reenter
+ suffixes
+ flag T:
+ E   >   ST  # As in late > latest
+ [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
+ [AEIOU]Y>   EST # As in gray > grayest
+ [^EY]   >   EST # As in small > smallest
+ 
+
+
+ And the .dict file has the following structure:
+ 
+ lapse/ADGRS
+ lard/DGRS
+ large/PRTY
+ lark/MRS
+ 
+
+ 
+
+ Format of the .dict file is:
+ 
+ basic_form/affix_class_name
+ 
+
+ 
+
+ In the .affix file every affix flag is described in the
+ following format:
+ 
+ condition > [-stripping_letters,] adding_affix
+ 
+
+ 
+
+ Here, condition has a format similar to the format of regular expressions.
+ It can use groupings [...] and [^...].
+ For example, [AEIOU]Y means that the last letter of the word
+ is "y" and the penultimate letter is "a",
+ "e", "i", "o" or "u".
+ [^EY] means that the last letter is neither "e"
+ nor "y".
+
+ 
+
  Ispell dictionaries support splitting compound words;
  a useful feature.
  Notice that the affix file should specify a special flag using the
***
*** 2663,2668  SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
--- 2735,2796 
  
 
  
+
+ MySpell is very similar to Hunspell.
+ The .affix file of Hunspell has the following structure:
+ 
+ PFX A Y 1
+ PFX A   0 re .
+ SFX T N 4
+ SFX T   0 st e
+ SFX T   y iest   [^aeiou]y
+ SFX T   0 est[aeiou]y
+ SFX T   0 est[^ey]
+ 
+
+ 
+
+ The first line of an affix class is the header. Fields of an affix rules are listed after the header:
+
+
+ 
+  
+   parameter name (PFX or SFX)
+  
+ 
+ 
+  
+   flag (name of the affix class)
+  
+ 
+ 
+  
+   stripping characters from beginning (at prefix) or end (at suffix) of the word
+  
+ 
+ 
+  
+   adding affix
+  
+ 
+ 
+  
+   condition that has a format similar to the format of regular expressions.
+  
+ 
+
+ 
+
+ The .dict file looks like the .dict file of
+ Ispell:
+ 
+ larder/M
+ lardy/RT
+ large/RSPMYT
+ largehearted
+ 
+
+ 
 
  
   MySpell does not support compound words.
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***
*** 5,10 
--- 5,54 
   *
   * Portions Copyright (c) 1996-2016, PostgreSQL Global D

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-28 Thread Alvaro Herrera
Artur Zakirov wrote:

> I undo the changes and the error will be raised. I will update the patch
> soon.

I don't think you ever did this.  I'm closing it now, but it sounds
useful stuff so please do resubmit for 2016-03.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-09 Thread Artur Zakirov

On 09.01.2016 05:38, Alvaro Herrera wrote:

Artur Zakirov wrote:


Now almost all dictionaries are loaded into PostgreSQL. But the da_dk
dictionary does not load. I see the following error:

ERROR: invalid regular expression: quantifier operand invalid
CONTEXT: line 439 of configuration file
"/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s
+GENITIV

If you open the affix file in editor you can see that there is incorrect
format of the affix 55 in 439 line (screen1.png):
  
  [ another email ]



I also had implemented a patch that fixes an error from the e-mail
http://www.postgresql.org/message-id/562e1073.8030...@postgrespro.ru
This patch just ignore that error.

I think it's a bad idea to just ignore these syntax errors.  This affix
file is effectively corrupt, after all, so it seems a bad idea that we
need to cope with it.  I think it would be better to raise the error
normally and instruct the user to fix the file; obviously it's better if
the upstream provider of the file fixes it.

Now, if there is proof somewhere that the file is correct, then the code
must cope in some reasonable way.  But in any case I don't think this
change is acceptable ... it can only cause pain, in the long run.
This error is raised in Danish dictionary because of erroneous entry in 
the .affix file. I sent a bug-report to developer. He fixed this bug. 
Corrected dictionary can be downloaded from LibreOffice site.


I undo the changes and the error will be raised. I will update the patch 
soon.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-09 Thread Artur Zakirov

Thanks for review.

On 09.01.2016 02:04, Alvaro Herrera wrote:

Artur Zakirov wrote:

--- 74,80 
   
   typedef struct aff_struct

   {
!   uint32  flag:16,
type:1,
flagflags:7,
issimple:1,

By doing this, you're using 40 bits of a 32-bits-wide field.  What does
this mean?  Are the final 8 bits lost?  Does the compiler allocate a
second uint32 member for those additional bits?  I don't know, but I
don't think this is a very clean idea.
No, 8 bits are not lost. This 8 bits are used if a dictionary uses 
double extended ASCII character flag type (Conf->flagMode == FM_LONG) or 
decimal number flag type (Conf->flagMode == FM_NUM). If a dictionary 
uses single extended ASCII character flag type (Conf->flagMode == 
FM_CHAR), then 8 bits lost.


You can see it in decodeFlag function. This function is used in 
NIImportOOAffixes function, decode affix flag from string type and store 
in flag field (flag:16).



   typedef struct spell_struct
   {
!   struct
{
!   int affix;
!   int len;
!   }   d;
!   /*
!* flag is filled in by NIImportDictionary. After NISortDictionary, d
!* is valid and flag is invalid.
!*/
!   char   *flag;
charword[FLEXIBLE_ARRAY_MEMBER];
   } SPELL;

Here you removed the union, with no rationale for doing so.  Why did you
do it?  Can it be avoided?  Because of the comment, I'd imagine that d
and flag are valid at different times, so at any time we care about only
one of them; but you haven't updated the comment stating the reason for
that no longer to be the case.  I suspect you need to keep flag valid
after NISortDictionary has been called, but if so why?  If "flag" is
invalid as the comment says, what's the reason for keeping it?
Union was removed because the flag field need to be dynamically sized. 
It had 16 size in the previous version. In this field flag set are 
stored. For example, if .dict file has the entry:


mitigate/NDSGny

Then the "NDSGny" is stored in the flag field.

But in some cases a flag set can have size bigger than 16. I added this 
changes after this message 
http://www.postgresql.org/message-id/CAE2gYzwom3=11u9g8zxmt5plkzrwb12bwzxh4db3hud89fo...@mail.gmail.com

In that Turkish dictionary there are can be large flag set. For example:

özek/2240,852,749,5026,2242,4455,2594,2597,4963,1608,494,2409

This flag set has 56 size.
This "flag" is valid all the time. It is used in NISortDictionary and it 
is not used after NISortDictionary function has been called. Maybe you 
right and there are no reason for keeping it, and it is necessary to 
store all flags in separate variable, that will be deleted after 
NISortDictionary has been called.



The routines decodeFlag and isAffixFlagInUse could do with more
comments.  Your patch adds zero.  Actually the whole file has not nearly
enough comments; adding some more would be very good.

Actually, after some more reading, I think this code is pretty terrible.
I have a hard time figuring out how the original works, which makes it
even more difficult to figure out whether your changes make sense.  I
would have to take your patch on faith, which doesn't sound so great an
idea.

palloc / cpalloc / tmpalloc make the whole mess even more confusing.
Why does this file have three ways to allocate memory?

Not sure what's a good way to go about this.  I am certainly not going
to commit this as is, because if I do whatever bugs you have are going
to become my problem; and with the severe lack of documentation and
given how fiddly this stuff is, I bet there are going to be a bunch of
bugs.  I suspect most committers are going to be in the same position.
I think you should start by adding a few comments here and there on top
of the original to explain how it works, then your patch on top.  I
suppose it's going to be a lot of work for you but I don't see any other
way.

A top-level overview about it would be good, too.  The current comment
at top of file states:

  * spell.c
  * Normalizing word with ISpell

which is, err, somewhat laconic.

I will provide comments and explain how it works in comments. Maybe I 
will add some explanation about dictionaries structure. I will update 
the patch soon.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-08 Thread Alvaro Herrera
Artur Zakirov wrote:

> Now almost all dictionaries are loaded into PostgreSQL. But the da_dk
> dictionary does not load. I see the following error:
> 
> ERROR: invalid regular expression: quantifier operand invalid
> CONTEXT: line 439 of configuration file
> "/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s
> +GENITIV
> 
> If you open the affix file in editor you can see that there is incorrect
> format of the affix 55 in 439 line (screen1.png):
 
 [ another email ]

> I also had implemented a patch that fixes an error from the e-mail
> http://www.postgresql.org/message-id/562e1073.8030...@postgrespro.ru
> This patch just ignore that error.

I think it's a bad idea to just ignore these syntax errors.  This affix
file is effectively corrupt, after all, so it seems a bad idea that we
need to cope with it.  I think it would be better to raise the error
normally and instruct the user to fix the file; obviously it's better if
the upstream provider of the file fixes it.

Now, if there is proof somewhere that the file is correct, then the code
must cope in some reasonable way.  But in any case I don't think this
change is acceptable ... it can only cause pain, in the long run.

> *** 429,443  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const 
> char *mask, const c
>   err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
>REG_ADVANCED | REG_NOSUB,
>DEFAULT_COLLATION_OID);
>   if (err)
> ! {
> ! charerrstr[100];
> ! 
> ! pg_regerror(err, &(Affix->reg.regex), errstr, 
> sizeof(errstr));
> ! ereport(ERROR,
> ! 
> (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
> !  errmsg("invalid regular expression: 
> %s", errstr)));
> ! }
>   }
>   
>   Affix->flagflags = flagflags;
> --- 429,437 
>   err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
>REG_ADVANCED | REG_NOSUB,
>DEFAULT_COLLATION_OID);
> + /* Ignore regular expression error and do not add wrong affix */
>   if (err)
> ! return;
>   }
>   
>   Affix->flagflags = flagflags;


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2016-01-08 Thread Alvaro Herrera
Artur Zakirov wrote:

> *** 77,83  typedef struct spell_struct
>   
>   typedef struct aff_struct
>   {
> ! uint32  flag:8,
>   type:1,
>   flagflags:7,
>   issimple:1,
> --- 74,80 
>   
>   typedef struct aff_struct
>   {
> ! uint32  flag:16,
>   type:1,
>   flagflags:7,
>   issimple:1,

By doing this, you're using 40 bits of a 32-bits-wide field.  What does
this mean?  Are the final 8 bits lost?  Does the compiler allocate a
second uint32 member for those additional bits?  I don't know, but I
don't think this is a very clean idea.

>   typedef struct spell_struct
>   {
> ! union
>   {
> ! /*
> !  * flag is filled in by NIImportDictionary. After 
> NISortDictionary, d
> !  * is valid and flag is invalid.
> !  */
> ! charflag[MAXFLAGLEN];
> ! struct
> ! {
> ! int affix;
> ! int len;
> ! }   d;
> ! }   p;
>   charword[FLEXIBLE_ARRAY_MEMBER];
>   } SPELL;
>   
> --- 57,72 
>   
>   typedef struct spell_struct
>   {
> ! struct
>   {
> ! int affix;
> ! int len;
> ! }   d;
> ! /*
> !  * flag is filled in by NIImportDictionary. After NISortDictionary, d
> !  * is valid and flag is invalid.
> !  */
> ! char   *flag;
>   charword[FLEXIBLE_ARRAY_MEMBER];
>   } SPELL;

Here you removed the union, with no rationale for doing so.  Why did you
do it?  Can it be avoided?  Because of the comment, I'd imagine that d
and flag are valid at different times, so at any time we care about only
one of them; but you haven't updated the comment stating the reason for
that no longer to be the case.  I suspect you need to keep flag valid
after NISortDictionary has been called, but if so why?  If "flag" is
invalid as the comment says, what's the reason for keeping it?


The routines decodeFlag and isAffixFlagInUse could do with more
comments.  Your patch adds zero.  Actually the whole file has not nearly
enough comments; adding some more would be very good.

Actually, after some more reading, I think this code is pretty terrible.
I have a hard time figuring out how the original works, which makes it
even more difficult to figure out whether your changes make sense.  I
would have to take your patch on faith, which doesn't sound so great an
idea.

palloc / cpalloc / tmpalloc make the whole mess even more confusing.
Why does this file have three ways to allocate memory?

Not sure what's a good way to go about this.  I am certainly not going
to commit this as is, because if I do whatever bugs you have are going
to become my problem; and with the severe lack of documentation and
given how fiddly this stuff is, I bet there are going to be a bunch of
bugs.  I suspect most committers are going to be in the same position.
I think you should start by adding a few comments here and there on top
of the original to explain how it works, then your patch on top.  I
suppose it's going to be a lot of work for you but I don't see any other
way.

A top-level overview about it would be good, too.  The current comment
at top of file states:

 * spell.c
 *  Normalizing word with ISpell

which is, err, somewhat laconic.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-12-01 Thread Artur Zakirov

On 16.11.2015 15:51, Artur Zakirov wrote:

On 10.11.2015 13:23, Artur Zakirov wrote:


Link to patch in commitfest:
https://commitfest.postgresql.org/8/420/

Link to regression tests:
https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz


I have done some changes in documentation in the section "12.6. 
Dictionaries". I have added some description how to load Ispell and 
Hunspell dictionaries and description about Ispell and Hunspell formats.


Patches for the documentation and for the code are attached separately.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***
*** 153,159  cmpspell(const void *s1, const void *s2)
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
  }
  
  static char *
--- 153,159 
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strncmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag, MAXFLAGLEN));
  }
  
  static char *
***
*** 237,242  cmpaffix(const void *s1, const void *s2)
--- 237,309 
  	   (const unsigned char *) a2->repl);
  }
  
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ 	unsigned short	s;
+ 	char		   *next;
+ 
+ 	switch (Conf->flagMode)
+ 	{
+ 		case FM_LONG:
+ 			s = (int)sflag[0] << 8 | (int)sflag[1];
+ 			if (sflagnext)
+ *sflagnext = sflag + 2;
+ 			break;
+ 		case FM_NUM:
+ 			s = (unsigned short) strtol(sflag, &next, 10);
+ 			if (sflagnext)
+ 			{
+ if (next)
+ {
+ 	*sflagnext = next;
+ 	while (**sflagnext)
+ 	{
+ 		if (**sflagnext == ',')
+ 		{
+ 			*sflagnext = *sflagnext + 1;
+ 			break;
+ 		}
+ 		*sflagnext = *sflagnext + 1;
+ 	}
+ }
+ else
+ 	*sflagnext = 0;
+ 			}
+ 			break;
+ 		default:
+ 			s = (unsigned short) *((unsigned char *)sflag);
+ 			if (sflagnext)
+ *sflagnext = sflag + 1;
+ 	}
+ 
+ 	return s;
+ }
+ 
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ 	char *flagcur;
+ 	char *flagnext = 0;
+ 
+ 	if (affixflag == 0)
+ 		return true;
+ 
+ 	flagcur = Conf->AffixData[affix];
+ 
+ 	while (*flagcur)
+ 	{
+ 		if (decodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ 			return true;
+ 		if (flagnext)
+ 			flagcur = flagnext;
+ 		else
+ 			break;
+ 	}
+ 
+ 	return false;
+ }
+ 
  static void
  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  {
***
*** 255,261  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
  	Conf->nspell++;
  }
  
--- 322,328 
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	Conf->Spell[Conf->nspell]->flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString;
  	Conf->nspell++;
  }
  
***
*** 355,361  FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
  		return 1;
  }
  node = StopMiddle->node;
--- 422,428 
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
  		return 1;
  }
  node = StopMiddle->node;
***
*** 394,400  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0)
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
--- 461,467 
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0 || *mask == '\0')
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
***
*** 429,443  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  		err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
  		 REG_ADVANCED | REG_NOSUB,
  		 DEFAULT_COLLATION_OID);
  		if (err)
! 		{
! 			char		errstr[100];
! 
! 			pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr));
! 			ereport(ERROR,
! 	(errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
! 	 errmsg("invalid regular expression: %s", errstr)));
! 		}
  	}
  
  	Affix->flagflags = flagflags;
--- 496,504 
  		err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
  		 REG_ADVANCED | REG_NOSUB,
  		 DEFAULT_COLLATION_OID);
+ 		/* Ignore regular expression error and do not add wrong affix */
  		if (err)
! 			return;
  	}
  
  	Affix->flagflags = flagflags

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-11-16 Thread Artur Zakirov

On 10.11.2015 13:23, Artur Zakirov wrote:


Link to patch in commitfest:
https://commitfest.postgresql.org/8/420/

Link to regression tests:
https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz



Hello!

Do you have any remarks or comments about my patch?

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-11-10 Thread Artur Zakirov

08.11.2015 14:23, Artur Zakirov пишет:

Thank you for reply.

This was because of the flag field size of the SPELL struct. And long
flags were being trancated in the .dict file.

I attached new patch. It is temporary patch, not final. It can be done
better.



I have updated the patch and attached it. Now dynamic memory allocation 
is used to the flag field of the SPELL struct.


I have valued time of a dictionary loading and memory using by a 
dictionary in the new patch. Dictionary is loaded at the first reference 
to it. For example, if we execute ts_lexize function. And first 
ts_lexize executing takes more time than second.


The following table shows performance of some dictionaries before patch 
and after in my computer.


-
|  |  loading time, ms  |  memory, MB   |
|  |  before  |  after  |  before |  after  |
-
|ar| 700  | 300 | 23,7| 15,7|
|br_fr | 410  | 450 | 27,4| 27,5|
|ca| 248  | 245 | 14,7| 15,4|
|en_us | 100  | 100 | 5,4 | 6,2 |
|fr| 160  | 178 | 13,7| 14,1|
|gl_es | 160  | 150 | 9   | 9,4 |
|is| 260  | 202 | 16,1| 16,3|
-

As you can see, substantially loading time and memory using before and 
after the patch are same.


Link to patch in commitfest:
https://commitfest.postgresql.org/8/420/

Link to regression tests:
https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***
*** 153,159  cmpspell(const void *s1, const void *s2)
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
  }
  
  static char *
--- 153,159 
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strncmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag, MAXFLAGLEN));
  }
  
  static char *
***
*** 237,242  cmpaffix(const void *s1, const void *s2)
--- 237,309 
  	   (const unsigned char *) a2->repl);
  }
  
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ 	unsigned short	s;
+ 	char		   *next;
+ 
+ 	switch (Conf->flagMode)
+ 	{
+ 		case FM_LONG:
+ 			s = (int)sflag[0] << 8 | (int)sflag[1];
+ 			if (sflagnext)
+ *sflagnext = sflag + 2;
+ 			break;
+ 		case FM_NUM:
+ 			s = (unsigned short) strtol(sflag, &next, 10);
+ 			if (sflagnext)
+ 			{
+ if (next)
+ {
+ 	*sflagnext = next;
+ 	while (**sflagnext)
+ 	{
+ 		if (**sflagnext == ',')
+ 		{
+ 			*sflagnext = *sflagnext + 1;
+ 			break;
+ 		}
+ 		*sflagnext = *sflagnext + 1;
+ 	}
+ }
+ else
+ 	*sflagnext = 0;
+ 			}
+ 			break;
+ 		default:
+ 			s = (unsigned short) *((unsigned char *)sflag);
+ 			if (sflagnext)
+ *sflagnext = sflag + 1;
+ 	}
+ 
+ 	return s;
+ }
+ 
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ 	char *flagcur;
+ 	char *flagnext = 0;
+ 
+ 	if (affixflag == 0)
+ 		return true;
+ 
+ 	flagcur = Conf->AffixData[affix];
+ 
+ 	while (*flagcur)
+ 	{
+ 		if (decodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ 			return true;
+ 		if (flagnext)
+ 			flagcur = flagnext;
+ 		else
+ 			break;
+ 	}
+ 
+ 	return false;
+ }
+ 
  static void
  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  {
***
*** 255,261  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
  	Conf->nspell++;
  }
  
--- 322,328 
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	Conf->Spell[Conf->nspell]->flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString;
  	Conf->nspell++;
  }
  
***
*** 355,361  FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
  		return 1;
  }
  node = StopMiddle->node;
--- 422,428 
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
  		return 1;
  }
  node = StopMiddle->node;
***
*** 394,400  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  
  	Affix = Conf->Affix + Conf->naffix

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-11-08 Thread Artur Zakirov

On 07.11.2015 17:20, Emre Hasegeli wrote:

It seems to have something to do with the order of the affixes.  It
works, if I move affix 2646 to the beginning of the list.

[1] https://tr-spell.googlecode.com/files/dict_aff_5000_suffix_113_words.zip

Thank you for reply.

This was because of the flag field size of the SPELL struct. And long 
flags were being trancated in the .dict file.


I attached new patch. It is temporary patch, not final. It can be done 
better.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***
*** 153,159  cmpspell(const void *s1, const void *s2)
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN));
  }
  
  static char *
--- 153,159 
  static int
  cmpspellaffix(const void *s1, const void *s2)
  {
! 	return (strcmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag));
  }
  
  static char *
***
*** 237,242  cmpaffix(const void *s1, const void *s2)
--- 237,309 
  	   (const unsigned char *) a2->repl);
  }
  
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ 	unsigned short	s;
+ 	char		   *next;
+ 
+ 	switch (Conf->flagMode)
+ 	{
+ 		case FM_LONG:
+ 			s = (int)sflag[0] << 8 | (int)sflag[1];
+ 			if (sflagnext)
+ *sflagnext = sflag + 2;
+ 			break;
+ 		case FM_NUM:
+ 			s = (unsigned short) strtol(sflag, &next, 10);
+ 			if (sflagnext)
+ 			{
+ if (next)
+ {
+ 	*sflagnext = next;
+ 	while (**sflagnext)
+ 	{
+ 		if (**sflagnext == ',')
+ 		{
+ 			*sflagnext = *sflagnext + 1;
+ 			break;
+ 		}
+ 		*sflagnext = *sflagnext + 1;
+ 	}
+ }
+ else
+ 	*sflagnext = 0;
+ 			}
+ 			break;
+ 		default:
+ 			s = (unsigned short) *((unsigned char *)sflag);
+ 			if (sflagnext)
+ *sflagnext = sflag + 1;
+ 	}
+ 
+ 	return s;
+ }
+ 
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ 	char *flagcur;
+ 	char *flagnext = 0;
+ 
+ 	if (affixflag == 0)
+ 		return true;
+ 
+ 	flagcur = Conf->AffixData[affix];
+ 
+ 	while (*flagcur)
+ 	{
+ 		if (decodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ 			return true;
+ 		if (flagnext)
+ 			flagcur = flagnext;
+ 		else
+ 			break;
+ 	}
+ 
+ 	return false;
+ }
+ 
  static void
  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  {
***
*** 255,261  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN);
  	Conf->nspell++;
  }
  
--- 322,328 
  	}
  	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
  	strcpy(Conf->Spell[Conf->nspell]->word, word);
! 	Conf->Spell[Conf->nspell]->flag = cpstrdup(Conf, flag);
  	Conf->nspell++;
  }
  
***
*** 355,361  FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
  		return 1;
  }
  node = StopMiddle->node;
--- 422,428 
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
  		return 1;
  }
  node = StopMiddle->node;
***
*** 394,400  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0)
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
--- 461,467 
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0 || *mask == '\0')
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
***
*** 429,443  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  		err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
  		 REG_ADVANCED | REG_NOSUB,
  		 DEFAULT_COLLATION_OID);
  		if (err)
! 		{
! 			char		errstr[100];
! 
! 			pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr));
! 			ereport(ERROR,
! 	(errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
! 	 errmsg("invalid regular expression: %s", errstr)));
! 		}
  	}
  
  	Affix->flagflags = flagflags;
--- 496,504 
  		err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen,
  		 REG_ADVANCED | REG_NOSUB,
  		 DEFAULT_COLLATION_OID);
+ 		/* Ignore regular expression error and do not add wrong affix */
  		if (err)
! 			return;
  	}
  
  	Affix->flagflags = flagflags;
***
*** 595,604  addFlagValue(IspellDict *Conf, char *s, uint32 val)
  (errcode(ERR

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-11-07 Thread Emre Hasegeli
Thank you for working on this.

I tried the patch with a Turkish dictionary [1] I could find on the
Internet.  It worked for some words, but not others:

> hasegeli=# create text search dictionary hunspell_tr (template = ispell, 
> dictfile = tr, afffile = tr);
> CREATE TEXT SEARCH DICTIONARY
>
> hasegeli=# select ts_lexize('hunspell_tr', 'tilki'); -- The root "fox"
> ---
> {tilki}
> (1 row)
>
> hasegeli=# select ts_lexize('hunspell_tr', 'tilkinin'); -- Genitive form, 
> affix 3290
> ts_lexize
> ---
> {tilki}
> (1 row)
>
> hasegeli=# select ts_lexize('hunspell_tr', 'tilkiler'); -- Plural form, affix 
> 4371
> ts_lexize
> ---
> {tilki}
> (1 row)
>
> hasegeli=# select ts_lexize('hunspell_tr', 'tilkiyi'); -- Accusative form, 
> affix 2646
> ts_lexize
> ---
>
> (1 row)

It seems to have something to do with the order of the affixes.  It
works, if I move affix 2646 to the beginning of the list.

[1] https://tr-spell.googlecode.com/files/dict_aff_5000_suffix_113_words.zip


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-11-06 Thread Artur Zakirov

06.11.2015 12:33, Artur Zakirov пишет:

Hello again!

Patches
===


Link to commitfest:
https://commitfest.postgresql.org/8/420/

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-11-06 Thread Artur Zakirov

Hello again!

Patches
===

I had implemented support for FLAG long, FLAG num and AF parameters. I 
attached patch to the e-mail (hunspell-dict.patch).


This patch allow to use Hunspell dictionaries listed in the previous 
e-mail: ar, br_fr, ca, ca_valencia, en_ca, en_gb, en_us, fr, gl_es, 
hu_hu, is, ne_np, nl_nl, si_lk.


The most part of changes was in spell.c in the affix file parsing code.
The following are dictionary structures changes:
- useFlagAliases and flagMode fields had been added to the IspellDict 
struct;

- flagval array size had been increased from 256 to 65000;
- flag field of the AFFIX struct also had been increased.

I also had implemented a patch that fixes an error from the e-mail
http://www.postgresql.org/message-id/562e1073.8030...@postgrespro.ru
This patch just ignore that error.

Tests
=

Extention test dictionaries for loading into PostgreSQL and for 
normalizing with ts_lexize function can be downloaded from 
https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz


It would be nice if somebody can do additional tests of dictionaries of 
well known languages. Because I do not know many of them.


Other Improvements
==

There are also some parameters for compound words. But I am not sure 
that we want use this parameters.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
*** a/src/backend/tsearch/spell.c
--- b/src/backend/tsearch/spell.c
***
*** 237,242  cmpaffix(const void *s1, const void *s2)
--- 237,309 
  	   (const unsigned char *) a2->repl);
  }
  
+ static unsigned short
+ decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext)
+ {
+ 	unsigned short	s;
+ 	char		   *next;
+ 
+ 	switch (Conf->flagMode)
+ 	{
+ 		case FM_LONG:
+ 			s = (int)sflag[0] << 8 | (int)sflag[1];
+ 			if (sflagnext)
+ *sflagnext = sflag + 2;
+ 			break;
+ 		case FM_NUM:
+ 			s = (unsigned short) strtol(sflag, &next, 10);
+ 			if (sflagnext)
+ 			{
+ if (next)
+ {
+ 	*sflagnext = next;
+ 	while (**sflagnext)
+ 	{
+ 		if (**sflagnext == ',')
+ 		{
+ 			*sflagnext = *sflagnext + 1;
+ 			break;
+ 		}
+ 		*sflagnext = *sflagnext + 1;
+ 	}
+ }
+ else
+ 	*sflagnext = 0;
+ 			}
+ 			break;
+ 		default:
+ 			s = (unsigned short) *((unsigned char *)sflag);
+ 			if (sflagnext)
+ *sflagnext = sflag + 1;
+ 	}
+ 
+ 	return s;
+ }
+ 
+ static bool
+ isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag)
+ {
+ 	char *flagcur;
+ 	char *flagnext = 0;
+ 
+ 	if (affixflag == 0)
+ 		return true;
+ 
+ 	flagcur = Conf->AffixData[affix];
+ 
+ 	while (*flagcur)
+ 	{
+ 		if (decodeFlag(Conf, flagcur, &flagnext) == affixflag)
+ 			return true;
+ 		if (flagnext)
+ 			flagcur = flagnext;
+ 		else
+ 			break;
+ 	}
+ 
+ 	return false;
+ }
+ 
  static void
  NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
  {
***
*** 355,361  FindWord(IspellDict *Conf, const char *word, int affixflag, int flag)
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL))
  		return 1;
  }
  node = StopMiddle->node;
--- 422,428 
  	else if ((flag & StopMiddle->compoundflag) == 0)
  		return 0;
  
! 	if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag))
  		return 1;
  }
  node = StopMiddle->node;
***
*** 394,400  NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0)
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
--- 461,467 
  
  	Affix = Conf->Affix + Conf->naffixes;
  
! 	if (strcmp(mask, ".") == 0 || *mask == '\0')
  	{
  		Affix->issimple = 1;
  		Affix->isregis = 0;
***
*** 595,604  addFlagValue(IspellDict *Conf, char *s, uint32 val)
  (errcode(ERRCODE_CONFIG_FILE_ERROR),
   errmsg("multibyte flag character is not allowed")));
  
! 	Conf->flagval[*(unsigned char *) s] = (unsigned char) val;
  	Conf->usecompound = true;
  }
  
  /*
   * Import an affix file that follows MySpell or Hunspell format
   */
--- 662,719 
  (errcode(ERRCODE_CONFIG_FILE_ERROR),
   errmsg("multibyte flag character is not allowed")));
  
! 	Conf->flagval[decodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val;
  	Conf->usecompound = true;
  }
  
+ static int
+ getFlagValues(IspellDict *Conf, char *s)
+ {
+ 	uint32	 flag = 0;
+ 	char	*flagcur;
+ 	char	*flagnext = 0;
+ 
+ 	flagcur = s;
+ 	while (*flagcur)
+ 	{
+ 		flag |= Conf->flagval[decodeFlag(Conf, flagcur, &flagnext)];
+ 		if (flagnext)
+ 			flagcur = flagnext;
+ 		else
+ 			break;
+ 	}
+ 
+ 	return flag;
+ }
+ 
+ /*
+  * Get flag set from "s".
+  *
+  * Returns flag set from AffixData array if AF parameter used (useFlagAliases is true).
+  * In this case "s" is alias for flag 

Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-10-26 Thread Artur Zakirov

20.10.2015 17:00, Artur Zakirov пишет:

These flag types are used in affix files of such dictionaries as ar,
br_fr, ca, ca_valencia, da_dk, en_ca, en_gb, en_us, fr, gl_es, is,
ne_np, nl_nl, si_lk (from
http://cgit.freedesktop.org/libreoffice/dictionaries/tree/).


Now almost all dictionaries are loaded into PostgreSQL. But the da_dk 
dictionary does not load. I see the following error:


ERROR: invalid regular expression: quantifier operand invalid
CONTEXT: line 439 of configuration file 
"/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s 
+GENITIV


If you open the affix file in editor you can see that there is incorrect 
format of the affix 55 in 439 line (screen1.png):


SFX 55 0  s +GENITIV

SFX parameter should have a 5 fields. There are no field between "0" 
digit and "s" symbol. "+GENITIV" is the optional morphological field and 
ignored by PostgreSQL.


I think that it is a error of the affix file. I wrote a e-mail to 
i...@stavekontrolden.dk to the dictionary authors about this error.


What is the right decision in this case? Should PostgreSQL ignore this 
error and do not show it?


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-10-21 Thread Artur Zakirov

21.10.2015 01:37, Jim Nasby пишет:

On 10/20/15 9:00 AM, Artur Zakirov wrote:

Internal representation of the dictionary in the PostgreSQL doesn't
impose too strict limits on the number of affix rules. There are a
flagval array, which size must be increased from 256 to 65000.


Is that per dictionary entry, fixed at 64k? That seems pretty excessive,
if that's the case...


This is per dictionary only. flagval array is used for the all 
dictionary. And it is not used for every dictionary word.


There are also flag field of AFFIX structure, wich size must be 
increased from 8 bit to 16 bit. This structure is used for every affix 
in affix file.


--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support

2015-10-20 Thread Jim Nasby

On 10/20/15 9:00 AM, Artur Zakirov wrote:

Internal representation of the dictionary in the PostgreSQL doesn't
impose too strict limits on the number of affix rules. There are a
flagval array, which size must be increased from 256 to 65000.


Is that per dictionary entry, fixed at 64k? That seems pretty excessive, 
if that's the case...

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers