Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 16.02.2016 18:14, Artur Zakirov wrote: I attached a new version of the patch. Sorry for noise. I attached new version of the patch. I saw mistakes in DecodeFlag(). This patch fix them. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/doc/src/sgml/textsearch.sgml --- b/doc/src/sgml/textsearch.sgml *** *** 2615,2632 SELECT plainto_tsquery('supernova star'); ! To create an Ispell dictionary, use the built-in ! ispell template and specify several parameters: ! ! CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, ! DictFile = english, ! AffFile = english, ! StopWords = english ! ); Here, DictFile, AffFile, and StopWords --- 2615,2655 ! To create an Ispell dictionary perform these steps: ! ! ! ! download dictionary configuration files. OpenOffice ! extension files have the .oxt extension. It is necessary ! to extract .aff and .dic files, change ! extensions to .affix and .dict. For some ! dictionary files it is also needed to convert characters to the UTF-8 ! encoding with commands (for example, for norwegian language dictionary): ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic ! ! ! ! ! ! copy files to the $SHAREDIR/tsearch_data directory ! ! ! ! ! load files into PostgreSQL with the following command: ! ! CREATE TEXT SEARCH DICTIONARY english_hunspell ( TEMPLATE = ispell, ! DictFile = en_us, ! AffFile = en_us, ! Stopwords = english); + + + Here, DictFile, AffFile, and StopWords *** *** 2643,2648 CREATE TEXT SEARCH DICTIONARY english_ispell ( --- 2666,2721 + The .affix file of Ispell has the following + structure: + + prefixes + flag *A: + . > RE # As in enter > reenter + suffixes + flag T: + E > ST # As in late > latest + [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest + [AEIOU]Y> EST # As in gray > grayest + [^EY] > EST # As in small > smallest + + + + And the .dict file has the following structure: + + lapse/ADGRS + lard/DGRS + large/PRTY + lark/MRS + + + + + Format of the .dict file is: + + basic_form/affix_class_name + + + + + In the .affix file every affix flag is described in the + following format: + + condition > [-stripping_letters,] adding_affix + + + + + Here, condition has a format similar to the format of regular expressions. + It can use groupings [...] and [^...]. + For example, [AEIOU]Y means that the last letter of the word + is "y" and the penultimate letter is "a", + "e", "i", "o" or "u". + [^EY] means that the last letter is neither "e" + nor "y". + + + Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should specify a special flag using the *** *** 2663,2668 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk'); --- 2736,2800 + + MySpell is very similar to Hunspell. + The .affix file of Hunspell has the following + structure: + + PFX A Y 1 + PFX A 0 re . + SFX T N 4 + SFX T 0 st e + SFX T y iest [^aeiou]y + SFX T 0 est[aeiou]y + SFX T 0 est[^ey] + + + + + The first line of an affix class is the header. Fields of an affix rules are + listed after the header: + + + + + parameter name (PFX or SFX) + + + + + flag (name of the affix class) + + + + + stripping characters from beginning (at prefix) or end (at suffix) of the + word + + + + + adding affix + + + + + condition that has a format similar to the format of regular expressions. + + + + + + The .dict file looks like the .dict file of + Ispell: + + larder/M + lardy/RT + large/RSPMYT + largehearted + + + MySpell does not support compound words. *** a/src/backend/tsearch/Makefile --- b/src/backend/tsearch/Makefile *** *** 13,20 include $(top_builddir)/src/Makefile.global DICTDIR=tsearch_data ! DICTFILES=synonym_sample.syn thesaurus_sample.ths hunspell_sample.affix \ ! ispell_sample.affix ispell_sample.dict OBJS = ts_locale.o ts_parse.o wparser.o wparser_def.o dict.o \ dict_simple.o dict_synonym.o dict_thesaurus.o \ --- 13,23 DICTDIR=tsearch_data ! DICTFILES=dicts/synonym_sample.syn dicts/thesaurus_sample.ths \ ! dicts/hunspell_sample.affix \ ! dicts/ispell_sample.affix dicts
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
I attached a new version of the patch. On 10.02.2016 19:46, Teodor Sigaev wrote: I duplicate the patch here. it's very good thing to update disctionaries to support modern versions. And thank you for improving documentation. Also I've impressed by long description in spell.c header. Som notices about code: 1 struct SPELL. Why do you remove union p? You leave comment about using d struct instead of flag field and as can see it's right comment. It increases size of SPELL structure. Fixed. 2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields should be less or equal to size of integer. In opposite case, suppose, we can get undefined behavior. Please, split bitfields to two integers. Fixed. 3 unsigned char flagval[65000]; Is it forbidden to use 6 number? In any case, decodeFlag() doesn't restrict return value. I suggest to enlarge array to 1<<16 and add limit to return value of decodeFlag(). flagval array was enlarged. Added verification of return value of DecodeFlag() for for various FLAG parameter (FM_LONG, FM_NUM and FM_CHAR). 4 I'd like to see a short comment describing at least new functions Added some comments which describe new functions and old functions for loading dictionaries into PostgreSQL. This patch adds new functions and modifies functions which is used for loading dictionaries. At the moment, comments does not describe functions which used for word normalization. But I can add more comments. 5 Pls, add tests for new code. Added tests. Old sample dictionaries files was moved to the folder "dicts". New sample dictionaries files was added: - hunspell_sample_long.affix - hunspell_sample_long.dict - hunspell_sample_num.affix - hunspell_sample_num.dict -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/doc/src/sgml/textsearch.sgml --- b/doc/src/sgml/textsearch.sgml *** *** 2615,2632 SELECT plainto_tsquery('supernova star'); ! To create an Ispell dictionary, use the built-in ! ispell template and specify several parameters: ! ! CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, ! DictFile = english, ! AffFile = english, ! StopWords = english ! ); Here, DictFile, AffFile, and StopWords --- 2615,2655 ! To create an Ispell dictionary perform these steps: ! ! ! ! download dictionary configuration files. OpenOffice ! extension files have the .oxt extension. It is necessary ! to extract .aff and .dic files, change extensions ! to .affix and .dict. For some dictionary ! files it is also needed to convert characters to the UTF-8 encoding ! with commands (for example, for norwegian language dictionary): ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic ! ! ! ! ! ! copy files to the $SHAREDIR/tsearch_data directory ! ! ! ! ! load files into PostgreSQL with the following command: ! ! CREATE TEXT SEARCH DICTIONARY english_hunspell ( TEMPLATE = ispell, ! DictFile = en_us, ! AffFile = en_us, ! Stopwords = english); + + + Here, DictFile, AffFile, and StopWords *** *** 2643,2648 CREATE TEXT SEARCH DICTIONARY english_ispell ( --- 2666,2720 + The .affix file of Ispell has the following structure: + + prefixes + flag *A: + . > RE # As in enter > reenter + suffixes + flag T: + E > ST # As in late > latest + [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest + [AEIOU]Y> EST # As in gray > grayest + [^EY] > EST # As in small > smallest + + + + And the .dict file has the following structure: + + lapse/ADGRS + lard/DGRS + large/PRTY + lark/MRS + + + + + Format of the .dict file is: + + basic_form/affix_class_name + + + + + In the .affix file every affix flag is described in the + following format: + + condition > [-stripping_letters,] adding_affix + + + + + Here, condition has a format similar to the format of regular expressions. + It can use groupings [...] and [^...]. + For example, [AEIOU]Y means that the last letter of the word + is "y" and the penultimate letter is "a", + "e", "i", "o" or "u". + [^EY] means that the last letter is neither "e" + nor "y". + + + Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should specify a special flag using the *** *** 2663,2668 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk'); --- 2735,2796 + + MySpell is very similar to Hunspell. + The .affix file of Hunspell has the following
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
Thank you for the review. On 10.02.2016 19:46, Teodor Sigaev wrote: I duplicate the patch here. it's very good thing to update disctionaries to support modern versions. And thank you for improving documentation. Also I've impressed by long description in spell.c header. Som notices about code: 1 struct SPELL. Why do you remove union p? You leave comment about using d struct instead of flag field and as can see it's right comment. It increases size of SPELL structure. I will fix it. I had misunderstood the Alvaro's comment about it. 2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields should be less or equal to size of integer. In opposite case, suppose, we can get undefined behavior. Please, split bitfields to two integers. I will fix it. Here I had misunderstood too. 3 unsigned char flagval[65000]; Is it forbidden to use 6 number? In any case, decodeFlag() doesn't restrict return value. I suggest to enlarge array to 1<<16 and add limit to return value of decodeFlag(). I think it can be done. 4 I'd like to see a short comment describing at least new functions Now in spell.c there are more comments. I wanted to send fixed patch after adding all comments that I want to add. But I can send the patch now. Also I will merge this commit http://www.postgresql.org/message-id/e1atf9o-0001ga...@gemulon.postgresql.org 5 Pls, add tests for new code. I will add. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
I duplicate the patch here. it's very good thing to update disctionaries to support modern versions. And thank you for improving documentation. Also I've impressed by long description in spell.c header. Som notices about code: 1 struct SPELL. Why do you remove union p? You leave comment about using d struct instead of flag field and as can see it's right comment. It increases size of SPELL structure. 2 struct AFFIX. I'm agree with Alvaro taht sum of sizes of bit fields should be less or equal to size of integer. In opposite case, suppose, we can get undefined behavior. Please, split bitfields to two integers. 3 unsigned char flagval[65000]; Is it forbidden to use 6 number? In any case, decodeFlag() doesn't restrict return value. I suggest to enlarge array to 1<<16 and add limit to return value of decodeFlag(). 4 I'd like to see a short comment describing at least new functions 5 Pls, add tests for new code. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
Sorry, I don't know why this thread was moved to another thread. I duplicate the patch here. On 28.01.2016 14:19, Alvaro Herrera wrote: Artur Zakirov wrote: I undo the changes and the error will be raised. I will update the patch soon. I don't think you ever did this. I'm closing it now, but it sounds useful stuff so please do resubmit for 2016-03. I'm working on the patch. I wanted to send this changes after all changes. This version of the patch has a top-level comment. Another comments I will provides soon. Also this patch has some changes with ternary operators. I don't think you ever did this. I'm closing it now, but it sounds useful stuff so please do resubmit for 2016-03. Moved to next CF. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/doc/src/sgml/textsearch.sgml --- b/doc/src/sgml/textsearch.sgml *** *** 2615,2632 SELECT plainto_tsquery('supernova star'); ! To create an Ispell dictionary, use the built-in ! ispell template and specify several parameters: ! ! CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, ! DictFile = english, ! AffFile = english, ! StopWords = english ! ); Here, DictFile, AffFile, and StopWords --- 2615,2655 ! To create an Ispell dictionary perform these steps: ! ! ! ! download dictionary configuration files. OpenOffice ! extension files have the .oxt extension. It is necessary ! to extract .aff and .dic files, change extensions ! to .affix and .dict. For some dictionary ! files it is also needed to convert characters to the UTF-8 encoding ! with commands (for example, for norwegian language dictionary): ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic ! ! ! ! ! ! copy files to the $SHAREDIR/tsearch_data directory ! ! ! ! ! load files into PostgreSQL with the following command: ! ! CREATE TEXT SEARCH DICTIONARY english_hunspell ( TEMPLATE = ispell, ! DictFile = en_us, ! AffFile = en_us, ! Stopwords = english); + + + Here, DictFile, AffFile, and StopWords *** *** 2643,2648 CREATE TEXT SEARCH DICTIONARY english_ispell ( --- 2666,2720 + The .affix file of Ispell has the following structure: + + prefixes + flag *A: + . > RE # As in enter > reenter + suffixes + flag T: + E > ST # As in late > latest + [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest + [AEIOU]Y> EST # As in gray > grayest + [^EY] > EST # As in small > smallest + + + + And the .dict file has the following structure: + + lapse/ADGRS + lard/DGRS + large/PRTY + lark/MRS + + + + + Format of the .dict file is: + + basic_form/affix_class_name + + + + + In the .affix file every affix flag is described in the + following format: + + condition > [-stripping_letters,] adding_affix + + + + + Here, condition has a format similar to the format of regular expressions. + It can use groupings [...] and [^...]. + For example, [AEIOU]Y means that the last letter of the word + is "y" and the penultimate letter is "a", + "e", "i", "o" or "u". + [^EY] means that the last letter is neither "e" + nor "y". + + + Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should specify a special flag using the *** *** 2663,2668 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk'); --- 2735,2796 + + MySpell is very similar to Hunspell. + The .affix file of Hunspell has the following structure: + + PFX A Y 1 + PFX A 0 re . + SFX T N 4 + SFX T 0 st e + SFX T y iest [^aeiou]y + SFX T 0 est[aeiou]y + SFX T 0 est[^ey] + + + + + The first line of an affix class is the header. Fields of an affix rules are listed after the header: + + + + + parameter name (PFX or SFX) + + + + + flag (name of the affix class) + + + + + stripping characters from beginning (at prefix) or end (at suffix) of the word + + + + + adding affix + + + + + condition that has a format similar to the format of regular expressions. + + + + + + The .dict file looks like the .dict file of + Ispell: + + larder/M + lardy/RT + large/RSPMYT + largehearted + + + MySpell does not support compound words. *** a/src/backend/tsearch/spell.c --- b/src/backend/tsearch/spell.c *
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 28.01.2016 14:19, Alvaro Herrera wrote: Artur Zakirov wrote: I undo the changes and the error will be raised. I will update the patch soon. I don't think you ever did this. I'm closing it now, but it sounds useful stuff so please do resubmit for 2016-03. I'm working on the patch. I wanted to send this changes after all changes. This version of the patch has a top-level comment. Another comments I will provides soon. Also this patch has some changes with ternary operators. > I don't think you ever did this. I'm closing it now, but it sounds > useful stuff so please do resubmit for 2016-03. Moved to next CF. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/doc/src/sgml/textsearch.sgml --- b/doc/src/sgml/textsearch.sgml *** *** 2615,2632 SELECT plainto_tsquery('supernova star'); ! To create an Ispell dictionary, use the built-in ! ispell template and specify several parameters: ! ! CREATE TEXT SEARCH DICTIONARY english_ispell ( TEMPLATE = ispell, ! DictFile = english, ! AffFile = english, ! StopWords = english ! ); Here, DictFile, AffFile, and StopWords --- 2615,2655 ! To create an Ispell dictionary perform these steps: ! ! ! ! download dictionary configuration files. OpenOffice ! extension files have the .oxt extension. It is necessary ! to extract .aff and .dic files, change extensions ! to .affix and .dict. For some dictionary ! files it is also needed to convert characters to the UTF-8 encoding ! with commands (for example, for norwegian language dictionary): ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff ! iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic ! ! ! ! ! ! copy files to the $SHAREDIR/tsearch_data directory ! ! ! ! ! load files into PostgreSQL with the following command: ! ! CREATE TEXT SEARCH DICTIONARY english_hunspell ( TEMPLATE = ispell, ! DictFile = en_us, ! AffFile = en_us, ! Stopwords = english); + + + Here, DictFile, AffFile, and StopWords *** *** 2643,2648 CREATE TEXT SEARCH DICTIONARY english_ispell ( --- 2666,2720 + The .affix file of Ispell has the following structure: + + prefixes + flag *A: + . > RE # As in enter > reenter + suffixes + flag T: + E > ST # As in late > latest + [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest + [AEIOU]Y> EST # As in gray > grayest + [^EY] > EST # As in small > smallest + + + + And the .dict file has the following structure: + + lapse/ADGRS + lard/DGRS + large/PRTY + lark/MRS + + + + + Format of the .dict file is: + + basic_form/affix_class_name + + + + + In the .affix file every affix flag is described in the + following format: + + condition > [-stripping_letters,] adding_affix + + + + + Here, condition has a format similar to the format of regular expressions. + It can use groupings [...] and [^...]. + For example, [AEIOU]Y means that the last letter of the word + is "y" and the penultimate letter is "a", + "e", "i", "o" or "u". + [^EY] means that the last letter is neither "e" + nor "y". + + + Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should specify a special flag using the *** *** 2663,2668 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk'); --- 2735,2796 + + MySpell is very similar to Hunspell. + The .affix file of Hunspell has the following structure: + + PFX A Y 1 + PFX A 0 re . + SFX T N 4 + SFX T 0 st e + SFX T y iest [^aeiou]y + SFX T 0 est[aeiou]y + SFX T 0 est[^ey] + + + + + The first line of an affix class is the header. Fields of an affix rules are listed after the header: + + + + + parameter name (PFX or SFX) + + + + + flag (name of the affix class) + + + + + stripping characters from beginning (at prefix) or end (at suffix) of the word + + + + + adding affix + + + + + condition that has a format similar to the format of regular expressions. + + + + + + The .dict file looks like the .dict file of + Ispell: + + larder/M + lardy/RT + large/RSPMYT + largehearted + + + MySpell does not support compound words. *** a/src/backend/tsearch/spell.c --- b/src/backend/tsearch/spell.c *** *** 5,10 --- 5,54 * * Portions Copyright (c) 1996-2016, PostgreSQL Global D
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
Artur Zakirov wrote: > I undo the changes and the error will be raised. I will update the patch > soon. I don't think you ever did this. I'm closing it now, but it sounds useful stuff so please do resubmit for 2016-03. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 09.01.2016 05:38, Alvaro Herrera wrote: Artur Zakirov wrote: Now almost all dictionaries are loaded into PostgreSQL. But the da_dk dictionary does not load. I see the following error: ERROR: invalid regular expression: quantifier operand invalid CONTEXT: line 439 of configuration file "/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s +GENITIV If you open the affix file in editor you can see that there is incorrect format of the affix 55 in 439 line (screen1.png): [ another email ] I also had implemented a patch that fixes an error from the e-mail http://www.postgresql.org/message-id/562e1073.8030...@postgrespro.ru This patch just ignore that error. I think it's a bad idea to just ignore these syntax errors. This affix file is effectively corrupt, after all, so it seems a bad idea that we need to cope with it. I think it would be better to raise the error normally and instruct the user to fix the file; obviously it's better if the upstream provider of the file fixes it. Now, if there is proof somewhere that the file is correct, then the code must cope in some reasonable way. But in any case I don't think this change is acceptable ... it can only cause pain, in the long run. This error is raised in Danish dictionary because of erroneous entry in the .affix file. I sent a bug-report to developer. He fixed this bug. Corrected dictionary can be downloaded from LibreOffice site. I undo the changes and the error will be raised. I will update the patch soon. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
Thanks for review. On 09.01.2016 02:04, Alvaro Herrera wrote: Artur Zakirov wrote: --- 74,80 typedef struct aff_struct { ! uint32 flag:16, type:1, flagflags:7, issimple:1, By doing this, you're using 40 bits of a 32-bits-wide field. What does this mean? Are the final 8 bits lost? Does the compiler allocate a second uint32 member for those additional bits? I don't know, but I don't think this is a very clean idea. No, 8 bits are not lost. This 8 bits are used if a dictionary uses double extended ASCII character flag type (Conf->flagMode == FM_LONG) or decimal number flag type (Conf->flagMode == FM_NUM). If a dictionary uses single extended ASCII character flag type (Conf->flagMode == FM_CHAR), then 8 bits lost. You can see it in decodeFlag function. This function is used in NIImportOOAffixes function, decode affix flag from string type and store in flag field (flag:16). typedef struct spell_struct { ! struct { ! int affix; ! int len; ! } d; ! /* !* flag is filled in by NIImportDictionary. After NISortDictionary, d !* is valid and flag is invalid. !*/ ! char *flag; charword[FLEXIBLE_ARRAY_MEMBER]; } SPELL; Here you removed the union, with no rationale for doing so. Why did you do it? Can it be avoided? Because of the comment, I'd imagine that d and flag are valid at different times, so at any time we care about only one of them; but you haven't updated the comment stating the reason for that no longer to be the case. I suspect you need to keep flag valid after NISortDictionary has been called, but if so why? If "flag" is invalid as the comment says, what's the reason for keeping it? Union was removed because the flag field need to be dynamically sized. It had 16 size in the previous version. In this field flag set are stored. For example, if .dict file has the entry: mitigate/NDSGny Then the "NDSGny" is stored in the flag field. But in some cases a flag set can have size bigger than 16. I added this changes after this message http://www.postgresql.org/message-id/CAE2gYzwom3=11u9g8zxmt5plkzrwb12bwzxh4db3hud89fo...@mail.gmail.com In that Turkish dictionary there are can be large flag set. For example: özek/2240,852,749,5026,2242,4455,2594,2597,4963,1608,494,2409 This flag set has 56 size. This "flag" is valid all the time. It is used in NISortDictionary and it is not used after NISortDictionary function has been called. Maybe you right and there are no reason for keeping it, and it is necessary to store all flags in separate variable, that will be deleted after NISortDictionary has been called. The routines decodeFlag and isAffixFlagInUse could do with more comments. Your patch adds zero. Actually the whole file has not nearly enough comments; adding some more would be very good. Actually, after some more reading, I think this code is pretty terrible. I have a hard time figuring out how the original works, which makes it even more difficult to figure out whether your changes make sense. I would have to take your patch on faith, which doesn't sound so great an idea. palloc / cpalloc / tmpalloc make the whole mess even more confusing. Why does this file have three ways to allocate memory? Not sure what's a good way to go about this. I am certainly not going to commit this as is, because if I do whatever bugs you have are going to become my problem; and with the severe lack of documentation and given how fiddly this stuff is, I bet there are going to be a bunch of bugs. I suspect most committers are going to be in the same position. I think you should start by adding a few comments here and there on top of the original to explain how it works, then your patch on top. I suppose it's going to be a lot of work for you but I don't see any other way. A top-level overview about it would be good, too. The current comment at top of file states: * spell.c * Normalizing word with ISpell which is, err, somewhat laconic. I will provide comments and explain how it works in comments. Maybe I will add some explanation about dictionaries structure. I will update the patch soon. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
Artur Zakirov wrote: > Now almost all dictionaries are loaded into PostgreSQL. But the da_dk > dictionary does not load. I see the following error: > > ERROR: invalid regular expression: quantifier operand invalid > CONTEXT: line 439 of configuration file > "/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s > +GENITIV > > If you open the affix file in editor you can see that there is incorrect > format of the affix 55 in 439 line (screen1.png): [ another email ] > I also had implemented a patch that fixes an error from the e-mail > http://www.postgresql.org/message-id/562e1073.8030...@postgrespro.ru > This patch just ignore that error. I think it's a bad idea to just ignore these syntax errors. This affix file is effectively corrupt, after all, so it seems a bad idea that we need to cope with it. I think it would be better to raise the error normally and instruct the user to fix the file; obviously it's better if the upstream provider of the file fixes it. Now, if there is proof somewhere that the file is correct, then the code must cope in some reasonable way. But in any case I don't think this change is acceptable ... it can only cause pain, in the long run. > *** 429,443 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const > char *mask, const c > err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen, >REG_ADVANCED | REG_NOSUB, >DEFAULT_COLLATION_OID); > if (err) > ! { > ! charerrstr[100]; > ! > ! pg_regerror(err, &(Affix->reg.regex), errstr, > sizeof(errstr)); > ! ereport(ERROR, > ! > (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION), > ! errmsg("invalid regular expression: > %s", errstr))); > ! } > } > > Affix->flagflags = flagflags; > --- 429,437 > err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen, >REG_ADVANCED | REG_NOSUB, >DEFAULT_COLLATION_OID); > + /* Ignore regular expression error and do not add wrong affix */ > if (err) > ! return; > } > > Affix->flagflags = flagflags; -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
Artur Zakirov wrote: > *** 77,83 typedef struct spell_struct > > typedef struct aff_struct > { > ! uint32 flag:8, > type:1, > flagflags:7, > issimple:1, > --- 74,80 > > typedef struct aff_struct > { > ! uint32 flag:16, > type:1, > flagflags:7, > issimple:1, By doing this, you're using 40 bits of a 32-bits-wide field. What does this mean? Are the final 8 bits lost? Does the compiler allocate a second uint32 member for those additional bits? I don't know, but I don't think this is a very clean idea. > typedef struct spell_struct > { > ! union > { > ! /* > ! * flag is filled in by NIImportDictionary. After > NISortDictionary, d > ! * is valid and flag is invalid. > ! */ > ! charflag[MAXFLAGLEN]; > ! struct > ! { > ! int affix; > ! int len; > ! } d; > ! } p; > charword[FLEXIBLE_ARRAY_MEMBER]; > } SPELL; > > --- 57,72 > > typedef struct spell_struct > { > ! struct > { > ! int affix; > ! int len; > ! } d; > ! /* > ! * flag is filled in by NIImportDictionary. After NISortDictionary, d > ! * is valid and flag is invalid. > ! */ > ! char *flag; > charword[FLEXIBLE_ARRAY_MEMBER]; > } SPELL; Here you removed the union, with no rationale for doing so. Why did you do it? Can it be avoided? Because of the comment, I'd imagine that d and flag are valid at different times, so at any time we care about only one of them; but you haven't updated the comment stating the reason for that no longer to be the case. I suspect you need to keep flag valid after NISortDictionary has been called, but if so why? If "flag" is invalid as the comment says, what's the reason for keeping it? The routines decodeFlag and isAffixFlagInUse could do with more comments. Your patch adds zero. Actually the whole file has not nearly enough comments; adding some more would be very good. Actually, after some more reading, I think this code is pretty terrible. I have a hard time figuring out how the original works, which makes it even more difficult to figure out whether your changes make sense. I would have to take your patch on faith, which doesn't sound so great an idea. palloc / cpalloc / tmpalloc make the whole mess even more confusing. Why does this file have three ways to allocate memory? Not sure what's a good way to go about this. I am certainly not going to commit this as is, because if I do whatever bugs you have are going to become my problem; and with the severe lack of documentation and given how fiddly this stuff is, I bet there are going to be a bunch of bugs. I suspect most committers are going to be in the same position. I think you should start by adding a few comments here and there on top of the original to explain how it works, then your patch on top. I suppose it's going to be a lot of work for you but I don't see any other way. A top-level overview about it would be good, too. The current comment at top of file states: * spell.c * Normalizing word with ISpell which is, err, somewhat laconic. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 16.11.2015 15:51, Artur Zakirov wrote: On 10.11.2015 13:23, Artur Zakirov wrote: Link to patch in commitfest: https://commitfest.postgresql.org/8/420/ Link to regression tests: https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz I have done some changes in documentation in the section "12.6. Dictionaries". I have added some description how to load Ispell and Hunspell dictionaries and description about Ispell and Hunspell formats. Patches for the documentation and for the code are attached separately. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/src/backend/tsearch/spell.c --- b/src/backend/tsearch/spell.c *** *** 153,159 cmpspell(const void *s1, const void *s2) static int cmpspellaffix(const void *s1, const void *s2) { ! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN)); } static char * --- 153,159 static int cmpspellaffix(const void *s1, const void *s2) { ! return (strncmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag, MAXFLAGLEN)); } static char * *** *** 237,242 cmpaffix(const void *s1, const void *s2) --- 237,309 (const unsigned char *) a2->repl); } + static unsigned short + decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext) + { + unsigned short s; + char *next; + + switch (Conf->flagMode) + { + case FM_LONG: + s = (int)sflag[0] << 8 | (int)sflag[1]; + if (sflagnext) + *sflagnext = sflag + 2; + break; + case FM_NUM: + s = (unsigned short) strtol(sflag, &next, 10); + if (sflagnext) + { + if (next) + { + *sflagnext = next; + while (**sflagnext) + { + if (**sflagnext == ',') + { + *sflagnext = *sflagnext + 1; + break; + } + *sflagnext = *sflagnext + 1; + } + } + else + *sflagnext = 0; + } + break; + default: + s = (unsigned short) *((unsigned char *)sflag); + if (sflagnext) + *sflagnext = sflag + 1; + } + + return s; + } + + static bool + isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag) + { + char *flagcur; + char *flagnext = 0; + + if (affixflag == 0) + return true; + + flagcur = Conf->AffixData[affix]; + + while (*flagcur) + { + if (decodeFlag(Conf, flagcur, &flagnext) == affixflag) + return true; + if (flagnext) + flagcur = flagnext; + else + break; + } + + return false; + } + static void NIAddSpell(IspellDict *Conf, const char *word, const char *flag) { *** *** 255,261 NIAddSpell(IspellDict *Conf, const char *word, const char *flag) } Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); ! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN); Conf->nspell++; } --- 322,328 } Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); ! Conf->Spell[Conf->nspell]->flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString; Conf->nspell++; } *** *** 355,361 FindWord(IspellDict *Conf, const char *word, int affixflag, int flag) else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL)) return 1; } node = StopMiddle->node; --- 422,428 else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag)) return 1; } node = StopMiddle->node; *** *** 394,400 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c Affix = Conf->Affix + Conf->naffixes; ! if (strcmp(mask, ".") == 0) { Affix->issimple = 1; Affix->isregis = 0; --- 461,467 Affix = Conf->Affix + Conf->naffixes; ! if (strcmp(mask, ".") == 0 || *mask == '\0') { Affix->issimple = 1; Affix->isregis = 0; *** *** 429,443 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen, REG_ADVANCED | REG_NOSUB, DEFAULT_COLLATION_OID); if (err) ! { ! char errstr[100]; ! ! pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr)); ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION), ! errmsg("invalid regular expression: %s", errstr))); ! } } Affix->flagflags = flagflags; --- 496,504 err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen, REG_ADVANCED | REG_NOSUB, DEFAULT_COLLATION_OID); + /* Ignore regular expression error and do not add wrong affix */ if (err) ! return; } Affix->flagflags = flagflags
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 10.11.2015 13:23, Artur Zakirov wrote: Link to patch in commitfest: https://commitfest.postgresql.org/8/420/ Link to regression tests: https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz Hello! Do you have any remarks or comments about my patch? -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
08.11.2015 14:23, Artur Zakirov пишет: Thank you for reply. This was because of the flag field size of the SPELL struct. And long flags were being trancated in the .dict file. I attached new patch. It is temporary patch, not final. It can be done better. I have updated the patch and attached it. Now dynamic memory allocation is used to the flag field of the SPELL struct. I have valued time of a dictionary loading and memory using by a dictionary in the new patch. Dictionary is loaded at the first reference to it. For example, if we execute ts_lexize function. And first ts_lexize executing takes more time than second. The following table shows performance of some dictionaries before patch and after in my computer. - | | loading time, ms | memory, MB | | | before | after | before | after | - |ar| 700 | 300 | 23,7| 15,7| |br_fr | 410 | 450 | 27,4| 27,5| |ca| 248 | 245 | 14,7| 15,4| |en_us | 100 | 100 | 5,4 | 6,2 | |fr| 160 | 178 | 13,7| 14,1| |gl_es | 160 | 150 | 9 | 9,4 | |is| 260 | 202 | 16,1| 16,3| - As you can see, substantially loading time and memory using before and after the patch are same. Link to patch in commitfest: https://commitfest.postgresql.org/8/420/ Link to regression tests: https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/src/backend/tsearch/spell.c --- b/src/backend/tsearch/spell.c *** *** 153,159 cmpspell(const void *s1, const void *s2) static int cmpspellaffix(const void *s1, const void *s2) { ! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN)); } static char * --- 153,159 static int cmpspellaffix(const void *s1, const void *s2) { ! return (strncmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag, MAXFLAGLEN)); } static char * *** *** 237,242 cmpaffix(const void *s1, const void *s2) --- 237,309 (const unsigned char *) a2->repl); } + static unsigned short + decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext) + { + unsigned short s; + char *next; + + switch (Conf->flagMode) + { + case FM_LONG: + s = (int)sflag[0] << 8 | (int)sflag[1]; + if (sflagnext) + *sflagnext = sflag + 2; + break; + case FM_NUM: + s = (unsigned short) strtol(sflag, &next, 10); + if (sflagnext) + { + if (next) + { + *sflagnext = next; + while (**sflagnext) + { + if (**sflagnext == ',') + { + *sflagnext = *sflagnext + 1; + break; + } + *sflagnext = *sflagnext + 1; + } + } + else + *sflagnext = 0; + } + break; + default: + s = (unsigned short) *((unsigned char *)sflag); + if (sflagnext) + *sflagnext = sflag + 1; + } + + return s; + } + + static bool + isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag) + { + char *flagcur; + char *flagnext = 0; + + if (affixflag == 0) + return true; + + flagcur = Conf->AffixData[affix]; + + while (*flagcur) + { + if (decodeFlag(Conf, flagcur, &flagnext) == affixflag) + return true; + if (flagnext) + flagcur = flagnext; + else + break; + } + + return false; + } + static void NIAddSpell(IspellDict *Conf, const char *word, const char *flag) { *** *** 255,261 NIAddSpell(IspellDict *Conf, const char *word, const char *flag) } Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); ! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN); Conf->nspell++; } --- 322,328 } Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); ! Conf->Spell[Conf->nspell]->flag = (*flag != '\0') ? cpstrdup(Conf, flag) : VoidString; Conf->nspell++; } *** *** 355,361 FindWord(IspellDict *Conf, const char *word, int affixflag, int flag) else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL)) return 1; } node = StopMiddle->node; --- 422,428 else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag)) return 1; } node = StopMiddle->node; *** *** 394,400 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c Affix = Conf->Affix + Conf->naffix
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 07.11.2015 17:20, Emre Hasegeli wrote: It seems to have something to do with the order of the affixes. It works, if I move affix 2646 to the beginning of the list. [1] https://tr-spell.googlecode.com/files/dict_aff_5000_suffix_113_words.zip Thank you for reply. This was because of the flag field size of the SPELL struct. And long flags were being trancated in the .dict file. I attached new patch. It is temporary patch, not final. It can be done better. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/src/backend/tsearch/spell.c --- b/src/backend/tsearch/spell.c *** *** 153,159 cmpspell(const void *s1, const void *s2) static int cmpspellaffix(const void *s1, const void *s2) { ! return (strncmp((*(SPELL *const *) s1)->p.flag, (*(SPELL *const *) s2)->p.flag, MAXFLAGLEN)); } static char * --- 153,159 static int cmpspellaffix(const void *s1, const void *s2) { ! return (strcmp((*(SPELL *const *) s1)->flag, (*(SPELL *const *) s2)->flag)); } static char * *** *** 237,242 cmpaffix(const void *s1, const void *s2) --- 237,309 (const unsigned char *) a2->repl); } + static unsigned short + decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext) + { + unsigned short s; + char *next; + + switch (Conf->flagMode) + { + case FM_LONG: + s = (int)sflag[0] << 8 | (int)sflag[1]; + if (sflagnext) + *sflagnext = sflag + 2; + break; + case FM_NUM: + s = (unsigned short) strtol(sflag, &next, 10); + if (sflagnext) + { + if (next) + { + *sflagnext = next; + while (**sflagnext) + { + if (**sflagnext == ',') + { + *sflagnext = *sflagnext + 1; + break; + } + *sflagnext = *sflagnext + 1; + } + } + else + *sflagnext = 0; + } + break; + default: + s = (unsigned short) *((unsigned char *)sflag); + if (sflagnext) + *sflagnext = sflag + 1; + } + + return s; + } + + static bool + isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag) + { + char *flagcur; + char *flagnext = 0; + + if (affixflag == 0) + return true; + + flagcur = Conf->AffixData[affix]; + + while (*flagcur) + { + if (decodeFlag(Conf, flagcur, &flagnext) == affixflag) + return true; + if (flagnext) + flagcur = flagnext; + else + break; + } + + return false; + } + static void NIAddSpell(IspellDict *Conf, const char *word, const char *flag) { *** *** 255,261 NIAddSpell(IspellDict *Conf, const char *word, const char *flag) } Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); ! strlcpy(Conf->Spell[Conf->nspell]->p.flag, flag, MAXFLAGLEN); Conf->nspell++; } --- 322,328 } Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1); strcpy(Conf->Spell[Conf->nspell]->word, word); ! Conf->Spell[Conf->nspell]->flag = cpstrdup(Conf, flag); Conf->nspell++; } *** *** 355,361 FindWord(IspellDict *Conf, const char *word, int affixflag, int flag) else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL)) return 1; } node = StopMiddle->node; --- 422,428 else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag)) return 1; } node = StopMiddle->node; *** *** 394,400 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c Affix = Conf->Affix + Conf->naffixes; ! if (strcmp(mask, ".") == 0) { Affix->issimple = 1; Affix->isregis = 0; --- 461,467 Affix = Conf->Affix + Conf->naffixes; ! if (strcmp(mask, ".") == 0 || *mask == '\0') { Affix->issimple = 1; Affix->isregis = 0; *** *** 429,443 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen, REG_ADVANCED | REG_NOSUB, DEFAULT_COLLATION_OID); if (err) ! { ! char errstr[100]; ! ! pg_regerror(err, &(Affix->reg.regex), errstr, sizeof(errstr)); ! ereport(ERROR, ! (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION), ! errmsg("invalid regular expression: %s", errstr))); ! } } Affix->flagflags = flagflags; --- 496,504 err = pg_regcomp(&(Affix->reg.regex), wmask, wmasklen, REG_ADVANCED | REG_NOSUB, DEFAULT_COLLATION_OID); + /* Ignore regular expression error and do not add wrong affix */ if (err) ! return; } Affix->flagflags = flagflags; *** *** 595,604 addFlagValue(IspellDict *Conf, char *s, uint32 val) (errcode(ERR
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
Thank you for working on this. I tried the patch with a Turkish dictionary [1] I could find on the Internet. It worked for some words, but not others: > hasegeli=# create text search dictionary hunspell_tr (template = ispell, > dictfile = tr, afffile = tr); > CREATE TEXT SEARCH DICTIONARY > > hasegeli=# select ts_lexize('hunspell_tr', 'tilki'); -- The root "fox" > --- > {tilki} > (1 row) > > hasegeli=# select ts_lexize('hunspell_tr', 'tilkinin'); -- Genitive form, > affix 3290 > ts_lexize > --- > {tilki} > (1 row) > > hasegeli=# select ts_lexize('hunspell_tr', 'tilkiler'); -- Plural form, affix > 4371 > ts_lexize > --- > {tilki} > (1 row) > > hasegeli=# select ts_lexize('hunspell_tr', 'tilkiyi'); -- Accusative form, > affix 2646 > ts_lexize > --- > > (1 row) It seems to have something to do with the order of the affixes. It works, if I move affix 2646 to the beginning of the list. [1] https://tr-spell.googlecode.com/files/dict_aff_5000_suffix_113_words.zip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
06.11.2015 12:33, Artur Zakirov пишет: Hello again! Patches === Link to commitfest: https://commitfest.postgresql.org/8/420/ -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
Hello again! Patches === I had implemented support for FLAG long, FLAG num and AF parameters. I attached patch to the e-mail (hunspell-dict.patch). This patch allow to use Hunspell dictionaries listed in the previous e-mail: ar, br_fr, ca, ca_valencia, en_ca, en_gb, en_us, fr, gl_es, hu_hu, is, ne_np, nl_nl, si_lk. The most part of changes was in spell.c in the affix file parsing code. The following are dictionary structures changes: - useFlagAliases and flagMode fields had been added to the IspellDict struct; - flagval array size had been increased from 256 to 65000; - flag field of the AFFIX struct also had been increased. I also had implemented a patch that fixes an error from the e-mail http://www.postgresql.org/message-id/562e1073.8030...@postgrespro.ru This patch just ignore that error. Tests = Extention test dictionaries for loading into PostgreSQL and for normalizing with ts_lexize function can be downloaded from https://dl.dropboxusercontent.com/u/15423817/HunspellDictTest.tar.gz It would be nice if somebody can do additional tests of dictionaries of well known languages. Because I do not know many of them. Other Improvements == There are also some parameters for compound words. But I am not sure that we want use this parameters. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company *** a/src/backend/tsearch/spell.c --- b/src/backend/tsearch/spell.c *** *** 237,242 cmpaffix(const void *s1, const void *s2) --- 237,309 (const unsigned char *) a2->repl); } + static unsigned short + decodeFlag(IspellDict *Conf, char *sflag, char **sflagnext) + { + unsigned short s; + char *next; + + switch (Conf->flagMode) + { + case FM_LONG: + s = (int)sflag[0] << 8 | (int)sflag[1]; + if (sflagnext) + *sflagnext = sflag + 2; + break; + case FM_NUM: + s = (unsigned short) strtol(sflag, &next, 10); + if (sflagnext) + { + if (next) + { + *sflagnext = next; + while (**sflagnext) + { + if (**sflagnext == ',') + { + *sflagnext = *sflagnext + 1; + break; + } + *sflagnext = *sflagnext + 1; + } + } + else + *sflagnext = 0; + } + break; + default: + s = (unsigned short) *((unsigned char *)sflag); + if (sflagnext) + *sflagnext = sflag + 1; + } + + return s; + } + + static bool + isAffixFlagInUse(IspellDict *Conf, int affix, unsigned short affixflag) + { + char *flagcur; + char *flagnext = 0; + + if (affixflag == 0) + return true; + + flagcur = Conf->AffixData[affix]; + + while (*flagcur) + { + if (decodeFlag(Conf, flagcur, &flagnext) == affixflag) + return true; + if (flagnext) + flagcur = flagnext; + else + break; + } + + return false; + } + static void NIAddSpell(IspellDict *Conf, const char *word, const char *flag) { *** *** 355,361 FindWord(IspellDict *Conf, const char *word, int affixflag, int flag) else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if ((affixflag == 0) || (strchr(Conf->AffixData[StopMiddle->affix], affixflag) != NULL)) return 1; } node = StopMiddle->node; --- 422,428 else if ((flag & StopMiddle->compoundflag) == 0) return 0; ! if (isAffixFlagInUse(Conf, StopMiddle->affix, affixflag)) return 1; } node = StopMiddle->node; *** *** 394,400 NIAddAffix(IspellDict *Conf, int flag, char flagflags, const char *mask, const c Affix = Conf->Affix + Conf->naffixes; ! if (strcmp(mask, ".") == 0) { Affix->issimple = 1; Affix->isregis = 0; --- 461,467 Affix = Conf->Affix + Conf->naffixes; ! if (strcmp(mask, ".") == 0 || *mask == '\0') { Affix->issimple = 1; Affix->isregis = 0; *** *** 595,604 addFlagValue(IspellDict *Conf, char *s, uint32 val) (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("multibyte flag character is not allowed"))); ! Conf->flagval[*(unsigned char *) s] = (unsigned char) val; Conf->usecompound = true; } /* * Import an affix file that follows MySpell or Hunspell format */ --- 662,719 (errcode(ERRCODE_CONFIG_FILE_ERROR), errmsg("multibyte flag character is not allowed"))); ! Conf->flagval[decodeFlag(Conf, s, (char **)NULL)] = (unsigned char) val; Conf->usecompound = true; } + static int + getFlagValues(IspellDict *Conf, char *s) + { + uint32 flag = 0; + char *flagcur; + char *flagnext = 0; + + flagcur = s; + while (*flagcur) + { + flag |= Conf->flagval[decodeFlag(Conf, flagcur, &flagnext)]; + if (flagnext) + flagcur = flagnext; + else + break; + } + + return flag; + } + + /* + * Get flag set from "s". + * + * Returns flag set from AffixData array if AF parameter used (useFlagAliases is true). + * In this case "s" is alias for flag
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
20.10.2015 17:00, Artur Zakirov пишет: These flag types are used in affix files of such dictionaries as ar, br_fr, ca, ca_valencia, da_dk, en_ca, en_gb, en_us, fr, gl_es, is, ne_np, nl_nl, si_lk (from http://cgit.freedesktop.org/libreoffice/dictionaries/tree/). Now almost all dictionaries are loaded into PostgreSQL. But the da_dk dictionary does not load. I see the following error: ERROR: invalid regular expression: quantifier operand invalid CONTEXT: line 439 of configuration file "/home/artur/progs/pgsql/share/tsearch_data/da_dk.affix": "SFX 55 0 s +GENITIV If you open the affix file in editor you can see that there is incorrect format of the affix 55 in 439 line (screen1.png): SFX 55 0 s +GENITIV SFX parameter should have a 5 fields. There are no field between "0" digit and "s" symbol. "+GENITIV" is the optional morphological field and ignored by PostgreSQL. I think that it is a error of the affix file. I wrote a e-mail to i...@stavekontrolden.dk to the dictionary authors about this error. What is the right decision in this case? Should PostgreSQL ignore this error and do not show it? -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
21.10.2015 01:37, Jim Nasby пишет: On 10/20/15 9:00 AM, Artur Zakirov wrote: Internal representation of the dictionary in the PostgreSQL doesn't impose too strict limits on the number of affix rules. There are a flagval array, which size must be increased from 256 to 65000. Is that per dictionary entry, fixed at 64k? That seems pretty excessive, if that's the case... This is per dictionary only. flagval array is used for the all dictionary. And it is not used for every dictionary word. There are also flag field of AFFIX structure, wich size must be increased from 8 bit to 16 bit. This structure is used for every affix in affix file. -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PROPOSAL] Improvements of Hunspell dictionaries support
On 10/20/15 9:00 AM, Artur Zakirov wrote: Internal representation of the dictionary in the PostgreSQL doesn't impose too strict limits on the number of affix rules. There are a flagval array, which size must be increased from 256 to 65000. Is that per dictionary entry, fixed at 64k? That seems pretty excessive, if that's the case... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers