Re: daitch_mokotoff module
Dag Lem writes: > I sincerely hope this resolves any blocking issues with copyright / > legalese / credits. > Can this now be considered ready for commiter, so that Paul or someone else can flip the bit? Best regards Dag Lem
Re: daitch_mokotoff module
Dag Lem writes: > Tomas Vondra writes: > >> On 2/7/23 18:08, Paul Ramsey wrote: >>> >>> >>>> On Feb 7, 2023, at 6:47 AM, Dag Lem wrote: >>>> >>>> I just went by to check the status of the patch, and I noticed that >>>> you've added yourself as reviewer earlier - great! >>>> >>>> Please tell me if there is anything I can do to help bring this across >>>> the finish line. >>> >>> Honestly, I had set it to Ready for Committer, but then I went to >>> run regression one more time and my regression blew up. I found I >>> couldn't enable the UTF tests without things failing. And I don't >>> blame you! I think my installation is probably out-of-alignment in >>> some way, but I didn't want to flip the Ready flag without having >>> run everything through to completion, so I flipped it back. Also, >>> are the UTF tests enabled by default? It wasn't clear to me that >>> they were? >>> >> The utf8 tests are enabled depending on the encoding returned by >> getdatabaseencoding(). Systems with other encodings will simply use the >> alternate .out file. And it works perfectly fine for me. >> >> IMHO it's ready for committer. >> >> >> regards > > Yes, the UTF-8 tests follow the current best practice as has been > explained to me earlier. The following patch exemplifies this: > > https://github.com/postgres/postgres/commit/c2e8bd27519f47ff56987b30eb34a01969b9a9e8 > > Can you please have a look at this again? Best regards, Dag Lem
Re: daitch_mokotoff module
I sincerely hope this resolves any blocking issues with copyright / legalese / credits. Best regards Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..12baf2d884 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,14 +3,15 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" -REGRESS = fuzzystrmatch +REGRESS = fuzzystrmatch fuzzystrmatch_utf8 ifdef USE_PGXS PG_CONFIG = pg_config @@ -22,3 +23,14 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +# Force this dependency to be known even without dependency info built: +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< $@ + +distprep: daitch_mokotoff.h + +maintainer-clean: + rm -f daitch_mokotoff.h diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..9215ab6530 --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,567 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2023, PostgreSQL Global Development Group + * + * This module was originally sponsored by Finance Norway / + * Trafikkforsikringsforeningen, and implemented by Dag Lem + * + * The implementation of the Daitch-Mokotoff Soundex System aims at correctness + * and high performance, and can be summarized as follows: + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - All other known implementations have the same unofficial rules for "UE", + * these are also adapted by this implementation (0, 1, NC). + * - The only other known implementation which is capable of generating all + * correct soundex codes in all cases is the JOS Soundex Calculator at + * https://www.jewishgen.org/jos/jossound.htm + * - "J" is considered (only) a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Identical code digits for adjacent letters are not collapsed correctly in + * dmsoundex.php when double digit codes are involved. E.g. "BESST" yields + * 744300 instead of 743000 as for "BEST". + * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java + * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java +*/ + +#include "postgres.h" + +#include "catalog/pg_type.h" +#include "mb/pg_wchar.h" +#include "utils/array.h" +#include "utils/builtins.h" +#include "utils/memutils.h" + + +/* + * The soundex coding chart table is adapted from + * https://www.jewishgen.org/InfoFiles/Soundex.html + * See daitch_mokotoff_header.pl for details. +*/ + +/* Generated coding chart table */ +#include "daitch_mokotoff.h" + +#define DM_CODE_DIGITS 6 + +/* Node in soundex code tree */ +struct dm_node +{ + int soundex_length; /* Length of generated soundex code */ + char soundex[DM_CODE_DIGITS]; /* Soundex code */ + int is_leaf; /* Candidate for complete soundex code */ + int last_update; /* Letter number for last update of node */ + char code_digit; /* Last code digit, 0 - 9 */ + + /* + * One or two alternate code digits leading to this node. If there are two + * digits, one of them is always an 'X'. Repeated code digits and 'X' lead + * back to the same node. + */ + char prev_code_digits[2]; + /* One or two alternate code digits moving forward. */ + char next_code_digits[2]; + /* ORed together code index(es) used to reach current node. */ + int prev_code_index; + int next_code_index; + /* Possible nodes branching out from this node - digits 0-9. */ + struct dm_node *children[10]; + /* Next node in linked list. Alternating index for each iteration. */ + struct dm_node *next[2]; +}; + +typedef struct dm_node dm_node; + + +/* Internal C implementation */ +static int daitch_mokotoff_c
Re: daitch_mokotoff module
Tomas Vondra writes: > On 2/8/23 15:31, Dag Lem wrote: >> Alvaro Herrera writes: >> >>> On 2023-Jan-17, Dag Lem wrote: >>> >>>> + * Daitch-Mokotoff Soundex >>>> + * >>>> + * Copyright (c) 2021 Finance Norway >>>> + * Author: Dag Lem >>> >>> Hmm, I don't think we accept copyright lines that aren't "PostgreSQL >>> Global Development Group". Is it okay to use that, and update the year >>> to 2023? (Note that answering "no" very likely means your patch is not >>> candidate for inclusion.) Also, we tend not to have "Author:" lines. >>> >> >> You'll have to forgive me for not knowing about this rule: >> >> grep -ER "Copyright.*[0-9]{4}" contrib/ | grep -v PostgreSQL >> >> In any case, I have checked with the copyright owner, and it would be OK >> to assign the copyright to "PostgreSQL Global Development Group". >> > > I'm not entirely sure what's the rule either, and I'm a committer. My > guess is these cases are either old and/or adding a code that already > existed elsewhere (like some of the double metaphone, for example), or > maybe both. But I'd bet we'd prefer not adding more ... > >> To avoid going back and forth with patches, how do you propose that the >> sponsor and the author of the contributed module should be credited? >> Woule something like this be acceptable? >> > > We generally credit contributors in two ways - by mentioning them in the > commit message, and by listing them in the release notes (for individual > features). > I'll ask again, would the proposed credits be acceptable? In this case, the code already existed elsewhere (as in your example for double metaphone) as a separate extension. The copyright owner is OK with copyright assignment, however I find it quite unreasonable that proper credits should not be given. Neither commit messages nor release notes follow the contributed module, which is in its entirety contributed by an external entity. I'll also point out that in addition to credits in code all over the place, PostgreSQL has much more prominent credits in the documentation: grep -ER "Author" doc/ | grep -v PostgreSQL "Author" is even documented as a top level section in the Reference Pages as "Author (only used in the contrib section)", see https://www.postgresql.org/docs/15/docguide-style.html#id-1.11.11.8.2 If there really exists some new rule which says that for new contributions under contrib/, credits should not be allowed in any way in either code or documentation (IANAL, but AFAIU this would be in conflict with laws on author's moral rights in several countries), then one would reasonably expect that you'd be upfront about this, both in documentation, and also as the very first thing when a contribution is first proposed for inclusion. Best regards Dag Lem
Re: daitch_mokotoff module
Alvaro Herrera writes: > On 2023-Jan-17, Dag Lem wrote: > >> + * Daitch-Mokotoff Soundex >> + * >> + * Copyright (c) 2021 Finance Norway >> + * Author: Dag Lem > > Hmm, I don't think we accept copyright lines that aren't "PostgreSQL > Global Development Group". Is it okay to use that, and update the year > to 2023? (Note that answering "no" very likely means your patch is not > candidate for inclusion.) Also, we tend not to have "Author:" lines. > You'll have to forgive me for not knowing about this rule: grep -ER "Copyright.*[0-9]{4}" contrib/ | grep -v PostgreSQL In any case, I have checked with the copyright owner, and it would be OK to assign the copyright to "PostgreSQL Global Development Group". To avoid going back and forth with patches, how do you propose that the sponsor and the author of the contributed module should be credited? Woule something like this be acceptable? /* * Daitch-Mokotoff Soundex * * Copyright (c) 2023, PostgreSQL Global Development Group * * This module was sponsored by Finance Norway / Trafikkforsikringsforeningen * and implemented by Dag Lem * ... [...] > > We don't keep a separate copyright statement in the file; rather we > assume that all files are under the PostgreSQL license, which is in the > COPYRIGHT file at the top of the tree. Changing it thus has the side > effect that these disclaim notes refer to the University of California > rather than "the Author". IANAL. OK, no problem. Note that you will again find counterexamples under contrib/ (and in some other places): grep -R "Permission to use" . > I think we should add SPDX markers to all the files we distribute: > /* SPDX-License-Identifier: PostgreSQL */ > > https://spdx.dev/ids/ > https://spdx.org/licenses/PostgreSQL.html As far as I can tell, this is not included in any file so far, and is thus better left to decide and implement by someone else. Best regards Dag Lem
Re: daitch_mokotoff module
Tomas Vondra writes: > On 2/7/23 18:08, Paul Ramsey wrote: >> >> >>> On Feb 7, 2023, at 6:47 AM, Dag Lem wrote: >>> >>> I just went by to check the status of the patch, and I noticed that >>> you've added yourself as reviewer earlier - great! >>> >>> Please tell me if there is anything I can do to help bring this across >>> the finish line. >> >> Honestly, I had set it to Ready for Committer, but then I went to >> run regression one more time and my regression blew up. I found I >> couldn't enable the UTF tests without things failing. And I don't >> blame you! I think my installation is probably out-of-alignment in >> some way, but I didn't want to flip the Ready flag without having >> run everything through to completion, so I flipped it back. Also, >> are the UTF tests enabled by default? It wasn't clear to me that >> they were? >> > The utf8 tests are enabled depending on the encoding returned by > getdatabaseencoding(). Systems with other encodings will simply use the > alternate .out file. And it works perfectly fine for me. > > IMHO it's ready for committer. > > > regards Yes, the UTF-8 tests follow the current best practice as has been explained to me earlier. The following patch exemplifies this: https://github.com/postgres/postgres/commit/c2e8bd27519f47ff56987b30eb34a01969b9a9e8 Best regards, Dag Lem
Re: daitch_mokotoff module
Hi Paul, I just went by to check the status of the patch, and I noticed that you've added yourself as reviewer earlier - great! Please tell me if there is anything I can do to help bring this across the finish line. Best regards, Dag Lem
Re: daitch_mokotoff module
Alvaro Herrera writes: > On 2023-Jan-05, Dag Lem wrote: > >> Is there anything else I should do here, to avoid the status being >> incorrectly stuck at "Waiting for Author" again. > > Just mark it Needs Review for now. I'll be back from vacation on Jan > 11th and can have a look then (or somebody else can, perhaps.) Paul Ramsey had a few comments in the mean time, and based on this I have produced (yet another) patch, with improved documentation. However it's still not marked as "Ready for Committer" - can you please take a look again? Best regards Dag Lem
Re: daitch_mokotoff module
Paul Ramsey writes: >> On Jan 12, 2023, at 7:30 AM, Dag Lem wrote: >> [...] >> >> Sure, I can do that. You don't think this much example text will be >> TL;DR? > > I can only speak for myself, but examples are the meat of > documentation learning, so as long as they come with enough > explanatory context to be legible it's worth having them, IMO. > I have updated the documentation, hopefully it is more accessible now. I also corrected documentation for the other functions in fuzzystrmatch (function name and argtype in the wrong order). Crossing fingers that someone will eventually change the status to "Ready for Committer" :-) Best regards, Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..12baf2d884 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,14 +3,15 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" -REGRESS = fuzzystrmatch +REGRESS = fuzzystrmatch fuzzystrmatch_utf8 ifdef USE_PGXS PG_CONFIG = pg_config @@ -22,3 +23,14 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +# Force this dependency to be known even without dependency info built: +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< $@ + +distprep: daitch_mokotoff.h + +maintainer-clean: + rm -f daitch_mokotoff.h diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..2548903770 --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,582 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - All other known implementations have the same unofficial rules for "UE", + * these are also adapted by this implementation (0, 1, NC). + * - The only other known implementation which is capable of generating all + * correct soundex codes in all cases is the JOS Soundex Calculator at + * https://www.jewishgen.org/jos/jossound.htm + * - "J" is considered (only) a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Identical code digits for adjacent letters are not collapsed correctly in + * dmsoundex.php when double digit codes are involved. E.g. "BESST" yields + * 744300 instead of 743000 as for "BEST". + * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java + * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java + * + * Permission to use, copy, modify, and distribute this software and its + * documentation for any purpose, without fee, and without a written agreement + * is hereby granted, provided that the above copyright notice and this + * paragraph and the following two paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR + * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, + * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY + * AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS + * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO + * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. +*/ + +#include "postgres.h" + +#include "catalog/pg_type.h" +#include
Re: daitch_mokotoff module
Paul Ramsey writes: > On Mon, Jan 2, 2023 at 2:03 PM Dag Lem wrote: > >> I also improved on the documentation example (using Full Text Search). >> AFAIK you can't make general queries like that using arrays, however in >> any case I must admit that text arrays seem like more natural building >> blocks than space delimited text here. > > This is a fun addition to fuzzystrmatch. I'm glad to hear it! :-) > > While it's a little late in the game, I'll just put it out there: > daitch_mokotoff() is way harder to type than soundex_dm(). Not sure > how you feel about that. I chose the name in order to follow the naming of the other functions in fuzzystrmatch, which as far as I can tell are given the name which each algorithm is known by. Personally I don't think it's worth it to deviate from the naming of the other functions just to avoid typing a few characters, and I certainly don't think daitch_mokotoff is any harder to get right than levenshtein_less_equal ;-) So, if I were to decide, I wouldn't change the name of the function. However I'm obviously not calling the shots on what goes into PostgreSQL - perhaps someone else would like to weigh in on this? > > On the documentation, I found the leap directly into the tsquery > example a bit too big. Maybe start with a very simple example, > > -- > dm=# SELECT daitch_mokotoff('Schwartzenegger'), > daitch_mokotoff('Swartzenegger'); > > daitch_mokotoff | daitch_mokotoff > -+- > {479465}| {479465} > -- > > Then transition into a more complex example that illustrates the GIN > index technique you mention in the text, but do not show: > > -- > CREATE TABLE dm_gin (source text, dm text[]); > > INSERT INTO dm_gin (source) VALUES > ('Swartzenegger'), > ('John'), > ('James'), > ('Steinman'), > ('Steinmetz'); > > UPDATE dm_gin SET dm = daitch_mokotoff(source); > > CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm); > > SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger'); > -- Sure, I can do that. You don't think this much example text will be TL;DR? > > And only then go into the tsearch example. Incidentally, what does the > tsearch approach provide that the simple GIN approach does not? The example shows how to do a simultaneous match on first AND last names, where the first and last names (any number of names) are stored in the same indexed column, and the order of the names in the index and the search term does not matter. If you were to use the GIN "&&" operator, you would get a match if either the first OR the last name matches. If you were to use the GIN "@>" operator, you would *not* get a match if the search term contains more soundex codes than the indexed name. E.g. this yields a correct match: SELECT soundex_tsvector('John Yamson') @@ soundex_tsquery('John Jameson'); While this yields a false positive: SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) && (daitch_mokotoff('John') || daitch_mokotoff('Doe')); And this yields a false negative: SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) @> (daitch_mokotoff('John') || daitch_mokotoff('Jameson')); This may explained better by simply showing the output of soundex_tsvector and soundex_tsquery: SELECT soundex_tsvector('John Yamson'); soundex_tsvector -- '16':1 '164600':3 '46':2 SELECT soundex_tsquery('John Jameson'); soundex_tsquery --- ( '16' | '46' ) & ( '164600' | '464600' ) > Ideally explain that briefly before launching into the example. With > all the custom functions and so on it's a little involved, so maybe if > there's not a huge win in using that approach drop it entirely? I believe this functionality is quite useful, and that it's actually what's called for in many situations. So, I'd rather not drop this example. > > ATB, > P > Best regards, Dag Lem
Re: daitch_mokotoff module
Alvaro Herrera writes: > On 2023-Jan-05, Dag Lem wrote: > >> Is there anything else I should do here, to avoid the status being >> incorrectly stuck at "Waiting for Author" again. > > Just mark it Needs Review for now. I'll be back from vacation on Jan > 11th and can have a look then (or somebody else can, perhaps.) OK, done. Have a nice vacation! Best regards Dag Lem
Re: daitch_mokotoff module
Is there anything else I should do here, to avoid the status being incorrectly stuck at "Waiting for Author" again. Best regards Dag Lem
Re: daitch_mokotoff module
Sorry about the latest unfinished email - don't know what key combination I managed to hit there. Alvaro Herrera writes: > Hello > > On 2022-Dec-23, Dag Lem wrote: > [...] > > So, yes, I'm proposing that we returns those as array elements and that > @> is used to match them. > Looking into the array operators I guess that to match such arrays directly one would actually use && (overlaps) rather than @> (contains), but I digress. The function is changed to return an array of soundex codes - I hope it is now to your liking :-) I also improved on the documentation example (using Full Text Search). AFAIK you can't make general queries like that using arrays, however in any case I must admit that text arrays seem like more natural building blocks than space delimited text here. [...] >> BTW Vera 79 does not match Veras 794000, because they don't sound >> the same (up to the maximum soundex code length). > > No, and maybe that's okay because they have different codes. But they > are both similar, in Daitch-Mokotoff, to Borja, which has two codes, > 79 and 794000. (Any Spanish speaker will readily tell you that > neither Vera nor Veras are similar in any way to Borja, but D-M has > chosen to say that each of them matches one of Borjas' codes. So they > *are* related, even though indirectly, and as a genealogist you *may* be > interested in getting a match for a person called Vera when looking for > relatives to a person called Veras. And, as a Spanish speaker, that > would make a lot of sense to me.) It is what it is - we can't call it Daitch-Mokotoff Soundex while implementing something else. Having said that, one can always pre- or postprocess to tweak the results. Daitch-Mokotoff Soundex is known to produce false positives, but that is in many cases not a problem. Even though it's clearly tuned for Jewish names, the soundex algorithm seems to work just fine for European names (we use it to match mostly Norwegian names). Best regards Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..12baf2d884 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,14 +3,15 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" -REGRESS = fuzzystrmatch +REGRESS = fuzzystrmatch fuzzystrmatch_utf8 ifdef USE_PGXS PG_CONFIG = pg_config @@ -22,3 +23,14 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +# Force this dependency to be known even without dependency info built: +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< $@ + +distprep: daitch_mokotoff.h + +maintainer-clean: + rm -f daitch_mokotoff.h diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..2548903770 --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,582 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - All other known implementations have the same unofficial rules for "UE", + * these are also adapted by this implementation (0, 1, NC). + * - The only other known implementation which is capable of generating all + * correct soundex codes in all cases is the JOS Soundex Calculator at + * https://www.jewishgen.org/jos/jossound.htm + * - "J" is considered (only) a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Identical code digits for adjacent letters are not collapsed correctly in + * dmsoundex.php when double digit codes are involved. E.g. "BESST" y
Re: daitch_mokotoff module
Alvaro Herrera writes: > Hello > > On 2022-Dec-23, Dag Lem wrote: > [...] > So, yes, I'm proposing that we returns those as array elements and that > @> is used to match them. Looking into the array operators I guess that to match such arrays directly one would actually use && (overlaps) rather than @> (contains), but I digress. The function is changed to return an array of soundex codes - I hope it is now to your liking :-) I also improved on the documentation example (using Full Text Search). AFAIK you can't make general queries like that using arrays, however in any case I must admit that text arrays seem like more natural building blocks than space delimited text here. Search to perform is the best match for Daitch-Mokotoff, however , but in any case I've changed it into return arrays now. I hope it is to your liking. > >> Daitch-Mokotoff Soundex indexes alternative sounds for the same name, >> however if I understand correctly, you want to index names by single >> sounds, linking all alike sounding names to the same soundex code. I >> fail to see how that is useful - if you want to find matches for a name, >> you simply match against all indexed names. If you only consider one >> sound, you won't find all names that match. > > Hmm, I think we're saying the same thing, but from opposite points of > view. No, I want each name to return multiple codes, but that those > multiple codes can be treated as a multiple-value array of codes, rather > than as a single string of space-separated codes. > >> In any case, as explained in the documentation, the implementation is >> intended to be a companion to Full Text Search, thus text is the natural >> representation for the soundex codes. > > Hmm, I don't agree with this point. The numbers are representations of > the strings, but they don't necessarily have to be strings themselves. > > >> BTW Vera 79 does not match Veras 794000, because they don't sound >> the same (up to the maximum soundex code length). > > No, and maybe that's okay because they have different codes. But they > are both similar, in Daitch-Mokotoff, to Borja, which has two codes, > 79 and 794000. (Any Spanish speaker will readily tell you that > neither Vera nor Veras are similar in any way to Borja, but D-M has > chosen to say that each of them matches one of Borjas' codes. So they > *are* related, even though indirectly, and as a genealogist you *may* be > interested in getting a match for a person called Vera when looking for > relatives to a person called Veras. And, as a Spanish speaker, that > would make a lot of sense to me.) > > > Now, it's true that I've chosen to use Spanish names for my silly little > experiment. Maybe this isn't terribly useful as a practical example, > because this algorithm seems to have been designed for Jew surnames and > perhaps not many (or not any) Jews had Spanish surnames. I don't know; > I'm not a Jew myself (though Noah Gordon tells the tale of a Spanish Jew > called Josep Álvarez in his book "The Winemaker", so I guess it's not > impossible). Anyway, I suspect if you repeat the experiment with names > of other origins, you'll find pretty much the same results apply there, > and that is the whole reason D-M returns multiple codes and not just > one. > > > Merry Christmas :-) -- Dag
Re: daitch_mokotoff module
Dag Lem writes: > Alvaro Herrera writes: > >> On 2022-Dec-23, Alvaro Herrera wrote: >> > > [...] > >> I tried downloading a list of surnames from here >> https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/ >> pasted that in a text file and \copy'ed it into a table. Then I ran >> this query >> >> select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*) >> from apellidos >> group by daitch_mokotoff(a) >> order by count(*) desc; >> >> so I have a first entry like this >> >> string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz >> Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas >> daitch_mokotoff │ 784000 >> count │ 18 >> >> but then I have a bunch of other entries with the same code 784000 as >> alternative codes, >> >> string_agg │ Velazco >> daitch_mokotoff │ 784500 784000 >> count │ 1 >> >> string_agg │ Palacio >> daitch_mokotoff │ 785000 784000 >> count │ 1 >> >> I suppose I need to group these together somehow, and it would make more >> sense to do that if the values were arrays. >> >> >> If I scroll a bit further down and choose, say, 794000 (a relatively >> popular one), then I have this >> >> string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras >> Peraza Peres Perez Porras Varas Veras >> daitch_mokotoff │ 794000 >> count │ 14 >> >> and looking for that code in the result I also get these three >> >> string_agg │ Barca Barco Parco >> daitch_mokotoff │ 795000 794000 >> count │ 3 >> >> string_agg │ Borja >> daitch_mokotoff │ 79 794000 >> count │ 1 >> >> string_agg │ Borjas >> daitch_mokotoff │ 794000 794400 >> count │ 1 >> >> and then I see that I should also search for possible matches in codes >> 795000, 79 and 794400, so that gives me >> >> string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra >> Perea Vera >> daitch_mokotoff │ 79 >> count │ 11 >> >> string_agg │ Barriga Borge Borrego Burgo Fraga >> daitch_mokotoff │ 795000 >> count │ 5 >> >> string_agg │ Borjas >> daitch_mokotoff │ 794000 794400 >> count │ 1 >> >> which look closely related (compare "Veras" in the first to "Vera" in >> the later set. If you ignore that pseudo-match, you're likely to miss >> possible family relationships.) >> >> >> I suppose if I were a genealogy researcher, I would be helped by having >> each of these codes behave as a separate unit, rather than me having to >> split the string into the several possible contained values. > > It seems to me like you're trying to use soundex coding for something it > was never designed for. > > As stated in my previous mail, soundex algorithms are designed to index > names on some representation of sound, so that alike sounding names with > different spellings will match, and as shown in the documentation > example, that is exactly what the implementation facilitates. > > Daitch-Mokotoff Soundex indexes alternative sounds for the same name, > however if I understand correctly, you want to index names by single > sounds, linking all alike sounding names to the same soundex code. I > fail to see how that is useful - if you want to find matches for a name, > you simply match against all indexed names. If you only consider one > sound, you won't find all names that match. > > In any case, as explained in the documentation, the implementation is > intended to be a companion to Full Text Search, thus text is the natural > representation for the soundex codes. > > BTW Vera 79 does not match Veras 794000, because they don't sound > the same (up to the maximum soundex code length). > I've been sleeping on this, and perhaps the normal use case can just as well (or better) be covered by the "@>" array operator? I originally implemented similar functionality using another soundex algorithm more than a decade ago, and either arrays couldn't be GIN indexed back then, or I simply missed it. I'll have to get back to this - now it's Christmas! Merry Christmas! Best regards, Dag Lem
Re: daitch_mokotoff module
Alvaro Herrera writes: > On 2022-Dec-23, Alvaro Herrera wrote: > [...] > I tried downloading a list of surnames from here > https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/ > pasted that in a text file and \copy'ed it into a table. Then I ran > this query > > select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*) > from apellidos > group by daitch_mokotoff(a) > order by count(*) desc; > > so I have a first entry like this > > string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz > Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas > daitch_mokotoff │ 784000 > count │ 18 > > but then I have a bunch of other entries with the same code 784000 as > alternative codes, > > string_agg │ Velazco > daitch_mokotoff │ 784500 784000 > count │ 1 > > string_agg │ Palacio > daitch_mokotoff │ 785000 784000 > count │ 1 > > I suppose I need to group these together somehow, and it would make more > sense to do that if the values were arrays. > > > If I scroll a bit further down and choose, say, 794000 (a relatively > popular one), then I have this > > string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras > Peraza Peres Perez Porras Varas Veras > daitch_mokotoff │ 794000 > count │ 14 > > and looking for that code in the result I also get these three > > string_agg │ Barca Barco Parco > daitch_mokotoff │ 795000 794000 > count │ 3 > > string_agg │ Borja > daitch_mokotoff │ 79 794000 > count │ 1 > > string_agg │ Borjas > daitch_mokotoff │ 794000 794400 > count │ 1 > > and then I see that I should also search for possible matches in codes > 795000, 79 and 794400, so that gives me > > string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra > Perea Vera > daitch_mokotoff │ 79 > count │ 11 > > string_agg │ Barriga Borge Borrego Burgo Fraga > daitch_mokotoff │ 795000 > count │ 5 > > string_agg │ Borjas > daitch_mokotoff │ 794000 794400 > count │ 1 > > which look closely related (compare "Veras" in the first to "Vera" in > the later set. If you ignore that pseudo-match, you're likely to miss > possible family relationships.) > > > I suppose if I were a genealogy researcher, I would be helped by having > each of these codes behave as a separate unit, rather than me having to > split the string into the several possible contained values. It seems to me like you're trying to use soundex coding for something it was never designed for. As stated in my previous mail, soundex algorithms are designed to index names on some representation of sound, so that alike sounding names with different spellings will match, and as shown in the documentation example, that is exactly what the implementation facilitates. Daitch-Mokotoff Soundex indexes alternative sounds for the same name, however if I understand correctly, you want to index names by single sounds, linking all alike sounding names to the same soundex code. I fail to see how that is useful - if you want to find matches for a name, you simply match against all indexed names. If you only consider one sound, you won't find all names that match. In any case, as explained in the documentation, the implementation is intended to be a companion to Full Text Search, thus text is the natural representation for the soundex codes. BTW Vera 79 does not match Veras 794000, because they don't sound the same (up to the maximum soundex code length). Best regards Dag Lem
Re: daitch_mokotoff module
Alvaro Herrera writes: > I wonder why do you have it return the multiple alternative codes as a > space-separated string. Maybe an array would be more appropriate. Even > on your documented example use, the first thing you do is split it on > spaces. In the example, the *input* is split on whitespace, the returned soundex codes are not. The splitting of the input is done in order to code each word separately. One of the stated rules of the Daitch-Mokotoff Soundex Coding is that "When a name consists of more than one word, it is coded as if one word", and this may not always be desired. See https://www.avotaynu.com/soundex.htm or https://www.jewishgen.org/InfoFiles/soundex.html for the rules. The intended use for the Daitch-Mokotoff soundex, as for any other soundex algorithm, is to index names (or words) on some representation of sound, so that alike sounding names with different spellings will match. In PostgreSQL, the Daitch-Mokotoff Soundex and Full Text Search makes for a powerful combination to match alike sounding names. Full Text Search (as any other free text search engine) works with documents, and thus the Daitch-Mokotoff Soundex implementation produces documents (words separated by space). As stated in the documentation: "Any alternative soundex codes are separated by space, which makes the returned text suited for use in Full Text Search". Best regards, Dag Lem
Re: daitch_mokotoff module
Tom Lane writes: > Alvaro Herrera writes: >> On 2022-Dec-22, Dag Lem wrote: >>> This should hopefully fix the last Cfbot failures, by exclusion of >>> daitch_mokotoff.h from headerscheck and cpluspluscheck. > >> Hmm, maybe it'd be better to move the typedefs to the .h file instead. > > Indeed, that sounds like exactly the wrong way to fix such a problem. > The bar for excluding stuff from headerscheck needs to be very high. > OK, I've moved enough declarations back to the generated header file again so as to avoid excluding it from headerscheck and cpluspluscheck. Best regards, Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..12baf2d884 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,14 +3,15 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" -REGRESS = fuzzystrmatch +REGRESS = fuzzystrmatch fuzzystrmatch_utf8 ifdef USE_PGXS PG_CONFIG = pg_config @@ -22,3 +23,14 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +# Force this dependency to be known even without dependency info built: +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< $@ + +distprep: daitch_mokotoff.h + +maintainer-clean: + rm -f daitch_mokotoff.h diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..e809d4a39e --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,582 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - All other known implementations have the same unofficial rules for "UE", + * these are also adapted by this implementation (0, 1, NC). + * - The only other known implementation which is capable of generating all + * correct soundex codes in all cases is the JOS Soundex Calculator at + * https://www.jewishgen.org/jos/jossound.htm + * - "J" is considered (only) a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Identical code digits for adjacent letters are not collapsed correctly in + * dmsoundex.php when double digit codes are involved. E.g. "BESST" yields + * 744300 instead of 743000 as for "BEST". + * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java + * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java + * + * Permission to use, copy, modify, and distribute this software and its + * documentation for any purpose, without fee, and without a written agreement + * is hereby granted, provided that the above copyright notice and this + * paragraph and the following two paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR + * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, + * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY + * AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS + * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO + * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. +*/ + +#include "postgres.h" + +#include "lib/stringinfo.h" +#include "mb/pg_wchar.h" +#include "utils/builtins.h" +#include "utils/memutils.
Re: daitch_mokotoff module
Andres Freund writes: > On 2022-12-22 14:27:54 +0100, Dag Lem wrote: >> This should hopefully fix the last Cfbot failures, by exclusion of >> daitch_mokotoff.h from headerscheck and cpluspluscheck. > > Btw, you can do the same tests as cfbot in your own repo by enabling CI > in a github repo. See src/tools/ci/README > OK, thanks, I've set it up now. Best regards, Dag Lem
Re: daitch_mokotoff module
Dag Lem writes: > Hi Ian, > > Ian Lawrence Barwick writes: > [...] >> I see you provided some feedback on >> https://commitfest.postgresql.org/36/3468/, >> though the patch seems to have not been accepted (but not >> conclusively rejected >> either). If you still have the chance to review another patch (or >> more) it would >> be much appreciated, as there's quite a few piling up. Things like >> documentation >> or small improvements to client applications are always a good place to >> start. >> Reviews can be provided at any time, there's no need to wait for the next >> CommitFest. >> > > OK, I'll try to find another patch to review. > I have scanned through all the patches in Commitfest 2023-01 with status "Needs review", and it is difficult to find something which I can meaningfully review. The only thing I felt qualified to comment (or nit-pick?) on was https://commitfest.postgresql.org/41/4071/ If something else should turn up which could be reviewed by someone without intimate knowledge of PostgreSQL internals, then don't hesitate to ask. As for the Daitch-Mokotoff patch, the review by Andres Freund was very helpful in order to improve the extension and to make it more idiomatic - hopefully it is now a bit closer to being included. Best regards Dag Lem
Re: daitch_mokotoff module
Dag Lem writes: > I noticed that the Meson builds failed in Cfbot, the updated patch adds > a missing "include_directories" line to meson.build. > This should hopefully fix the last Cfbot failures, by exclusion of daitch_mokotoff.h from headerscheck and cpluspluscheck. Best regards Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..12baf2d884 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,14 +3,15 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" -REGRESS = fuzzystrmatch +REGRESS = fuzzystrmatch fuzzystrmatch_utf8 ifdef USE_PGXS PG_CONFIG = pg_config @@ -22,3 +23,14 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +# Force this dependency to be known even without dependency info built: +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< $@ + +distprep: daitch_mokotoff.h + +maintainer-clean: + rm -f daitch_mokotoff.h diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..d4ad95c283 --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,596 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - All other known implementations have the same unofficial rules for "UE", + * these are also adapted by this implementation (0, 1, NC). + * - The only other known implementation which is capable of generating all + * correct soundex codes in all cases is the JOS Soundex Calculator at + * https://www.jewishgen.org/jos/jossound.htm + * - "J" is considered (only) a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Identical code digits for adjacent letters are not collapsed correctly in + * dmsoundex.php when double digit codes are involved. E.g. "BESST" yields + * 744300 instead of 743000 as for "BEST". + * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java + * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java + * + * Permission to use, copy, modify, and distribute this software and its + * documentation for any purpose, without fee, and without a written agreement + * is hereby granted, provided that the above copyright notice and this + * paragraph and the following two paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR + * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, + * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY + * AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS + * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO + * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. +*/ + +#include "postgres.h" + +#include "lib/stringinfo.h" +#include "mb/pg_wchar.h" +#include "utils/builtins.h" +#include "utils/memutils.h" + + +/* + * The soundex coding chart table is adapted from from + * https://www.jewishgen.org/InfoFiles/Soundex.html + * See daitch_mokotoff_header.pl for details. +*/ + +#define DM_CODE_DIGITS 6 + +/* Coding chart table: Soundex codes */ +typedef char dm_code[2 + 1]; /* One or two sequential code digits + NUL */ +typ
Re: daitch_mokotoff module
I noticed that the Meson builds failed in Cfbot, the updated patch adds a missing "include_directories" line to meson.build. Best regards Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..12baf2d884 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,14 +3,15 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" -REGRESS = fuzzystrmatch +REGRESS = fuzzystrmatch fuzzystrmatch_utf8 ifdef USE_PGXS PG_CONFIG = pg_config @@ -22,3 +23,14 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +# Force this dependency to be known even without dependency info built: +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< $@ + +distprep: daitch_mokotoff.h + +maintainer-clean: + rm -f daitch_mokotoff.h diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..d4ad95c283 --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,596 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - All other known implementations have the same unofficial rules for "UE", + * these are also adapted by this implementation (0, 1, NC). + * - The only other known implementation which is capable of generating all + * correct soundex codes in all cases is the JOS Soundex Calculator at + * https://www.jewishgen.org/jos/jossound.htm + * - "J" is considered (only) a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Identical code digits for adjacent letters are not collapsed correctly in + * dmsoundex.php when double digit codes are involved. E.g. "BESST" yields + * 744300 instead of 743000 as for "BEST". + * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java + * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java + * + * Permission to use, copy, modify, and distribute this software and its + * documentation for any purpose, without fee, and without a written agreement + * is hereby granted, provided that the above copyright notice and this + * paragraph and the following two paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR + * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, + * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY + * AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS + * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO + * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. +*/ + +#include "postgres.h" + +#include "lib/stringinfo.h" +#include "mb/pg_wchar.h" +#include "utils/builtins.h" +#include "utils/memutils.h" + + +/* + * The soundex coding chart table is adapted from from + * https://www.jewishgen.org/InfoFiles/Soundex.html + * See daitch_mokotoff_header.pl for details. +*/ + +#define DM_CODE_DIGITS 6 + +/* Coding chart table: Soundex codes */ +typedef char dm_code[2 + 1]; /* One or two sequential code digits + NUL */ +typedef dm_code dm_codes[3]; /* Start of name, before a vowel, any other */ + +/* Coding chart table: Letter in input sequence */ +struct dm_letter +{ + char
Re: [PATCH] Add function to_oct
The following review has been posted through the commitfest application: make installcheck-world: not tested Implements feature: not tested Spec compliant: not tested Documentation:not tested This is a mini-review of to_oct :-) The function seems useful to me, and is obviously correct. I don't know whether optimization at such a low level is considered in PostgreSQL, but here goes. The calculation of quotient and remainder can be replaced by less costly masking and shifting. Defining #define OCT_DIGIT_BITS 3 #define OCT_DIGIT_BITMASK 0x7 the content of the loop can be replaced by *--ptr = digits[value & OCT_DIGIT_BITMASK]; value >>= OCT_DIGIT_BITS; Also, the check for ptr > buf in the while loop can be removed. The check is superfluous, since buf cannot possibly be exhausted by a 32 bit integer (the maximum octal number being 377). Best regards Dag Lem The new status of this patch is: Waiting on Author
Re: daitch_mokotoff module
Hi Andreas, Thank you for your detailed and constructive review! I have made a conscientuous effort to address all the issues you point out, please see comments below. Andres Freund writes: > Hi, > > On 2022-02-03 15:27:32 +0100, Dag Lem wrote: [...] > [23:43:34.796] contrib/fuzzystrmatch/meson.build:18:0: ERROR: File > fuzzystrmatch--1.1.sql does not exist. > > >> -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql >> +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql >> fuzzystrmatch--1.0--1.1.sql >> PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" > > > The patch seems to remove fuzzystrmatch--1.1.sql - I suggest not doing > that. In recent years our approach has been to just keep the "base version" of > the upgrade script, with extension creation running through the upgrade > scripts. > OK, I have now kept fuzzystrmatch--1.1.sql, and omitted fuzzystrmatch--1.2.sql Both the Makefile and meson.build are updated to handle the new files, including the generated header. >> >> + >> +#include "daitch_mokotoff.h" >> + >> +#include "postgres.h" > > Postgres policy is that the include of "postgres.h" has to be the first > include in every .c file. > > OK, fixed. >> +#include "utils/builtins.h" >> +#include "mb/pg_wchar.h" >> + >> +#include >> + >> +/* Internal C implementation */ >> +static char *_daitch_mokotoff(char *word, char *soundex, size_t n); >> + >> + >> +PG_FUNCTION_INFO_V1(daitch_mokotoff); >> +Datum >> +daitch_mokotoff(PG_FUNCTION_ARGS) >> +{ >> +text *arg = PG_GETARG_TEXT_PP(0); >> +char *string, >> + *tmp_soundex; >> +text *soundex; >> + >> +/* >> + * The maximum theoretical soundex size is several KB, however in >> practice >> + * anything but contrived synthetic inputs will yield a soundex size of >> + * less than 100 bytes. We thus allocate and free a temporary work >> buffer, >> + * and return only the actual soundex result. >> + */ >> + string = pg_server_to_any(text_to_cstring(arg), >> VARSIZE_ANY_EXHDR(arg), PG_UTF8); >> +tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS); > > Seems that just using StringInfo to hold the soundex output would work better > than a static allocation? > OK, fixed. > >> +if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS)) > > We imo shouldn't introduce new functions starting with _. > OK, fixed. Note that I just followed the existing pattern in fuzzystrmatch.c there. [...] >> +/* Find next node corresponding to code digit, or create a new node. */ >> +static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes, >> + dm_node * node, char code_digit) > > PG code style is to have a line break between a function defintion's return > type and the function name - like you actually do above. > OK, fixed. Both pgindent and I must have missed that particular function. >> +/* Mapping from ISO8859-1 to upper-case ASCII */ >> +static const char tr_iso8859_1_to_ascii_upper[] = >> +/* >> +"`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ >> ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ" >> +*/ >> +"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~ ! >> ?AAECDNO*OYDSAAECDNO/OYDY"; >> + >> +static char >> +iso8859_1_to_ascii_upper(unsigned char c) >> +{ >> +return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c; >> +} >> + >> + >> +/* Convert an UTF-8 character to ISO-8859-1. >> + * Unconvertable characters are returned as '?'. >> + * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț. >> + */ >> +static char >> +utf8_to_iso8859_1(char *str, int *ix) > > It seems decidedly not great to have custom encoding conversion routines in a > contrib module. Is there any way we can avoid this? > I have now replaced the custom UTF-8 decode with calls to utf8_to_unicode and pg_utf_mblen, and simplified the subsequent conversion to ASCII. Hopefully this makes the conversion code more palatable. I don't see how the conversion to ASCII could be substantially simplified further. The conversion maps lowercase and 8 bit ISO8859-1 characters to ASCII via uppercasing, removal of accents, and discarding of special characters. In addition to that, it maps (the non-ISO8859-1) Ą, Ę, and Ţ/Ț from the coding chart to [, \, and ]. After this, a simple O(1) table lookup can be used to retrieve the soun
Re: daitch_mokotoff module
Hi Ian, Ian Lawrence Barwick writes: > Hi Dag > > 2022年2月3日(木) 23:27 Dag Lem : >> >> Hi, >> >> Just some minor adjustments to the patch: >> >> * Removed call to locale-dependent toupper() >> * Cleaned up input normalization > > This patch was marked as "Waiting on Author" in the CommitFest entry [1] > but I see you provided an updated version which hasn't received any feedback, > so I've move this to the next CommitFest [2] and set it to "Needs Review". > > [1] https://commitfest.postgresql.org/40/3451/ > [2] https://commitfest.postgresql.org/41/3451/ > >> I have been asked to sign up to review a commitfest patch or patches - >> unfortunately I've been ill with COVID-19 and it's not until now that >> I feel well enough to have a look. >> >> Julien: I'll have a look at https://commitfest.postgresql.org/36/3468/ >> as you suggested (https://commitfest.postgresql.org/36/3379/ seems to >> have been reviewed now). >> >> If there are other suggestions for a patch or patches to review for >> someone new to PostgreSQL internals, I'd be grateful for that. > > I see you provided some feedback on > https://commitfest.postgresql.org/36/3468/, > though the patch seems to have not been accepted (but not conclusively > rejected > either). If you still have the chance to review another patch (or more) it > would > be much appreciated, as there's quite a few piling up. Things like > documentation > or small improvements to client applications are always a good place to start. > Reviews can be provided at any time, there's no need to wait for the next > CommitFest. > OK, I'll try to find another patch to review. Regards Dag Lem
Re: Add psql command to list constraints
"David G. Johnston" writes: > On Monday, November 15, 2021, Tatsuro Yamada > wrote: > > I don't know if this is a good example, but if you look at > StackOverflow, > it seems that people who want to see a list of constraints appear > regularly. > > https://stackoverflow.com/questions/62987794/how-to-list-all-constraints- > of-a-table-in-postgresql > > > Given the questioner restricted their question to “for a given table” > I’d say it supports leaving the status quo, not changing it. > > If, as you suppose, these come up regularly then finding one that asks > for it “in the entire database”, ideally with some stated goal, should > be doable. > > David J. > > This is my review of the patch in https://commitfest.postgresql.org/37/3468/ The patch adds the command "\dco" to list constraints in psql. This seems useful to me. The patch applies cleanly to HEAD, although some hunks have rather large offsets. As far as I can tell, the "\dco" command works as documented. I have however found the following issues with the patch: * A TAB character has been added to doc/src/sgml/ref/psql-ref.sgml - this should be replaced with spaces. * The call to listConstraints in line src/bin/psql/command.c 794 refers to [2], this should rather be [3]. * The patch kills the "\dc" command in src/bin/psql/command.c This can be fixed by adding the following at line 800: else success = listConversions(pattern, show_verbose, show_system); Another comment is that the "\dco" command outputs quite a lot of information, which only fits in a wide terminal window. Would it be an idea to only display the columns "Schema" and "Name" by default, and use "+" to specify inclusion of the columns "Definition" and "Table". Best regards Dag Lem
Re: daitch_mokotoff module
Hi, Just some minor adjustments to the patch: * Removed call to locale-dependent toupper() * Cleaned up input normalization I have been asked to sign up to review a commitfest patch or patches - unfortunately I've been ill with COVID-19 and it's not until now that I feel well enough to have a look. Julien: I'll have a look at https://commitfest.postgresql.org/36/3468/ as you suggested (https://commitfest.postgresql.org/36/3379/ seems to have been reviewed now). If there are other suggestions for a patch or patches to review for someone new to PostgreSQL internals, I'd be grateful for that. Best regards Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..1d5bd84be8 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,14 +3,15 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" -REGRESS = fuzzystrmatch +REGRESS = fuzzystrmatch fuzzystrmatch_utf8 ifdef USE_PGXS PG_CONFIG = pg_config @@ -22,3 +23,8 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< > $@ diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..ba87061845 --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,587 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - All other known implementations have the same unofficial rules for "UE", + * these are also adapted by this implementation (0, 1, NC). + * - The only other known implementation which is capable of generating all + * correct soundex codes in all cases is the JOS Soundex Calculator at + * https://www.jewishgen.org/jos/jossound.htm + * - "J" is considered (only) a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Identical code digits for adjacent letters are not collapsed correctly in + * dmsoundex.php when double digit codes are involved. E.g. "BESST" yields + * 744300 instead of 743000 as for "BEST". + * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java + * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java + * + * Permission to use, copy, modify, and distribute this software and its + * documentation for any purpose, without fee, and without a written agreement + * is hereby granted, provided that the above copyright notice and this + * paragraph and the following two paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR + * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, + * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY + * AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS + * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO + * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. +*/ + +#include "daitch_mokotoff.h" + +#include "postgres.h" +#include "utils/builtins.h" +#include "mb/pg_wchar.h" + +#include + +/* Internal C implementation */ +static char *_daitch_mokotoff(char *word, char *soundex, size_t n); + + +PG_FUNCTION_INFO_V1(daitch_mokotoff); +Datum +daitch_mokotoff(PG_FUNC
Re: daitch_mokotoff module
Dag Lem writes: > Thomas Munro writes: > >> On Wed, Jan 5, 2022 at 2:49 AM Dag Lem wrote: >>> However I guess this won't make any difference wrt. actually running the >>> tests, as long as there seems to be an encoding problem in the cfbot >> >> Fixed -- I told it to pull down patches as binary, not text. Now it >> makes commits that look healthier, and so far all the Unix systems >> have survived CI: >> >> https://github.com/postgresql-cfbot/postgresql/commit/79700efc61d15c2414b8450a786951fa9308c07f >> http://cfbot.cputube.org/dag-lem.html >> > > Great! > > Dag > > After this I did the mistake of including a patch for citext in this thread, which is now picked up by cfbot instead of the Daitch-Mokotoff patch. Attaching the original patch again in order to hopefully fix my mistake. Best regards Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..1d5bd84be8 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,14 +3,15 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" -REGRESS = fuzzystrmatch +REGRESS = fuzzystrmatch fuzzystrmatch_utf8 ifdef USE_PGXS PG_CONFIG = pg_config @@ -22,3 +23,8 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< > $@ diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..1b7263c349 --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,593 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - All other known implementations have the same unofficial rules for "UE", + * these are also adapted by this implementation (0, 1, NC). + * - The only other known implementation which is capable of generating all + * correct soundex codes in all cases is the JOS Soundex Calculator at + * https://www.jewishgen.org/jos/jossound.htm + * - "J" is considered (only) a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Identical code digits for adjacent letters are not collapsed correctly in + * dmsoundex.php when double digit codes are involved. E.g. "BESST" yields + * 744300 instead of 743000 as for "BEST". + * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java + * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java + * + * Permission to use, copy, modify, and distribute this software and its + * documentation for any purpose, without fee, and without a written agreement + * is hereby granted, provided that the above copyright notice and this + * paragraph and the following two paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR + * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, + * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY + * AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS + * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO + * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. +*/ + +#include "daitch_mokotoff.h" + +
[PATCH] Run UTF8-dependent tests for citext [Re: daitch_mokotoff module]
Tom Lane writes: > Dag Lem writes: > >> Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that >> two other modules are using UTF8 characters in tests - citext and >> unaccent. > > Yeah, neither of those have been upgraded to said best practice. > (If you feel like doing the legwork to improve that situation, > that'd be great.) > Please find attached a patch to run the previously commented-out UTF8-dependent tests for citext, according to best practice. For now I don't dare to touch the unaccent module, which seems to be UTF8-only anyway. Best regards Dag Lem diff --git a/contrib/citext/Makefile b/contrib/citext/Makefile index a7de52928d..789932fe36 100644 --- a/contrib/citext/Makefile +++ b/contrib/citext/Makefile @@ -11,7 +11,7 @@ DATA = citext--1.4.sql \ citext--1.0--1.1.sql PGFILEDESC = "citext - case-insensitive character string data type" -REGRESS = citext +REGRESS = citext citext_utf8 ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/citext/expected/citext.out b/contrib/citext/expected/citext.out index 3bac0534fb..48b4de8993 100644 --- a/contrib/citext/expected/citext.out +++ b/contrib/citext/expected/citext.out @@ -48,29 +48,6 @@ SELECT 'a'::citext <> 'ab'::citext AS t; t (1 row) --- Multibyte sanity tests. Uncomment to run. --- SELECT 'À'::citext = 'À'::citext AS t; --- SELECT 'À'::citext = 'à'::citext AS t; --- SELECT 'À'::text = 'à'::text AS f; -- text wins. --- SELECT 'À'::citext <> 'B'::citext AS t; --- Test combining characters making up canonically equivalent strings. --- SELECT 'Ä'::text <> 'Ä'::text AS t; --- SELECT 'Ä'::citext <> 'Ä'::citext AS t; --- Test the Turkish dotted I. The lowercase is a single byte while the --- uppercase is multibyte. This is why the comparison code can't be optimized --- to compare string lengths. --- SELECT 'i'::citext = 'İ'::citext AS t; --- Regression. --- SELECT 'láska'::citext <> 'laská'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive; --- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative; -- Test > and >= SELECT 'B'::citext > 'a'::citext AS t; t diff --git a/contrib/citext/expected/citext_1.out b/contrib/citext/expected/citext_1.out index 57fc863f7a..8ab4d4224e 100644 --- a/contrib/citext/expected/citext_1.out +++ b/contrib/citext/expected/citext_1.out @@ -48,29 +48,6 @@ SELECT 'a'::citext <> 'ab'::citext AS t; t (1 row) --- Multibyte sanity tests. Uncomment to run. --- SELECT 'À'::citext = 'À'::citext AS t; --- SELECT 'À'::citext = 'à'::citext AS t; --- SELECT 'À'::text = 'à'::text AS f; -- text wins. --- SELECT 'À'::citext <> 'B'::citext AS t; --- Test combining characters making up canonically equivalent strings. --- SELECT 'Ä'::text <> 'Ä'::text AS t; --- SELECT 'Ä'::citext <> 'Ä'::citext AS t; --- Test the Turkish dotted I. The lowercase is a single byte while the --- uppercase is multibyte. This is why the comparison code can't be optimized --- to compare string lengths. --- SELECT 'i'::citext = 'İ'::citext AS t; --- Regression. --- SELECT 'láska'::citext <> 'laská'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t; --- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero; --- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive; --- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative; -- Test > and >= SELECT 'B'::citext > 'a'::citext AS t; t diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out new file mode 100644 index 00..1f4fa79aff --- /dev/null +++ b/contrib/citext/expected/citext_utf8.out @@ -0,0 +1,119 @@ +/* + * This test must be run in a database with UTF-8 encoding, + * because o
Re: daitch_mokotoff module
Thomas Munro writes: > On Wed, Jan 5, 2022 at 2:49 AM Dag Lem wrote: >> However I guess this won't make any difference wrt. actually running the >> tests, as long as there seems to be an encoding problem in the cfbot > > Fixed -- I told it to pull down patches as binary, not text. Now it > makes commits that look healthier, and so far all the Unix systems > have survived CI: > > https://github.com/postgresql-cfbot/postgresql/commit/79700efc61d15c2414b8450a786951fa9308c07f > http://cfbot.cputube.org/dag-lem.html > Great! Dag
Re: daitch_mokotoff module
Andres Freund writes: > Hi, > > On 2022-01-02 21:41:53 -0500, Tom Lane wrote: >> ... so, that test case is guaranteed to fail in non-UTF8 encodings, >> I suppose? I wonder what the LANG environment is in that cfbot >> instance. > > LANG="en_US.UTF-8" > > But it looks to me like the problem is in the commit cfbot creates, rather > than the test run itself: > https://github.com/postgresql-cfbot/postgresql/commit/d5b4ec87cfd65dc08d26e1b789bd254405c90a66#diff-388d4bb360a3b24c425e29a85899315dc02f9c1dd9b9bc9aaa828876bdfea50aR56 > > Greetings, > > Andres Freund > > I have now separated out the UTF8-dependent tests, hopefully according to the current best practice (based on src/test/modules/test_regex/ and https://www.postgresql.org/docs/14/regress-variant.html). However I guess this won't make any difference wrt. actually running the tests, as long as there seems to be an encoding problem in the cfbot pipeline. Is there anything else I can do? Could perhaps fuzzystrmatch_utf8 simply be commented out from the Makefile for the time being? Best regards Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..1d5bd84be8 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,14 +3,15 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" -REGRESS = fuzzystrmatch +REGRESS = fuzzystrmatch fuzzystrmatch_utf8 ifdef USE_PGXS PG_CONFIG = pg_config @@ -22,3 +23,8 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< > $@ diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..1b7263c349 --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,593 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - All other known implementations have the same unofficial rules for "UE", + * these are also adapted by this implementation (0, 1, NC). + * - The only other known implementation which is capable of generating all + * correct soundex codes in all cases is the JOS Soundex Calculator at + * https://www.jewishgen.org/jos/jossound.htm + * - "J" is considered (only) a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Identical code digits for adjacent letters are not collapsed correctly in + * dmsoundex.php when double digit codes are involved. E.g. "BESST" yields + * 744300 instead of 743000 as for "BEST". + * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java + * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java + * + * Permission to use, copy, modify, and distribute this software and its + * documentation for any purpose, without fee, and without a written agreement + * is hereby granted, provided that the above copyright notice and this + * paragraph and the following two paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR + * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, + * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY + * AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUN
Re: daitch_mokotoff module
Tom Lane writes: > Thomas Munro writes: >> Erm, it looks like something weird is happening somewhere in cfbot's >> pipeline, because Dag's patch says: > >> +SELECT daitch_mokotoff('Straßburg'); >> + daitch_mokotoff >> +- >> + 294795 >> +(1 row) > > ... so, that test case is guaranteed to fail in non-UTF8 encodings, > I suppose? I wonder what the LANG environment is in that cfbot > instance. > > (We do have methods for dealing with non-ASCII test cases, but > I can't see that this patch is using any of them.) > > regards, tom lane > I naively assumed that tests would be run in an UTF8 environment. Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that two other modules are using UTF8 characters in tests - citext and unaccent. The citext tests seem to be commented out - "Multibyte sanity tests. Uncomment to run." Looking into the unaccent module, I don't quite understand how it will work with various encodings, since it doesn't seem to decode its input - will it fail if run under anything but ASCII or UTF8? In any case, I see that unaccent.sql starts as follows: CREATE EXTENSION unaccent; -- must have a UTF8 database SELECT getdatabaseencoding(); SET client_encoding TO 'UTF8'; Would doing the same thing in fuzzystrmatch.sql fix the problem with failing tests? Should I prepare a new patch? Best regards Dag Lem
Re: daitch_mokotoff module
Hello again, It turns out that there actually exists an(other) implementation of the Daitch-Mokotoff Soundex System which gets it right; the JOS Soundex Calculator at https://www.jewishgen.org/jos/jossound.htm Other implementations I have been able to find, like the one in Apache Commons Coded used in e.g. Elasticsearch, are far from correct. The source code for the JOS Soundex Calculator is not available, as far as I can tell, however I have run the complete list of 98412 last names at https://raw.githubusercontent.com/philipperemy/name-dataset/master/names_dataset/v1/last_names.all.txt through the calculator, in order to have a good basis for comparison. This revealed a few shortcomings in my implementation. In particular I had to go back to the drawing board in order to handle the dual nature of "J" correctly. "J" can be either a vowel or a consonant in Daitch-Mokotoff soundex, and this complicates encoding of the *previous* character. I have also done a more thorough review and refactoring of the code, which should hopefully make things quite a bit more understandable to others. The changes are summarized as follows: * Returns NULL for input without any encodable characters. * Uses the same "unoffical" rules for "UE" as other implementations. * Correctly considers "J" as either a vowel or a consonant. * Corrected encoding for e.g. "HANNMANN". * Code refactoring and comments for readability. * Better examples in the documentation. The implementation is now in correspondence with the JOS Soundex Calculator for the 98412 last names mentioned above, with only the following exceptions: JOS: cedeño 43 53 PG: cedeño 436000 536000 JOS: sadab(khura) 437000 PG: sadab(khura) 437590 I hope this addition to the fuzzystrmatch extension module will prove to be useful to others as well! This is my very first code contribution to PostgreSQL, and I would be grateful for any advice on how to proceed in order to get the patch accepted. Best regards Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..826e529e3e 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,11 +3,12 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" REGRESS = fuzzystrmatch @@ -22,3 +23,8 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< > $@ diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..1b7263c349 --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,593 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - All other known implementations have the same unofficial rules for "UE", + * these are also adapted by this implementation (0, 1, NC). + * - The only other known implementation which is capable of generating all + * correct soundex codes in all cases is the JOS Soundex Calculator at + * https://www.jewishgen.org/jos/jossound.htm + * - "J" is considered (only) a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Identical code digits for adjacent letters are not collapsed correctly in + * dmsoundex.php when double digit codes are involved. E.g. "BESST" yields + * 744300 instead of 743000 as for "BEST". + * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java + * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java + * + * Permission to use, copy,
Re: daitch_mokotoff module
Tomas Vondra writes: [...] > > Thanks, looks interesting. A couple generic comments, based on a quick > code review. Thank you for the constructive review! > > 1) Can the extension be marked as trusted, just like fuzzystrmatch? I have now moved the daitch_mokotoff function into the fuzzystrmatch module, as suggested by Andrew Dunstan. > > 2) The docs really need an explanation of what the extension is for, > not just a link to fuzzystrmatch. Also, a couple examples would be > helpful, I guess - similarly to fuzzystrmatch. The last line in the > docs is annoyingly long. Please see the updated documentation for the fuzzystrmatch module. > > 3) What's daitch_mokotov_header.pl for? I mean, it generates the > header, but when do we need to run it? It only has to be run if the soundex rules are changed. I have now made the dependencies explicit in the fuzzystrmatch Makefile. > > 4) It seems to require perl-open, which is a module we did not need > until now. Not sure how well supported it is, but maybe we can use a > more standard module? I believe Perl I/O layers have been part of Perl core for two decades now :-) > > 5) Do we need to keep DM_MAIN? It seems to be meant for some kind of > testing, but our regression tests certainly don't need it (or the > palloc mockup). I suggest to get rid of it. Done. BTW this was modeled after dmetaphone.c > > 6) I really don't understand some of the comments in > daitch_mokotov.sql, like for example: > > -- testEncodeBasic > -- Tests covered above are omitted. > > Also, comments with names of Java methods seem pretty confusing. It'd > be better to actually explain what rules are the tests checking. The tests were copied from various web sites and implementations. I have cut down on the number of tests and made the comments more to the point. > > 7) There are almost no comments in the .c file (ignoring the comment > on top). Short functions like initialize_node are probably fine > without one, but e.g. update_node would deserve one. More comments are added to both the .h and the .c file. > > 8) Some of the lines are pretty long (e.g. the update_node signature > is almost 170 chars). That should be wrapped. Maybe try running > pgindent on the code, that'll show which parts need better formatting > (so as not to get broken later). Fixed. I did run pgindent earlier, however it didn't catch those long lines. > > 9) I'm sure there's better way to get the number of valid chars than this: > > for (i = 0, ilen = 0; (c = read_char([i], )) && (c < 'A' || > c > ']'); i += ilen) > { > } > > Say, a while loop or something? The code gets to the next encodable character, skipping any other characters. I have now added a comment which should hopefully make this clearer, and broken up the for loop for readability. Please find attached the revised patch. Best regards Dag Lem diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile index 0704894f88..826e529e3e 100644 --- a/contrib/fuzzystrmatch/Makefile +++ b/contrib/fuzzystrmatch/Makefile @@ -3,11 +3,12 @@ MODULE_big = fuzzystrmatch OBJS = \ $(WIN32RES) \ + daitch_mokotoff.o \ dmetaphone.o \ fuzzystrmatch.o EXTENSION = fuzzystrmatch -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql PGFILEDESC = "fuzzystrmatch - similarities and distance between strings" REGRESS = fuzzystrmatch @@ -22,3 +23,8 @@ top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif + +daitch_mokotoff.o: daitch_mokotoff.h + +daitch_mokotoff.h: daitch_mokotoff_header.pl + perl $< > $@ diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c new file mode 100644 index 00..302e9a6d86 --- /dev/null +++ b/contrib/fuzzystrmatch/daitch_mokotoff.c @@ -0,0 +1,516 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (Dait
Re: daitch_mokotoff module
Please find attached an updated patch, with the following fixes: * Replaced remaining malloc/free with palloc/pfree. * Made "make check" pass. * Updated notes on other implementations. Best regards Dag Lem diff --git a/contrib/Makefile b/contrib/Makefile index 87bf87ab90..5ea729 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -14,6 +14,7 @@ SUBDIRS = \ btree_gist \ citext \ cube \ + daitch_mokotoff \ dblink \ dict_int \ dict_xsyn \ diff --git a/contrib/daitch_mokotoff/Makefile b/contrib/daitch_mokotoff/Makefile new file mode 100644 index 00..baec5e31d4 --- /dev/null +++ b/contrib/daitch_mokotoff/Makefile @@ -0,0 +1,25 @@ +# contrib/daitch_mokotoff/Makefile + +MODULE_big = daitch_mokotoff +OBJS = \ + $(WIN32RES) \ + daitch_mokotoff.o + +EXTENSION = daitch_mokotoff +DATA = daitch_mokotoff--1.0.sql +PGFILEDESC = "daitch_mokotoff - Daitch-Mokotoff Soundex System" + +HEADERS = daitch_mokotoff.h + +REGRESS = daitch_mokotoff + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/daitch_mokotoff +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql new file mode 100644 index 00..0b5a643175 --- /dev/null +++ b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql @@ -0,0 +1,8 @@ +/* contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION daitch_mokotoff" to load this file. \quit + +CREATE FUNCTION daitch_mokotoff(text) RETURNS text +AS 'MODULE_PATHNAME', 'daitch_mokotoff' +LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE; diff --git a/contrib/daitch_mokotoff/daitch_mokotoff.c b/contrib/daitch_mokotoff/daitch_mokotoff.c new file mode 100644 index 00..a7f1fd8541 --- /dev/null +++ b/contrib/daitch_mokotoff/daitch_mokotoff.c @@ -0,0 +1,549 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - "J" is considered a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Identical code digits for adjacent letters are not collapsed correctly in + * dmsoundex.php when double digit codes are involved. E.g. "BESST" yields + * 744300 instead of 743000 as for "BEST". + * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java + * - Both dmlat.php and dmrules.txt have the same unofficial rules for "UE". + * - Coding of MN/NM + M/N differs between dmsoundex.php and DaitchMokotoffSoundex.java + * - No other known implementation yields the correct set of codes for e.g. + * "CJC" (55 54 545000 45 40 44). + * + * Permission to use, copy, modify, and distribute this software and its + * documentation for any purpose, without fee, and without a written agreement + * is hereby granted, provided that the above copyright notice and this + * paragraph and the following two paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR + * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE + * POSSIBILITY OF SUCH DAMAGE. + * + * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, + * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY + * AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS + * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO + * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. +*/ + +#include "daitch_mokotoff.h" + +#ifndef DM_MAIN + +#include "postgres.h" +#include "utils/builtins.h" +#include "mb/pg_wchar.h&qu
daitch_mokotoff module
Hello, Please find attached a patch for the daitch_mokotoff module. This implements the Daitch-Mokotoff Soundex System, as described in https://www.avotaynu.com/soundex.htm The module is used in production at Finance Norway. In order to verify correctness, I have compared generated soundex codes with corresponding results from the implementation by Stephen P. Morse at https://stevemorse.org/census/soundex.html Where soundex codes differ, the daitch_mokotoff module has been found to be correct. The Morse implementation uses a few unofficial rules, and also has an error in the handling of adjacent identical code digits. Please see daitch_mokotoff.c for further references and comments. For reference, detailed instructions for soundex code comparison are attached. Best regards Dag Lem diff --git a/contrib/Makefile b/contrib/Makefile index 87bf87ab90..5ea729 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -14,6 +14,7 @@ SUBDIRS = \ btree_gist \ citext \ cube \ + daitch_mokotoff \ dblink \ dict_int \ dict_xsyn \ diff --git a/contrib/daitch_mokotoff/Makefile b/contrib/daitch_mokotoff/Makefile new file mode 100644 index 00..baec5e31d4 --- /dev/null +++ b/contrib/daitch_mokotoff/Makefile @@ -0,0 +1,25 @@ +# contrib/daitch_mokotoff/Makefile + +MODULE_big = daitch_mokotoff +OBJS = \ + $(WIN32RES) \ + daitch_mokotoff.o + +EXTENSION = daitch_mokotoff +DATA = daitch_mokotoff--1.0.sql +PGFILEDESC = "daitch_mokotoff - Daitch-Mokotoff Soundex System" + +HEADERS = daitch_mokotoff.h + +REGRESS = daitch_mokotoff + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/daitch_mokotoff +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql new file mode 100644 index 00..0b5a643175 --- /dev/null +++ b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql @@ -0,0 +1,8 @@ +/* contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION daitch_mokotoff" to load this file. \quit + +CREATE FUNCTION daitch_mokotoff(text) RETURNS text +AS 'MODULE_PATHNAME', 'daitch_mokotoff' +LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE; diff --git a/contrib/daitch_mokotoff/daitch_mokotoff.c b/contrib/daitch_mokotoff/daitch_mokotoff.c new file mode 100644 index 00..9e66aee434 --- /dev/null +++ b/contrib/daitch_mokotoff/daitch_mokotoff.c @@ -0,0 +1,551 @@ +/* + * Daitch-Mokotoff Soundex + * + * Copyright (c) 2021 Finance Norway + * Author: Dag Lem + * + * This implementation of the Daitch-Mokotoff Soundex System aims at high + * performance. + * + * - The processing of each phoneme is initiated by an O(1) table lookup. + * - For phonemes containing more than one character, a coding tree is traversed + * to process the complete phoneme. + * - The (alternate) soundex codes are produced digit by digit in-place in + * another tree structure. + * + * References: + * + * https://www.avotaynu.com/soundex.htm + * https://www.jewishgen.org/InfoFiles/Soundex.html + * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex + * https://stevemorse.org/phoneticinfo.htm (dmlat.php, dmsoundex.php) + * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java) + * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm) + * + * A few notes on other implementations: + * + * - "J" is considered a vowel in dmlat.php + * - The official rules for "RS" are commented out in dmlat.php + * - Adjacent identical code digits are not collapsed correctly in dmsoundex.php + * when double digit codes are involved. E.g. "BESST" yields 744300 instead of + * 743000 as for "BEST". + * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java + * - Both dmlat.php and dmrules.txt have the same unofficial rules for "UE". + * - Coding of MN/NM + M/N differs between dmsoundex.php and DaitchMokotoffSoundex.java + * - Neither dmsoundex.php nor DaitchMokotoffSoundex.java yields all valid codes for e.g. + * "CJC" (55 54 545000 45 40 44). + * + * Permission to use, copy, modify, and distribute this software and its + * documentation for any purpose, without fee, and without a written agreement + * is hereby granted, provided that the above copyright notice and this + * paragraph and the following two paragraphs appear in all copies. + * + * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR + * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS + * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE + * POSSIBILIT