Re: daitch_mokotoff module

2023-04-03 Thread Dag Lem
Dag Lem  writes:

> I sincerely hope this resolves any blocking issues with copyright /
> legalese / credits.
>

Can this now be considered ready for commiter, so that Paul or someone
else can flip the bit?

Best regards
Dag Lem




Re: daitch_mokotoff module

2023-03-06 Thread Dag Lem
Dag Lem  writes:

> Tomas Vondra  writes:
>
>> On 2/7/23 18:08, Paul Ramsey wrote:
>>> 
>>> 
>>>> On Feb 7, 2023, at 6:47 AM, Dag Lem  wrote:
>>>>
>>>> I just went by to check the status of the patch, and I noticed that
>>>> you've added yourself as reviewer earlier - great!
>>>>
>>>> Please tell me if there is anything I can do to help bring this across
>>>> the finish line.
>>> 
>>> Honestly, I had set it to Ready for Committer, but then I went to
>>> run regression one more time and my regression blew up. I found I
>>> couldn't enable the UTF tests without things failing. And I don't
>>> blame you! I think my installation is probably out-of-alignment in
>>> some way, but I didn't want to flip the Ready flag without having
>>> run everything through to completion, so I flipped it back. Also,
>>> are the UTF tests enabled by default? It wasn't clear to me that
>>> they were?
>>> 
>> The utf8 tests are enabled depending on the encoding returned by
>> getdatabaseencoding(). Systems with other encodings will simply use the
>> alternate .out file. And it works perfectly fine for me.
>>
>> IMHO it's ready for committer.
>>
>>
>> regards
>
> Yes, the UTF-8 tests follow the current best practice as has been
> explained to me earlier. The following patch exemplifies this:
>
> https://github.com/postgres/postgres/commit/c2e8bd27519f47ff56987b30eb34a01969b9a9e8
>
>

Can you please have a look at this again?

Best regards,

Dag Lem




Re: daitch_mokotoff module

2023-02-14 Thread Dag Lem
I sincerely hope this resolves any blocking issues with copyright /
legalese / credits.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..9215ab6530
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,567 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * This module was originally sponsored by Finance Norway /
+ * Trafikkforsikringsforeningen, and implemented by Dag Lem 
+ *
+ * The implementation of the Daitch-Mokotoff Soundex System aims at correctness
+ * and high performance, and can be summarized as follows:
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ * References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+*/
+
+#include "postgres.h"
+
+#include "catalog/pg_type.h"
+#include "mb/pg_wchar.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+/* Generated coding chart table */
+#include "daitch_mokotoff.h"
+
+#define DM_CODE_DIGITS 6
+
+/* Node in soundex code tree */
+struct dm_node
+{
+	int			soundex_length; /* Length of generated soundex code */
+	char		soundex[DM_CODE_DIGITS];	/* Soundex code */
+	int			is_leaf;		/* Candidate for complete soundex code */
+	int			last_update;	/* Letter number for last update of node */
+	char		code_digit;		/* Last code digit, 0 - 9 */
+
+	/*
+	 * One or two alternate code digits leading to this node. If there are two
+	 * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+	 * back to the same node.
+	 */
+	char		prev_code_digits[2];
+	/* One or two alternate code digits moving forward. */
+	char		next_code_digits[2];
+	/* ORed together code index(es) used to reach current node. */
+	int			prev_code_index;
+	int			next_code_index;
+	/* Possible nodes branching out from this node - digits 0-9. */
+	struct dm_node *children[10];
+	/* Next node in linked list. Alternating index for each iteration. */
+	struct dm_node *next[2];
+};
+
+typedef struct dm_node dm_node;
+
+
+/* Internal C implementation */
+static int	daitch_mokotoff_c

Re: daitch_mokotoff module

2023-02-09 Thread Dag Lem
Tomas Vondra  writes:

> On 2/8/23 15:31, Dag Lem wrote:
>> Alvaro Herrera  writes:
>> 
>>> On 2023-Jan-17, Dag Lem wrote:
>>>
>>>> + * Daitch-Mokotoff Soundex
>>>> + *
>>>> + * Copyright (c) 2021 Finance Norway
>>>> + * Author: Dag Lem 
>>>
>>> Hmm, I don't think we accept copyright lines that aren't "PostgreSQL
>>> Global Development Group".  Is it okay to use that, and update the year
>>> to 2023?  (Note that answering "no" very likely means your patch is not
>>> candidate for inclusion.)  Also, we tend not to have "Author:" lines.
>>>
>> 
>> You'll have to forgive me for not knowing about this rule:
>> 
>>   grep -ER "Copyright.*[0-9]{4}" contrib/ | grep -v PostgreSQL
>> 
>> In any case, I have checked with the copyright owner, and it would be OK
>> to assign the copyright to "PostgreSQL Global Development Group".
>> 
>
> I'm not entirely sure what's the rule either, and I'm a committer. My
> guess is these cases are either old and/or adding a code that already
> existed elsewhere (like some of the double metaphone, for example), or
> maybe both. But I'd bet we'd prefer not adding more ...
>
>> To avoid going back and forth with patches, how do you propose that the
>> sponsor and the author of the contributed module should be credited?
>> Woule something like this be acceptable?
>> 
>
> We generally credit contributors in two ways - by mentioning them in the
> commit message, and by listing them in the release notes (for individual
> features).
>

I'll ask again, would the proposed credits be acceptable? In this case,
the code already existed elsewhere (as in your example for double
metaphone) as a separate extension. The copyright owner is OK with
copyright assignment, however I find it quite unreasonable that proper
credits should not be given. Neither commit messages nor release notes
follow the contributed module, which is in its entirety contributed by
an external entity.

I'll also point out that in addition to credits in code all over the
place, PostgreSQL has much more prominent credits in the documentation:

  grep -ER "Author" doc/ | grep -v PostgreSQL

"Author" is even documented as a top level section in the Reference
Pages as "Author (only used in the contrib section)", see

  https://www.postgresql.org/docs/15/docguide-style.html#id-1.11.11.8.2

If there really exists some new rule which says that for new
contributions under contrib/, credits should not be allowed in any way
in either code or documentation (IANAL, but AFAIU this would be in
conflict with laws on author's moral rights in several countries), then
one would reasonably expect that you'd be upfront about this, both in
documentation, and also as the very first thing when a contribution is
first proposed for inclusion.

Best regards

Dag Lem




Re: daitch_mokotoff module

2023-02-08 Thread Dag Lem
Alvaro Herrera  writes:

> On 2023-Jan-17, Dag Lem wrote:
>
>> + * Daitch-Mokotoff Soundex
>> + *
>> + * Copyright (c) 2021 Finance Norway
>> + * Author: Dag Lem 
>
> Hmm, I don't think we accept copyright lines that aren't "PostgreSQL
> Global Development Group".  Is it okay to use that, and update the year
> to 2023?  (Note that answering "no" very likely means your patch is not
> candidate for inclusion.)  Also, we tend not to have "Author:" lines.
>

You'll have to forgive me for not knowing about this rule:

  grep -ER "Copyright.*[0-9]{4}" contrib/ | grep -v PostgreSQL

In any case, I have checked with the copyright owner, and it would be OK
to assign the copyright to "PostgreSQL Global Development Group".

To avoid going back and forth with patches, how do you propose that the
sponsor and the author of the contributed module should be credited?
Woule something like this be acceptable?

/*
 * Daitch-Mokotoff Soundex
 *
 * Copyright (c) 2023, PostgreSQL Global Development Group
 *
 * This module was sponsored by Finance Norway / Trafikkforsikringsforeningen
 * and implemented by Dag Lem 
 *
 ...

[...]

>
> We don't keep a separate copyright statement in the file; rather we
> assume that all files are under the PostgreSQL license, which is in the
> COPYRIGHT file at the top of the tree.  Changing it thus has the side
> effect that these disclaim notes refer to the University of California
> rather than "the Author".  IANAL.

OK, no problem. Note that you will again find counterexamples under
contrib/ (and in some other places):

  grep -R "Permission to use" .

> I think we should add SPDX markers to all the files we distribute:
> /* SPDX-License-Identifier: PostgreSQL */
>
> https://spdx.dev/ids/
> https://spdx.org/licenses/PostgreSQL.html

As far as I can tell, this is not included in any file so far, and is
thus better left to decide and implement by someone else.

Best regards

Dag Lem




Re: daitch_mokotoff module

2023-02-08 Thread Dag Lem
Tomas Vondra  writes:

> On 2/7/23 18:08, Paul Ramsey wrote:
>> 
>> 
>>> On Feb 7, 2023, at 6:47 AM, Dag Lem  wrote:
>>>
>>> I just went by to check the status of the patch, and I noticed that
>>> you've added yourself as reviewer earlier - great!
>>>
>>> Please tell me if there is anything I can do to help bring this across
>>> the finish line.
>> 
>> Honestly, I had set it to Ready for Committer, but then I went to
>> run regression one more time and my regression blew up. I found I
>> couldn't enable the UTF tests without things failing. And I don't
>> blame you! I think my installation is probably out-of-alignment in
>> some way, but I didn't want to flip the Ready flag without having
>> run everything through to completion, so I flipped it back. Also,
>> are the UTF tests enabled by default? It wasn't clear to me that
>> they were?
>> 
> The utf8 tests are enabled depending on the encoding returned by
> getdatabaseencoding(). Systems with other encodings will simply use the
> alternate .out file. And it works perfectly fine for me.
>
> IMHO it's ready for committer.
>
>
> regards

Yes, the UTF-8 tests follow the current best practice as has been
explained to me earlier. The following patch exemplifies this:

https://github.com/postgres/postgres/commit/c2e8bd27519f47ff56987b30eb34a01969b9a9e8


Best regards,

Dag Lem




Re: daitch_mokotoff module

2023-02-07 Thread Dag Lem
Hi Paul,

I just went by to check the status of the patch, and I noticed that
you've added yourself as reviewer earlier - great!

Please tell me if there is anything I can do to help bring this across
the finish line.

Best regards,

Dag Lem




Re: daitch_mokotoff module

2023-01-20 Thread Dag Lem
Alvaro Herrera  writes:

> On 2023-Jan-05, Dag Lem wrote:
>
>> Is there anything else I should do here, to avoid the status being
>> incorrectly stuck at "Waiting for Author" again.
>
> Just mark it Needs Review for now.  I'll be back from vacation on Jan
> 11th and can have a look then (or somebody else can, perhaps.)

Paul Ramsey had a few comments in the mean time, and based on this I
have produced (yet another) patch, with improved documentation.

However it's still not marked as "Ready for Committer" - can you please
take a look again?

Best regards

Dag Lem




Re: daitch_mokotoff module

2023-01-17 Thread Dag Lem
Paul Ramsey  writes:

>> On Jan 12, 2023, at 7:30 AM, Dag Lem  wrote:
>> 

[...]

>> 
>> Sure, I can do that. You don't think this much example text will be
>> TL;DR?
>
> I can only speak for myself, but examples are the meat of
> documentation learning, so as long as they come with enough
> explanatory context to be legible it's worth having them, IMO.
>

I have updated the documentation, hopefully it is more accessible now.

I also corrected documentation for the other functions in fuzzystrmatch
(function name and argtype in the wrong order).

Crossing fingers that someone will eventually change the status to
"Ready for Committer" :-)

Best regards,

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..2548903770
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,582 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "catalog/pg_type.h"
+#include 

Re: daitch_mokotoff module

2023-01-12 Thread Dag Lem
Paul Ramsey  writes:

> On Mon, Jan 2, 2023 at 2:03 PM Dag Lem  wrote:
>
>> I also improved on the documentation example (using Full Text Search).
>> AFAIK you can't make general queries like that using arrays, however in
>> any case I must admit that text arrays seem like more natural building
>> blocks than space delimited text here.
>
> This is a fun addition to fuzzystrmatch.

I'm glad to hear it! :-)

>
> While it's a little late in the game, I'll just put it out there:
> daitch_mokotoff() is way harder to type than soundex_dm(). Not sure
> how you feel about that.

I chose the name in order to follow the naming of the other functions in
fuzzystrmatch, which as far as I can tell are given the name which each
algorithm is known by.

Personally I don't think it's worth it to deviate from the naming of the
other functions just to avoid typing a few characters, and I certainly
don't think daitch_mokotoff is any harder to get right than
levenshtein_less_equal ;-)

So, if I were to decide, I wouldn't change the name of the function.
However I'm obviously not calling the shots on what goes into PostgreSQL
- perhaps someone else would like to weigh in on this?

>
> On the documentation, I found the leap directly into the tsquery
> example a bit too big. Maybe start with a very simple example,
>
> --
> dm=# SELECT daitch_mokotoff('Schwartzenegger'),
> daitch_mokotoff('Swartzenegger');
>
>  daitch_mokotoff | daitch_mokotoff
> -+-
>  {479465}| {479465}
> --
>
> Then transition into a more complex example that illustrates the GIN
> index technique you mention in the text, but do not show:
>
> --
> CREATE TABLE dm_gin (source text, dm text[]);
>
> INSERT INTO dm_gin (source) VALUES
> ('Swartzenegger'),
> ('John'),
> ('James'),
> ('Steinman'),
> ('Steinmetz');
>
> UPDATE dm_gin SET dm = daitch_mokotoff(source);
>
> CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm);
>
> SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger');
> --

Sure, I can do that. You don't think this much example text will be
TL;DR?

>
> And only then go into the tsearch example. Incidentally, what does the
> tsearch approach provide that the simple GIN approach does not?

The example shows how to do a simultaneous match on first AND last
names, where the first and last names (any number of names) are stored
in the same indexed column, and the order of the names in the index and
the search term does not matter.

If you were to use the GIN "&&" operator, you would get a match if
either the first OR the last name matches. If you were to use the GIN
"@>" operator, you would *not* get a match if the search term contains
more soundex codes than the indexed name.

E.g. this yields a correct match:
SELECT soundex_tsvector('John Yamson') @@ soundex_tsquery('John Jameson');

While this yields a false positive:
SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) && 
(daitch_mokotoff('John') || daitch_mokotoff('Doe'));

And this yields a false negative:
SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) @> 
(daitch_mokotoff('John') || daitch_mokotoff('Jameson'));

This may explained better by simply showing the output of
soundex_tsvector and soundex_tsquery:

SELECT soundex_tsvector('John Yamson');
 soundex_tsvector 
--
 '16':1 '164600':3 '46':2

SELECT soundex_tsquery('John Jameson');
  soundex_tsquery  
---
 ( '16' | '46' ) & ( '164600' | '464600' )

> Ideally explain that briefly before launching into the example. With
> all the custom functions and so on it's a little involved, so maybe if
> there's not a huge win in using that approach drop it entirely?

I believe this functionality is quite useful, and that it's actually
what's called for in many situations. So, I'd rather not drop this
example.

>
> ATB,
> P
>

Best regards,

Dag Lem




Re: daitch_mokotoff module

2023-01-06 Thread Dag Lem
Alvaro Herrera  writes:

> On 2023-Jan-05, Dag Lem wrote:
>
>> Is there anything else I should do here, to avoid the status being
>> incorrectly stuck at "Waiting for Author" again.
>
> Just mark it Needs Review for now.  I'll be back from vacation on Jan
> 11th and can have a look then (or somebody else can, perhaps.)

OK, done. Have a nice vacation!

Best regards

Dag Lem




Re: daitch_mokotoff module

2023-01-05 Thread Dag Lem
Is there anything else I should do here, to avoid the status being
incorrectly stuck at "Waiting for Author" again.

Best regards

Dag Lem




Re: daitch_mokotoff module

2023-01-02 Thread Dag Lem
Sorry about the latest unfinished email - don't know what key
combination I managed to hit there.

Alvaro Herrera  writes:

> Hello
>
> On 2022-Dec-23, Dag Lem wrote:
>

[...]

>
> So, yes, I'm proposing that we returns those as array elements and that
> @> is used to match them.
>

Looking into the array operators I guess that to match such arrays
directly one would actually use && (overlaps) rather than @> (contains),
but I digress.

The function is changed to return an array of soundex codes - I hope it
is now to your liking :-)

I also improved on the documentation example (using Full Text Search).
AFAIK you can't make general queries like that using arrays, however in
any case I must admit that text arrays seem like more natural building
blocks than space delimited text here.

[...]

>> BTW Vera 79 does not match Veras 794000, because they don't sound
>> the same (up to the maximum soundex code length).
>
> No, and maybe that's okay because they have different codes.  But they
> are both similar, in Daitch-Mokotoff, to Borja, which has two codes,
> 79 and 794000.  (Any Spanish speaker will readily tell you that
> neither Vera nor Veras are similar in any way to Borja, but D-M has
> chosen to say that each of them matches one of Borjas' codes.  So they
> *are* related, even though indirectly, and as a genealogist you *may* be
> interested in getting a match for a person called Vera when looking for
> relatives to a person called Veras.  And, as a Spanish speaker, that
> would make a lot of sense to me.)

It is what it is - we can't call it Daitch-Mokotoff Soundex while
implementing something else. Having said that, one can always pre- or
postprocess to tweak the results.

Daitch-Mokotoff Soundex is known to produce false positives, but that is
in many cases not a problem.

Even though it's clearly tuned for Jewish names, the soundex algorithm
seems to work just fine for European names (we use it to match mostly
Norwegian names).


Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..2548903770
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,582 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" y

Re: daitch_mokotoff module

2023-01-02 Thread Dag Lem
Alvaro Herrera  writes:

> Hello
>
> On 2022-Dec-23, Dag Lem wrote:
>

[...]

> So, yes, I'm proposing that we returns those as array elements and that
> @> is used to match them.

Looking into the array operators I guess that to match such arrays
directly one would actually use && (overlaps) rather than @> (contains),
but I digress.

The function is changed to return an array of soundex codes - I hope it
is now to your liking :-)

I also improved on the documentation example (using Full Text Search).
AFAIK you can't make general queries like that using arrays, however in
any case I must admit that text arrays seem like more natural building
blocks than space delimited text here.

Search to perform

is the best match for Daitch-Mokotoff, however

, but
in any case I've changed it into return arrays now. I hope it is to your
liking.

>
>> Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
>> however if I understand correctly, you want to index names by single
>> sounds, linking all alike sounding names to the same soundex code. I
>> fail to see how that is useful - if you want to find matches for a name,
>> you simply match against all indexed names. If you only consider one
>> sound, you won't find all names that match.
>
> Hmm, I think we're saying the same thing, but from opposite points of
> view.  No, I want each name to return multiple codes, but that those
> multiple codes can be treated as a multiple-value array of codes, rather
> than as a single string of space-separated codes.
>
>> In any case, as explained in the documentation, the implementation is
>> intended to be a companion to Full Text Search, thus text is the natural
>> representation for the soundex codes.
>
> Hmm, I don't agree with this point.  The numbers are representations of
> the strings, but they don't necessarily have to be strings themselves.
>
>
>> BTW Vera 79 does not match Veras 794000, because they don't sound
>> the same (up to the maximum soundex code length).
>
> No, and maybe that's okay because they have different codes.  But they
> are both similar, in Daitch-Mokotoff, to Borja, which has two codes,
> 79 and 794000.  (Any Spanish speaker will readily tell you that
> neither Vera nor Veras are similar in any way to Borja, but D-M has
> chosen to say that each of them matches one of Borjas' codes.  So they
> *are* related, even though indirectly, and as a genealogist you *may* be
> interested in getting a match for a person called Vera when looking for
> relatives to a person called Veras.  And, as a Spanish speaker, that
> would make a lot of sense to me.)
>
>
> Now, it's true that I've chosen to use Spanish names for my silly little
> experiment.  Maybe this isn't terribly useful as a practical example,
> because this algorithm seems to have been designed for Jew surnames and
> perhaps not many (or not any) Jews had Spanish surnames.  I don't know;
> I'm not a Jew myself (though Noah Gordon tells the tale of a Spanish Jew
> called Josep Álvarez in his book "The Winemaker", so I guess it's not
> impossible).  Anyway, I suspect if you repeat the experiment with names
> of other origins, you'll find pretty much the same results apply there,
> and that is the whole reason D-M returns multiple codes and not just
> one.
>
>
> Merry Christmas :-)

-- 
Dag




Re: daitch_mokotoff module

2022-12-24 Thread Dag Lem
Dag Lem  writes:

> Alvaro Herrera  writes:
>
>> On 2022-Dec-23, Alvaro Herrera wrote:
>>
>
> [...]
>
>> I tried downloading a list of surnames from here
>> https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
>> pasted that in a text file and \copy'ed it into a table.  Then I ran
>> this query
>>
>> select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
>> from apellidos
>> group by daitch_mokotoff(a)
>> order by count(*) desc;
>>
>> so I have a first entry like this
>>
>> string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz
>> Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas
>> daitch_mokotoff │ 784000
>> count   │ 18
>>
>> but then I have a bunch of other entries with the same code 784000 as
>> alternative codes,
>>
>> string_agg  │ Velazco
>> daitch_mokotoff │ 784500 784000
>> count   │ 1
>>
>> string_agg  │ Palacio
>> daitch_mokotoff │ 785000 784000
>> count   │ 1
>>
>> I suppose I need to group these together somehow, and it would make more
>> sense to do that if the values were arrays.
>>
>>
>> If I scroll a bit further down and choose, say, 794000 (a relatively
>> popular one), then I have this
>>
>> string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras
>> Peraza Peres Perez Porras Varas Veras
>> daitch_mokotoff │ 794000
>> count   │ 14
>>
>> and looking for that code in the result I also get these three
>>
>> string_agg  │ Barca Barco Parco
>> daitch_mokotoff │ 795000 794000
>> count   │ 3
>>
>> string_agg  │ Borja
>> daitch_mokotoff │ 79 794000
>> count   │ 1
>>
>> string_agg  │ Borjas
>> daitch_mokotoff │ 794000 794400
>> count   │ 1
>>
>> and then I see that I should also search for possible matches in codes
>> 795000, 79 and 794400, so that gives me
>>
>> string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra
>> Perea Vera
>> daitch_mokotoff │ 79
>> count   │ 11
>>
>> string_agg  │ Barriga Borge Borrego Burgo Fraga
>> daitch_mokotoff │ 795000
>> count   │ 5
>>
>> string_agg  │ Borjas
>> daitch_mokotoff │ 794000 794400
>> count   │ 1
>>
>> which look closely related (compare "Veras" in the first to "Vera" in
>> the later set.  If you ignore that pseudo-match, you're likely to miss
>> possible family relationships.)
>>
>>
>> I suppose if I were a genealogy researcher, I would be helped by having
>> each of these codes behave as a separate unit, rather than me having to
>> split the string into the several possible contained values.
>
> It seems to me like you're trying to use soundex coding for something it
> was never designed for.
>
> As stated in my previous mail, soundex algorithms are designed to index
> names on some representation of sound, so that alike sounding names with
> different spellings will match, and as shown in the documentation
> example, that is exactly what the implementation facilitates.
>
> Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
> however if I understand correctly, you want to index names by single
> sounds, linking all alike sounding names to the same soundex code. I
> fail to see how that is useful - if you want to find matches for a name,
> you simply match against all indexed names. If you only consider one
> sound, you won't find all names that match.
>
> In any case, as explained in the documentation, the implementation is
> intended to be a companion to Full Text Search, thus text is the natural
> representation for the soundex codes.
>
> BTW Vera 79 does not match Veras 794000, because they don't sound
> the same (up to the maximum soundex code length).
>

I've been sleeping on this, and perhaps the normal use case can just as
well (or better) be covered by the "@>" array operator? I originally
implemented similar functionality using another soundex algorithm more
than a decade ago, and either arrays couldn't be GIN indexed back then,
or I simply missed it. I'll have to get back to this - now it's
Christmas!

Merry Christmas!

Best regards,

Dag Lem




Re: daitch_mokotoff module

2022-12-23 Thread Dag Lem
Alvaro Herrera  writes:

> On 2022-Dec-23, Alvaro Herrera wrote:
>

[...]

> I tried downloading a list of surnames from here
> https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
> pasted that in a text file and \copy'ed it into a table.  Then I ran
> this query
>
> select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
> from apellidos
> group by daitch_mokotoff(a)
> order by count(*) desc;
>
> so I have a first entry like this
>
> string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz
> Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas
> daitch_mokotoff │ 784000
> count   │ 18
>
> but then I have a bunch of other entries with the same code 784000 as
> alternative codes,
>
> string_agg  │ Velazco
> daitch_mokotoff │ 784500 784000
> count   │ 1
>
> string_agg  │ Palacio
> daitch_mokotoff │ 785000 784000
> count   │ 1
>
> I suppose I need to group these together somehow, and it would make more
> sense to do that if the values were arrays.
>
>
> If I scroll a bit further down and choose, say, 794000 (a relatively
> popular one), then I have this
>
> string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras
> Peraza Peres Perez Porras Varas Veras
> daitch_mokotoff │ 794000
> count   │ 14
>
> and looking for that code in the result I also get these three
>
> string_agg  │ Barca Barco Parco
> daitch_mokotoff │ 795000 794000
> count   │ 3
>
> string_agg  │ Borja
> daitch_mokotoff │ 79 794000
> count   │ 1
>
> string_agg  │ Borjas
> daitch_mokotoff │ 794000 794400
> count   │ 1
>
> and then I see that I should also search for possible matches in codes
> 795000, 79 and 794400, so that gives me
>
> string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra
> Perea Vera
> daitch_mokotoff │ 79
> count   │ 11
>
> string_agg  │ Barriga Borge Borrego Burgo Fraga
> daitch_mokotoff │ 795000
> count   │ 5
>
> string_agg  │ Borjas
> daitch_mokotoff │ 794000 794400
> count   │ 1
>
> which look closely related (compare "Veras" in the first to "Vera" in
> the later set.  If you ignore that pseudo-match, you're likely to miss
> possible family relationships.)
>
>
> I suppose if I were a genealogy researcher, I would be helped by having
> each of these codes behave as a separate unit, rather than me having to
> split the string into the several possible contained values.

It seems to me like you're trying to use soundex coding for something it
was never designed for.

As stated in my previous mail, soundex algorithms are designed to index
names on some representation of sound, so that alike sounding names with
different spellings will match, and as shown in the documentation
example, that is exactly what the implementation facilitates.

Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
however if I understand correctly, you want to index names by single
sounds, linking all alike sounding names to the same soundex code. I
fail to see how that is useful - if you want to find matches for a name,
you simply match against all indexed names. If you only consider one
sound, you won't find all names that match.

In any case, as explained in the documentation, the implementation is
intended to be a companion to Full Text Search, thus text is the natural
representation for the soundex codes.

BTW Vera 79 does not match Veras 794000, because they don't sound
the same (up to the maximum soundex code length).

Best regards

Dag Lem




Re: daitch_mokotoff module

2022-12-23 Thread Dag Lem
Alvaro Herrera  writes:

> I wonder why do you have it return the multiple alternative codes as a
> space-separated string.  Maybe an array would be more appropriate.  Even
> on your documented example use, the first thing you do is split it on
> spaces.

In the example, the *input* is split on whitespace, the returned soundex
codes are not. The splitting of the input is done in order to code each
word separately. One of the stated rules of the Daitch-Mokotoff Soundex
Coding is that "When a name consists of more than one word, it is coded
as if one word", and this may not always be desired. See
https://www.avotaynu.com/soundex.htm or
https://www.jewishgen.org/InfoFiles/soundex.html for the rules.

The intended use for the Daitch-Mokotoff soundex, as for any other
soundex algorithm, is to index names (or words) on some representation
of sound, so that alike sounding names with different spellings will
match.

In PostgreSQL, the Daitch-Mokotoff Soundex and Full Text Search makes
for a powerful combination to match alike sounding names. Full Text
Search (as any other free text search engine) works with documents, and
thus the Daitch-Mokotoff Soundex implementation produces documents
(words separated by space). As stated in the documentation: "Any
alternative soundex codes are separated by space, which makes the
returned text suited for use in Full Text Search".

Best regards,

Dag Lem




Re: daitch_mokotoff module

2022-12-23 Thread Dag Lem
Tom Lane  writes:

> Alvaro Herrera  writes:
>> On 2022-Dec-22, Dag Lem wrote:
>>> This should hopefully fix the last Cfbot failures, by exclusion of
>>> daitch_mokotoff.h from headerscheck and cpluspluscheck.
>
>> Hmm, maybe it'd be better to move the typedefs to the .h file instead.
>
> Indeed, that sounds like exactly the wrong way to fix such a problem.
> The bar for excluding stuff from headerscheck needs to be very high.
>

OK, I've moved enough declarations back to the generated header file
again so as to avoid excluding it from headerscheck and cpluspluscheck.

Best regards,

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..e809d4a39e
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,582 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "mb/pg_wchar.h"
+#include "utils/builtins.h"
+#include "utils/memutils.

Re: daitch_mokotoff module

2022-12-23 Thread Dag Lem
Andres Freund  writes:

> On 2022-12-22 14:27:54 +0100, Dag Lem wrote:
>> This should hopefully fix the last Cfbot failures, by exclusion of
>> daitch_mokotoff.h from headerscheck and cpluspluscheck.
>
> Btw, you can do the same tests as cfbot in your own repo by enabling CI
> in a github repo. See src/tools/ci/README
>

OK, thanks, I've set it up now.

Best regards,

Dag Lem




Re: daitch_mokotoff module

2022-12-22 Thread Dag Lem
Dag Lem  writes:

> Hi Ian,
>
> Ian Lawrence Barwick  writes:
>

[...]

>> I see you provided some feedback on
>> https://commitfest.postgresql.org/36/3468/,
>> though the patch seems to have not been accepted (but not
>> conclusively rejected
>> either). If you still have the chance to review another patch (or
>> more) it would
>> be much appreciated, as there's quite a few piling up. Things like
>> documentation
>> or small improvements to client applications are always a good place to 
>> start.
>> Reviews can be provided at any time, there's no need to wait for the next
>> CommitFest.
>>
>
> OK, I'll try to find another patch to review.
>

I have scanned through all the patches in Commitfest 2023-01 with status
"Needs review", and it is difficult to find something which I can
meaningfully review.

The only thing I felt qualified to comment (or nit-pick?) on was
https://commitfest.postgresql.org/41/4071/

If something else should turn up which could be reviewed by someone
without intimate knowledge of PostgreSQL internals, then don't hesitate
to ask.

As for the Daitch-Mokotoff patch, the review by Andres Freund was very
helpful in order to improve the extension and to make it more idiomatic
- hopefully it is now a bit closer to being included.


Best regards

Dag Lem




Re: daitch_mokotoff module

2022-12-22 Thread Dag Lem
Dag Lem  writes:

> I noticed that the Meson builds failed in Cfbot, the updated patch adds
> a missing "include_directories" line to meson.build.
>

This should hopefully fix the last Cfbot failures, by exclusion of
daitch_mokotoff.h from headerscheck and cpluspluscheck.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..d4ad95c283
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,596 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "mb/pg_wchar.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+#define DM_CODE_DIGITS 6
+
+/* Coding chart table: Soundex codes */
+typedef char dm_code[2 + 1];	/* One or two sequential code digits + NUL */
+typ

Re: daitch_mokotoff module

2022-12-22 Thread Dag Lem
I noticed that the Meson builds failed in Cfbot, the updated patch adds
a missing "include_directories" line to meson.build.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..d4ad95c283
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,596 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "mb/pg_wchar.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+#define DM_CODE_DIGITS 6
+
+/* Coding chart table: Soundex codes */
+typedef char dm_code[2 + 1];	/* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];	/* Start of name, before a vowel, any other */
+
+/* Coding chart table: Letter in input sequence */
+struct dm_letter
+{
+	char		

Re: [PATCH] Add function to_oct

2022-12-22 Thread Dag Lem
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

This is a mini-review of to_oct :-)

The function seems useful to me, and is obviously correct.

I don't know whether optimization at such a low level is considered in 
PostgreSQL, but here goes.

The calculation of quotient and remainder can be replaced by less costly 
masking and shifting.

Defining

#define OCT_DIGIT_BITS 3
#define OCT_DIGIT_BITMASK 0x7

the content of the loop can be replaced by

*--ptr = digits[value & OCT_DIGIT_BITMASK];
value >>= OCT_DIGIT_BITS;

Also, the check for ptr > buf in the while loop can be removed. The check is 
superfluous, since buf cannot possibly be exhausted by a 32 bit integer (the 
maximum octal number being 377).


Best regards

Dag Lem

The new status of this patch is: Waiting on Author


Re: daitch_mokotoff module

2022-12-21 Thread Dag Lem
Hi Andreas,

Thank you for your detailed and constructive review!

I have made a conscientuous effort to address all the issues you point
out, please see comments below.

Andres Freund  writes:

> Hi,
>
> On 2022-02-03 15:27:32 +0100, Dag Lem wrote:

[...]

> [23:43:34.796] contrib/fuzzystrmatch/meson.build:18:0: ERROR: File
> fuzzystrmatch--1.1.sql does not exist.
>
>
>> -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
>> +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql
>> fuzzystrmatch--1.0--1.1.sql
>>  PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
>
>
> The patch seems to remove fuzzystrmatch--1.1.sql - I suggest not doing
> that. In recent years our approach has been to just keep the "base version" of
> the upgrade script, with extension creation running through the upgrade
> scripts.
>

OK, I have now kept fuzzystrmatch--1.1.sql, and omitted
fuzzystrmatch--1.2.sql

Both the Makefile and meson.build are updated to handle the new files,
including the generated header.

>>  
>> +
>> +#include "daitch_mokotoff.h"
>> +
>> +#include "postgres.h"
>
> Postgres policy is that the include of "postgres.h" has to be the first
> include in every .c file.
>
>

OK, fixed.

>> +#include "utils/builtins.h"
>> +#include "mb/pg_wchar.h"
>> +
>> +#include 
>> +
>> +/* Internal C implementation */
>> +static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
>> +
>> +
>> +PG_FUNCTION_INFO_V1(daitch_mokotoff);
>> +Datum
>> +daitch_mokotoff(PG_FUNCTION_ARGS)
>> +{
>> +text   *arg = PG_GETARG_TEXT_PP(0);
>> +char   *string,
>> +   *tmp_soundex;
>> +text   *soundex;
>> +
>> +/*
>> + * The maximum theoretical soundex size is several KB, however in
>> practice
>> + * anything but contrived synthetic inputs will yield a soundex size of
>> + * less than 100 bytes. We thus allocate and free a temporary work
>> buffer,
>> + * and return only the actual soundex result.
>> + */
>> + string = pg_server_to_any(text_to_cstring(arg),
>> VARSIZE_ANY_EXHDR(arg), PG_UTF8);
>> +tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
>
> Seems that just using StringInfo to hold the soundex output would work better
> than a static allocation?
>

OK, fixed.

>
>> +if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))
>
> We imo shouldn't introduce new functions starting with _.
>

OK, fixed. Note that I just followed the existing pattern in
fuzzystrmatch.c there.

[...]

>> +/* Find next node corresponding to code digit, or create a new node. */
>> +static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
>> + dm_node * node, char code_digit)
>
> PG code style is to have a line break between a function defintion's return
> type and the function name - like you actually do above.
>

OK, fixed. Both pgindent and I must have missed that particular
function.

>> +/* Mapping from ISO8859-1 to upper-case ASCII */
>> +static const char tr_iso8859_1_to_ascii_upper[] =
>> +/*
>> +"`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬
>> ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
>> +*/
>> +"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~ !
>> ?AAECDNO*OYDSAAECDNO/OYDY";
>> +
>> +static char
>> +iso8859_1_to_ascii_upper(unsigned char c)
>> +{
>> +return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c;
>> +}
>> +
>> +
>> +/* Convert an UTF-8 character to ISO-8859-1.
>> + * Unconvertable characters are returned as '?'.
>> + * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
>> + */
>> +static char
>> +utf8_to_iso8859_1(char *str, int *ix)
>
> It seems decidedly not great to have custom encoding conversion routines in a
> contrib module. Is there any way we can avoid this?
>

I have now replaced the custom UTF-8 decode with calls to
utf8_to_unicode and pg_utf_mblen, and simplified the subsequent
conversion to ASCII. Hopefully this makes the conversion code more
palatable.

I don't see how the conversion to ASCII could be substantially
simplified further. The conversion maps lowercase and 8 bit ISO8859-1
characters to ASCII via uppercasing, removal of accents, and discarding
of special characters. In addition to that, it maps (the non-ISO8859-1)
Ą, Ę, and Ţ/Ț from the coding chart to [, \, and ]. After this, a simple
O(1) table lookup can be used to retrieve the soun

Re: daitch_mokotoff module

2022-12-05 Thread Dag Lem
Hi Ian,

Ian Lawrence Barwick  writes:

> Hi Dag
>
> 2022年2月3日(木) 23:27 Dag Lem :
>>
>> Hi,
>>
>> Just some minor adjustments to the patch:
>>
>> * Removed call to locale-dependent toupper()
>> * Cleaned up input normalization
>
> This patch was marked as "Waiting on Author" in the CommitFest entry [1]
> but I see you provided an updated version which hasn't received any feedback,
> so I've move this to the next CommitFest [2] and set it to "Needs Review".
>
> [1] https://commitfest.postgresql.org/40/3451/
> [2] https://commitfest.postgresql.org/41/3451/
>
>> I have been asked to sign up to review a commitfest patch or patches -
>> unfortunately I've been ill with COVID-19 and it's not until now that
>> I feel well enough to have a look.
>>
>> Julien: I'll have a look at https://commitfest.postgresql.org/36/3468/
>> as you suggested (https://commitfest.postgresql.org/36/3379/ seems to
>> have been reviewed now).
>>
>> If there are other suggestions for a patch or patches to review for
>> someone new to PostgreSQL internals, I'd be grateful for that.
>
> I see you provided some feedback on 
> https://commitfest.postgresql.org/36/3468/,
> though the patch seems to have not been accepted (but not conclusively 
> rejected
> either). If you still have the chance to review another patch (or more) it 
> would
> be much appreciated, as there's quite a few piling up. Things like 
> documentation
> or small improvements to client applications are always a good place to start.
> Reviews can be provided at any time, there's no need to wait for the next
> CommitFest.
>

OK, I'll try to find another patch to review.

Regards

Dag Lem




Re: Add psql command to list constraints

2022-02-04 Thread Dag Lem
"David G. Johnston"  writes:

> On Monday, November 15, 2021, Tatsuro Yamada
>  wrote:
>
> I don't know if this is a good example, but if you look at
> StackOverflow,
> it seems that people who want to see a list of constraints appear
> regularly.
>
> https://stackoverflow.com/questions/62987794/how-to-list-all-constraints-
> of-a-table-in-postgresql
> 
>
> Given the questioner restricted their question to “for a given table”
> I’d say it supports leaving the status quo, not changing it.
>
> If, as you suppose, these come up regularly then finding one that asks
> for it “in the entire database”, ideally with some stated goal, should
> be doable.
>
> David J.
>
>

This is my review of the patch in
https://commitfest.postgresql.org/37/3468/

The patch adds the command "\dco" to list constraints in psql. This
seems useful to me.

The patch applies cleanly to HEAD, although some hunks have rather large
offsets.

As far as I can tell, the "\dco" command works as documented.

I have however found the following issues with the patch:

* A TAB character has been added to doc/src/sgml/ref/psql-ref.sgml -
  this should be replaced with spaces.
* The call to listConstraints in line src/bin/psql/command.c 794 refers
  to [2], this should rather be [3].
* The patch kills the "\dc" command in src/bin/psql/command.c
  This can be fixed by adding the following at line 800:
  else
success =
  listConversions(pattern, show_verbose, show_system);

Another comment is that the "\dco" command outputs quite a lot of
information, which only fits in a wide terminal window. Would it be an
idea to only display the columns "Schema" and "Name" by default, and
use "+" to specify inclusion of the columns "Definition" and "Table".


Best regards

Dag Lem




Re: daitch_mokotoff module

2022-02-03 Thread Dag Lem
Hi,

Just some minor adjustments to the patch:

* Removed call to locale-dependent toupper()
* Cleaned up input normalization

I have been asked to sign up to review a commitfest patch or patches -
unfortunately I've been ill with COVID-19 and it's not until now that
I feel well enough to have a look.

Julien: I'll have a look at https://commitfest.postgresql.org/36/3468/
as you suggested (https://commitfest.postgresql.org/36/3379/ seems to
have been reviewed now).

If there are other suggestions for a patch or patches to review for
someone new to PostgreSQL internals, I'd be grateful for that.


Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..1d5bd84be8 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..ba87061845
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,587 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "daitch_mokotoff.h"
+
+#include "postgres.h"
+#include "utils/builtins.h"
+#include "mb/pg_wchar.h"
+
+#include 
+
+/* Internal C implementation */
+static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+Datum
+daitch_mokotoff(PG_FUNC

Re: daitch_mokotoff module

2022-01-05 Thread Dag Lem
Dag Lem  writes:

> Thomas Munro  writes:
>
>> On Wed, Jan 5, 2022 at 2:49 AM Dag Lem  wrote:
>>> However I guess this won't make any difference wrt. actually running the
>>> tests, as long as there seems to be an encoding problem in the cfbot
>>
>> Fixed -- I told it to pull down patches as binary, not text.  Now it
>> makes commits that look healthier, and so far all the Unix systems
>> have survived CI:
>>
>> https://github.com/postgresql-cfbot/postgresql/commit/79700efc61d15c2414b8450a786951fa9308c07f
>> http://cfbot.cputube.org/dag-lem.html
>>
>
> Great!
>
> Dag
>
>

After this I did the mistake of including a patch for citext in this
thread, which is now picked up by cfbot instead of the Daitch-Mokotoff
patch.

Attaching the original patch again in order to hopefully fix my mistake.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..1d5bd84be8 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..1b7263c349
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,593 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "daitch_mokotoff.h"
+
+

[PATCH] Run UTF8-dependent tests for citext [Re: daitch_mokotoff module]

2022-01-05 Thread Dag Lem
Tom Lane  writes:

> Dag Lem  writes:
>
>> Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that
>> two other modules are using UTF8 characters in tests - citext and
>> unaccent.
>
> Yeah, neither of those have been upgraded to said best practice.
> (If you feel like doing the legwork to improve that situation,
> that'd be great.)
>

Please find attached a patch to run the previously commented-out
UTF8-dependent tests for citext, according to best practice. For now I
don't dare to touch the unaccent module, which seems to be UTF8-only
anyway.


Best regards

Dag Lem

diff --git a/contrib/citext/Makefile b/contrib/citext/Makefile
index a7de52928d..789932fe36 100644
--- a/contrib/citext/Makefile
+++ b/contrib/citext/Makefile
@@ -11,7 +11,7 @@ DATA = citext--1.4.sql \
 	citext--1.0--1.1.sql
 PGFILEDESC = "citext - case-insensitive character string data type"
 
-REGRESS = citext
+REGRESS = citext citext_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/citext/expected/citext.out b/contrib/citext/expected/citext.out
index 3bac0534fb..48b4de8993 100644
--- a/contrib/citext/expected/citext.out
+++ b/contrib/citext/expected/citext.out
@@ -48,29 +48,6 @@ SELECT 'a'::citext <> 'ab'::citext AS t;
  t
 (1 row)
 
--- Multibyte sanity tests. Uncomment to run.
--- SELECT 'À'::citext =  'À'::citext AS t;
--- SELECT 'À'::citext =  'à'::citext AS t;
--- SELECT 'À'::text   =  'à'::text   AS f; -- text wins.
--- SELECT 'À'::citext <> 'B'::citext AS t;
--- Test combining characters making up canonically equivalent strings.
--- SELECT 'Ä'::text   <> 'Ä'::text   AS t;
--- SELECT 'Ä'::citext <> 'Ä'::citext AS t;
--- Test the Turkish dotted I. The lowercase is a single byte while the
--- uppercase is multibyte. This is why the comparison code can't be optimized
--- to compare string lengths.
--- SELECT 'i'::citext = 'İ'::citext AS t;
--- Regression.
--- SELECT 'láska'::citext <> 'laská'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive;
--- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative;
 -- Test > and >=
 SELECT 'B'::citext > 'a'::citext AS t;
  t 
diff --git a/contrib/citext/expected/citext_1.out b/contrib/citext/expected/citext_1.out
index 57fc863f7a..8ab4d4224e 100644
--- a/contrib/citext/expected/citext_1.out
+++ b/contrib/citext/expected/citext_1.out
@@ -48,29 +48,6 @@ SELECT 'a'::citext <> 'ab'::citext AS t;
  t
 (1 row)
 
--- Multibyte sanity tests. Uncomment to run.
--- SELECT 'À'::citext =  'À'::citext AS t;
--- SELECT 'À'::citext =  'à'::citext AS t;
--- SELECT 'À'::text   =  'à'::text   AS f; -- text wins.
--- SELECT 'À'::citext <> 'B'::citext AS t;
--- Test combining characters making up canonically equivalent strings.
--- SELECT 'Ä'::text   <> 'Ä'::text   AS t;
--- SELECT 'Ä'::citext <> 'Ä'::citext AS t;
--- Test the Turkish dotted I. The lowercase is a single byte while the
--- uppercase is multibyte. This is why the comparison code can't be optimized
--- to compare string lengths.
--- SELECT 'i'::citext = 'İ'::citext AS t;
--- Regression.
--- SELECT 'láska'::citext <> 'laská'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive;
--- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative;
 -- Test > and >=
 SELECT 'B'::citext > 'a'::citext AS t;
  t 
diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
new file mode 100644
index 00..1f4fa79aff
--- /dev/null
+++ b/contrib/citext/expected/citext_utf8.out
@@ -0,0 +1,119 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because o

Re: daitch_mokotoff module

2022-01-04 Thread Dag Lem
Thomas Munro  writes:

> On Wed, Jan 5, 2022 at 2:49 AM Dag Lem  wrote:
>> However I guess this won't make any difference wrt. actually running the
>> tests, as long as there seems to be an encoding problem in the cfbot
>
> Fixed -- I told it to pull down patches as binary, not text.  Now it
> makes commits that look healthier, and so far all the Unix systems
> have survived CI:
>
> https://github.com/postgresql-cfbot/postgresql/commit/79700efc61d15c2414b8450a786951fa9308c07f
> http://cfbot.cputube.org/dag-lem.html
>

Great!

Dag




Re: daitch_mokotoff module

2022-01-04 Thread Dag Lem
Andres Freund  writes:

> Hi,
>
> On 2022-01-02 21:41:53 -0500, Tom Lane wrote:
>> ... so, that test case is guaranteed to fail in non-UTF8 encodings,
>> I suppose?  I wonder what the LANG environment is in that cfbot
>> instance.
>
> LANG="en_US.UTF-8"
>
> But it looks to me like the problem is in the commit cfbot creates, rather
> than the test run itself:
> https://github.com/postgresql-cfbot/postgresql/commit/d5b4ec87cfd65dc08d26e1b789bd254405c90a66#diff-388d4bb360a3b24c425e29a85899315dc02f9c1dd9b9bc9aaa828876bdfea50aR56
>
> Greetings,
>
> Andres Freund
>
>

I have now separated out the UTF8-dependent tests, hopefully according
to the current best practice (based on src/test/modules/test_regex/ and
https://www.postgresql.org/docs/14/regress-variant.html).

However I guess this won't make any difference wrt. actually running the
tests, as long as there seems to be an encoding problem in the cfbot
pipeline.

Is there anything else I can do? Could perhaps fuzzystrmatch_utf8 simply
be commented out from the Makefile for the time being?

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..1d5bd84be8 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..1b7263c349
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,593 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUN

Re: daitch_mokotoff module

2022-01-03 Thread Dag Lem
Tom Lane  writes:

> Thomas Munro  writes:
>> Erm, it looks like something weird is happening somewhere in cfbot's
>> pipeline, because Dag's patch says:
>
>> +SELECT daitch_mokotoff('Straßburg');
>> + daitch_mokotoff
>> +-
>> + 294795
>> +(1 row)
>
> ... so, that test case is guaranteed to fail in non-UTF8 encodings,
> I suppose?  I wonder what the LANG environment is in that cfbot
> instance.
>
> (We do have methods for dealing with non-ASCII test cases, but
> I can't see that this patch is using any of them.)
>
>   regards, tom lane
>

I naively assumed that tests would be run in an UTF8 environment.

Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that
two other modules are using UTF8 characters in tests - citext and
unaccent.

The citext tests seem to be commented out - "Multibyte sanity
tests. Uncomment to run."

Looking into the unaccent module, I don't quite understand how it will
work with various encodings, since it doesn't seem to decode its input -
will it fail if run under anything but ASCII or UTF8?

In any case, I see that unaccent.sql starts as follows:


CREATE EXTENSION unaccent;

-- must have a UTF8 database
SELECT getdatabaseencoding();

SET client_encoding TO 'UTF8';


Would doing the same thing in fuzzystrmatch.sql fix the problem with
failing tests? Should I prepare a new patch?


Best regards

Dag Lem




Re: daitch_mokotoff module

2021-12-21 Thread Dag Lem
Hello again,

It turns out that there actually exists an(other) implementation of
the Daitch-Mokotoff Soundex System which gets it right; the JOS
Soundex Calculator at https://www.jewishgen.org/jos/jossound.htm
Other implementations I have been able to find, like the one in Apache
Commons Coded used in e.g. Elasticsearch, are far from correct.

The source code for the JOS Soundex Calculator is not available, as
far as I can tell, however I have run the complete list of 98412 last
names at
https://raw.githubusercontent.com/philipperemy/name-dataset/master/names_dataset/v1/last_names.all.txt
through the calculator, in order to have a good basis for comparison.

This revealed a few shortcomings in my implementation. In particular I
had to go back to the drawing board in order to handle the dual nature
of "J" correctly. "J" can be either a vowel or a consonant in
Daitch-Mokotoff soundex, and this complicates encoding of the
*previous* character.

I have also done a more thorough review and refactoring of the code,
which should hopefully make things quite a bit more understandable to
others.

The changes are summarized as follows:

* Returns NULL for input without any encodable characters.
* Uses the same "unoffical" rules for "UE" as other implementations.
* Correctly considers "J" as either a vowel or a consonant.
* Corrected encoding for e.g. "HANNMANN".
* Code refactoring and comments for readability.
* Better examples in the documentation.

The implementation is now in correspondence with the JOS Soundex
Calculator for the 98412 last names mentioned above, with only the
following exceptions:

JOS: cedeño 43 53
PG:  cedeño 436000 536000
JOS: sadab(khura)  437000
PG:  sadab(khura)  437590

I hope this addition to the fuzzystrmatch extension module will prove
to be useful to others as well!

This is my very first code contribution to PostgreSQL, and I would be
grateful for any advice on how to proceed in order to get the patch
accepted.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..826e529e3e 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,11 +3,12 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
 REGRESS = fuzzystrmatch
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..1b7263c349
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,593 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, 

Re: daitch_mokotoff module

2021-12-14 Thread Dag Lem
Tomas Vondra  writes:

[...]

>
> Thanks, looks interesting. A couple generic comments, based on a quick
> code review.

Thank you for the constructive review!

>
> 1) Can the extension be marked as trusted, just like fuzzystrmatch?

I have now moved the daitch_mokotoff function into the fuzzystrmatch
module, as suggested by Andrew Dunstan.

>
> 2) The docs really need an explanation of what the extension is for,
> not just a link to fuzzystrmatch. Also, a couple examples would be
> helpful, I guess - similarly to fuzzystrmatch. The last line in the
> docs is annoyingly long.

Please see the updated documentation for the fuzzystrmatch module.

>
> 3) What's daitch_mokotov_header.pl for? I mean, it generates the
> header, but when do we need to run it?

It only has to be run if the soundex rules are changed. I have now
made the dependencies explicit in the fuzzystrmatch Makefile.

>
> 4) It seems to require perl-open, which is a module we did not need
> until now. Not sure how well supported it is, but maybe we can use a
> more standard module?

I believe Perl I/O layers have been part of Perl core for two decades
now :-)

>
> 5) Do we need to keep DM_MAIN? It seems to be meant for some kind of
> testing, but our regression tests certainly don't need it (or the
> palloc mockup). I suggest to get rid of it.

Done. BTW this was modeled after dmetaphone.c

>
> 6) I really don't understand some of the comments in
> daitch_mokotov.sql, like for example:
>
> -- testEncodeBasic
> -- Tests covered above are omitted.
>
> Also, comments with names of Java methods seem pretty confusing. It'd
> be better to actually explain what rules are the tests checking.

The tests were copied from various web sites and implementations. I have
cut down on the number of tests and made the comments more to the point.

>
> 7) There are almost no comments in the .c file (ignoring the comment
> on top). Short functions like initialize_node are probably fine
> without one, but e.g. update_node would deserve one.

More comments are added to both the .h and the .c file.

>
> 8) Some of the lines are pretty long (e.g. the update_node signature
> is almost 170 chars). That should be wrapped. Maybe try running
> pgindent on the code, that'll show which parts need better formatting
> (so as not to get broken later).

Fixed. I did run pgindent earlier, however it didn't catch those long
lines.

>
> 9) I'm sure there's better way to get the number of valid chars than this:
>
>   for (i = 0, ilen = 0; (c = read_char([i], )) && (c < 'A' ||
> c > ']'); i += ilen)
>   {
>   }
>
> Say, a while loop or something?

The code gets to the next encodable character, skipping any other
characters. I have now added a comment which should hopefully make this
clearer, and broken up the for loop for readability.

Please find attached the revised patch.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..826e529e3e 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,11 +3,12 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
 REGRESS = fuzzystrmatch
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..302e9a6d86
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,516 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (Dait

Re: daitch_mokotoff module

2021-12-13 Thread Dag Lem
Please find attached an updated patch, with the following fixes:

* Replaced remaining malloc/free with palloc/pfree.
* Made "make check" pass.
* Updated notes on other implementations.

Best regards

Dag Lem

diff --git a/contrib/Makefile b/contrib/Makefile
index 87bf87ab90..5ea729 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		btree_gist	\
 		citext		\
 		cube		\
+		daitch_mokotoff	\
 		dblink		\
 		dict_int	\
 		dict_xsyn	\
diff --git a/contrib/daitch_mokotoff/Makefile b/contrib/daitch_mokotoff/Makefile
new file mode 100644
index 00..baec5e31d4
--- /dev/null
+++ b/contrib/daitch_mokotoff/Makefile
@@ -0,0 +1,25 @@
+# contrib/daitch_mokotoff/Makefile
+
+MODULE_big = daitch_mokotoff
+OBJS = \
+	$(WIN32RES) \
+	daitch_mokotoff.o
+
+EXTENSION = daitch_mokotoff
+DATA = daitch_mokotoff--1.0.sql
+PGFILEDESC = "daitch_mokotoff - Daitch-Mokotoff Soundex System"
+
+HEADERS = daitch_mokotoff.h
+
+REGRESS = daitch_mokotoff
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/daitch_mokotoff
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql
new file mode 100644
index 00..0b5a643175
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql
@@ -0,0 +1,8 @@
+/* contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION daitch_mokotoff" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff.c b/contrib/daitch_mokotoff/daitch_mokotoff.c
new file mode 100644
index 00..a7f1fd8541
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff.c
@@ -0,0 +1,549 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - "J" is considered a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ * - Both dmlat.php and dmrules.txt have the same unofficial rules for "UE".
+ * - Coding of MN/NM + M/N differs between dmsoundex.php and DaitchMokotoffSoundex.java
+ * - No other known implementation yields the correct set of codes for e.g.
+ *   "CJC" (55 54 545000 45 40 44).
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "daitch_mokotoff.h"
+
+#ifndef DM_MAIN
+
+#include "postgres.h"
+#include "utils/builtins.h"
+#include "mb/pg_wchar.h&qu

daitch_mokotoff module

2021-12-03 Thread Dag Lem
Hello,

Please find attached a patch for the daitch_mokotoff module.

This implements the Daitch-Mokotoff Soundex System, as described in
https://www.avotaynu.com/soundex.htm

The module is used in production at Finance Norway.

In order to verify correctness, I have compared generated soundex codes
with corresponding results from the implementation by Stephen P. Morse
at https://stevemorse.org/census/soundex.html

Where soundex codes differ, the daitch_mokotoff module has been found
to be correct. The Morse implementation uses a few unofficial rules,
and also has an error in the handling of adjacent identical code
digits. Please see daitch_mokotoff.c for further references and
comments.

For reference, detailed instructions for soundex code comparison are
attached.


Best regards

Dag Lem

diff --git a/contrib/Makefile b/contrib/Makefile
index 87bf87ab90..5ea729 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		btree_gist	\
 		citext		\
 		cube		\
+		daitch_mokotoff	\
 		dblink		\
 		dict_int	\
 		dict_xsyn	\
diff --git a/contrib/daitch_mokotoff/Makefile b/contrib/daitch_mokotoff/Makefile
new file mode 100644
index 00..baec5e31d4
--- /dev/null
+++ b/contrib/daitch_mokotoff/Makefile
@@ -0,0 +1,25 @@
+# contrib/daitch_mokotoff/Makefile
+
+MODULE_big = daitch_mokotoff
+OBJS = \
+	$(WIN32RES) \
+	daitch_mokotoff.o
+
+EXTENSION = daitch_mokotoff
+DATA = daitch_mokotoff--1.0.sql
+PGFILEDESC = "daitch_mokotoff - Daitch-Mokotoff Soundex System"
+
+HEADERS = daitch_mokotoff.h
+
+REGRESS = daitch_mokotoff
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/daitch_mokotoff
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql
new file mode 100644
index 00..0b5a643175
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql
@@ -0,0 +1,8 @@
+/* contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION daitch_mokotoff" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff.c b/contrib/daitch_mokotoff/daitch_mokotoff.c
new file mode 100644
index 00..9e66aee434
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff.c
@@ -0,0 +1,551 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/phoneticinfo.htm (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - "J" is considered a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Adjacent identical code digits are not collapsed correctly in dmsoundex.php
+ *   when double digit codes are involved. E.g. "BESST" yields 744300 instead of
+ *   743000 as for "BEST".
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ * - Both dmlat.php and dmrules.txt have the same unofficial rules for "UE".
+ * - Coding of MN/NM + M/N differs between dmsoundex.php and DaitchMokotoffSoundex.java
+ * - Neither dmsoundex.php nor DaitchMokotoffSoundex.java yields all valid codes for e.g.
+ *   "CJC" (55 54 545000 45 40 44).
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILIT