Re: daitch_mokotoff module

2023-04-14 Thread Tom Lane
Buildfarm member hamerkop has a niggle about this patch:

c:\\build-farm-local\\buildroot\\head\\pgsql.build\\contrib\\fuzzystrmatch\\daitch_mokotoff.c
 : warning C4819: The file contains a character that cannot be represented in 
the current code page (932). Save the file in Unicode format to prevent data 
loss

It's complaining about the comment in

static const char iso8859_1_to_ascii_upper[] =
/*
"`abcdefghijklmnopqrstuvwxyz{|}~  ¡¢£¤¥¦§¨©ª«¬ 
®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
*/
"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~  ! 

?AAECDNO*OYDSAAECDNO/OYDY";

There are some other comments with non-ASCII characters elsewhere in the
file, but I think it's mainly just the weird symbols here that might fail
to translate to encodings that are not based on ISO 8859-1.

I think we need to get rid of this warning: it's far from obvious that
it's a non-issue, and because the compiler is not at all specific about
where the issue is, people could waste a lot of time figuring that out.
In fact, it might *not* be a non-issue, if it prevents the source tree
as a whole from being processed by some tool or other.

So I propose to replace those symbols with "... random symbols ..." or
the like and see if the warning goes away.  If not, we might have to
resort to something more drastic like removing this comment altogether.
We do have non-ASCII text in comments and test cases elsewhere in the
tree, and have not had a lot of trouble with that, so I'm hoping the
letters can stay because they are useful to compare to the constant.

regards, tom lane




Re: daitch_mokotoff module

2023-04-08 Thread Andrew Dunstan


On 2023-04-07 Fr 23:25, Tom Lane wrote:

I wrote:

Anyway, I assume this is just syntactic sugar for something
we can do another way?  If it's at all fundamental, I'll have
to back the patch out.

On closer inspection, this script is completely devoid of any
need to deal in non-ASCII data at all.  So I just nuked the
"use" lines.





Yeah.

I just spent a little while staring at the perl code. I have to say it 
seems rather opaque, the data structure seems a bit baroque. I'll try to 
simplify it.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: daitch_mokotoff module

2023-04-07 Thread Tom Lane
I wrote:
> Anyway, I assume this is just syntactic sugar for something
> we can do another way?  If it's at all fundamental, I'll have
> to back the patch out.

On closer inspection, this script is completely devoid of any
need to deal in non-ASCII data at all.  So I just nuked the
"use" lines.

regards, tom lane




Re: daitch_mokotoff module

2023-04-07 Thread Tom Lane
Andrew Dunstan  writes:
> On 2023-04-07 Fr 21:52, Tom Lane wrote:
>> pipit appears to be running a reasonably current system (RHEL8), so
>> the claim that "open" is a Perl core module appears false.  We need
>> to rewrite this to not use that.

> I think it is a core module (See ) but it 
> appears that some packagers have separated it out for reasons that 
> aren't entirely obvious:

Hmm, yeah: on my RHEL8 workstation

$ rpm -qf /usr/share/perl5/open.pm
perl-open-1.11-421.el8.noarch

It's not exactly clear how that came to be installed, because

$ rpm -q perl-open --whatrequires
no package requires perl-open

and indeed another nearby RHEL8 machine doesn't have that package
installed at all, even though I've got it loaded up with enough
stuff for most Postgres work.  (Sadly, I'd not tested on that one.)

Anyway, I assume this is just syntactic sugar for something
we can do another way?  If it's at all fundamental, I'll have
to back the patch out.

regards, tom lane




Re: daitch_mokotoff module

2023-04-07 Thread Andrew Dunstan


On 2023-04-07 Fr 21:52, Tom Lane wrote:

I wrote:

I pushed this after some mostly-cosmetic fiddling.  Most of the
buildfarm seems okay with it,

Spoke too soon [1]:

make[1]: Entering directory 
'/home/linux1/build-farm-16-pipit/buildroot/HEAD/pgsql.build/contrib/fuzzystrmatch'
'/usr/bin/perl' daitch_mokotoff_header.pl daitch_mokotoff.h
Can't locate open.pm in @INC (you may need to install the open module) (@INC 
contains: /usr/local/lib64/perl5 /usr/local/share/perl5 
/usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 
/usr/share/perl5) at daitch_mokotoff_header.pl line 15.
BEGIN failed--compilation aborted at daitch_mokotoff_header.pl line 15.
make[1]: *** [Makefile:33: daitch_mokotoff.h] Error 2

pipit appears to be running a reasonably current system (RHEL8), so
the claim that "open" is a Perl core module appears false.  We need
to rewrite this to not use that.

regards, tom lane

[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pipit=2023-04-08%2001%3A02%3A39




I think it is a core module (See ) but it 
appears that some packagers have separated it out for reasons that 
aren't entirely obvious:


andrew@emma:~ $ rpm -q -l -f /usr/share/perl5/open.pm
/usr/share/man/man3/open.3pm.gz
/usr/share/perl5/open.pm

cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: daitch_mokotoff module

2023-04-07 Thread Tom Lane
I wrote:
> I pushed this after some mostly-cosmetic fiddling.  Most of the
> buildfarm seems okay with it,

Spoke too soon [1]:

make[1]: Entering directory 
'/home/linux1/build-farm-16-pipit/buildroot/HEAD/pgsql.build/contrib/fuzzystrmatch'
'/usr/bin/perl' daitch_mokotoff_header.pl daitch_mokotoff.h
Can't locate open.pm in @INC (you may need to install the open module) (@INC 
contains: /usr/local/lib64/perl5 /usr/local/share/perl5 
/usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 
/usr/share/perl5) at daitch_mokotoff_header.pl line 15.
BEGIN failed--compilation aborted at daitch_mokotoff_header.pl line 15.
make[1]: *** [Makefile:33: daitch_mokotoff.h] Error 2

pipit appears to be running a reasonably current system (RHEL8), so
the claim that "open" is a Perl core module appears false.  We need
to rewrite this to not use that.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=pipit=2023-04-08%2001%3A02%3A39




Re: daitch_mokotoff module

2023-04-07 Thread Tom Lane
Andres Freund  writes:
> On 2023-04-07 21:13:43 -0400, Tom Lane wrote:
>> I pushed this after some mostly-cosmetic fiddling.  Most of the
>> buildfarm seems okay with it, but crake's perlcritic run is not:
>> 
>> ./contrib/fuzzystrmatch/daitch_mokotoff_header.pl: I/O layer ":utf8" used at 
>> line 15, column 5.  Use ":encoding(UTF-8)" to get strict validation.  
>> ([InputOutput::RequireEncodingWithUTF8Layer] Severity: 5)

> Unless it's not available with old versions, using :encoding(UTF-8) seems
> sensible?

Yeah, that's the obvious fix, I was just wondering if people with
more perl-fu than I have see a problem with it.  But I'll go ahead
and push that for now.

regards, tom lane




Re: daitch_mokotoff module

2023-04-07 Thread Andres Freund
Hi,

On 2023-04-07 21:13:43 -0400, Tom Lane wrote:
> I wrote:
> > That seems fine to me.  I'll check this over and see if I can get
> > it pushed today.
> 
> I pushed this after some mostly-cosmetic fiddling.  Most of the
> buildfarm seems okay with it, but crake's perlcritic run is not:
> 
> ./contrib/fuzzystrmatch/daitch_mokotoff_header.pl: I/O layer ":utf8" used at 
> line 15, column 5.  Use ":encoding(UTF-8)" to get strict validation.  
> ([InputOutput::RequireEncodingWithUTF8Layer] Severity: 5)
> 
> Any suggestions on exactly how to pacify that?

You could follow it's advise and replace the :utf8 with :encoding(UTF-8), that
works here. Or disable it in that piece of code with ## no critic
(RequireEncodingWithUTF8Layer) Or we could disable the warning in
perlcriticrc for all files?

Unless it's not available with old versions, using :encoding(UTF-8) seems
sensible?

Greetings,

Andres Freund




Re: daitch_mokotoff module

2023-04-07 Thread Tom Lane
I wrote:
> That seems fine to me.  I'll check this over and see if I can get
> it pushed today.

I pushed this after some mostly-cosmetic fiddling.  Most of the
buildfarm seems okay with it, but crake's perlcritic run is not:

./contrib/fuzzystrmatch/daitch_mokotoff_header.pl: I/O layer ":utf8" used at 
line 15, column 5.  Use ":encoding(UTF-8)" to get strict validation.  
([InputOutput::RequireEncodingWithUTF8Layer] Severity: 5)

Any suggestions on exactly how to pacify that?

regards, tom lane




Re: daitch_mokotoff module

2023-04-07 Thread Tom Lane
Tomas Vondra  writes:
> Hi, I think from the technical point of view it's sound and ready for
> commit. The patch stalled on the copyright/credit stuff, which is
> somewhat separate and mostly non-technical aspect of patches. Sorry for
> that, I'm sure it's annoying/frustrating :-(

> I see the current patch has two simple lines:

>  * This module was originally sponsored by Finance Norway /
>  * Trafikkforsikringsforeningen, and implemented by Dag Lem

> Any objections to this level of attribution in commnents?

That seems fine to me.  I'll check this over and see if I can get
it pushed today.

regards, tom lane




Re: daitch_mokotoff module

2023-04-03 Thread Tomas Vondra
On 4/3/23 15:19, Dag Lem wrote:
> Dag Lem  writes:
> 
>> I sincerely hope this resolves any blocking issues with copyright /
>> legalese / credits.
>>
> 
> Can this now be considered ready for commiter, so that Paul or someone
> else can flip the bit?
> 

Hi, I think from the technical point of view it's sound and ready for
commit. The patch stalled on the copyright/credit stuff, which is
somewhat separate and mostly non-technical aspect of patches. Sorry for
that, I'm sure it's annoying/frustrating :-(

I see the current patch has two simple lines:

 * This module was originally sponsored by Finance Norway /
 * Trafikkforsikringsforeningen, and implemented by Dag Lem

Any objections to this level of attribution in commnents?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: daitch_mokotoff module

2023-04-03 Thread Dag Lem
Dag Lem  writes:

> I sincerely hope this resolves any blocking issues with copyright /
> legalese / credits.
>

Can this now be considered ready for commiter, so that Paul or someone
else can flip the bit?

Best regards
Dag Lem




Re: daitch_mokotoff module

2023-03-06 Thread Dag Lem
Dag Lem  writes:

> Tomas Vondra  writes:
>
>> On 2/7/23 18:08, Paul Ramsey wrote:
>>> 
>>> 
 On Feb 7, 2023, at 6:47 AM, Dag Lem  wrote:

 I just went by to check the status of the patch, and I noticed that
 you've added yourself as reviewer earlier - great!

 Please tell me if there is anything I can do to help bring this across
 the finish line.
>>> 
>>> Honestly, I had set it to Ready for Committer, but then I went to
>>> run regression one more time and my regression blew up. I found I
>>> couldn't enable the UTF tests without things failing. And I don't
>>> blame you! I think my installation is probably out-of-alignment in
>>> some way, but I didn't want to flip the Ready flag without having
>>> run everything through to completion, so I flipped it back. Also,
>>> are the UTF tests enabled by default? It wasn't clear to me that
>>> they were?
>>> 
>> The utf8 tests are enabled depending on the encoding returned by
>> getdatabaseencoding(). Systems with other encodings will simply use the
>> alternate .out file. And it works perfectly fine for me.
>>
>> IMHO it's ready for committer.
>>
>>
>> regards
>
> Yes, the UTF-8 tests follow the current best practice as has been
> explained to me earlier. The following patch exemplifies this:
>
> https://github.com/postgres/postgres/commit/c2e8bd27519f47ff56987b30eb34a01969b9a9e8
>
>

Can you please have a look at this again?

Best regards,

Dag Lem




Re: daitch_mokotoff module

2023-02-14 Thread Dag Lem
I sincerely hope this resolves any blocking issues with copyright /
legalese / credits.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..9215ab6530
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,567 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * This module was originally sponsored by Finance Norway /
+ * Trafikkforsikringsforeningen, and implemented by Dag Lem 
+ *
+ * The implementation of the Daitch-Mokotoff Soundex System aims at correctness
+ * and high performance, and can be summarized as follows:
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ * References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+*/
+
+#include "postgres.h"
+
+#include "catalog/pg_type.h"
+#include "mb/pg_wchar.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+/* Generated coding chart table */
+#include "daitch_mokotoff.h"
+
+#define DM_CODE_DIGITS 6
+
+/* Node in soundex code tree */
+struct dm_node
+{
+	int			soundex_length; /* Length of generated soundex code */
+	char		soundex[DM_CODE_DIGITS];	/* Soundex code */
+	int			is_leaf;		/* Candidate for complete soundex code */
+	int			last_update;	/* Letter number for last update of node */
+	char		code_digit;		/* Last code digit, 0 - 9 */
+
+	/*
+	 * One or two alternate code digits leading to this node. If there are two
+	 * digits, one of them is always an 'X'. Repeated code digits and 'X' lead
+	 * back to the same node.
+	 */
+	char		prev_code_digits[2];
+	/* One or two alternate code digits moving forward. */
+	char		next_code_digits[2];
+	/* ORed together code index(es) used to reach current node. */
+	int			prev_code_index;
+	int			next_code_index;
+	/* Possible nodes branching out from this node - digits 0-9. */
+	struct dm_node *children[10];
+	/* Next node in linked list. Alternating index for each iteration. */
+	struct dm_node *next[2];
+};
+
+typedef struct dm_node dm_node;
+
+
+/* Internal C implementation */
+static int	daitch_mokotoff_coding(char *word, ArrayBuildState *soundex, MemoryContext tmp_ctx);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{

Re: daitch_mokotoff module

2023-02-10 Thread Andres Freund
Hi,

On 2023-02-09 10:28:36 +0100, Dag Lem wrote:
> I'll ask again, would the proposed credits be acceptable? In this case,
> the code already existed elsewhere (as in your example for double
> metaphone) as a separate extension. The copyright owner is OK with
> copyright assignment, however I find it quite unreasonable that proper
> credits should not be given.

You don't need to assign copyright, it needs however be licensed under the
terms of the PostgreSQL License.


> Neither commit messages nor release notes
> follow the contributed module, which is in its entirety contributed by
> an external entity.

The problem with adding credits to source files is that it's hard to maintain
them reasonably over time. At what point has a C file been extended
sufficiently to warrant an additional author?


> I'll also point out that in addition to credits in code all over the
> place, PostgreSQL has much more prominent credits in the documentation:
>
>   grep -ER "Author" doc/ | grep -v PostgreSQL

FWIW, I'd rather remove them. In several of those the credited author has, by
now, only done a small fraction of the overall work.

They don't make much sense to me - you don't get a permanent mention in other
parts of the documentation either. Many of the binaries outside of contrib/
involved a lot more work by one individual than cases in contrib/. Lots of
backend code has a *lot* of work done by one individual, yet we don't add
authorship notes in relevant sections of the documentation.

Greetings,

Andres Freund




Re: daitch_mokotoff module

2023-02-09 Thread Dag Lem
Tomas Vondra  writes:

> On 2/8/23 15:31, Dag Lem wrote:
>> Alvaro Herrera  writes:
>> 
>>> On 2023-Jan-17, Dag Lem wrote:
>>>
 + * Daitch-Mokotoff Soundex
 + *
 + * Copyright (c) 2021 Finance Norway
 + * Author: Dag Lem 
>>>
>>> Hmm, I don't think we accept copyright lines that aren't "PostgreSQL
>>> Global Development Group".  Is it okay to use that, and update the year
>>> to 2023?  (Note that answering "no" very likely means your patch is not
>>> candidate for inclusion.)  Also, we tend not to have "Author:" lines.
>>>
>> 
>> You'll have to forgive me for not knowing about this rule:
>> 
>>   grep -ER "Copyright.*[0-9]{4}" contrib/ | grep -v PostgreSQL
>> 
>> In any case, I have checked with the copyright owner, and it would be OK
>> to assign the copyright to "PostgreSQL Global Development Group".
>> 
>
> I'm not entirely sure what's the rule either, and I'm a committer. My
> guess is these cases are either old and/or adding a code that already
> existed elsewhere (like some of the double metaphone, for example), or
> maybe both. But I'd bet we'd prefer not adding more ...
>
>> To avoid going back and forth with patches, how do you propose that the
>> sponsor and the author of the contributed module should be credited?
>> Woule something like this be acceptable?
>> 
>
> We generally credit contributors in two ways - by mentioning them in the
> commit message, and by listing them in the release notes (for individual
> features).
>

I'll ask again, would the proposed credits be acceptable? In this case,
the code already existed elsewhere (as in your example for double
metaphone) as a separate extension. The copyright owner is OK with
copyright assignment, however I find it quite unreasonable that proper
credits should not be given. Neither commit messages nor release notes
follow the contributed module, which is in its entirety contributed by
an external entity.

I'll also point out that in addition to credits in code all over the
place, PostgreSQL has much more prominent credits in the documentation:

  grep -ER "Author" doc/ | grep -v PostgreSQL

"Author" is even documented as a top level section in the Reference
Pages as "Author (only used in the contrib section)", see

  https://www.postgresql.org/docs/15/docguide-style.html#id-1.11.11.8.2

If there really exists some new rule which says that for new
contributions under contrib/, credits should not be allowed in any way
in either code or documentation (IANAL, but AFAIU this would be in
conflict with laws on author's moral rights in several countries), then
one would reasonably expect that you'd be upfront about this, both in
documentation, and also as the very first thing when a contribution is
first proposed for inclusion.

Best regards

Dag Lem




Re: daitch_mokotoff module

2023-02-08 Thread Tomas Vondra



On 2/8/23 15:31, Dag Lem wrote:
> Alvaro Herrera  writes:
> 
>> On 2023-Jan-17, Dag Lem wrote:
>>
>>> + * Daitch-Mokotoff Soundex
>>> + *
>>> + * Copyright (c) 2021 Finance Norway
>>> + * Author: Dag Lem 
>>
>> Hmm, I don't think we accept copyright lines that aren't "PostgreSQL
>> Global Development Group".  Is it okay to use that, and update the year
>> to 2023?  (Note that answering "no" very likely means your patch is not
>> candidate for inclusion.)  Also, we tend not to have "Author:" lines.
>>
> 
> You'll have to forgive me for not knowing about this rule:
> 
>   grep -ER "Copyright.*[0-9]{4}" contrib/ | grep -v PostgreSQL
> 
> In any case, I have checked with the copyright owner, and it would be OK
> to assign the copyright to "PostgreSQL Global Development Group".
> 

I'm not entirely sure what's the rule either, and I'm a committer. My
guess is these cases are either old and/or adding a code that already
existed elsewhere (like some of the double metaphone, for example), or
maybe both. But I'd bet we'd prefer not adding more ...

> To avoid going back and forth with patches, how do you propose that the
> sponsor and the author of the contributed module should be credited?
> Woule something like this be acceptable?
> 

We generally credit contributors in two ways - by mentioning them in the
commit message, and by listing them in the release notes (for individual
features).

> /*
>  * Daitch-Mokotoff Soundex
>  *
>  * Copyright (c) 2023, PostgreSQL Global Development Group
>  *
>  * This module was sponsored by Finance Norway / Trafikkforsikringsforeningen
>  * and implemented by Dag Lem 
>  *
>  ...
> 
> [...]
> 
>>
>> We don't keep a separate copyright statement in the file; rather we
>> assume that all files are under the PostgreSQL license, which is in the
>> COPYRIGHT file at the top of the tree.  Changing it thus has the side
>> effect that these disclaim notes refer to the University of California
>> rather than "the Author".  IANAL.
> 
> OK, no problem. Note that you will again find counterexamples under
> contrib/ (and in some other places):
> 
>   grep -R "Permission to use" .
> 
>> I think we should add SPDX markers to all the files we distribute:
>> /* SPDX-License-Identifier: PostgreSQL */
>>
>> https://spdx.dev/ids/
>> https://spdx.org/licenses/PostgreSQL.html
> 
> As far as I can tell, this is not included in any file so far, and is
> thus better left to decide and implement by someone else.
> 

I don't think Alvaro was suggesting this patch should do that. It was
more a generic comment about what the project as a whole might do.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: daitch_mokotoff module

2023-02-08 Thread Dag Lem
Alvaro Herrera  writes:

> On 2023-Jan-17, Dag Lem wrote:
>
>> + * Daitch-Mokotoff Soundex
>> + *
>> + * Copyright (c) 2021 Finance Norway
>> + * Author: Dag Lem 
>
> Hmm, I don't think we accept copyright lines that aren't "PostgreSQL
> Global Development Group".  Is it okay to use that, and update the year
> to 2023?  (Note that answering "no" very likely means your patch is not
> candidate for inclusion.)  Also, we tend not to have "Author:" lines.
>

You'll have to forgive me for not knowing about this rule:

  grep -ER "Copyright.*[0-9]{4}" contrib/ | grep -v PostgreSQL

In any case, I have checked with the copyright owner, and it would be OK
to assign the copyright to "PostgreSQL Global Development Group".

To avoid going back and forth with patches, how do you propose that the
sponsor and the author of the contributed module should be credited?
Woule something like this be acceptable?

/*
 * Daitch-Mokotoff Soundex
 *
 * Copyright (c) 2023, PostgreSQL Global Development Group
 *
 * This module was sponsored by Finance Norway / Trafikkforsikringsforeningen
 * and implemented by Dag Lem 
 *
 ...

[...]

>
> We don't keep a separate copyright statement in the file; rather we
> assume that all files are under the PostgreSQL license, which is in the
> COPYRIGHT file at the top of the tree.  Changing it thus has the side
> effect that these disclaim notes refer to the University of California
> rather than "the Author".  IANAL.

OK, no problem. Note that you will again find counterexamples under
contrib/ (and in some other places):

  grep -R "Permission to use" .

> I think we should add SPDX markers to all the files we distribute:
> /* SPDX-License-Identifier: PostgreSQL */
>
> https://spdx.dev/ids/
> https://spdx.org/licenses/PostgreSQL.html

As far as I can tell, this is not included in any file so far, and is
thus better left to decide and implement by someone else.

Best regards

Dag Lem




Re: daitch_mokotoff module

2023-02-08 Thread Dag Lem
Tomas Vondra  writes:

> On 2/7/23 18:08, Paul Ramsey wrote:
>> 
>> 
>>> On Feb 7, 2023, at 6:47 AM, Dag Lem  wrote:
>>>
>>> I just went by to check the status of the patch, and I noticed that
>>> you've added yourself as reviewer earlier - great!
>>>
>>> Please tell me if there is anything I can do to help bring this across
>>> the finish line.
>> 
>> Honestly, I had set it to Ready for Committer, but then I went to
>> run regression one more time and my regression blew up. I found I
>> couldn't enable the UTF tests without things failing. And I don't
>> blame you! I think my installation is probably out-of-alignment in
>> some way, but I didn't want to flip the Ready flag without having
>> run everything through to completion, so I flipped it back. Also,
>> are the UTF tests enabled by default? It wasn't clear to me that
>> they were?
>> 
> The utf8 tests are enabled depending on the encoding returned by
> getdatabaseencoding(). Systems with other encodings will simply use the
> alternate .out file. And it works perfectly fine for me.
>
> IMHO it's ready for committer.
>
>
> regards

Yes, the UTF-8 tests follow the current best practice as has been
explained to me earlier. The following patch exemplifies this:

https://github.com/postgres/postgres/commit/c2e8bd27519f47ff56987b30eb34a01969b9a9e8


Best regards,

Dag Lem




Re: daitch_mokotoff module

2023-02-08 Thread Alvaro Herrera
On 2023-Jan-17, Dag Lem wrote:

> + * Daitch-Mokotoff Soundex
> + *
> + * Copyright (c) 2021 Finance Norway
> + * Author: Dag Lem 

Hmm, I don't think we accept copyright lines that aren't "PostgreSQL
Global Development Group".  Is it okay to use that, and update the year
to 2023?  (Note that answering "no" very likely means your patch is not
candidate for inclusion.)  Also, we tend not to have "Author:" lines.

> + * Permission to use, copy, modify, and distribute this software and its
> + * documentation for any purpose, without fee, and without a written 
> agreement
> + * is hereby granted, provided that the above copyright notice and this
> + * paragraph and the following two paragraphs appear in all copies.
> + *
> + * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
> + * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
> + * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
> + * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
> + * POSSIBILITY OF SUCH DAMAGE.
> + *
> + * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
> + * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
> + * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
> + * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
> + * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

We don't keep a separate copyright statement in the file; rather we
assume that all files are under the PostgreSQL license, which is in the
COPYRIGHT file at the top of the tree.  Changing it thus has the side
effect that these disclaim notes refer to the University of California
rather than "the Author".  IANAL.


I think we should add SPDX markers to all the files we distribute:
/* SPDX-License-Identifier: PostgreSQL */

https://spdx.dev/ids/
https://spdx.org/licenses/PostgreSQL.html

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)




Re: daitch_mokotoff module

2023-02-07 Thread Tomas Vondra
On 2/7/23 18:08, Paul Ramsey wrote:
> 
> 
>> On Feb 7, 2023, at 6:47 AM, Dag Lem  wrote:
>>
>> I just went by to check the status of the patch, and I noticed that
>> you've added yourself as reviewer earlier - great!
>>
>> Please tell me if there is anything I can do to help bring this across
>> the finish line.
> 
> Honestly, I had set it to Ready for Committer, but then I went to run 
> regression one more time and my regression blew up. I found I couldn't enable 
> the UTF tests without things failing. And I don't blame you! I think my 
> installation is probably out-of-alignment in some way, but I didn't want to 
> flip the Ready flag without having run everything through to completion, so I 
> flipped it back. Also, are the UTF tests enabled by default? It wasn't clear 
> to me that they were?
> 
The utf8 tests are enabled depending on the encoding returned by
getdatabaseencoding(). Systems with other encodings will simply use the
alternate .out file. And it works perfectly fine for me.

IMHO it's ready for committer.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: daitch_mokotoff module

2023-02-07 Thread Paul Ramsey



> On Feb 7, 2023, at 6:47 AM, Dag Lem  wrote:
> 
> I just went by to check the status of the patch, and I noticed that
> you've added yourself as reviewer earlier - great!
> 
> Please tell me if there is anything I can do to help bring this across
> the finish line.

Honestly, I had set it to Ready for Committer, but then I went to run 
regression one more time and my regression blew up. I found I couldn't enable 
the UTF tests without things failing. And I don't blame you! I think my 
installation is probably out-of-alignment in some way, but I didn't want to 
flip the Ready flag without having run everything through to completion, so I 
flipped it back. Also, are the UTF tests enabled by default? It wasn't clear to 
me that they were?

P



Re: daitch_mokotoff module

2023-02-07 Thread Dag Lem
Hi Paul,

I just went by to check the status of the patch, and I noticed that
you've added yourself as reviewer earlier - great!

Please tell me if there is anything I can do to help bring this across
the finish line.

Best regards,

Dag Lem




Re: daitch_mokotoff module

2023-01-20 Thread Dag Lem
Alvaro Herrera  writes:

> On 2023-Jan-05, Dag Lem wrote:
>
>> Is there anything else I should do here, to avoid the status being
>> incorrectly stuck at "Waiting for Author" again.
>
> Just mark it Needs Review for now.  I'll be back from vacation on Jan
> 11th and can have a look then (or somebody else can, perhaps.)

Paul Ramsey had a few comments in the mean time, and based on this I
have produced (yet another) patch, with improved documentation.

However it's still not marked as "Ready for Committer" - can you please
take a look again?

Best regards

Dag Lem




Re: daitch_mokotoff module

2023-01-17 Thread Dag Lem
Paul Ramsey  writes:

>> On Jan 12, 2023, at 7:30 AM, Dag Lem  wrote:
>> 

[...]

>> 
>> Sure, I can do that. You don't think this much example text will be
>> TL;DR?
>
> I can only speak for myself, but examples are the meat of
> documentation learning, so as long as they come with enough
> explanatory context to be legible it's worth having them, IMO.
>

I have updated the documentation, hopefully it is more accessible now.

I also corrected documentation for the other functions in fuzzystrmatch
(function name and argtype in the wrong order).

Crossing fingers that someone will eventually change the status to
"Ready for Committer" :-)

Best regards,

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..2548903770
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,582 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "catalog/pg_type.h"
+#include "mb/pg_wchar.h"
+#include "utils/array.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from
+ * 

Re: daitch_mokotoff module

2023-01-12 Thread Paul Ramsey



> On Jan 12, 2023, at 7:30 AM, Dag Lem  wrote:
> 
> Paul Ramsey  writes:
> 
>> On Mon, Jan 2, 2023 at 2:03 PM Dag Lem  wrote:
>> 
>>> I also improved on the documentation example (using Full Text Search).
>>> AFAIK you can't make general queries like that using arrays, however in
>>> any case I must admit that text arrays seem like more natural building
>>> blocks than space delimited text here.
>> 
>> This is a fun addition to fuzzystrmatch.
> 
> I'm glad to hear it! :-)
> 
>> 
>> While it's a little late in the game, I'll just put it out there:
>> daitch_mokotoff() is way harder to type than soundex_dm(). Not sure
>> how you feel about that.
> 
> I chose the name in order to follow the naming of the other functions in
> fuzzystrmatch, which as far as I can tell are given the name which each
> algorithm is known by.
> 
> Personally I don't think it's worth it to deviate from the naming of the
> other functions just to avoid typing a few characters, and I certainly
> don't think daitch_mokotoff is any harder to get right than
> levenshtein_less_equal ;-)

Good points :)

> 
>> 
>> On the documentation, I found the leap directly into the tsquery
>> example a bit too big. Maybe start with a very simple example,
>> 
>> --
>> dm=# SELECT daitch_mokotoff('Schwartzenegger'),
>>daitch_mokotoff('Swartzenegger');
>> 
>> daitch_mokotoff | daitch_mokotoff
>> -+-
>> {479465}| {479465}
>> --
>> 
>> Then transition into a more complex example that illustrates the GIN
>> index technique you mention in the text, but do not show:
>> 
>> --
>> CREATE TABLE dm_gin (source text, dm text[]);
>> 
>> INSERT INTO dm_gin (source) VALUES
>>('Swartzenegger'),
>>('John'),
>>('James'),
>>('Steinman'),
>>('Steinmetz');
>> 
>> UPDATE dm_gin SET dm = daitch_mokotoff(source);
>> 
>> CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm);
>> 
>> SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger');
>> --
> 
> Sure, I can do that. You don't think this much example text will be
> TL;DR?

I can only speak for myself, but examples are the meat of documentation 
learning, so as long as they come with enough explanatory context to be legible 
it's worth having them, IMO.

> 
>> 
>> And only then go into the tsearch example. Incidentally, what does the
>> tsearch approach provide that the simple GIN approach does not?
> 
> The example shows how to do a simultaneous match on first AND last
> names, where the first and last names (any number of names) are stored
> in the same indexed column, and the order of the names in the index and
> the search term does not matter.
> 
> If you were to use the GIN "&&" operator, you would get a match if
> either the first OR the last name matches. If you were to use the GIN
> "@>" operator, you would *not* get a match if the search term contains
> more soundex codes than the indexed name.
> 
> E.g. this yields a correct match:
> SELECT soundex_tsvector('John Yamson') @@ soundex_tsquery('John Jameson');
> 
> While this yields a false positive:
> SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) && 
> (daitch_mokotoff('John') || daitch_mokotoff('Doe'));
> 
> And this yields a false negative:
> SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) @> 
> (daitch_mokotoff('John') || daitch_mokotoff('Jameson'));
> 
> This may explained better by simply showing the output of
> soundex_tsvector and soundex_tsquery:
> 
> SELECT soundex_tsvector('John Yamson');
> soundex_tsvector 
> --
> '16':1 '164600':3 '46':2
> 
> SELECT soundex_tsquery('John Jameson');
>  soundex_tsquery  
> ---
> ( '16' | '46' ) & ( '164600' | '464600' )
> 
>> Ideally explain that briefly before launching into the example. With
>> all the custom functions and so on it's a little involved, so maybe if
>> there's not a huge win in using that approach drop it entirely?
> 
> I believe this functionality is quite useful, and that it's actually
> what's called for in many situations. So, I'd rather not drop this
> example.

Sounds good

P

> 
>> 
>> ATB,
>> P
>> 
> 
> Best regards,
> 
> Dag Lem





Re: daitch_mokotoff module

2023-01-12 Thread Dag Lem
Paul Ramsey  writes:

> On Mon, Jan 2, 2023 at 2:03 PM Dag Lem  wrote:
>
>> I also improved on the documentation example (using Full Text Search).
>> AFAIK you can't make general queries like that using arrays, however in
>> any case I must admit that text arrays seem like more natural building
>> blocks than space delimited text here.
>
> This is a fun addition to fuzzystrmatch.

I'm glad to hear it! :-)

>
> While it's a little late in the game, I'll just put it out there:
> daitch_mokotoff() is way harder to type than soundex_dm(). Not sure
> how you feel about that.

I chose the name in order to follow the naming of the other functions in
fuzzystrmatch, which as far as I can tell are given the name which each
algorithm is known by.

Personally I don't think it's worth it to deviate from the naming of the
other functions just to avoid typing a few characters, and I certainly
don't think daitch_mokotoff is any harder to get right than
levenshtein_less_equal ;-)

So, if I were to decide, I wouldn't change the name of the function.
However I'm obviously not calling the shots on what goes into PostgreSQL
- perhaps someone else would like to weigh in on this?

>
> On the documentation, I found the leap directly into the tsquery
> example a bit too big. Maybe start with a very simple example,
>
> --
> dm=# SELECT daitch_mokotoff('Schwartzenegger'),
> daitch_mokotoff('Swartzenegger');
>
>  daitch_mokotoff | daitch_mokotoff
> -+-
>  {479465}| {479465}
> --
>
> Then transition into a more complex example that illustrates the GIN
> index technique you mention in the text, but do not show:
>
> --
> CREATE TABLE dm_gin (source text, dm text[]);
>
> INSERT INTO dm_gin (source) VALUES
> ('Swartzenegger'),
> ('John'),
> ('James'),
> ('Steinman'),
> ('Steinmetz');
>
> UPDATE dm_gin SET dm = daitch_mokotoff(source);
>
> CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm);
>
> SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger');
> --

Sure, I can do that. You don't think this much example text will be
TL;DR?

>
> And only then go into the tsearch example. Incidentally, what does the
> tsearch approach provide that the simple GIN approach does not?

The example shows how to do a simultaneous match on first AND last
names, where the first and last names (any number of names) are stored
in the same indexed column, and the order of the names in the index and
the search term does not matter.

If you were to use the GIN "&&" operator, you would get a match if
either the first OR the last name matches. If you were to use the GIN
"@>" operator, you would *not* get a match if the search term contains
more soundex codes than the indexed name.

E.g. this yields a correct match:
SELECT soundex_tsvector('John Yamson') @@ soundex_tsquery('John Jameson');

While this yields a false positive:
SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) && 
(daitch_mokotoff('John') || daitch_mokotoff('Doe'));

And this yields a false negative:
SELECT (daitch_mokotoff('John') || daitch_mokotoff('Yamson')) @> 
(daitch_mokotoff('John') || daitch_mokotoff('Jameson'));

This may explained better by simply showing the output of
soundex_tsvector and soundex_tsquery:

SELECT soundex_tsvector('John Yamson');
 soundex_tsvector 
--
 '16':1 '164600':3 '46':2

SELECT soundex_tsquery('John Jameson');
  soundex_tsquery  
---
 ( '16' | '46' ) & ( '164600' | '464600' )

> Ideally explain that briefly before launching into the example. With
> all the custom functions and so on it's a little involved, so maybe if
> there's not a huge win in using that approach drop it entirely?

I believe this functionality is quite useful, and that it's actually
what's called for in many situations. So, I'd rather not drop this
example.

>
> ATB,
> P
>

Best regards,

Dag Lem




Re: daitch_mokotoff module

2023-01-11 Thread Paul Ramsey
On Mon, Jan 2, 2023 at 2:03 PM Dag Lem  wrote:

> I also improved on the documentation example (using Full Text Search).
> AFAIK you can't make general queries like that using arrays, however in
> any case I must admit that text arrays seem like more natural building
> blocks than space delimited text here.

This is a fun addition to fuzzystrmatch.

While it's a little late in the game, I'll just put it out there:
daitch_mokotoff() is way harder to type than soundex_dm(). Not sure
how you feel about that.

On the documentation, I found the leap directly into the tsquery
example a bit too big. Maybe start with a very simple example,

--
dm=# SELECT daitch_mokotoff('Schwartzenegger'),
daitch_mokotoff('Swartzenegger');

 daitch_mokotoff | daitch_mokotoff
-+-
 {479465}| {479465}
--

Then transition into a more complex example that illustrates the GIN
index technique you mention in the text, but do not show:

--
CREATE TABLE dm_gin (source text, dm text[]);

INSERT INTO dm_gin (source) VALUES
('Swartzenegger'),
('John'),
('James'),
('Steinman'),
('Steinmetz');

UPDATE dm_gin SET dm = daitch_mokotoff(source);

CREATE INDEX dm_gin_x ON dm_gin USING GIN (dm);

SELECT * FROM dm_gin WHERE dm && daitch_mokotoff('Schwartzenegger');
--

And only then go into the tsearch example. Incidentally, what does the
tsearch approach provide that the simple GIN approach does not?
Ideally explain that briefly before launching into the example. With
all the custom functions and so on it's a little involved, so maybe if
there's not a huge win in using that approach drop it entirely?

ATB,
P




Re: daitch_mokotoff module

2023-01-06 Thread Dag Lem
Alvaro Herrera  writes:

> On 2023-Jan-05, Dag Lem wrote:
>
>> Is there anything else I should do here, to avoid the status being
>> incorrectly stuck at "Waiting for Author" again.
>
> Just mark it Needs Review for now.  I'll be back from vacation on Jan
> 11th and can have a look then (or somebody else can, perhaps.)

OK, done. Have a nice vacation!

Best regards

Dag Lem




Re: daitch_mokotoff module

2023-01-05 Thread Alvaro Herrera
On 2023-Jan-05, Dag Lem wrote:

> Is there anything else I should do here, to avoid the status being
> incorrectly stuck at "Waiting for Author" again.

Just mark it Needs Review for now.  I'll be back from vacation on Jan
11th and can have a look then (or somebody else can, perhaps.)

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Puedes vivir sólo una vez, pero si lo haces bien, una vez es suficiente"




Re: daitch_mokotoff module

2023-01-05 Thread Dag Lem
Is there anything else I should do here, to avoid the status being
incorrectly stuck at "Waiting for Author" again.

Best regards

Dag Lem




Re: daitch_mokotoff module

2023-01-02 Thread Dag Lem
Sorry about the latest unfinished email - don't know what key
combination I managed to hit there.

Alvaro Herrera  writes:

> Hello
>
> On 2022-Dec-23, Dag Lem wrote:
>

[...]

>
> So, yes, I'm proposing that we returns those as array elements and that
> @> is used to match them.
>

Looking into the array operators I guess that to match such arrays
directly one would actually use && (overlaps) rather than @> (contains),
but I digress.

The function is changed to return an array of soundex codes - I hope it
is now to your liking :-)

I also improved on the documentation example (using Full Text Search).
AFAIK you can't make general queries like that using arrays, however in
any case I must admit that text arrays seem like more natural building
blocks than space delimited text here.

[...]

>> BTW Vera 79 does not match Veras 794000, because they don't sound
>> the same (up to the maximum soundex code length).
>
> No, and maybe that's okay because they have different codes.  But they
> are both similar, in Daitch-Mokotoff, to Borja, which has two codes,
> 79 and 794000.  (Any Spanish speaker will readily tell you that
> neither Vera nor Veras are similar in any way to Borja, but D-M has
> chosen to say that each of them matches one of Borjas' codes.  So they
> *are* related, even though indirectly, and as a genealogist you *may* be
> interested in getting a match for a person called Vera when looking for
> relatives to a person called Veras.  And, as a Spanish speaker, that
> would make a lot of sense to me.)

It is what it is - we can't call it Daitch-Mokotoff Soundex while
implementing something else. Having said that, one can always pre- or
postprocess to tweak the results.

Daitch-Mokotoff Soundex is known to produce false positives, but that is
in many cases not a problem.

Even though it's clearly tuned for Jewish names, the soundex algorithm
seems to work just fine for European names (we use it to match mostly
Norwegian names).


Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..2548903770
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,582 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is 

Re: daitch_mokotoff module

2023-01-02 Thread Dag Lem
Alvaro Herrera  writes:

> Hello
>
> On 2022-Dec-23, Dag Lem wrote:
>

[...]

> So, yes, I'm proposing that we returns those as array elements and that
> @> is used to match them.

Looking into the array operators I guess that to match such arrays
directly one would actually use && (overlaps) rather than @> (contains),
but I digress.

The function is changed to return an array of soundex codes - I hope it
is now to your liking :-)

I also improved on the documentation example (using Full Text Search).
AFAIK you can't make general queries like that using arrays, however in
any case I must admit that text arrays seem like more natural building
blocks than space delimited text here.

Search to perform

is the best match for Daitch-Mokotoff, however

, but
in any case I've changed it into return arrays now. I hope it is to your
liking.

>
>> Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
>> however if I understand correctly, you want to index names by single
>> sounds, linking all alike sounding names to the same soundex code. I
>> fail to see how that is useful - if you want to find matches for a name,
>> you simply match against all indexed names. If you only consider one
>> sound, you won't find all names that match.
>
> Hmm, I think we're saying the same thing, but from opposite points of
> view.  No, I want each name to return multiple codes, but that those
> multiple codes can be treated as a multiple-value array of codes, rather
> than as a single string of space-separated codes.
>
>> In any case, as explained in the documentation, the implementation is
>> intended to be a companion to Full Text Search, thus text is the natural
>> representation for the soundex codes.
>
> Hmm, I don't agree with this point.  The numbers are representations of
> the strings, but they don't necessarily have to be strings themselves.
>
>
>> BTW Vera 79 does not match Veras 794000, because they don't sound
>> the same (up to the maximum soundex code length).
>
> No, and maybe that's okay because they have different codes.  But they
> are both similar, in Daitch-Mokotoff, to Borja, which has two codes,
> 79 and 794000.  (Any Spanish speaker will readily tell you that
> neither Vera nor Veras are similar in any way to Borja, but D-M has
> chosen to say that each of them matches one of Borjas' codes.  So they
> *are* related, even though indirectly, and as a genealogist you *may* be
> interested in getting a match for a person called Vera when looking for
> relatives to a person called Veras.  And, as a Spanish speaker, that
> would make a lot of sense to me.)
>
>
> Now, it's true that I've chosen to use Spanish names for my silly little
> experiment.  Maybe this isn't terribly useful as a practical example,
> because this algorithm seems to have been designed for Jew surnames and
> perhaps not many (or not any) Jews had Spanish surnames.  I don't know;
> I'm not a Jew myself (though Noah Gordon tells the tale of a Spanish Jew
> called Josep Álvarez in his book "The Winemaker", so I guess it's not
> impossible).  Anyway, I suspect if you repeat the experiment with names
> of other origins, you'll find pretty much the same results apply there,
> and that is the whole reason D-M returns multiple codes and not just
> one.
>
>
> Merry Christmas :-)

-- 
Dag




Re: daitch_mokotoff module

2022-12-25 Thread Alvaro Herrera
Hello

On 2022-Dec-23, Dag Lem wrote:

> It seems to me like you're trying to use soundex coding for something it
> was never designed for.

I'm not trying to use it for anything, actually.  I'm just reading the
pages your patch links to, to try and understand how this algorithm can
be best implemented in Postgres.

So I got to this page
https://www.avotaynu.com/soundex.htm
which explains that Daitch figured that it would be best if a letter
that can have two possible encodings would be encoded in both ways:

> 5. If a combination of letters could have two possible sounds, then it
> is coded in both manners. For example, the letters ch can have a soft
> sound such as in Chicago or a hard sound as in Christmas.

which I understand as meaning that a single name returns two possible
encodings, which is why these three names
 Barca Barco Parco
have two possible encodings
 795000 and 794000
which is what your algorithm returns.

In fact, using the word Christmas we do get alternative codes for the first
letter (either 4 or 5), precisely as in Daitch's example:

=# select daitch_mokotoff('christmas');
 daitch_mokotoff 
─
 594364 494364
(1 fila)

and if we take out the ambiguous 'ch', we get a single one:

=# select daitch_mokotoff('ristmas');
 daitch_mokotoff 
─
 943640
(1 fila)

and if we add another 'ch', we get the codes for each possibility at each
position of the ambiguous 'ch':

=# select daitch_mokotoff('christmach');
   daitch_mokotoff   
─
 594365 594364 494365 494364
(1 fila)


So, yes, I'm proposing that we returns those as array elements and that
@> is used to match them.

> Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
> however if I understand correctly, you want to index names by single
> sounds, linking all alike sounding names to the same soundex code. I
> fail to see how that is useful - if you want to find matches for a name,
> you simply match against all indexed names. If you only consider one
> sound, you won't find all names that match.

Hmm, I think we're saying the same thing, but from opposite points of
view.  No, I want each name to return multiple codes, but that those
multiple codes can be treated as a multiple-value array of codes, rather
than as a single string of space-separated codes.

> In any case, as explained in the documentation, the implementation is
> intended to be a companion to Full Text Search, thus text is the natural
> representation for the soundex codes.

Hmm, I don't agree with this point.  The numbers are representations of
the strings, but they don't necessarily have to be strings themselves.


> BTW Vera 79 does not match Veras 794000, because they don't sound
> the same (up to the maximum soundex code length).

No, and maybe that's okay because they have different codes.  But they
are both similar, in Daitch-Mokotoff, to Borja, which has two codes,
79 and 794000.  (Any Spanish speaker will readily tell you that
neither Vera nor Veras are similar in any way to Borja, but D-M has
chosen to say that each of them matches one of Borjas' codes.  So they
*are* related, even though indirectly, and as a genealogist you *may* be
interested in getting a match for a person called Vera when looking for
relatives to a person called Veras.  And, as a Spanish speaker, that
would make a lot of sense to me.)


Now, it's true that I've chosen to use Spanish names for my silly little
experiment.  Maybe this isn't terribly useful as a practical example,
because this algorithm seems to have been designed for Jew surnames and
perhaps not many (or not any) Jews had Spanish surnames.  I don't know;
I'm not a Jew myself (though Noah Gordon tells the tale of a Spanish Jew
called Josep Álvarez in his book "The Winemaker", so I guess it's not
impossible).  Anyway, I suspect if you repeat the experiment with names
of other origins, you'll find pretty much the same results apply there,
and that is the whole reason D-M returns multiple codes and not just
one.


Merry Christmas :-)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/




Re: daitch_mokotoff module

2022-12-24 Thread Dag Lem
Dag Lem  writes:

> Alvaro Herrera  writes:
>
>> On 2022-Dec-23, Alvaro Herrera wrote:
>>
>
> [...]
>
>> I tried downloading a list of surnames from here
>> https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
>> pasted that in a text file and \copy'ed it into a table.  Then I ran
>> this query
>>
>> select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
>> from apellidos
>> group by daitch_mokotoff(a)
>> order by count(*) desc;
>>
>> so I have a first entry like this
>>
>> string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz
>> Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas
>> daitch_mokotoff │ 784000
>> count   │ 18
>>
>> but then I have a bunch of other entries with the same code 784000 as
>> alternative codes,
>>
>> string_agg  │ Velazco
>> daitch_mokotoff │ 784500 784000
>> count   │ 1
>>
>> string_agg  │ Palacio
>> daitch_mokotoff │ 785000 784000
>> count   │ 1
>>
>> I suppose I need to group these together somehow, and it would make more
>> sense to do that if the values were arrays.
>>
>>
>> If I scroll a bit further down and choose, say, 794000 (a relatively
>> popular one), then I have this
>>
>> string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras
>> Peraza Peres Perez Porras Varas Veras
>> daitch_mokotoff │ 794000
>> count   │ 14
>>
>> and looking for that code in the result I also get these three
>>
>> string_agg  │ Barca Barco Parco
>> daitch_mokotoff │ 795000 794000
>> count   │ 3
>>
>> string_agg  │ Borja
>> daitch_mokotoff │ 79 794000
>> count   │ 1
>>
>> string_agg  │ Borjas
>> daitch_mokotoff │ 794000 794400
>> count   │ 1
>>
>> and then I see that I should also search for possible matches in codes
>> 795000, 79 and 794400, so that gives me
>>
>> string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra
>> Perea Vera
>> daitch_mokotoff │ 79
>> count   │ 11
>>
>> string_agg  │ Barriga Borge Borrego Burgo Fraga
>> daitch_mokotoff │ 795000
>> count   │ 5
>>
>> string_agg  │ Borjas
>> daitch_mokotoff │ 794000 794400
>> count   │ 1
>>
>> which look closely related (compare "Veras" in the first to "Vera" in
>> the later set.  If you ignore that pseudo-match, you're likely to miss
>> possible family relationships.)
>>
>>
>> I suppose if I were a genealogy researcher, I would be helped by having
>> each of these codes behave as a separate unit, rather than me having to
>> split the string into the several possible contained values.
>
> It seems to me like you're trying to use soundex coding for something it
> was never designed for.
>
> As stated in my previous mail, soundex algorithms are designed to index
> names on some representation of sound, so that alike sounding names with
> different spellings will match, and as shown in the documentation
> example, that is exactly what the implementation facilitates.
>
> Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
> however if I understand correctly, you want to index names by single
> sounds, linking all alike sounding names to the same soundex code. I
> fail to see how that is useful - if you want to find matches for a name,
> you simply match against all indexed names. If you only consider one
> sound, you won't find all names that match.
>
> In any case, as explained in the documentation, the implementation is
> intended to be a companion to Full Text Search, thus text is the natural
> representation for the soundex codes.
>
> BTW Vera 79 does not match Veras 794000, because they don't sound
> the same (up to the maximum soundex code length).
>

I've been sleeping on this, and perhaps the normal use case can just as
well (or better) be covered by the "@>" array operator? I originally
implemented similar functionality using another soundex algorithm more
than a decade ago, and either arrays couldn't be GIN indexed back then,
or I simply missed it. I'll have to get back to this - now it's
Christmas!

Merry Christmas!

Best regards,

Dag Lem




Re: daitch_mokotoff module

2022-12-23 Thread Dag Lem
Alvaro Herrera  writes:

> On 2022-Dec-23, Alvaro Herrera wrote:
>

[...]

> I tried downloading a list of surnames from here
> https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
> pasted that in a text file and \copy'ed it into a table.  Then I ran
> this query
>
> select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
> from apellidos
> group by daitch_mokotoff(a)
> order by count(*) desc;
>
> so I have a first entry like this
>
> string_agg │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz
> Palos Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas
> daitch_mokotoff │ 784000
> count   │ 18
>
> but then I have a bunch of other entries with the same code 784000 as
> alternative codes,
>
> string_agg  │ Velazco
> daitch_mokotoff │ 784500 784000
> count   │ 1
>
> string_agg  │ Palacio
> daitch_mokotoff │ 785000 784000
> count   │ 1
>
> I suppose I need to group these together somehow, and it would make more
> sense to do that if the values were arrays.
>
>
> If I scroll a bit further down and choose, say, 794000 (a relatively
> popular one), then I have this
>
> string_agg │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras
> Peraza Peres Perez Porras Varas Veras
> daitch_mokotoff │ 794000
> count   │ 14
>
> and looking for that code in the result I also get these three
>
> string_agg  │ Barca Barco Parco
> daitch_mokotoff │ 795000 794000
> count   │ 3
>
> string_agg  │ Borja
> daitch_mokotoff │ 79 794000
> count   │ 1
>
> string_agg  │ Borjas
> daitch_mokotoff │ 794000 794400
> count   │ 1
>
> and then I see that I should also search for possible matches in codes
> 795000, 79 and 794400, so that gives me
>
> string_agg │ Baria Baro Barrio Barro Berra Borra Feria Para Parra
> Perea Vera
> daitch_mokotoff │ 79
> count   │ 11
>
> string_agg  │ Barriga Borge Borrego Burgo Fraga
> daitch_mokotoff │ 795000
> count   │ 5
>
> string_agg  │ Borjas
> daitch_mokotoff │ 794000 794400
> count   │ 1
>
> which look closely related (compare "Veras" in the first to "Vera" in
> the later set.  If you ignore that pseudo-match, you're likely to miss
> possible family relationships.)
>
>
> I suppose if I were a genealogy researcher, I would be helped by having
> each of these codes behave as a separate unit, rather than me having to
> split the string into the several possible contained values.

It seems to me like you're trying to use soundex coding for something it
was never designed for.

As stated in my previous mail, soundex algorithms are designed to index
names on some representation of sound, so that alike sounding names with
different spellings will match, and as shown in the documentation
example, that is exactly what the implementation facilitates.

Daitch-Mokotoff Soundex indexes alternative sounds for the same name,
however if I understand correctly, you want to index names by single
sounds, linking all alike sounding names to the same soundex code. I
fail to see how that is useful - if you want to find matches for a name,
you simply match against all indexed names. If you only consider one
sound, you won't find all names that match.

In any case, as explained in the documentation, the implementation is
intended to be a companion to Full Text Search, thus text is the natural
representation for the soundex codes.

BTW Vera 79 does not match Veras 794000, because they don't sound
the same (up to the maximum soundex code length).

Best regards

Dag Lem




Re: daitch_mokotoff module

2022-12-23 Thread Dag Lem
Alvaro Herrera  writes:

> I wonder why do you have it return the multiple alternative codes as a
> space-separated string.  Maybe an array would be more appropriate.  Even
> on your documented example use, the first thing you do is split it on
> spaces.

In the example, the *input* is split on whitespace, the returned soundex
codes are not. The splitting of the input is done in order to code each
word separately. One of the stated rules of the Daitch-Mokotoff Soundex
Coding is that "When a name consists of more than one word, it is coded
as if one word", and this may not always be desired. See
https://www.avotaynu.com/soundex.htm or
https://www.jewishgen.org/InfoFiles/soundex.html for the rules.

The intended use for the Daitch-Mokotoff soundex, as for any other
soundex algorithm, is to index names (or words) on some representation
of sound, so that alike sounding names with different spellings will
match.

In PostgreSQL, the Daitch-Mokotoff Soundex and Full Text Search makes
for a powerful combination to match alike sounding names. Full Text
Search (as any other free text search engine) works with documents, and
thus the Daitch-Mokotoff Soundex implementation produces documents
(words separated by space). As stated in the documentation: "Any
alternative soundex codes are separated by space, which makes the
returned text suited for use in Full Text Search".

Best regards,

Dag Lem




Re: daitch_mokotoff module

2022-12-23 Thread Dag Lem
Tom Lane  writes:

> Alvaro Herrera  writes:
>> On 2022-Dec-22, Dag Lem wrote:
>>> This should hopefully fix the last Cfbot failures, by exclusion of
>>> daitch_mokotoff.h from headerscheck and cpluspluscheck.
>
>> Hmm, maybe it'd be better to move the typedefs to the .h file instead.
>
> Indeed, that sounds like exactly the wrong way to fix such a problem.
> The bar for excluding stuff from headerscheck needs to be very high.
>

OK, I've moved enough declarations back to the generated header file
again so as to avoid excluding it from headerscheck and cpluspluscheck.

Best regards,

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..e809d4a39e
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,582 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "mb/pg_wchar.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+/* Generated 

Re: daitch_mokotoff module

2022-12-23 Thread Dag Lem
Andres Freund  writes:

> On 2022-12-22 14:27:54 +0100, Dag Lem wrote:
>> This should hopefully fix the last Cfbot failures, by exclusion of
>> daitch_mokotoff.h from headerscheck and cpluspluscheck.
>
> Btw, you can do the same tests as cfbot in your own repo by enabling CI
> in a github repo. See src/tools/ci/README
>

OK, thanks, I've set it up now.

Best regards,

Dag Lem




Re: daitch_mokotoff module

2022-12-23 Thread Tom Lane
Alvaro Herrera  writes:
> On 2022-Dec-22, Dag Lem wrote:
>> This should hopefully fix the last Cfbot failures, by exclusion of
>> daitch_mokotoff.h from headerscheck and cpluspluscheck.

> Hmm, maybe it'd be better to move the typedefs to the .h file instead.

Indeed, that sounds like exactly the wrong way to fix such a problem.
The bar for excluding stuff from headerscheck needs to be very high.

regards, tom lane




Re: daitch_mokotoff module

2022-12-23 Thread Alvaro Herrera
On 2022-Dec-23, Alvaro Herrera wrote:

> I wonder why do you have it return the multiple alternative codes as a
> space-separated string.  Maybe an array would be more appropriate.  Even
> on your documented example use, the first thing you do is split it on
> spaces.

I tried downloading a list of surnames from here
https://www.bibliotecadenombres.com/apellidos/apellidos-espanoles/
pasted that in a text file and \copy'ed it into a table.  Then I ran
this query

select string_agg(a, ' ' order by a), daitch_mokotoff(a), count(*)
from apellidos
group by daitch_mokotoff(a)
order by count(*) desc;

so I have a first entry like this

string_agg  │ Balasco Balles Belasco Belles Blas Blasco Fallas Feliz Palos 
Pelaez Plaza Valles Vallez Velasco Velez Veliz Veloz Villas
daitch_mokotoff │ 784000
count   │ 18

but then I have a bunch of other entries with the same code 784000 as
alternative codes,

string_agg  │ Velazco
daitch_mokotoff │ 784500 784000
count   │ 1

string_agg  │ Palacio
daitch_mokotoff │ 785000 784000
count   │ 1

I suppose I need to group these together somehow, and it would make more
sense to do that if the values were arrays.


If I scroll a bit further down and choose, say, 794000 (a relatively
popular one), then I have this

string_agg  │ Barraza Barrios Barros Bras Ferraz Frias Frisco Parras Peraza 
Peres Perez Porras Varas Veras
daitch_mokotoff │ 794000
count   │ 14

and looking for that code in the result I also get these three

string_agg  │ Barca Barco Parco
daitch_mokotoff │ 795000 794000
count   │ 3

string_agg  │ Borja
daitch_mokotoff │ 79 794000
count   │ 1

string_agg  │ Borjas
daitch_mokotoff │ 794000 794400
count   │ 1

and then I see that I should also search for possible matches in codes
795000, 79 and 794400, so that gives me

string_agg  │ Baria Baro Barrio Barro Berra Borra Feria Para Parra Perea 
Vera
daitch_mokotoff │ 79
count   │ 11

string_agg  │ Barriga Borge Borrego Burgo Fraga
daitch_mokotoff │ 795000
count   │ 5

string_agg  │ Borjas
daitch_mokotoff │ 794000 794400
count   │ 1

which look closely related (compare "Veras" in the first to "Vera" in
the later set.  If you ignore that pseudo-match, you're likely to miss
possible family relationships.)


I suppose if I were a genealogy researcher, I would be helped by having
each of these codes behave as a separate unit, rather than me having to
split the string into the several possible contained values.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees."  (E. Dijkstra)




Re: daitch_mokotoff module

2022-12-23 Thread Alvaro Herrera
I wonder why do you have it return the multiple alternative codes as a
space-separated string.  Maybe an array would be more appropriate.  Even
on your documented example use, the first thing you do is split it on
spaces.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: daitch_mokotoff module

2022-12-23 Thread Alvaro Herrera
On 2022-Dec-22, Dag Lem wrote:

> This should hopefully fix the last Cfbot failures, by exclusion of
> daitch_mokotoff.h from headerscheck and cpluspluscheck.

Hmm, maybe it'd be better to move the typedefs to the .h file instead.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Pensar que el espectro que vemos es ilusorio no lo despoja de espanto,
sólo le suma el nuevo terror de la locura" (Perelandra, C.S. Lewis)




Re: daitch_mokotoff module

2022-12-23 Thread Andres Freund
On 2022-12-22 14:27:54 +0100, Dag Lem wrote:
> This should hopefully fix the last Cfbot failures, by exclusion of
> daitch_mokotoff.h from headerscheck and cpluspluscheck.

Btw, you can do the same tests as cfbot in your own repo by enabling CI
in a github repo. See src/tools/ci/README




Re: daitch_mokotoff module

2022-12-22 Thread Dag Lem
Dag Lem  writes:

> Hi Ian,
>
> Ian Lawrence Barwick  writes:
>

[...]

>> I see you provided some feedback on
>> https://commitfest.postgresql.org/36/3468/,
>> though the patch seems to have not been accepted (but not
>> conclusively rejected
>> either). If you still have the chance to review another patch (or
>> more) it would
>> be much appreciated, as there's quite a few piling up. Things like
>> documentation
>> or small improvements to client applications are always a good place to 
>> start.
>> Reviews can be provided at any time, there's no need to wait for the next
>> CommitFest.
>>
>
> OK, I'll try to find another patch to review.
>

I have scanned through all the patches in Commitfest 2023-01 with status
"Needs review", and it is difficult to find something which I can
meaningfully review.

The only thing I felt qualified to comment (or nit-pick?) on was
https://commitfest.postgresql.org/41/4071/

If something else should turn up which could be reviewed by someone
without intimate knowledge of PostgreSQL internals, then don't hesitate
to ask.

As for the Daitch-Mokotoff patch, the review by Andres Freund was very
helpful in order to improve the extension and to make it more idiomatic
- hopefully it is now a bit closer to being included.


Best regards

Dag Lem




Re: daitch_mokotoff module

2022-12-22 Thread Dag Lem
Dag Lem  writes:

> I noticed that the Meson builds failed in Cfbot, the updated patch adds
> a missing "include_directories" line to meson.build.
>

This should hopefully fix the last Cfbot failures, by exclusion of
daitch_mokotoff.h from headerscheck and cpluspluscheck.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..d4ad95c283
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,596 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "mb/pg_wchar.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+#define DM_CODE_DIGITS 6
+
+/* Coding chart table: Soundex codes */
+typedef char dm_code[2 + 1];	/* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];	/* Start of name, before a vowel, any other */
+
+/* Coding chart table: Letter in input sequence */
+struct dm_letter
+{
+	char		letter;			/* 

Re: daitch_mokotoff module

2022-12-22 Thread Dag Lem
I noticed that the Meson builds failed in Cfbot, the updated patch adds
a missing "include_directories" line to meson.build.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..12baf2d884 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,14 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+# Force this dependency to be known even without dependency info built:
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< $@
+
+distprep: daitch_mokotoff.h
+
+maintainer-clean:
+	rm -f daitch_mokotoff.h
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..d4ad95c283
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,596 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "mb/pg_wchar.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+
+
+/*
+ * The soundex coding chart table is adapted from from
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * See daitch_mokotoff_header.pl for details.
+*/
+
+#define DM_CODE_DIGITS 6
+
+/* Coding chart table: Soundex codes */
+typedef char dm_code[2 + 1];	/* One or two sequential code digits + NUL */
+typedef dm_code dm_codes[3];	/* Start of name, before a vowel, any other */
+
+/* Coding chart table: Letter in input sequence */
+struct dm_letter
+{
+	char		letter;			/* Present letter in sequence */
+	const struct dm_letter *letters;	/* List of possible successive letters */
+	const		dm_codes *codes;	/* Code 

Re: daitch_mokotoff module

2022-12-21 Thread Dag Lem
Hi Andreas,

Thank you for your detailed and constructive review!

I have made a conscientuous effort to address all the issues you point
out, please see comments below.

Andres Freund  writes:

> Hi,
>
> On 2022-02-03 15:27:32 +0100, Dag Lem wrote:

[...]

> [23:43:34.796] contrib/fuzzystrmatch/meson.build:18:0: ERROR: File
> fuzzystrmatch--1.1.sql does not exist.
>
>
>> -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
>> +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql
>> fuzzystrmatch--1.0--1.1.sql
>>  PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
>
>
> The patch seems to remove fuzzystrmatch--1.1.sql - I suggest not doing
> that. In recent years our approach has been to just keep the "base version" of
> the upgrade script, with extension creation running through the upgrade
> scripts.
>

OK, I have now kept fuzzystrmatch--1.1.sql, and omitted
fuzzystrmatch--1.2.sql

Both the Makefile and meson.build are updated to handle the new files,
including the generated header.

>>  
>> +
>> +#include "daitch_mokotoff.h"
>> +
>> +#include "postgres.h"
>
> Postgres policy is that the include of "postgres.h" has to be the first
> include in every .c file.
>
>

OK, fixed.

>> +#include "utils/builtins.h"
>> +#include "mb/pg_wchar.h"
>> +
>> +#include 
>> +
>> +/* Internal C implementation */
>> +static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
>> +
>> +
>> +PG_FUNCTION_INFO_V1(daitch_mokotoff);
>> +Datum
>> +daitch_mokotoff(PG_FUNCTION_ARGS)
>> +{
>> +text   *arg = PG_GETARG_TEXT_PP(0);
>> +char   *string,
>> +   *tmp_soundex;
>> +text   *soundex;
>> +
>> +/*
>> + * The maximum theoretical soundex size is several KB, however in
>> practice
>> + * anything but contrived synthetic inputs will yield a soundex size of
>> + * less than 100 bytes. We thus allocate and free a temporary work
>> buffer,
>> + * and return only the actual soundex result.
>> + */
>> + string = pg_server_to_any(text_to_cstring(arg),
>> VARSIZE_ANY_EXHDR(arg), PG_UTF8);
>> +tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);
>
> Seems that just using StringInfo to hold the soundex output would work better
> than a static allocation?
>

OK, fixed.

>
>> +if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))
>
> We imo shouldn't introduce new functions starting with _.
>

OK, fixed. Note that I just followed the existing pattern in
fuzzystrmatch.c there.

[...]

>> +/* Find next node corresponding to code digit, or create a new node. */
>> +static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
>> + dm_node * node, char code_digit)
>
> PG code style is to have a line break between a function defintion's return
> type and the function name - like you actually do above.
>

OK, fixed. Both pgindent and I must have missed that particular
function.

>> +/* Mapping from ISO8859-1 to upper-case ASCII */
>> +static const char tr_iso8859_1_to_ascii_upper[] =
>> +/*
>> +"`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬
>> ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
>> +*/
>> +"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~ !
>> ?AAECDNO*OYDSAAECDNO/OYDY";
>> +
>> +static char
>> +iso8859_1_to_ascii_upper(unsigned char c)
>> +{
>> +return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c;
>> +}
>> +
>> +
>> +/* Convert an UTF-8 character to ISO-8859-1.
>> + * Unconvertable characters are returned as '?'.
>> + * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
>> + */
>> +static char
>> +utf8_to_iso8859_1(char *str, int *ix)
>
> It seems decidedly not great to have custom encoding conversion routines in a
> contrib module. Is there any way we can avoid this?
>

I have now replaced the custom UTF-8 decode with calls to
utf8_to_unicode and pg_utf_mblen, and simplified the subsequent
conversion to ASCII. Hopefully this makes the conversion code more
palatable.

I don't see how the conversion to ASCII could be substantially
simplified further. The conversion maps lowercase and 8 bit ISO8859-1
characters to ASCII via uppercasing, removal of accents, and discarding
of special characters. In addition to that, it maps (the non-ISO8859-1)
Ą, Ę, and Ţ/Ț from the coding chart to [, \, and ]. After this, a simple
O(1) table lookup can be used to retrieve the soundex code tree for a
letter sequence.

>
>> +/* Generate all Daitch-Mokotoff soundex codes for word, separated
>> by space. */
>> +static char *
>> +_daitch_mokotoff(char *word, char *soundex, size_t n)
>> +{
>> +int i = 0,
>> +j;
>> +int letter_no = 0;
>> +int ix_leaves = 0;
>> +int num_nodes = 0,
>> +num_leaves = 0;
>> +dm_codes   *codes,
>> +   *next_codes;
>> +dm_node*nodes;
>> +dm_leaves  

Re: daitch_mokotoff module

2022-12-07 Thread Andres Freund
Hi,

On 2022-02-03 15:27:32 +0100, Dag Lem wrote:
> Just some minor adjustments to the patch:
> 
> * Removed call to locale-dependent toupper()
> * Cleaned up input normalization

This patch currently fails in cfbot, likely because meson.build needs to be
adjusted (this didn't exist at the time you submitted this version of the
patch):

[23:43:34.796] contrib/fuzzystrmatch/meson.build:18:0: ERROR: File 
fuzzystrmatch--1.1.sql does not exist.


> -DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
> +DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql 
> fuzzystrmatch--1.0--1.1.sql
>  PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"


The patch seems to remove fuzzystrmatch--1.1.sql - I suggest not doing
that. In recent years our approach has been to just keep the "base version" of
the upgrade script, with extension creation running through the upgrade
scripts.

>  
> +
> +#include "daitch_mokotoff.h"
> +
> +#include "postgres.h"

Postgres policy is that the include of "postgres.h" has to be the first
include in every .c file.


> +#include "utils/builtins.h"
> +#include "mb/pg_wchar.h"
> +
> +#include 
> +
> +/* Internal C implementation */
> +static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
> +
> +
> +PG_FUNCTION_INFO_V1(daitch_mokotoff);
> +Datum
> +daitch_mokotoff(PG_FUNCTION_ARGS)
> +{
> + text   *arg = PG_GETARG_TEXT_PP(0);
> + char   *string,
> +*tmp_soundex;
> + text   *soundex;
> +
> + /*
> +  * The maximum theoretical soundex size is several KB, however in 
> practice
> +  * anything but contrived synthetic inputs will yield a soundex size of
> +  * less than 100 bytes. We thus allocate and free a temporary work 
> buffer,
> +  * and return only the actual soundex result.
> +  */
> + string = pg_server_to_any(text_to_cstring(arg), VARSIZE_ANY_EXHDR(arg), 
> PG_UTF8);
> + tmp_soundex = palloc(DM_MAX_SOUNDEX_CHARS);

Seems that just using StringInfo to hold the soundex output would work better
than a static allocation?


> + if (!_daitch_mokotoff(string, tmp_soundex, DM_MAX_SOUNDEX_CHARS))

We imo shouldn't introduce new functions starting with _.


> +/* Mark soundex code tree node as leaf. */
> +static void
> +set_leaf(dm_leaves leaves_next, int *num_leaves_next, dm_node * node)
> +{
> + if (!node->is_leaf)
> + {
> + node->is_leaf = 1;
> + leaves_next[(*num_leaves_next)++] = node;
> + }
> +}
> +
> +
> +/* Find next node corresponding to code digit, or create a new node. */
> +static dm_node * find_or_create_node(dm_nodes nodes, int *num_nodes,
> +  
> dm_node * node, char code_digit)

PG code style is to have a line break between a function defintion's return
type and the function name - like you actually do above.




> +/* Mapping from ISO8859-1 to upper-case ASCII */
> +static const char tr_iso8859_1_to_ascii_upper[] =
> +/*
> +"`abcdefghijklmnopqrstuvwxyz{|}~  
> ¡¢£¤¥¦§¨©ª«¬ 
> ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"
> +*/
> +"`ABCDEFGHIJKLMNOPQRSTUVWXYZ{|}~  !  
>
> ?AAECDNO*OYDSAAECDNO/OYDY";
> +
> +static char
> +iso8859_1_to_ascii_upper(unsigned char c)
> +{
> + return c >= 0x60 ? tr_iso8859_1_to_ascii_upper[c - 0x60] : c;
> +}
> +
> +
> +/* Convert an UTF-8 character to ISO-8859-1.
> + * Unconvertable characters are returned as '?'.
> + * NB! Beware of the domain specific conversion of Ą, Ę, and Ţ/Ț.
> + */
> +static char
> +utf8_to_iso8859_1(char *str, int *ix)

It seems decidedly not great to have custom encoding conversion routines in a
contrib module. Is there any way we can avoid this?


> +/* Generate all Daitch-Mokotoff soundex codes for word, separated by space. 
> */
> +static char *
> +_daitch_mokotoff(char *word, char *soundex, size_t n)
> +{
> + int i = 0,
> + j;
> + int letter_no = 0;
> + int ix_leaves = 0;
> + int num_nodes = 0,
> + num_leaves = 0;
> + dm_codes   *codes,
> +*next_codes;
> + dm_node*nodes;
> + dm_leaves  *leaves;
> +
> + /* First letter. */
> + if (!(codes = read_letter(word, )))
> + {
> + /* No encodable character in input. */
> + return NULL;
> + }
> +
> + /* Allocate memory for node tree. */
> + nodes = palloc(sizeof(dm_nodes));
> + leaves = palloc(2 * sizeof(dm_leaves));

So this allocates the worst case memory usage, is that right? That's quite a
bit of memory. Shouldn't nodes be allocated dynamically?

Instead of carefully freeing individual memory allocations, I think it be
better to create a 

Re: daitch_mokotoff module

2022-12-05 Thread Dag Lem
Hi Ian,

Ian Lawrence Barwick  writes:

> Hi Dag
>
> 2022年2月3日(木) 23:27 Dag Lem :
>>
>> Hi,
>>
>> Just some minor adjustments to the patch:
>>
>> * Removed call to locale-dependent toupper()
>> * Cleaned up input normalization
>
> This patch was marked as "Waiting on Author" in the CommitFest entry [1]
> but I see you provided an updated version which hasn't received any feedback,
> so I've move this to the next CommitFest [2] and set it to "Needs Review".
>
> [1] https://commitfest.postgresql.org/40/3451/
> [2] https://commitfest.postgresql.org/41/3451/
>
>> I have been asked to sign up to review a commitfest patch or patches -
>> unfortunately I've been ill with COVID-19 and it's not until now that
>> I feel well enough to have a look.
>>
>> Julien: I'll have a look at https://commitfest.postgresql.org/36/3468/
>> as you suggested (https://commitfest.postgresql.org/36/3379/ seems to
>> have been reviewed now).
>>
>> If there are other suggestions for a patch or patches to review for
>> someone new to PostgreSQL internals, I'd be grateful for that.
>
> I see you provided some feedback on 
> https://commitfest.postgresql.org/36/3468/,
> though the patch seems to have not been accepted (but not conclusively 
> rejected
> either). If you still have the chance to review another patch (or more) it 
> would
> be much appreciated, as there's quite a few piling up. Things like 
> documentation
> or small improvements to client applications are always a good place to start.
> Reviews can be provided at any time, there's no need to wait for the next
> CommitFest.
>

OK, I'll try to find another patch to review.

Regards

Dag Lem




Re: daitch_mokotoff module

2022-11-30 Thread Ian Lawrence Barwick
Hi Dag

2022年2月3日(木) 23:27 Dag Lem :
>
> Hi,
>
> Just some minor adjustments to the patch:
>
> * Removed call to locale-dependent toupper()
> * Cleaned up input normalization

This patch was marked as "Waiting on Author" in the CommitFest entry [1]
but I see you provided an updated version which hasn't received any feedback,
so I've move this to the next CommitFest [2] and set it to "Needs Review".

[1] https://commitfest.postgresql.org/40/3451/
[2] https://commitfest.postgresql.org/41/3451/

> I have been asked to sign up to review a commitfest patch or patches -
> unfortunately I've been ill with COVID-19 and it's not until now that
> I feel well enough to have a look.
>
> Julien: I'll have a look at https://commitfest.postgresql.org/36/3468/
> as you suggested (https://commitfest.postgresql.org/36/3379/ seems to
> have been reviewed now).
>
> If there are other suggestions for a patch or patches to review for
> someone new to PostgreSQL internals, I'd be grateful for that.

I see you provided some feedback on https://commitfest.postgresql.org/36/3468/,
though the patch seems to have not been accepted (but not conclusively rejected
either). If you still have the chance to review another patch (or more) it would
be much appreciated, as there's quite a few piling up. Things like documentation
or small improvements to client applications are always a good place to start.
Reviews can be provided at any time, there's no need to wait for the next
CommitFest.

Regards

Ian Barwick




Re: daitch_mokotoff module

2022-02-03 Thread Dag Lem
Hi,

Just some minor adjustments to the patch:

* Removed call to locale-dependent toupper()
* Cleaned up input normalization

I have been asked to sign up to review a commitfest patch or patches -
unfortunately I've been ill with COVID-19 and it's not until now that
I feel well enough to have a look.

Julien: I'll have a look at https://commitfest.postgresql.org/36/3468/
as you suggested (https://commitfest.postgresql.org/36/3379/ seems to
have been reviewed now).

If there are other suggestions for a patch or patches to review for
someone new to PostgreSQL internals, I'd be grateful for that.


Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..1d5bd84be8 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..ba87061845
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,587 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "daitch_mokotoff.h"
+
+#include "postgres.h"
+#include "utils/builtins.h"
+#include "mb/pg_wchar.h"
+
+#include 
+
+/* Internal C implementation */
+static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
+
+
+PG_FUNCTION_INFO_V1(daitch_mokotoff);
+Datum
+daitch_mokotoff(PG_FUNCTION_ARGS)
+{
+	text	   *arg = PG_GETARG_TEXT_PP(0);
+	char	   *string,
+			   *tmp_soundex;
+	text	   *soundex;
+
+	/*
+	 * The maximum 

Re: daitch_mokotoff module

2022-01-05 Thread Dag Lem
Dag Lem  writes:

> Thomas Munro  writes:
>
>> On Wed, Jan 5, 2022 at 2:49 AM Dag Lem  wrote:
>>> However I guess this won't make any difference wrt. actually running the
>>> tests, as long as there seems to be an encoding problem in the cfbot
>>
>> Fixed -- I told it to pull down patches as binary, not text.  Now it
>> makes commits that look healthier, and so far all the Unix systems
>> have survived CI:
>>
>> https://github.com/postgresql-cfbot/postgresql/commit/79700efc61d15c2414b8450a786951fa9308c07f
>> http://cfbot.cputube.org/dag-lem.html
>>
>
> Great!
>
> Dag
>
>

After this I did the mistake of including a patch for citext in this
thread, which is now picked up by cfbot instead of the Daitch-Mokotoff
patch.

Attaching the original patch again in order to hopefully fix my mistake.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..1d5bd84be8 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..1b7263c349
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,593 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
+*/
+
+#include "daitch_mokotoff.h"
+
+#include "postgres.h"
+#include "utils/builtins.h"
+#include "mb/pg_wchar.h"
+
+#include 
+#include 
+
+/* Internal C implementation */
+static char *_daitch_mokotoff(char *word, char *soundex, size_t n);
+
+

Re: [PATCH] Run UTF8-dependent tests for citext [Re: daitch_mokotoff module]

2022-01-05 Thread Tom Lane
Dag Lem  writes:
> Please find attached a patch to run the previously commented-out
> UTF8-dependent tests for citext, according to best practice. For now I
> don't dare to touch the unaccent module, which seems to be UTF8-only
> anyway.

I tried this on a bunch of different locale settings and concluded that
we need to restrict the locale to avoid failures: it falls over with
locale C.  With that, it passes on all UTF8 LANG settings on RHEL8
and FreeBSD 12, and all except am_ET.UTF-8 on current macOS.  I'm not
sure what the deal is with am_ET, but macOS has a long and sad history
of wonky UTF8 locales, so I was actually expecting worse.  If the
buildfarm shows more problems, we can restrict it further --- I won't
be too upset if we end up restricting to just Linux systems, like
collate.linux.utf8.  Anyway, pushed to see what happens.

regards, tom lane




[PATCH] Run UTF8-dependent tests for citext [Re: daitch_mokotoff module]

2022-01-05 Thread Dag Lem
Tom Lane  writes:

> Dag Lem  writes:
>
>> Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that
>> two other modules are using UTF8 characters in tests - citext and
>> unaccent.
>
> Yeah, neither of those have been upgraded to said best practice.
> (If you feel like doing the legwork to improve that situation,
> that'd be great.)
>

Please find attached a patch to run the previously commented-out
UTF8-dependent tests for citext, according to best practice. For now I
don't dare to touch the unaccent module, which seems to be UTF8-only
anyway.


Best regards

Dag Lem

diff --git a/contrib/citext/Makefile b/contrib/citext/Makefile
index a7de52928d..789932fe36 100644
--- a/contrib/citext/Makefile
+++ b/contrib/citext/Makefile
@@ -11,7 +11,7 @@ DATA = citext--1.4.sql \
 	citext--1.0--1.1.sql
 PGFILEDESC = "citext - case-insensitive character string data type"
 
-REGRESS = citext
+REGRESS = citext citext_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/citext/expected/citext.out b/contrib/citext/expected/citext.out
index 3bac0534fb..48b4de8993 100644
--- a/contrib/citext/expected/citext.out
+++ b/contrib/citext/expected/citext.out
@@ -48,29 +48,6 @@ SELECT 'a'::citext <> 'ab'::citext AS t;
  t
 (1 row)
 
--- Multibyte sanity tests. Uncomment to run.
--- SELECT 'À'::citext =  'À'::citext AS t;
--- SELECT 'À'::citext =  'à'::citext AS t;
--- SELECT 'À'::text   =  'à'::text   AS f; -- text wins.
--- SELECT 'À'::citext <> 'B'::citext AS t;
--- Test combining characters making up canonically equivalent strings.
--- SELECT 'Ä'::text   <> 'Ä'::text   AS t;
--- SELECT 'Ä'::citext <> 'Ä'::citext AS t;
--- Test the Turkish dotted I. The lowercase is a single byte while the
--- uppercase is multibyte. This is why the comparison code can't be optimized
--- to compare string lengths.
--- SELECT 'i'::citext = 'İ'::citext AS t;
--- Regression.
--- SELECT 'láska'::citext <> 'laská'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive;
--- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative;
 -- Test > and >=
 SELECT 'B'::citext > 'a'::citext AS t;
  t 
diff --git a/contrib/citext/expected/citext_1.out b/contrib/citext/expected/citext_1.out
index 57fc863f7a..8ab4d4224e 100644
--- a/contrib/citext/expected/citext_1.out
+++ b/contrib/citext/expected/citext_1.out
@@ -48,29 +48,6 @@ SELECT 'a'::citext <> 'ab'::citext AS t;
  t
 (1 row)
 
--- Multibyte sanity tests. Uncomment to run.
--- SELECT 'À'::citext =  'À'::citext AS t;
--- SELECT 'À'::citext =  'à'::citext AS t;
--- SELECT 'À'::text   =  'à'::text   AS f; -- text wins.
--- SELECT 'À'::citext <> 'B'::citext AS t;
--- Test combining characters making up canonically equivalent strings.
--- SELECT 'Ä'::text   <> 'Ä'::text   AS t;
--- SELECT 'Ä'::citext <> 'Ä'::citext AS t;
--- Test the Turkish dotted I. The lowercase is a single byte while the
--- uppercase is multibyte. This is why the comparison code can't be optimized
--- to compare string lengths.
--- SELECT 'i'::citext = 'İ'::citext AS t;
--- Regression.
--- SELECT 'láska'::citext <> 'laská'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'Ask Bjørn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext = 'ASK BJØRN HANSEN'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'Ask Bjorn Hansen'::citext AS t;
--- SELECT 'Ask Bjørn Hansen'::citext <> 'ASK BJORN HANSEN'::citext AS t;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ask bjørn hansen'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'ASK BJØRN HANSEN'::citext) AS zero;
--- SELECT citext_cmp('Ask Bjørn Hansen'::citext, 'Ask Bjorn Hansen'::citext) AS positive;
--- SELECT citext_cmp('Ask Bjorn Hansen'::citext, 'Ask Bjørn Hansen'::citext) AS negative;
 -- Test > and >=
 SELECT 'B'::citext > 'a'::citext AS t;
  t 
diff --git a/contrib/citext/expected/citext_utf8.out b/contrib/citext/expected/citext_utf8.out
new file mode 100644
index 00..1f4fa79aff
--- /dev/null
+++ b/contrib/citext/expected/citext_utf8.out
@@ -0,0 +1,119 @@
+/*
+ * This test must be run in a database with UTF-8 encoding,
+ * because other encodings don't support all the characters used.
+ */
+SELECT getdatabaseencoding() <> 'UTF8'
+   AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+set 

Re: daitch_mokotoff module

2022-01-04 Thread Dag Lem
Thomas Munro  writes:

> On Wed, Jan 5, 2022 at 2:49 AM Dag Lem  wrote:
>> However I guess this won't make any difference wrt. actually running the
>> tests, as long as there seems to be an encoding problem in the cfbot
>
> Fixed -- I told it to pull down patches as binary, not text.  Now it
> makes commits that look healthier, and so far all the Unix systems
> have survived CI:
>
> https://github.com/postgresql-cfbot/postgresql/commit/79700efc61d15c2414b8450a786951fa9308c07f
> http://cfbot.cputube.org/dag-lem.html
>

Great!

Dag




Re: daitch_mokotoff module

2022-01-04 Thread Thomas Munro
On Wed, Jan 5, 2022 at 2:49 AM Dag Lem  wrote:
> However I guess this won't make any difference wrt. actually running the
> tests, as long as there seems to be an encoding problem in the cfbot

Fixed -- I told it to pull down patches as binary, not text.  Now it
makes commits that look healthier, and so far all the Unix systems
have survived CI:

https://github.com/postgresql-cfbot/postgresql/commit/79700efc61d15c2414b8450a786951fa9308c07f
http://cfbot.cputube.org/dag-lem.html




Re: daitch_mokotoff module

2022-01-04 Thread Dag Lem
Andres Freund  writes:

> Hi,
>
> On 2022-01-02 21:41:53 -0500, Tom Lane wrote:
>> ... so, that test case is guaranteed to fail in non-UTF8 encodings,
>> I suppose?  I wonder what the LANG environment is in that cfbot
>> instance.
>
> LANG="en_US.UTF-8"
>
> But it looks to me like the problem is in the commit cfbot creates, rather
> than the test run itself:
> https://github.com/postgresql-cfbot/postgresql/commit/d5b4ec87cfd65dc08d26e1b789bd254405c90a66#diff-388d4bb360a3b24c425e29a85899315dc02f9c1dd9b9bc9aaa828876bdfea50aR56
>
> Greetings,
>
> Andres Freund
>
>

I have now separated out the UTF8-dependent tests, hopefully according
to the current best practice (based on src/test/modules/test_regex/ and
https://www.postgresql.org/docs/14/regress-variant.html).

However I guess this won't make any difference wrt. actually running the
tests, as long as there seems to be an encoding problem in the cfbot
pipeline.

Is there anything else I can do? Could perhaps fuzzystrmatch_utf8 simply
be commented out from the Makefile for the time being?

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..1d5bd84be8 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,14 +3,15 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
-REGRESS = fuzzystrmatch
+REGRESS = fuzzystrmatch fuzzystrmatch_utf8
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..1b7263c349
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,593 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ *
+ * THE AUTHOR AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES,
+ * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
+ * AND FITNESS FOR A PARTICULAR PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS
+ * ON AN "AS IS" BASIS, AND THE AUTHOR AND DISTRIBUTORS HAS NO OBLIGATIONS TO
+ * PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

Re: daitch_mokotoff module

2022-01-03 Thread Andres Freund
Hi,

On 2022-01-02 21:41:53 -0500, Tom Lane wrote:
> ... so, that test case is guaranteed to fail in non-UTF8 encodings,
> I suppose?  I wonder what the LANG environment is in that cfbot
> instance.

LANG="en_US.UTF-8"

But it looks to me like the problem is in the commit cfbot creates, rather
than the test run itself:
https://github.com/postgresql-cfbot/postgresql/commit/d5b4ec87cfd65dc08d26e1b789bd254405c90a66#diff-388d4bb360a3b24c425e29a85899315dc02f9c1dd9b9bc9aaa828876bdfea50aR56

Greetings,

Andres Freund




Re: daitch_mokotoff module

2022-01-03 Thread Tom Lane
Dag Lem  writes:
> Tom Lane  writes:
>> (We do have methods for dealing with non-ASCII test cases, but
>> I can't see that this patch is using any of them.)

> I naively assumed that tests would be run in an UTF8 environment.

Nope, not necessarily.

Our current best practice for this is to separate out encoding-dependent
test cases into their own test script, and guard the script with an
initial test on database encoding.  You can see an example in
src/test/modules/test_regex/sql/test_regex_utf8.sql
and the two associated expected-files.  It's a good idea to also cover
as much as you can with pure-ASCII test cases that will run regardless
of the prevailing encoding.

> Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that
> two other modules are using UTF8 characters in tests - citext and
> unaccent.

Yeah, neither of those have been upgraded to said best practice.
(If you feel like doing the legwork to improve that situation,
that'd be great.)

> Looking into the unaccent module, I don't quite understand how it will
> work with various encodings, since it doesn't seem to decode its input -
> will it fail if run under anything but ASCII or UTF8?

Its Makefile seems to be forcing the test database to use UTF8.
I think this is a less-than-best-practice choice, because then
we have zero test coverage for other encodings; but it does
prevent test failures.

regards, tom lane




Re: daitch_mokotoff module

2022-01-03 Thread Dag Lem
Tom Lane  writes:

> Thomas Munro  writes:
>> Erm, it looks like something weird is happening somewhere in cfbot's
>> pipeline, because Dag's patch says:
>
>> +SELECT daitch_mokotoff('Straßburg');
>> + daitch_mokotoff
>> +-
>> + 294795
>> +(1 row)
>
> ... so, that test case is guaranteed to fail in non-UTF8 encodings,
> I suppose?  I wonder what the LANG environment is in that cfbot
> instance.
>
> (We do have methods for dealing with non-ASCII test cases, but
> I can't see that this patch is using any of them.)
>
>   regards, tom lane
>

I naively assumed that tests would be run in an UTF8 environment.

Running "ack -l '[\x80-\xff]'" in the contrib/ directory reveals that
two other modules are using UTF8 characters in tests - citext and
unaccent.

The citext tests seem to be commented out - "Multibyte sanity
tests. Uncomment to run."

Looking into the unaccent module, I don't quite understand how it will
work with various encodings, since it doesn't seem to decode its input -
will it fail if run under anything but ASCII or UTF8?

In any case, I see that unaccent.sql starts as follows:


CREATE EXTENSION unaccent;

-- must have a UTF8 database
SELECT getdatabaseencoding();

SET client_encoding TO 'UTF8';


Would doing the same thing in fuzzystrmatch.sql fix the problem with
failing tests? Should I prepare a new patch?


Best regards

Dag Lem




Re: daitch_mokotoff module

2022-01-02 Thread Tom Lane
Thomas Munro  writes:
> Erm, it looks like something weird is happening somewhere in cfbot's
> pipeline, because Dag's patch says:

> +SELECT daitch_mokotoff('Straßburg');
> + daitch_mokotoff
> +-
> + 294795
> +(1 row)

... so, that test case is guaranteed to fail in non-UTF8 encodings,
I suppose?  I wonder what the LANG environment is in that cfbot
instance.

(We do have methods for dealing with non-ASCII test cases, but
I can't see that this patch is using any of them.)

regards, tom lane




Re: daitch_mokotoff module

2022-01-02 Thread Thomas Munro
On Mon, Jan 3, 2022 at 10:32 AM Andres Freund  wrote:
> On 2021-12-21 22:41:18 +0100, Dag Lem wrote:
> > This is my very first code contribution to PostgreSQL, and I would be
> > grateful for any advice on how to proceed in order to get the patch
> > accepted.
>
> Currently the tests don't seem to pass on any platform:
> https://cirrus-ci.com/task/5941863248035840?logs=test_world#L572
> https://api.cirrus-ci.com/v1/artifact/task/5941863248035840/regress_diffs/contrib/fuzzystrmatch/regression.diffs

Erm, it looks like something weird is happening somewhere in cfbot's
pipeline, because Dag's patch says:

+SELECT daitch_mokotoff('Straßburg');
+ daitch_mokotoff
+-
+ 294795
+(1 row)

... but it's failing like:

 SELECT daitch_mokotoff('Straßburg');
  daitch_mokotoff
 -
- 294795
+ 297950
 (1 row)

It's possible that I broke cfbot when upgrading to Python 3 a few
months back (ie encoding snafu when using the "requests" module to
pull patches down from the archives).  I'll try to fix this soon.




Re: daitch_mokotoff module

2022-01-02 Thread Andres Freund
On 2021-12-21 22:41:18 +0100, Dag Lem wrote:
> This is my very first code contribution to PostgreSQL, and I would be
> grateful for any advice on how to proceed in order to get the patch
> accepted.

Currently the tests don't seem to pass on any platform:
https://cirrus-ci.com/task/5941863248035840?logs=test_world#L572
https://api.cirrus-ci.com/v1/artifact/task/5941863248035840/regress_diffs/contrib/fuzzystrmatch/regression.diffs

Greetings,

Andres Freund




Re: daitch_mokotoff module

2021-12-21 Thread Dag Lem
Hello again,

It turns out that there actually exists an(other) implementation of
the Daitch-Mokotoff Soundex System which gets it right; the JOS
Soundex Calculator at https://www.jewishgen.org/jos/jossound.htm
Other implementations I have been able to find, like the one in Apache
Commons Coded used in e.g. Elasticsearch, are far from correct.

The source code for the JOS Soundex Calculator is not available, as
far as I can tell, however I have run the complete list of 98412 last
names at
https://raw.githubusercontent.com/philipperemy/name-dataset/master/names_dataset/v1/last_names.all.txt
through the calculator, in order to have a good basis for comparison.

This revealed a few shortcomings in my implementation. In particular I
had to go back to the drawing board in order to handle the dual nature
of "J" correctly. "J" can be either a vowel or a consonant in
Daitch-Mokotoff soundex, and this complicates encoding of the
*previous* character.

I have also done a more thorough review and refactoring of the code,
which should hopefully make things quite a bit more understandable to
others.

The changes are summarized as follows:

* Returns NULL for input without any encodable characters.
* Uses the same "unoffical" rules for "UE" as other implementations.
* Correctly considers "J" as either a vowel or a consonant.
* Corrected encoding for e.g. "HANNMANN".
* Code refactoring and comments for readability.
* Better examples in the documentation.

The implementation is now in correspondence with the JOS Soundex
Calculator for the 98412 last names mentioned above, with only the
following exceptions:

JOS: cedeño 43 53
PG:  cedeño 436000 536000
JOS: sadab(khura)  437000
PG:  sadab(khura)  437590

I hope this addition to the fuzzystrmatch extension module will prove
to be useful to others as well!

This is my very first code contribution to PostgreSQL, and I would be
grateful for any advice on how to proceed in order to get the patch
accepted.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..826e529e3e 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,11 +3,12 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
 REGRESS = fuzzystrmatch
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..1b7263c349
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,593 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - All other known implementations have the same unofficial rules for "UE",
+ *   these are also adapted by this implementation (0, 1, NC).
+ * - The only other known implementation which is capable of generating all
+ *   correct soundex codes in all cases is the JOS Soundex Calculator at
+ *   https://www.jewishgen.org/jos/jossound.htm
+ * - "J" is considered (only) a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Identical code digits for adjacent letters are not collapsed correctly in
+ *   dmsoundex.php when double digit codes are involved. E.g. "BESST" yields
+ *   744300 instead of 743000 as for "BEST".
+ * - "J" is considered (only) a consonant in DaitchMokotoffSoundex.java
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, 

Re: daitch_mokotoff module

2021-12-14 Thread Dag Lem
Tomas Vondra  writes:

[...]

>
> Thanks, looks interesting. A couple generic comments, based on a quick
> code review.

Thank you for the constructive review!

>
> 1) Can the extension be marked as trusted, just like fuzzystrmatch?

I have now moved the daitch_mokotoff function into the fuzzystrmatch
module, as suggested by Andrew Dunstan.

>
> 2) The docs really need an explanation of what the extension is for,
> not just a link to fuzzystrmatch. Also, a couple examples would be
> helpful, I guess - similarly to fuzzystrmatch. The last line in the
> docs is annoyingly long.

Please see the updated documentation for the fuzzystrmatch module.

>
> 3) What's daitch_mokotov_header.pl for? I mean, it generates the
> header, but when do we need to run it?

It only has to be run if the soundex rules are changed. I have now
made the dependencies explicit in the fuzzystrmatch Makefile.

>
> 4) It seems to require perl-open, which is a module we did not need
> until now. Not sure how well supported it is, but maybe we can use a
> more standard module?

I believe Perl I/O layers have been part of Perl core for two decades
now :-)

>
> 5) Do we need to keep DM_MAIN? It seems to be meant for some kind of
> testing, but our regression tests certainly don't need it (or the
> palloc mockup). I suggest to get rid of it.

Done. BTW this was modeled after dmetaphone.c

>
> 6) I really don't understand some of the comments in
> daitch_mokotov.sql, like for example:
>
> -- testEncodeBasic
> -- Tests covered above are omitted.
>
> Also, comments with names of Java methods seem pretty confusing. It'd
> be better to actually explain what rules are the tests checking.

The tests were copied from various web sites and implementations. I have
cut down on the number of tests and made the comments more to the point.

>
> 7) There are almost no comments in the .c file (ignoring the comment
> on top). Short functions like initialize_node are probably fine
> without one, but e.g. update_node would deserve one.

More comments are added to both the .h and the .c file.

>
> 8) Some of the lines are pretty long (e.g. the update_node signature
> is almost 170 chars). That should be wrapped. Maybe try running
> pgindent on the code, that'll show which parts need better formatting
> (so as not to get broken later).

Fixed. I did run pgindent earlier, however it didn't catch those long
lines.

>
> 9) I'm sure there's better way to get the number of valid chars than this:
>
>   for (i = 0, ilen = 0; (c = read_char([i], )) && (c < 'A' ||
> c > ']'); i += ilen)
>   {
>   }
>
> Say, a while loop or something?

The code gets to the next encodable character, skipping any other
characters. I have now added a comment which should hopefully make this
clearer, and broken up the for loop for readability.

Please find attached the revised patch.

Best regards

Dag Lem

diff --git a/contrib/fuzzystrmatch/Makefile b/contrib/fuzzystrmatch/Makefile
index 0704894f88..826e529e3e 100644
--- a/contrib/fuzzystrmatch/Makefile
+++ b/contrib/fuzzystrmatch/Makefile
@@ -3,11 +3,12 @@
 MODULE_big = fuzzystrmatch
 OBJS = \
 	$(WIN32RES) \
+	daitch_mokotoff.o \
 	dmetaphone.o \
 	fuzzystrmatch.o
 
 EXTENSION = fuzzystrmatch
-DATA = fuzzystrmatch--1.1.sql fuzzystrmatch--1.0--1.1.sql
+DATA = fuzzystrmatch--1.2.sql fuzzystrmatch--1.1--1.2.sql fuzzystrmatch--1.0--1.1.sql
 PGFILEDESC = "fuzzystrmatch - similarities and distance between strings"
 
 REGRESS = fuzzystrmatch
@@ -22,3 +23,8 @@ top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 include $(top_srcdir)/contrib/contrib-global.mk
 endif
+
+daitch_mokotoff.o: daitch_mokotoff.h
+
+daitch_mokotoff.h: daitch_mokotoff_header.pl
+	perl $< > $@
diff --git a/contrib/fuzzystrmatch/daitch_mokotoff.c b/contrib/fuzzystrmatch/daitch_mokotoff.c
new file mode 100644
index 00..302e9a6d86
--- /dev/null
+++ b/contrib/fuzzystrmatch/daitch_mokotoff.c
@@ -0,0 +1,516 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/census/soundex.html (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - "J" is considered a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * 

Re: daitch_mokotoff module

2021-12-13 Thread Tomas Vondra

On 12/13/21 16:05, Andrew Dunstan wrote:


On 12/13/21 09:26, Tomas Vondra wrote:

On 12/13/21 14:38, Dag Lem wrote:

Please find attached an updated patch, with the following fixes:

* Replaced remaining malloc/free with palloc/pfree.
* Made "make check" pass.
* Updated notes on other implementations.



Thanks, looks interesting. A couple generic comments, based on a quick
code review.

1) Can the extension be marked as trusted, just like fuzzystrmatch?

2) The docs really need an explanation of what the extension is for,
not just a link to fuzzystrmatch. Also, a couple examples would be
helpful, I guess - similarly to fuzzystrmatch. The last line in the
docs is annoyingly long.



It's not clear to me why we need a new module for this. Wouldn't it be
better just to add the new function to fuzzystrmatch?



Yeah, that's a valid point. I think we're quite conservative about 
adding more contrib modules, and adding a function to an existing one 
works around a lot of that.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: daitch_mokotoff module

2021-12-13 Thread Andrew Dunstan


On 12/13/21 09:26, Tomas Vondra wrote:
> On 12/13/21 14:38, Dag Lem wrote:
>> Please find attached an updated patch, with the following fixes:
>>
>> * Replaced remaining malloc/free with palloc/pfree.
>> * Made "make check" pass.
>> * Updated notes on other implementations.
>>
>
> Thanks, looks interesting. A couple generic comments, based on a quick
> code review.
>
> 1) Can the extension be marked as trusted, just like fuzzystrmatch?
>
> 2) The docs really need an explanation of what the extension is for,
> not just a link to fuzzystrmatch. Also, a couple examples would be
> helpful, I guess - similarly to fuzzystrmatch. The last line in the
> docs is annoyingly long.


It's not clear to me why we need a new module for this. Wouldn't it be
better just to add the new function to fuzzystrmatch?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: daitch_mokotoff module

2021-12-13 Thread Tomas Vondra

On 12/13/21 14:38, Dag Lem wrote:

Please find attached an updated patch, with the following fixes:

* Replaced remaining malloc/free with palloc/pfree.
* Made "make check" pass.
* Updated notes on other implementations.



Thanks, looks interesting. A couple generic comments, based on a quick 
code review.


1) Can the extension be marked as trusted, just like fuzzystrmatch?

2) The docs really need an explanation of what the extension is for, not 
just a link to fuzzystrmatch. Also, a couple examples would be helpful, 
I guess - similarly to fuzzystrmatch. The last line in the docs is 
annoyingly long.


3) What's daitch_mokotov_header.pl for? I mean, it generates the header, 
but when do we need to run it?


4) It seems to require perl-open, which is a module we did not need 
until now. Not sure how well supported it is, but maybe we can use a 
more standard module?


5) Do we need to keep DM_MAIN? It seems to be meant for some kind of 
testing, but our regression tests certainly don't need it (or the palloc 
mockup). I suggest to get rid of it.


6) I really don't understand some of the comments in daitch_mokotov.sql, 
like for example:


-- testEncodeBasic
-- Tests covered above are omitted.

Also, comments with names of Java methods seem pretty confusing. It'd be 
better to actually explain what rules are the tests checking.


7) There are almost no comments in the .c file (ignoring the comment on 
top). Short functions like initialize_node are probably fine without 
one, but e.g. update_node would deserve one.


8) Some of the lines are pretty long (e.g. the update_node signature is 
almost 170 chars). That should be wrapped. Maybe try running pgindent on 
the code, that'll show which parts need better formatting (so as not to 
get broken later).


9) I'm sure there's better way to get the number of valid chars than this:

  for (i = 0, ilen = 0; (c = read_char([i], )) && (c < 'A' || 
c > ']'); i += ilen)

  {
  }

Say, a while loop or something?


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: daitch_mokotoff module

2021-12-13 Thread Dag Lem
aitch_mokotoff 
+-
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KI-NGSMITH');
+ daitch_mokotoff 
+-
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KIN-GSMITH');
+ daitch_mokotoff 
+-
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KING-SMITH');
+ daitch_mokotoff 
+-
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGS-MITH');
+ daitch_mokotoff 
+-
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGSM-ITH');
+ daitch_mokotoff 
+-
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGSMI-TH');
+ daitch_mokotoff 
+-
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGSMIT-H');
+ daitch_mokotoff 
+-
+ 565463
+(1 row)
+
+SELECT daitch_mokotoff('KINGSMITH-');
+ daitch_mokotoff 
+-
+ 565463
+(1 row)
+
+-- testEncodeIgnoreTrimmable
+SELECT daitch_mokotoff(E' \t\n\r Washington \t\n\r ');
+ daitch_mokotoff 
+-
+ 746536
+(1 row)
+
+SELECT daitch_mokotoff('Washington');
+ daitch_mokotoff 
+-
+ 746536
+(1 row)
+
+-- testSoundexBasic
+-- Tests covered above are omitted.
+-- testSoundexBasic2
+-- Tests covered above are omitted.
+-- testSoundexBasic3
+-- Tests covered above are omitted.
+-- testSpecialRomanianCharacters
+SELECT daitch_mokotoff('ţamas'); -- t-cedila
+ daitch_mokotoff 
+-
+ 364000 464000
+(1 row)
+
+SELECT daitch_mokotoff('țamas'); -- t-comma
+ daitch_mokotoff 
+-
+ 364000 464000
+(1 row)
+
+-- Contrived case which is not handled correctly by other implementations.
+SELECT daitch_mokotoff('CJC');
+  daitch_mokotoff  
+---
+ 55 54 545000 45 40 44
+(1 row)
+
diff --git a/contrib/daitch_mokotoff/sql/daitch_mokotoff.sql b/contrib/daitch_mokotoff/sql/daitch_mokotoff.sql
new file mode 100644
index 00..eafc24ee87
--- /dev/null
+++ b/contrib/daitch_mokotoff/sql/daitch_mokotoff.sql
@@ -0,0 +1,121 @@
+CREATE EXTENSION daitch_mokotoff;
+
+
+-- https://www.jewishgen.org/InfoFiles/Soundex.html
+SELECT daitch_mokotoff('GOLDEN');
+SELECT daitch_mokotoff('Alpert');
+SELECT daitch_mokotoff('Breuer');
+SELECT daitch_mokotoff('Freud');
+SELECT daitch_mokotoff('Haber');
+SELECT daitch_mokotoff('Manheim');
+SELECT daitch_mokotoff('Mintz');
+SELECT daitch_mokotoff('Topf');
+SELECT daitch_mokotoff('Kleinman');
+SELECT daitch_mokotoff('Ben Aron');
+
+SELECT daitch_mokotoff('AUERBACH');
+SELECT daitch_mokotoff('OHRBACH');
+SELECT daitch_mokotoff('LIPSHITZ');
+SELECT daitch_mokotoff('LIPPSZYC');
+SELECT daitch_mokotoff('LEWINSKY');
+SELECT daitch_mokotoff('LEVINSKI');
+SELECT daitch_mokotoff('SZLAMAWICZ');
+SELECT daitch_mokotoff('SHLAMOVITZ');
+
+
+-- https://www.avotaynu.com/soundex.htm
+SELECT daitch_mokotoff('Ceniow');
+SELECT daitch_mokotoff('Tsenyuv');
+SELECT daitch_mokotoff('Holubica');
+SELECT daitch_mokotoff('Golubitsa');
+SELECT daitch_mokotoff('Przemysl');
+SELECT daitch_mokotoff('Pshemeshil');
+SELECT daitch_mokotoff('Rosochowaciec');
+SELECT daitch_mokotoff('Rosokhovatsets');
+
+
+-- https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+SELECT daitch_mokotoff('Peters');
+SELECT daitch_mokotoff('Peterson');
+SELECT daitch_mokotoff('Moskowitz');
+SELECT daitch_mokotoff('Moskovitz');
+SELECT daitch_mokotoff('Auerbach');
+SELECT daitch_mokotoff('Uhrbach');
+SELECT daitch_mokotoff('Jackson');
+SELECT daitch_mokotoff('Jackson-Jackson');
+
+
+-- Perl Text::Phonetic::DaitchMokotoff 006_daitchmokotoff.t
+-- Tests covered above are omitted.
+SELECT daitch_mokotoff('Müller');
+SELECT daitch_mokotoff('schmidt');
+SELECT daitch_mokotoff('schneider');
+SELECT daitch_mokotoff('fischer');
+SELECT daitch_mokotoff('weber');
+SELECT daitch_mokotoff('meyer');
+SELECT daitch_mokotoff('wagner');
+SELECT daitch_mokotoff('schulz');
+SELECT daitch_mokotoff('becker');
+SELECT daitch_mokotoff('hoffmann');
+SELECT daitch_mokotoff('schäfer');
+
+
+-- Apache Commons DaitchMokotoffSoundexTest.java
+
+-- testAccentedCharacterFolding
+SELECT daitch_mokotoff('Straßburg');
+SELECT daitch_mokotoff('Strasburg');
+
+SELECT daitch_mokotoff('Éregon');
+SELECT daitch_mokotoff('Eregon');
+
+-- testAdjacentCodes
+SELECT daitch_mokotoff('AKSSOL');
+SELECT daitch_mokotoff('GERSCHFELD');
+
+-- testEncodeBasic
+-- Tests covered above are omitted.
+
+-- testEncodeIgnoreApostrophes
+SELECT daitch_mokotoff('OBrien');
+SELECT daitch_mokotoff('''OBrien');
+SELECT daitch_mokotoff('O''Brien');
+SELECT daitch_mokotoff('OB''rien');
+SELECT daitch_mokotoff('OBr''ien');
+SELECT daitch_mokotoff('OBri''en');
+SELECT daitch_mokotoff('OBrie''n');
+SELECT daitch_mokotoff('OBrien''');
+
+-- testEncodeIgnoreHyphens
+SELECT daitch_mokotoff('KINGSMITH');
+SELECT daitch_mokotoff('-KINGSMITH');
+SELECT daitch_mokotoff('K-INGSMITH');
+SELECT daitch_mokotoff('KI-NGSMITH');
+SELECT daitch_mokotoff('KIN-GSMITH');
+SELECT daitch_mokotoff('KING-SMITH');
+SELECT daitch_mokotoff('KINGS-MITH');
+SELECT daitch_mokotoff('KINGSM-ITH');
+SELECT daitch_mokotoff('KINGSM

daitch_mokotoff module

2021-12-03 Thread Dag Lem
Hello,

Please find attached a patch for the daitch_mokotoff module.

This implements the Daitch-Mokotoff Soundex System, as described in
https://www.avotaynu.com/soundex.htm

The module is used in production at Finance Norway.

In order to verify correctness, I have compared generated soundex codes
with corresponding results from the implementation by Stephen P. Morse
at https://stevemorse.org/census/soundex.html

Where soundex codes differ, the daitch_mokotoff module has been found
to be correct. The Morse implementation uses a few unofficial rules,
and also has an error in the handling of adjacent identical code
digits. Please see daitch_mokotoff.c for further references and
comments.

For reference, detailed instructions for soundex code comparison are
attached.


Best regards

Dag Lem

diff --git a/contrib/Makefile b/contrib/Makefile
index 87bf87ab90..5ea729 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -14,6 +14,7 @@ SUBDIRS = \
 		btree_gist	\
 		citext		\
 		cube		\
+		daitch_mokotoff	\
 		dblink		\
 		dict_int	\
 		dict_xsyn	\
diff --git a/contrib/daitch_mokotoff/Makefile b/contrib/daitch_mokotoff/Makefile
new file mode 100644
index 00..baec5e31d4
--- /dev/null
+++ b/contrib/daitch_mokotoff/Makefile
@@ -0,0 +1,25 @@
+# contrib/daitch_mokotoff/Makefile
+
+MODULE_big = daitch_mokotoff
+OBJS = \
+	$(WIN32RES) \
+	daitch_mokotoff.o
+
+EXTENSION = daitch_mokotoff
+DATA = daitch_mokotoff--1.0.sql
+PGFILEDESC = "daitch_mokotoff - Daitch-Mokotoff Soundex System"
+
+HEADERS = daitch_mokotoff.h
+
+REGRESS = daitch_mokotoff
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/daitch_mokotoff
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql
new file mode 100644
index 00..0b5a643175
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql
@@ -0,0 +1,8 @@
+/* contrib/daitch_mokotoff/daitch_mokotoff--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION daitch_mokotoff" to load this file. \quit
+
+CREATE FUNCTION daitch_mokotoff(text) RETURNS text
+AS 'MODULE_PATHNAME', 'daitch_mokotoff'
+LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;
diff --git a/contrib/daitch_mokotoff/daitch_mokotoff.c b/contrib/daitch_mokotoff/daitch_mokotoff.c
new file mode 100644
index 00..9e66aee434
--- /dev/null
+++ b/contrib/daitch_mokotoff/daitch_mokotoff.c
@@ -0,0 +1,551 @@
+/*
+ * Daitch-Mokotoff Soundex
+ *
+ * Copyright (c) 2021 Finance Norway
+ * Author: Dag Lem 
+ *
+ * This implementation of the Daitch-Mokotoff Soundex System aims at high
+ * performance.
+ *
+ * - The processing of each phoneme is initiated by an O(1) table lookup.
+ * - For phonemes containing more than one character, a coding tree is traversed
+ *   to process the complete phoneme.
+ * - The (alternate) soundex codes are produced digit by digit in-place in
+ *   another tree structure.
+ *
+ *  References:
+ *
+ * https://www.avotaynu.com/soundex.htm
+ * https://www.jewishgen.org/InfoFiles/Soundex.html
+ * https://familypedia.fandom.com/wiki/Daitch-Mokotoff_Soundex
+ * https://stevemorse.org/phoneticinfo.htm (dmlat.php, dmsoundex.php)
+ * https://github.com/apache/commons-codec/ (dmrules.txt, DaitchMokotoffSoundex.java)
+ * https://metacpan.org/pod/Text::Phonetic (DaitchMokotoff.pm)
+ *
+ * A few notes on other implementations:
+ *
+ * - "J" is considered a vowel in dmlat.php
+ * - The official rules for "RS" are commented out in dmlat.php
+ * - Adjacent identical code digits are not collapsed correctly in dmsoundex.php
+ *   when double digit codes are involved. E.g. "BESST" yields 744300 instead of
+ *   743000 as for "BEST".
+ * - "Y" is not considered a vowel in DaitchMokotoffSoundex.java
+ * - Both dmlat.php and dmrules.txt have the same unofficial rules for "UE".
+ * - Coding of MN/NM + M/N differs between dmsoundex.php and DaitchMokotoffSoundex.java
+ * - Neither dmsoundex.php nor DaitchMokotoffSoundex.java yields all valid codes for e.g.
+ *   "CJC" (55 54 545000 45 40 44).
+ *
+ * Permission to use, copy, modify, and distribute this software and its
+ * documentation for any purpose, without fee, and without a written agreement
+ * is hereby granted, provided that the above copyright notice and this
+ * paragraph and the following two paragraphs appear in all copies.
+ *
+ * IN NO EVENT SHALL THE AUTHOR OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR
+ * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
+ * LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
+ * DOCUMENTATION, EVEN IF THE AUTHOR OR DISTRIBUTORS HAVE BEEN ADVISED OF THE
+ * POSSIBILIT