Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2018-04-26 Thread Bruce Momjian
On Mon, Apr 16, 2018 at 07:48:47PM +0300, Liudmila Mantrova wrote:
> Hi everyone,
> 
> When translating doc updates, Alexander Lakhin noticed that trigram examples
> were not quite accurate.
> A small patch fixing this issue is attached.

FYI, this has been applied by Teodor Sigaev:

   
https://git.postgresql.org/pg/commitdiff/9975c128a1d1bd7e7366adf133b21540a2bc2450

---


> 
> 
> On 03/21/2018 03:35 PM, Teodor Sigaev wrote:
> >Thank you, pushed
> >
> >David Steele wrote:
> >>On 3/6/18 7:04 AM, Teodor Sigaev wrote:
> I agree with Teodor (upthread, not quoted here) that the documentation
> could use some editing.
> 
> I started to do it myself, but quickly realized I have no knowledge of
> the content.  I'm afraid I would destroy the meaning while updating
> the
> grammar.
> 
> Anyone understand the subject matter well enough to review the
> documentation?
> >>>
> >>>Liudmila tried to improve docs in Alexander's patchset.
> >>>
> >>>https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752c...@postgrespro.ru
> >>>
> >>
> >>This looks good to me with a few minor exceptions:
> >>
> >>+   word_similarity(text, text) requires further
> >>+   explanation. Consider the following example:
> >>
> >>Maybe too verbose?  I think "word_similarity(text,
> >>text) requires further explanation." can be removed entirely.
> >>
> >>+   string.  However, this function does not add paddings to the
> >>
> >>"add padding"
> >>
> >>>BTW, adding Liudmila's message to commitfest task
> >>>(https://commitfest.postgresql.org/17/1403/) doesn't work
> >>
> >>Doesn't work for me either.
> >>
> >>Alexander, can you post the final patches to the thread so they show up
> >>in the CF app?
> >>
> >>Thanks,
> >>
> >
> 
> -- 
> Liudmila Mantrova
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
> 

> diff --git a/doc/src/sgml/pgtrgm.sgml b/doc/src/sgml/pgtrgm.sgml
> index 8f39529..be43cdf 100644
> --- a/doc/src/sgml/pgtrgm.sgml
> +++ b/doc/src/sgml/pgtrgm.sgml
> @@ -152,9 +152,9 @@
>  
>  
> In the first string, the set of trigrams is
> -   {"  w"," wo","ord","wor","rd "}.
> +   {"  w"," wo","wor","ord","rd "}.
> In the second string, the ordered set of trigrams is
> -   {"  t"," tw",two,"wo ","  w"," wo","wor","ord","rds", ds 
> "}.
> +   {"  t"," tw","two","wo ","  w"," wo","wor","ord","rds","ds 
> "}.
> The most similar extent of an ordered set of trigrams in the second string
> is {"  w"," wo","wor","ord"}, and the similarity is
> 0.8.
> @@ -172,7 +172,7 @@
> At the same time, strict_word_similarity(text, text)
> has to select an extent that matches word boundaries.  In the example 
> above,
> strict_word_similarity(text, text) would select the
> -   extent {"  w"," wo","wor","ord","rds", ds "}, which
> +   extent {"  w"," wo","wor","ord","rds","ds "}, which
> corresponds to the whole word 'words'.
>  
>  


-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2018-04-16 Thread Liudmila Mantrova

Hi everyone,

When translating doc updates, Alexander Lakhin noticed that trigram 
examples were not quite accurate.

A small patch fixing this issue is attached.


On 03/21/2018 03:35 PM, Teodor Sigaev wrote:

Thank you, pushed

David Steele wrote:

On 3/6/18 7:04 AM, Teodor Sigaev wrote:

I agree with Teodor (upthread, not quoted here) that the documentation
could use some editing.

I started to do it myself, but quickly realized I have no knowledge of
the content.  I'm afraid I would destroy the meaning while updating 
the

grammar.

Anyone understand the subject matter well enough to review the
documentation?


Liudmila tried to improve docs in Alexander's patchset.

https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752c...@postgrespro.ru 



This looks good to me with a few minor exceptions:

+   word_similarity(text, text) requires further
+   explanation. Consider the following example:

Maybe too verbose?  I think "word_similarity(text,
text) requires further explanation." can be removed entirely.

+   string.  However, this function does not add paddings to the

"add padding"


BTW, adding Liudmila's message to commitfest task
(https://commitfest.postgresql.org/17/1403/) doesn't work


Doesn't work for me either.

Alexander, can you post the final patches to the thread so they show up
in the CF app?

Thanks,





--
Liudmila Mantrova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/pgtrgm.sgml b/doc/src/sgml/pgtrgm.sgml
index 8f39529..be43cdf 100644
--- a/doc/src/sgml/pgtrgm.sgml
+++ b/doc/src/sgml/pgtrgm.sgml
@@ -152,9 +152,9 @@
 
 
In the first string, the set of trigrams is
-   {"  w"," wo","ord","wor","rd "}.
+   {"  w"," wo","wor","ord","rd "}.
In the second string, the ordered set of trigrams is
-   {"  t"," tw",two,"wo ","  w"," wo","wor","ord","rds", ds "}.
+   {"  t"," tw","two","wo ","  w"," wo","wor","ord","rds","ds "}.
The most similar extent of an ordered set of trigrams in the second string
is {"  w"," wo","wor","ord"}, and the similarity is
0.8.
@@ -172,7 +172,7 @@
At the same time, strict_word_similarity(text, text)
has to select an extent that matches word boundaries.  In the example above,
strict_word_similarity(text, text) would select the
-   extent {"  w"," wo","wor","ord","rds", ds "}, which
+   extent {"  w"," wo","wor","ord","rds","ds "}, which
corresponds to the whole word 'words'.
 
 


Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2018-03-21 Thread Teodor Sigaev

Thank you, pushed

David Steele wrote:

On 3/6/18 7:04 AM, Teodor Sigaev wrote:

I agree with Teodor (upthread, not quoted here) that the documentation
could use some editing.

I started to do it myself, but quickly realized I have no knowledge of
the content.  I'm afraid I would destroy the meaning while updating the
grammar.

Anyone understand the subject matter well enough to review the
documentation?


Liudmila tried to improve docs in Alexander's patchset.

https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752c...@postgrespro.ru


This looks good to me with a few minor exceptions:

+   word_similarity(text, text) requires further
+   explanation. Consider the following example:

Maybe too verbose?  I think "word_similarity(text,
text) requires further explanation." can be removed entirely.

+   string.  However, this function does not add paddings to the

"add padding"


BTW, adding Liudmila's message to commitfest task
(https://commitfest.postgresql.org/17/1403/) doesn't work


Doesn't work for me either.

Alexander, can you post the final patches to the thread so they show up
in the CF app?

Thanks,



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2018-03-21 Thread Alexander Korotkov
On Tue, Mar 6, 2018 at 7:59 PM, David Steele  wrote:

> On 3/6/18 7:04 AM, Teodor Sigaev wrote:
> >> I agree with Teodor (upthread, not quoted here) that the documentation
> >> could use some editing.
> >>
> >> I started to do it myself, but quickly realized I have no knowledge of
> >> the content.  I'm afraid I would destroy the meaning while updating the
> >> grammar.
> >>
> >> Anyone understand the subject matter well enough to review the
> >> documentation?
> >
> > Liudmila tried to improve docs in Alexander's patchset.
> >
> > https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8
> b-d37e9752c...@postgrespro.ru
>
> This looks good to me with a few minor exceptions:
>
> +   word_similarity(text, text) requires further
> +   explanation. Consider the following example:
>
> Maybe too verbose?  I think "word_similarity(text,
> text) requires further explanation." can be removed entirely.
>
> +   string.  However, this function does not add paddings to the
>
> "add padding"
>
> > BTW, adding Liudmila's message to commitfest task
> > (https://commitfest.postgresql.org/17/1403/) doesn't work
>
> Doesn't work for me either.
>
> Alexander, can you post the final patches to the thread so they show up
> in the CF app?
>

I'm sorry for not updating patches, I've missed this message in the thread.

BTW, Teodor have pushed fix to the documentation up to 9.6.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=
aea7c17e86e99a7ed4da489b3df2b5493b5e5e95

And new function strict_word_similarity() to PostgreSQL 11.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=
be8a7a6866276b228b4ffaa3003e1dc2dd1d140a

Could someone put this information to stackoverflow?
https://stackoverflow.com/questions/46966360/postgres-word-similarity-not-comparing-words
I don't have enough of reputation to comment.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2018-03-06 Thread David Steele
On 3/6/18 7:04 AM, Teodor Sigaev wrote:
>> I agree with Teodor (upthread, not quoted here) that the documentation
>> could use some editing.
>>
>> I started to do it myself, but quickly realized I have no knowledge of
>> the content.  I'm afraid I would destroy the meaning while updating the
>> grammar.
>>
>> Anyone understand the subject matter well enough to review the
>> documentation?
> 
> Liudmila tried to improve docs in Alexander's patchset.
> 
> https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752c...@postgrespro.ru

This looks good to me with a few minor exceptions:

+   word_similarity(text, text) requires further
+   explanation. Consider the following example:

Maybe too verbose?  I think "word_similarity(text,
text) requires further explanation." can be removed entirely.

+   string.  However, this function does not add paddings to the

"add padding"

> BTW, adding Liudmila's message to commitfest task
> (https://commitfest.postgresql.org/17/1403/) doesn't work

Doesn't work for me either.

Alexander, can you post the final patches to the thread so they show up
in the CF app?

Thanks,
-- 
-David
da...@pgmasters.net



Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2018-03-06 Thread Teodor Sigaev

I agree with Teodor (upthread, not quoted here) that the documentation
could use some editing.

I started to do it myself, but quickly realized I have no knowledge of
the content.  I'm afraid I would destroy the meaning while updating the
grammar.

Anyone understand the subject matter well enough to review the
documentation?


Liudmila tried to improve docs in Alexander's patchset.

https://www.postgresql.org/message-id/f43b242d-000c-f4c8-cb8b-d37e9752c...@postgrespro.ru

BTW, adding Liudmila's message to commitfest task 
(https://commitfest.postgresql.org/17/1403/) doesn't work

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2018-03-01 Thread David Steele

Hi Alexander,

On 3/1/18 4:26 PM, Alexander Korotkov wrote:
On Thu, Mar 1, 2018 at 11:05 PM, David Steele > wrote:


I agree with Teodor (upthread, not quoted here) that the documentation
could use some editing.

I started to do it myself, but quickly realized I have no knowledge of
the content.  I'm afraid I would destroy the meaning while updating the
grammar.

That's to problem.  If you're willing to help you can edit the documentation
and let me review that it's correct.  Also feel free to ask any 
questions and

more explanation from me.  Ultimately, we need to have a documentation
that any average user can understand, not to mention you :)


OK, I'm the CFM so I have my plate full for the next few days but if 
nobody picks this up then I will give it a go.



Anyone understand the subject matter well enough to review the
documentation?

I expect it would be hard to find anybody matching this criteria.


You are probably right, but it never hurts to try.

--
-David
da...@pgmasters.net



Re: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2018-03-01 Thread Alexander Korotkov
On Thu, Mar 1, 2018 at 11:05 PM, David Steele  wrote:

> On 1/4/18 4:25 PM, Alexander Korotkov wrote:
> >
> > I just found that patch apply is failed according to
> > commitfest.cputube.org .  I think it's
> > because I sent only second patch from patchset in last message.
> > Anyway I resend both patches rebased to current master.
>
> I agree with Teodor (upthread, not quoted here) that the documentation
> could use some editing.
>
> I started to do it myself, but quickly realized I have no knowledge of
> the content.  I'm afraid I would destroy the meaning while updating the
> grammar.
>

That's to problem.  If you're willing to help you can edit the documentation
and let me review that it's correct.  Also feel free to ask any questions
and
more explanation from me.  Ultimately, we need to have a documentation
that any average user can understand, not to mention you :)

Anyone understand the subject matter well enough to review the
> documentation?
>

I expect it would be hard to find anybody matching this criteria.  But it
would be nice to find one though.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2018-03-01 Thread David Steele
Hi Alexander,

On 1/4/18 4:25 PM, Alexander Korotkov wrote:
> 
> I just found that patch apply is failed according to
> commitfest.cputube.org .  I think it's
> because I sent only second patch from patchset in last message.
> Anyway I resend both patches rebased to current master.

I agree with Teodor (upthread, not quoted here) that the documentation
could use some editing.

I started to do it myself, but quickly realized I have no knowledge of
the content.  I'm afraid I would destroy the meaning while updating the
grammar.

Anyone understand the subject matter well enough to review the
documentation?

Thanks,
-- 
-David
da...@pgmasters.net



Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2018-01-19 Thread Liudmila Mantrova

Hello everyone,

I would like to contribute to documentation review of the patches 
discussed in thread 
https://www.postgresql.org/message-id/flat/cy4pr17mb13207ed8310f847cf117eed0d8...@cy4pr17mb1320.namprd17.prod.outlook.com 
(https://commitfest.postgresql.org/16/1403/). Unfortunately, I was not 
subscribed to pgsql-hackers before, so I cannot respond to this thread 
directly.


Please find attached new revisions of the original patches. I hope 
you'll find the changes useful!


--
Liudmila Mantrova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index f7e96ac..306d60b 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -456,7 +456,7 @@ iterate_word_similarity(int *trg2indexes,
 			lastpos[trgindex] = i;
 		}
 
-		/* Adjust lower bound if this trigram is present in required substring */
+		/* Adjust upper bound if this trigram is present in required substring */
 		if (found[trgindex])
 		{
 			int			prev_lower,
@@ -473,7 +473,7 @@ iterate_word_similarity(int *trg2indexes,
 
 			smlr_cur = CALCSML(count, ulen1, ulen2);
 
-			/* Also try to adjust upper bound for greater similarity */
+			/* Also try to adjust lower bound for greater similarity */
 			tmp_count = count;
 			tmp_ulen2 = ulen2;
 			prev_lower = lower;
diff --git a/doc/src/sgml/pgtrgm.sgml b/doc/src/sgml/pgtrgm.sgml
index 338ef30..005961c 100644
--- a/doc/src/sgml/pgtrgm.sgml
+++ b/doc/src/sgml/pgtrgm.sgml
@@ -99,12 +99,10 @@
   
   real
   
-   Returns a number that indicates how similar the first string
-   to the most similar word of the second string. The function searches in
-   the second string a most similar word not a most similar substring.  The
-   range of the result is zero (indicating that the two strings are
-   completely dissimilar) to one (indicating that the first string is
-   identical to one of the words of the second string).
+   Returns a number that indicates the greatest similarity between
+   the set of trigrams in the first string and any continuous extent
+   of an ordered set of trigrams in the second string.  For details, see
+   the explanation below.
   
  
  
@@ -131,6 +129,35 @@

   
 
+  
+   word_similarity(text, text) requires further
+   explanation. Consider the following example:
+
+
+# SELECT word_similarity('word', 'two words');
+ word_similarity
+-
+ 0.8
+(1 row)
+
+
+   In the first string, the set of trigrams is
+   {"  w"," wo","ord","wor","rd "}.
+   In the second string, the ordered set of trigrams is
+   {"  t"," tw",two,"wo ","  w"," wo","wor","ord","rds", ds "}.
+   The most similar extent of an ordered set of trigrams in the second string
+   is {"  w"," wo","wor","ord"}, and the similarity is
+   0.8.
+  
+
+  
+   This function returns a value that can be approximately understood as the
+   greatest similarity between the first string and any substring of the second
+   string.  However, this function does not add paddings to the boundaries of
+   the extent.  Thus, a whole word match gets a higher score than a match with
+   a part of the word.
+  
+
   
pg_trgm Operators

@@ -156,10 +183,11 @@
   text % text
   boolean
   
-   Returns true if its first argument has the similar word in
-   the second argument and they have a similarity that is greater than the
-   current word similarity threshold set by
-   pg_trgm.word_similarity_threshold parameter.
+   Returns true if the similarity between the trigram
+   set in the first argument and a continuous extent of an ordered trigram
+   set in the second argument is greater than the current word similarity
+   threshold set by pg_trgm.word_similarity_threshold
+   parameter.
   
  
  
@@ -302,10 +330,11 @@ SELECT t, word_similarity('word', t) AS sml
   WHERE 'word' % t
   ORDER BY sml DESC, t;
 
-   This will return all values in the text column that have a word
-   which sufficiently similar to word, sorted from best
-   match to worst.  The index will be used to make this a fast operation
-   even over very large data sets.
+   This will return all values in the text column for which there is a
+   continuous extent in the corresponding ordered trigram set that is
+   sufficiently similar to the trigram set of word,
+   sorted from best match to worst.  The index will be used to make this
+   a fast operation even over very large data sets.
   
 
   
diff --git a/contrib/pg_trgm/Makefile b/contrib/pg_trgm/Makefile
index 212a890..dfecc2a 100644
--- a/contrib/pg_trgm/Makefile
+++ b/contrib/pg_trgm/Makefile
@@ -4,11 +4,12 @@ MODULE_big = pg_trgm
 OBJS = trgm_op.o trgm_gist.o trgm_gin.o trgm_regexp.o $(WIN32RES)
 
 EXTENSION = pg_trgm
-DATA = pg_trgm--1.3.sql pg_trgm--1.2--1.3.sql pg_trgm--1.1--1.2.sql \
+DATA = pg_trgm--1.3--1.4.sql \
+	pg_trgm--1.3.sql 

Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2018-01-04 Thread Alexander Korotkov
On Wed, Dec 13, 2017 at 2:13 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Tue, Dec 12, 2017 at 2:33 PM, Teodor Sigaev  wrote:
>
>> 0002-pg-trgm-strict_word-similarity.patch – implementation of
>>> strict_word_similarity() with comments, docs and tests.
>>>
>> After some looking in
>>
>> 1)
>> repeated piece of code:
>> +   if (strategy == SimilarityStrategyNumber)
>> +   nlimit = similarity_threshold;
>> +   else if (strategy == WordSimilarityStrategyNumber)
>> +   nlimit = word_similarity_threshold;
>> +   else /* strategy == StrictWordSimilarityStrategyNumber */
>> +   nlimit = strict_word_similarity_threshold;
>> Isn't it better to move that piece to separate function?
>>
>
> Good point.  Moved to separate function.
>
> 2)
>> calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
>>  bool check_only, bool word_bounds)
>>
>> Seems, two bools args are replaceble to  bitwise-ORed flag. It will
>> simplify adding new options in future.
>
>
> Yep.  I've introduced flags.
>
> Also, I've adjusted tests to make them stable (found example where TOP-8
> distances are unique).
> Please, find revised patch in attachment.
>

I just found that patch apply is failed according to commitfest.cputube.org.
I think it's because I sent only second patch from patchset in last message.
Anyway I resend both patches rebased to current master.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-pg-trgm-word-similarity-docs-improvement-3.patch
Description: Binary data


0002-pg-trgm-strict_word-similarity-3.patch
Description: Binary data


Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2017-12-13 Thread Alexander Korotkov
On Tue, Dec 12, 2017 at 2:33 PM, Teodor Sigaev  wrote:

> 0002-pg-trgm-strict_word-similarity.patch – implementation of
>> strict_word_similarity() with comments, docs and tests.
>>
> After some looking in
>
> 1)
> repeated piece of code:
> +   if (strategy == SimilarityStrategyNumber)
> +   nlimit = similarity_threshold;
> +   else if (strategy == WordSimilarityStrategyNumber)
> +   nlimit = word_similarity_threshold;
> +   else /* strategy == StrictWordSimilarityStrategyNumber */
> +   nlimit = strict_word_similarity_threshold;
> Isn't it better to move that piece to separate function?
>

Good point.  Moved to separate function.

2)
> calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
>  bool check_only, bool word_bounds)
>
> Seems, two bools args are replaceble to  bitwise-ORed flag. It will
> simplify adding new options in future.


Yep.  I've introduced flags.

Also, I've adjusted tests to make them stable (found example where TOP-8
distances are unique).
Please, find revised patch in attachment.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0002-pg-trgm-strict_word-similarity-2.patch
Description: Binary data


Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2017-12-12 Thread Teodor Sigaev
0001-pg-trgm-word-similarity-docs-improvement.patch – contains improvement to 
documentation of word_similarity() and related operators.  I decided to give 
formal definition first (what exactly it internally does), and then example and 
some more human-understandable description.  This patch also adjusts two 
comments where lower and upper bounds mess up.


I'm ready for commit that, but I'd like someone from native English speaker to 
check that. Thank you.


And, suppose, this patch should be backpatched to 9.6

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2017-12-12 Thread Teodor Sigaev
0002-pg-trgm-strict_word-similarity.patch – implementation of 
strict_word_similarity() with comments, docs and tests.

After some looking in

1)
repeated piece of code:
+   if (strategy == SimilarityStrategyNumber)
+   nlimit = similarity_threshold;
+   else if (strategy == WordSimilarityStrategyNumber)
+   nlimit = word_similarity_threshold;
+   else /* strategy == StrictWordSimilarityStrategyNumber */
+   nlimit = strict_word_similarity_threshold;
Isn't it better to move that piece to separate function?


2)
calc_word_similarity(char *str1, int slen1, char *str2, int slen2,
 bool check_only, bool word_bounds)

Seems, two bools args are replaceble to  bitwise-ORed flag. It will simplify 
adding new options in future.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2017-12-12 Thread Teodor Sigaev
0002-pg-trgm-strict_word-similarity.patch – implementation of 
strict_word_similarity() with comments, docs and tests.


The patch looks commmitable, but sometime I get
*** ...pgsql/contrib/pg_trgm/expected/pg_strict_word_trgm.out 2017-12-12 
14:16:55.190989000 +0300
--- .../pgsql/contrib/pg_trgm/results/pg_strict_word_trgm.out  2017-12-12 
14:17:27.645639000 +0300

***
*** 153,160 
  --+--
  0 | Kabankala
   0.25 | Kabankalan City Public Plaza
-  0.416667 | Kabakala
   0.416667 | Abankala
   0.538462 | Kabikala
  0.625 | Ntombankala School
   0.642857 | Nehalla Bankalah Reserved Forest
--- 153,160 
  --+--
  0 | Kabankala
   0.25 | Kabankalan City Public Plaza
   0.416667 | Abankala
+  0.416667 | Kabakala
   0.538462 | Kabikala
  0.625 | Ntombankala School
   0.642857 | Nehalla Bankalah Reserved Forest

==


Seems, some stability order should be added to tests

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/



Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2017-12-11 Thread Alexander Korotkov
On Fri, Dec 8, 2017 at 2:50 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Dec 7, 2017 at 8:59 PM, Robert Haas  wrote:
>
>> On Tue, Nov 7, 2017 at 7:51 AM, Jan Przemysław Wójcik
>>  wrote:
>> > I'm afraid that creating a function that implements quite different
>> > algorithms depending on a global parameter seems very hacky and would
>> lead
>> > to misunderstandings. I do understand the need of backward
>> compatibility,
>> > but I'd opt for the lesser evil. Perhaps a good idea would be to change
>> the
>> > name to 'substring_similarity()' and introduce the new function
>> > 'word_similarity()' later, for example in the next major version
>> release.
>>
>> That breaks things for everybody using word_similarity() currently.
>> If the previous discussion of this topic concluded that
>> word_similarity() was an OK name despite being a slight misnomer, I
>> don't think we should change our mind now.  Instead the new function
>> can be called something which makes the difference clear, e.g.
>> strict_word_similarity(), and the old function can remain as it is.
>
>
> +1
> Thank you for pointing this.  Yes, it would be better not to change
> existing names and behavior, but adjust documentation and add alternative
> behavior with another name.
> Therefore, I'm going to provide patchset of two patches:
> 1) Improve word_similarity() documentation.
> 2) Add new function strict_word_similarity() (or whatever better name we
> invent).
>

Please, find patchset attached.

0001-pg-trgm-word-similarity-docs-improvement.patch – contains improvement
to documentation of word_similarity() and related operators.  I decided to
give formal definition first (what exactly it internally does), and then
example and some more human-understandable description.  This patch also
adjusts two comments where lower and upper bounds mess up.

0002-pg-trgm-strict_word-similarity.patch – implementation of
strict_word_similarity() with comments, docs and tests.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-pg-trgm-word-similarity-docs-improvement.patch
Description: Binary data


0002-pg-trgm-strict_word-similarity.patch
Description: Binary data


Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2017-12-08 Thread Alexander Korotkov
On Thu, Dec 7, 2017 at 8:59 PM, Robert Haas  wrote:

> On Tue, Nov 7, 2017 at 7:51 AM, Jan Przemysław Wójcik
>  wrote:
> > I'm afraid that creating a function that implements quite different
> > algorithms depending on a global parameter seems very hacky and would
> lead
> > to misunderstandings. I do understand the need of backward compatibility,
> > but I'd opt for the lesser evil. Perhaps a good idea would be to change
> the
> > name to 'substring_similarity()' and introduce the new function
> > 'word_similarity()' later, for example in the next major version release.
>
> That breaks things for everybody using word_similarity() currently.
> If the previous discussion of this topic concluded that
> word_similarity() was an OK name despite being a slight misnomer, I
> don't think we should change our mind now.  Instead the new function
> can be called something which makes the difference clear, e.g.
> strict_word_similarity(), and the old function can remain as it is.


+1
Thank you for pointing this.  Yes, it would be better not to change
existing names and behavior, but adjust documentation and add alternative
behavior with another name.
Therefore, I'm going to provide patchset of two patches:
1) Improve word_similarity() documentation.
2) Add new function strict_word_similarity() (or whatever better name we
invent).

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2017-12-07 Thread Robert Haas
On Tue, Nov 7, 2017 at 7:51 AM, Jan Przemysław Wójcik
 wrote:
> I'm afraid that creating a function that implements quite different
> algorithms depending on a global parameter seems very hacky and would lead
> to misunderstandings. I do understand the need of backward compatibility,
> but I'd opt for the lesser evil. Perhaps a good idea would be to change the
> name to 'substring_similarity()' and introduce the new function
> 'word_similarity()' later, for example in the next major version release.

That breaks things for everybody using word_similarity() currently.
If the previous discussion of this topic concluded that
word_similarity() was an OK name despite being a slight misnomer, I
don't think we should change our mind now.  Instead the new function
can be called something which makes the difference clear, e.g.
strict_word_similarity(), and the old function can remain as it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Fwd: [BUGS] pg_trgm word_similarity inconsistencies or bug

2017-12-07 Thread Alexander Korotkov
On Tue, Nov 7, 2017 at 7:24 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Tue, Nov 7, 2017 at 3:51 PM, Jan Przemysław Wójcik <
> jan.przemyslaw.woj...@gmail.com> wrote:
>
>> my statement about the function usefulness was probably too categorical,
>> though I had in mind the current name of the function.
>>
>> I'm afraid that creating a function that implements quite different
>> algorithms depending on a global parameter seems very hacky and would lead
>> to misunderstandings. I do understand the need of backward compatibility,
>> but I'd opt for the lesser evil. Perhaps a good idea would be to change
>> the
>> name to 'substring_similarity()' and introduce the new function
>> 'word_similarity()' later, for example in the next major version release.
>>
>
> Good point.  I've no complaints about that.  I'm going to propose
> corresponding patch to the next commitfest.
>

I've written a draft patch for fixing this inconsistency.  Please, find it
in attachment.  This patch doesn't contain proper documentation and
comments yet.

I've called existing behavior subset_similarity().  I didn't use name
substring_similarity(), because it doesn't really looking for substring
with appropriate padding, but rather searching for continuous subset of
trigrams.  For index search over subset similarity, %>>, <<%, <->>>, <<<->
operators are provided.  I've added extra arrow sign to denote these
operators look deeper into string.

Simultaneously, word_similarity() now forces extent bounds to be word
bounds.  Now word_similarity() behaves similar to my_word_similarity()
proposed on stackoverlow.

# with data(t) as (
values
('message'),
('message s'),
('message sag'),
('message sag sag'),
('message sag sage')
)
select t, subset_similarity('sage', t), word_similarity('sage', t)
from data;
t | subset_similarity | word_similarity
--+---+-
 message  |   0.6 | 0.3
 message s|   0.8 |0.363636
 message sag  | 1 | 0.5
 message sag sag  | 1 | 0.5
 message sag sage | 1 |   1
(5 rows)

The difference here is only in 'messsage s' row, because word_similarity()
allows matching one word to two or more while my_word_similarity() doesn't
allow that.  In this case word_similarity() returns similarity between
'sage' and 'message s'.

# select similarity('sage', 'message s');
 similarity

   0.363636
(1 row)

I think behavior of word_similarity() appears better here, because typo can
break word into two.

I also wonder if word_similarity() and subset_similarity() should share
same threshold value for indexed search.  subset_similarity() typically
returns higher values than word_similarity().  Thus, it's probably makes
sense to split their threshold values.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


pg-trgm-word-subset-similarity-1.patch
Description: Binary data