Re: [GENERAL] Grouping By Similarity (using pg_trgm)?

2015-05-22 Thread Oleg Bartunov
Have you seen http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf ?

On Thu, May 14, 2015 at 9:58 PM, Cory Tucker cory.tuc...@gmail.com wrote:

 [pg version 9.3 or 9.4]

 Suppose I have a simple table:

 create table data (
   my_value  TEXT NOT NULL
 );
 CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);


 Now I would like to essentially do group by to get a count of all the
 values that are sufficiently similar.  I can do it using something like a
 CROSS JOIN to join the table on itself, but then I still am getting all the
 rows with duplicate counts.

 Is there a way to do a group by query and only return a single my_value
 column and a count of the number of times other values are similar while
 also not returning the included similar values in the output, too?




Re: [GENERAL] Grouping By Similarity (using pg_trgm)?

2015-05-14 Thread David G. Johnston
On Thu, May 14, 2015 at 1:09 PM, Cory Tucker cory.tuc...@gmail.com wrote:

 That produces pretty much the same results as the CROSS JOIN I was using
 before.  Because each my_value in the table are different, if I group on
 just their value then I will always have the full result set and a bunch of
 essentially duplicated results.

 Any other ideas/options?


​how do you want to solve the problem:

A is similar to B
B is similar to C
A IS NOT similar to C

​?

I'm not sure that PostgreSQL is the best tool to solve clustering
problems...though my experience with them is minimal.

David J.


Re: [GENERAL] Grouping By Similarity (using pg_trgm)?

2015-05-14 Thread Cory Tucker
That produces pretty much the same results as the CROSS JOIN I was using
before.  Because each my_value in the table are different, if I group on
just their value then I will always have the full result set and a bunch of
essentially duplicated results.

Any other ideas/options?

On Thu, May 14, 2015 at 12:08 PM David G. Johnston 
david.g.johns...@gmail.com wrote:


 On Thu, May 14, 2015 at 11:58 AM, Cory Tucker cory.tuc...@gmail.com
 wrote:

 [pg version 9.3 or 9.4]

 Suppose I have a simple table:

 create table data (
   my_value  TEXT NOT NULL
 );
 CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);


 Now I would like to essentially do group by to get a count of all the
 values that are sufficiently similar.  I can do it using something like a
 CROSS JOIN to join the table on itself, but then I still am getting all the
 rows with duplicate counts.

 Is there a way to do a group by query and only return a single my_value
 column and a count of the number of times other values are similar while
 also not returning the included similar values in the output, too?


 ​Concept below - not bothering to lookup the functions/operators for
 pg_trgm:

 SELECT my_value_src, count(*)
 FROM (SELECT my_value AS my_value_src FROM data) src
 JOIN (SELECT my_value AS my_value_compareto FROM data) comparedto
 ON ( func(my_value_src, my_value_compareto)  # )
 GROUP BY my_value_src

 ​David J.




Re: [GENERAL] Grouping By Similarity (using pg_trgm)?

2015-05-14 Thread David G. Johnston
On Thu, May 14, 2015 at 11:58 AM, Cory Tucker cory.tuc...@gmail.com wrote:

 [pg version 9.3 or 9.4]

 Suppose I have a simple table:

 create table data (
   my_value  TEXT NOT NULL
 );
 CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops);


 Now I would like to essentially do group by to get a count of all the
 values that are sufficiently similar.  I can do it using something like a
 CROSS JOIN to join the table on itself, but then I still am getting all the
 rows with duplicate counts.

 Is there a way to do a group by query and only return a single my_value
 column and a count of the number of times other values are similar while
 also not returning the included similar values in the output, too?


​Concept below - not bothering to lookup the functions/operators for
pg_trgm:

SELECT my_value_src, count(*)
FROM (SELECT my_value AS my_value_src FROM data) src
JOIN (SELECT my_value AS my_value_compareto FROM data) comparedto
ON ( func(my_value_src, my_value_compareto)  # )
GROUP BY my_value_src

​David J.