Re: [GENERAL] Grouping By Similarity (using pg_trgm)?
Have you seen http://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf ? On Thu, May 14, 2015 at 9:58 PM, Cory Tucker cory.tuc...@gmail.com wrote: [pg version 9.3 or 9.4] Suppose I have a simple table: create table data ( my_value TEXT NOT NULL ); CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops); Now I would like to essentially do group by to get a count of all the values that are sufficiently similar. I can do it using something like a CROSS JOIN to join the table on itself, but then I still am getting all the rows with duplicate counts. Is there a way to do a group by query and only return a single my_value column and a count of the number of times other values are similar while also not returning the included similar values in the output, too?
Re: [GENERAL] Grouping By Similarity (using pg_trgm)?
On Thu, May 14, 2015 at 1:09 PM, Cory Tucker cory.tuc...@gmail.com wrote: That produces pretty much the same results as the CROSS JOIN I was using before. Because each my_value in the table are different, if I group on just their value then I will always have the full result set and a bunch of essentially duplicated results. Any other ideas/options? how do you want to solve the problem: A is similar to B B is similar to C A IS NOT similar to C ? I'm not sure that PostgreSQL is the best tool to solve clustering problems...though my experience with them is minimal. David J.
Re: [GENERAL] Grouping By Similarity (using pg_trgm)?
That produces pretty much the same results as the CROSS JOIN I was using before. Because each my_value in the table are different, if I group on just their value then I will always have the full result set and a bunch of essentially duplicated results. Any other ideas/options? On Thu, May 14, 2015 at 12:08 PM David G. Johnston david.g.johns...@gmail.com wrote: On Thu, May 14, 2015 at 11:58 AM, Cory Tucker cory.tuc...@gmail.com wrote: [pg version 9.3 or 9.4] Suppose I have a simple table: create table data ( my_value TEXT NOT NULL ); CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops); Now I would like to essentially do group by to get a count of all the values that are sufficiently similar. I can do it using something like a CROSS JOIN to join the table on itself, but then I still am getting all the rows with duplicate counts. Is there a way to do a group by query and only return a single my_value column and a count of the number of times other values are similar while also not returning the included similar values in the output, too? Concept below - not bothering to lookup the functions/operators for pg_trgm: SELECT my_value_src, count(*) FROM (SELECT my_value AS my_value_src FROM data) src JOIN (SELECT my_value AS my_value_compareto FROM data) comparedto ON ( func(my_value_src, my_value_compareto) # ) GROUP BY my_value_src David J.
Re: [GENERAL] Grouping By Similarity (using pg_trgm)?
On Thu, May 14, 2015 at 11:58 AM, Cory Tucker cory.tuc...@gmail.com wrote: [pg version 9.3 or 9.4] Suppose I have a simple table: create table data ( my_value TEXT NOT NULL ); CREATE INDEX idx_my_value ON data USING gin(my_value gin_trgm_ops); Now I would like to essentially do group by to get a count of all the values that are sufficiently similar. I can do it using something like a CROSS JOIN to join the table on itself, but then I still am getting all the rows with duplicate counts. Is there a way to do a group by query and only return a single my_value column and a count of the number of times other values are similar while also not returning the included similar values in the output, too? Concept below - not bothering to lookup the functions/operators for pg_trgm: SELECT my_value_src, count(*) FROM (SELECT my_value AS my_value_src FROM data) src JOIN (SELECT my_value AS my_value_compareto FROM data) comparedto ON ( func(my_value_src, my_value_compareto) # ) GROUP BY my_value_src David J.