where `similarity` is a measure of overlap of `t` between each pair of `c` and is calculated as the number of `t` common between a given pair divided by the total number of unique `t` in the pair.

Just like Igor, I take "unique `t` in pair" to mean "distinct `t` in pair".


I want the result to be

    source_c target_c similarity
    -------- -------- ----------
1 2 0.2 (= 1 common `t` between the 2 `c` / by total 4 `t`)

Here I take 0.2 to mean 0.25.


If so, here's some more amunition:

select src.c_no "Source c",
       trg.c_no "Target c",
(select count(*) from (select 1 from c_t where c_t.c_no in (src.c_no, trg.c_no) group by c_t.t_no having count(*) > 1))
       * 1.0 /
(select count(*) from (select distinct t_no from c_t where c_t.c_no in (src.c_no, trg.c_no))) Similarity
       from c src join c trg on (src.c_no < trg.c_no);

If I'm not mistaken, both versions should give identical results. Check by yourself on real data and see which is faster for you (most probably Igor's).

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to