Hey, folks. If I run kmeans-clustering with the -cl option, I get <kmeans_output>/clusteredPoints.
Running clusterdump with -p <kmeans_output>/clusteredPoints, I get output that looks like this for a given cluster (running on the reuters corpus): ... * Top Terms:* * said => 1.421944826722092* * 3 => 0.9007495669006188* * reuter => 0.8924866335531932* ... * Weight: Point:* * 1.0: [srd:7.671, 20.00:8.269, bp:13.510, co:3.989, financial:5.109, under:3.297, activities:5.452, called:4.288, investment:3.831, owns:5.457, interest:3.438, market:2.977, comp\* *anies:3.894, plan:4.041, 02:4.706, joint:4.573, both:4.004, manage:6.697, oversight:8.364, trading:3.842, also:2.737, venture:7.009, 15:2.866, money:4.178, borrowing:5.718, inc:2.282, c\* *ommittee:4.438, north:6.854, 26-feb-1987:5.496, 55:4.719, form:7.126, management:4.015, standard:10.665, subsidiary:4.040, plc:4.462, america:6.863, 3:1.124, which:2.445, petroleum:4.71\* *6, oil:7.377, operated:6.224, british:4.646, reuter:1.133, pct:2.233, said:1.330, unit:3.541]* * 1.0: [16:3.076, 35:4.400, 26-feb-1987:5.496]* * 1.0: [54.20:9.057, 16:3.076, 36:4.670, 26-feb-1987:5.496]* ... I get the bags of words that end up in a given cluster, but I don't see the original document ID from which that bag of words was generated (e.g. reut2-111.sgm-211.txt, etc) In the sequence file generated by 'seqdirectory', we get the following: *[msp...@wowzers mahout-distribution-0.4]$ hadoop dfs -cat examples/bin/work/reuters-out-seqdir/chunk-0 | head* *SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text �$�]Ƥ��...@7a/reut2-020.sgm-237.txt�'20-OCT-1987 10:29:21.69* * * *HUTTON <EFH> REITERATES STATEMENT OF SOLVENCY* ... In the sparse vectors (which are passed into kmeans), we get: *[msp...@wowzers mahout-distribution-0.4]$ hadoop dfs -cat examples/bin/work/reuters-out-seqdir-sparse/tf-vectors/part-r-00000 | head* *1�e�>�C/reut2-000.sgm-0.txt����...@��@�v?��?�?��?�...@�?��?��$?���?���?��@�1@ �6@ �"@�c?��?��?��3?��"?���?��c?...@��?�ľ@�...@?��?��?��~?��u?��u?�؈�@?��?��...@ݗ?��?���?�̜?���@�y?��?��?�?��l?��t@ �y@"��?��...@��?��/?�ė?��...@�����3@��?��}?��...@�p@�...@�d?���?��?��m?��?���?��k@ ��?�Ņ?��0?��u?�ֹ?��o?��...@�%?�֖?��|@�...@��@���?��?��o?��?��?���?��%?��...@��@�f@ ��?��?��T?��f?��@,�)?��?���?���?��?��f?��j?��n@ ��?���?���?��?��s?��?�؞?��]?��?��?��* *?���?���� ?��X?���?���?��?���?��* ... It looks like the document IDs are being passed on through the data wrangling but then unused by kmeans and/or not reported in clusteredPoints. It seems to me like that'd be super useful to have them in the final output. Are they easy to get at? Thanks, Matt
