You're totally right. The short answer is that I'm an idiot, seq2sparse has -nv to generate the NamedVectors, which I failed to realize.
Thanks, Jeff! On Oct 29, 2010 10:07 PM, "Jeff Eastman" <[email protected]> wrote: > Hi Matt, > > K-means passes NamedVectors transparently through its processing, > including the clustering output. Looking over the relevant code: > > ClusterDumper.print() calls AbstractVector.asFormatString(v,bindings) > AbstractVector.asFormatString(v,bindings) evaluates > if (v instanceof NamedVector) { > buf.append(((NamedVector) v).getName()).append(" = "); > } > ...before printing the formatted vector. Given that the sequence files > generated by seq2sparse and presented to kmeans contain NamedVector > wrappers (which they appear to do), the output should look like <name> = > [<vector>] > > I don't know why you aren't seeing that. Can you please investigate? > Jeff > > On 10/29/10 11:30 AM, Matt Spitz wrote: >> Hey, folks. >> >> If I run kmeans-clustering with the -cl option, I get >> <kmeans_output>/clusteredPoints. >> >> Running clusterdump with -p<kmeans_output>/clusteredPoints, I get output >> that looks like this for a given cluster (running on the reuters corpus): >> >> ... >> * Top Terms:* >> * said => >> 1.421944826722092* >> * 3 => >> 0.9007495669006188* >> * reuter => >> 0.8924866335531932* >> ... >> * Weight: Point:* >> * 1.0: [srd:7.671, 20.00:8.269, bp:13.510, co:3.989, financial:5.109, >> under:3.297, activities:5.452, called:4.288, investment:3.831, owns:5.457, >> interest:3.438, market:2.977, comp\* >> *anies:3.894, plan:4.041, 02:4.706, joint:4.573, both:4.004, manage:6.697, >> oversight:8.364, trading:3.842, also:2.737, venture:7.009, 15:2.866, >> money:4.178, borrowing:5.718, inc:2.282, c\* >> *ommittee:4.438, north:6.854, 26-feb-1987:5.496, 55:4.719, form:7.126, >> management:4.015, standard:10.665, subsidiary:4.040, plc:4.462, >> america:6.863, 3:1.124, which:2.445, petroleum:4.71\* >> *6, oil:7.377, operated:6.224, british:4.646, reuter:1.133, pct:2.233, >> said:1.330, unit:3.541]* >> * 1.0: [16:3.076, 35:4.400, 26-feb-1987:5.496]* >> * 1.0: [54.20:9.057, 16:3.076, 36:4.670, 26-feb-1987:5.496]* >> ... >> >> I get the bags of words that end up in a given cluster, but I don't see the >> original document ID from which that bag of words was generated (e.g. >> reut2-111.sgm-211.txt, etc) >> >> In the sequence file generated by 'seqdirectory', we get the following: >> >> *[msp...@wowzers mahout-distribution-0.4]$ hadoop dfs -cat >> examples/bin/work/reuters-out-seqdir/chunk-0 | head* >> *SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text >> �$�]Ƥ��...@7a/reut2-020.sgm-237.txt�'20-OCT-1987 >> 10:29:21.69* >> * >> * >> *HUTTON<EFH> REITERATES STATEMENT OF SOLVENCY* >> ... >> >> In the sparse vectors (which are passed into kmeans), we get: >> *[msp...@wowzers mahout-distribution-0.4]$ hadoop dfs -cat >> examples/bin/work/reuters-out-seqdir-sparse/tf-vectors/part-r-00000 | head* >> *1�e�>�C/reut2-000.sgm-0.txt����h@ �...@�v?��?�?��?��@�?��?��$?���?���?�...@�1@ >> �6@ >> �"@�c?��?��?��3?��"?���?��c?...@��?�ľ@�...@?��?��?��~?��u?��u?�؈�@?��?��...@ݗ?��?���?�̜?���@�y?��?��?�?��l?��t@ >> �y@"��?��...@��?��/?�ė?��...@�����3@��?��}?��...@�p@�...@�d?���?��?��m?��?���?��k@ >> ��?�Ņ?��0?��u?�ֹ?��o?��...@�%?�֖?��|@�...@��@���?��?��o?��?��?���?��%?��H@ �...@�f@ >> ��?��?��T?��f?��@,�)?��?���?���?��?��f?��j?��n@ >> ��?���?���?��?��s?��?�؞?��]?��?��?��* >> *?���?���� ?��X?���?���?��?���?��* >> ... >> >> It looks like the document IDs are being passed on through the data >> wrangling but then unused by kmeans and/or not reported in clusteredPoints. >> It seems to me like that'd be super useful to have them in the final >> output. Are they easy to get at? >> >> Thanks, >> Matt >
