Hey, folks.

If I run kmeans-clustering with the -cl option, I get
<kmeans_output>/clusteredPoints.

Running clusterdump with -p <kmeans_output>/clusteredPoints, I get output
that looks like this for a given cluster (running on the reuters corpus):

...
*        Top Terms:*
*                said                                    =>
1.421944826722092*
*                3                                       =>
 0.9007495669006188*
*                reuter                                  =>
 0.8924866335531932*
...
*        Weight:  Point:*
*        1.0: [srd:7.671, 20.00:8.269, bp:13.510, co:3.989, financial:5.109,
under:3.297, activities:5.452, called:4.288, investment:3.831, owns:5.457,
interest:3.438, market:2.977, comp\*
*anies:3.894, plan:4.041, 02:4.706, joint:4.573, both:4.004, manage:6.697,
oversight:8.364, trading:3.842, also:2.737, venture:7.009, 15:2.866,
money:4.178, borrowing:5.718, inc:2.282, c\*
*ommittee:4.438, north:6.854, 26-feb-1987:5.496, 55:4.719, form:7.126,
management:4.015, standard:10.665, subsidiary:4.040, plc:4.462,
america:6.863, 3:1.124, which:2.445, petroleum:4.71\*
*6, oil:7.377, operated:6.224, british:4.646, reuter:1.133, pct:2.233,
said:1.330, unit:3.541]*
*        1.0: [16:3.076, 35:4.400, 26-feb-1987:5.496]*
*        1.0: [54.20:9.057, 16:3.076, 36:4.670, 26-feb-1987:5.496]*
...

I get the bags of words that end up in a given cluster, but I don't see the
original document ID from which that bag of words was generated (e.g.
reut2-111.sgm-211.txt, etc)

In the sequence file generated by 'seqdirectory', we get the following:

*[msp...@wowzers mahout-distribution-0.4]$ hadoop dfs -cat
examples/bin/work/reuters-out-seqdir/chunk-0 | head*
*SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
�$�]Ƥ��...@7a/reut2-020.sgm-237.txt�'20-OCT-1987
10:29:21.69*
*
*
*HUTTON <EFH> REITERATES STATEMENT OF SOLVENCY*
...

In the sparse vectors (which are passed into kmeans), we get:
*[msp...@wowzers mahout-distribution-0.4]$ hadoop dfs -cat
examples/bin/work/reuters-out-seqdir-sparse/tf-vectors/part-r-00000 | head*
*1�e�>�C/reut2-000.sgm-0.txt����...@��@�v?��?�?��?�...@�?��?��$?���?���?��@�1@
�6@
�"@�c?��?��?��3?��"?���?��c?...@��?�ľ@�...@?��?��?��~?��u?��u?�؈�@?��?��...@ݗ?��?���?�̜?���@�y?��?��?�?��l?��t@
�y@"��?��...@��?��/?���...@�����3@��?��}?��...@�p@�...@�d?���?��?��m?��?���?��k@
��?�Ņ?��0?��u?�ֹ?��o?��...@�%?�֖?��|@�...@��@���?��?��o?��?��?���?��%?��...@��@�f@
��?��?��T?��f?��@,�)?��?���?���?��?��f?��j?��n@
��?���?���?��?��s?��?�؞?��]?��?��?��*
*?���?���� ?��X?���?���?��?���?��*
...

It looks like the document IDs are being passed on through the data
wrangling but then unused by kmeans and/or not reported in clusteredPoints.
 It seems to me like that'd be super useful to have them in the final
output.  Are they easy to get at?

Thanks,
Matt

Reply via email to