clusterdump + document ids?

Matt Spitz Fri, 29 Oct 2010 11:31:17 -0700

Hey, folks.

If I run kmeans-clustering with the -cl option, I get
<kmeans_output>/clusteredPoints.


Running clusterdump with -p <kmeans_output>/clusteredPoints, I get output
that looks like this for a given cluster (running on the reuters corpus):

...
*        Top Terms:*
*                said                                    =>
1.421944826722092*
*                3                                       =>
 0.9007495669006188*
*                reuter                                  =>
 0.8924866335531932*
...
*        Weight:  Point:*
*        1.0: [srd:7.671, 20.00:8.269, bp:13.510, co:3.989, financial:5.109,
under:3.297, activities:5.452, called:4.288, investment:3.831, owns:5.457,
interest:3.438, market:2.977, comp\*
*anies:3.894, plan:4.041, 02:4.706, joint:4.573, both:4.004, manage:6.697,
oversight:8.364, trading:3.842, also:2.737, venture:7.009, 15:2.866,
money:4.178, borrowing:5.718, inc:2.282, c\*
*ommittee:4.438, north:6.854, 26-feb-1987:5.496, 55:4.719, form:7.126,
management:4.015, standard:10.665, subsidiary:4.040, plc:4.462,
america:6.863, 3:1.124, which:2.445, petroleum:4.71\*
*6, oil:7.377, operated:6.224, british:4.646, reuter:1.133, pct:2.233,
said:1.330, unit:3.541]*
*        1.0: [16:3.076, 35:4.400, 26-feb-1987:5.496]*
*        1.0: [54.20:9.057, 16:3.076, 36:4.670, 26-feb-1987:5.496]*
...

I get the bags of words that end up in a given cluster, but I don't see the
original document ID from which that bag of words was generated (e.g.
reut2-111.sgm-211.txt, etc)

In the sequence file generated by 'seqdirectory', we get the following:

*[msp...@wowzers mahout-distribution-0.4]$ hadoop dfs -cat
examples/bin/work/reuters-out-seqdir/chunk-0 | head*
*SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
�$�]Ƥ��...@7a/reut2-020.sgm-237.txt�'20-OCT-1987
10:29:21.69*
*
*
*HUTTON <EFH> REITERATES STATEMENT OF SOLVENCY*
...

In the sparse vectors (which are passed into kmeans), we get:
*[msp...@wowzers mahout-distribution-0.4]$ hadoop dfs -cat
examples/bin/work/reuters-out-seqdir-sparse/tf-vectors/part-r-00000 | head*
*1�e�>�C/reut2-000.sgm-0.txt����...@��@�v?��?�?��?�...@�?��?��$?���?���?��@�1@
�6@
�"@�c?��?��?��3?��"?���?��c?...@��?�ľ@�...@?��?��?��~?��u?��u?�؈�@?��?��...@ݗ?��?���?�̜?���@�y?��?��?�?��l?��t@
�y@"��?��...@��?��/?�ė?��...@�����3@��?��}?��...@�p@�...@�d?���?��?��m?��?���?��k@
��?�Ņ?��0?��u?�ֹ?��o?��...@�%?�֖?��|@�...@��@���?��?��o?��?��?���?��%?��...@��@�f@
��?��?��T?��f?��@,�)?��?���?���?��?��f?��j?��n@
��?���?���?��?��s?��?�؞?��]?��?��?��*
*?���?���� ?��X?���?���?��?���?��*
...

It looks like the document IDs are being passed on through the data
wrangling but then unused by kmeans and/or not reported in clusteredPoints.
 It seems to me like that'd be super useful to have them in the final
output.  Are they easy to get at?

Thanks,
Matt

clusterdump + document ids?

Reply via email to