Hi Matt,

K-means passes NamedVectors transparently through its processing, including the clustering output. Looking over the relevant code:

ClusterDumper.print() calls AbstractVector.asFormatString(v,bindings)
AbstractVector.asFormatString(v,bindings) evaluates
    if (v instanceof NamedVector) {
      buf.append(((NamedVector) v).getName()).append(" = ");
    }
...before printing the formatted vector. Given that the sequence files generated by seq2sparse and presented to kmeans contain NamedVector wrappers (which they appear to do), the output should look like <name> = [<vector>]

I don't know why you aren't seeing that. Can you please investigate?
Jeff

On 10/29/10 11:30 AM, Matt Spitz wrote:
Hey, folks.

If I run kmeans-clustering with the -cl option, I get
<kmeans_output>/clusteredPoints.

Running clusterdump with -p<kmeans_output>/clusteredPoints, I get output
that looks like this for a given cluster (running on the reuters corpus):

...
*        Top Terms:*
*                said                                    =>
1.421944826722092*
*                3                                       =>
  0.9007495669006188*
*                reuter                                  =>
  0.8924866335531932*
...
*        Weight:  Point:*
*        1.0: [srd:7.671, 20.00:8.269, bp:13.510, co:3.989, financial:5.109,
under:3.297, activities:5.452, called:4.288, investment:3.831, owns:5.457,
interest:3.438, market:2.977, comp\*
*anies:3.894, plan:4.041, 02:4.706, joint:4.573, both:4.004, manage:6.697,
oversight:8.364, trading:3.842, also:2.737, venture:7.009, 15:2.866,
money:4.178, borrowing:5.718, inc:2.282, c\*
*ommittee:4.438, north:6.854, 26-feb-1987:5.496, 55:4.719, form:7.126,
management:4.015, standard:10.665, subsidiary:4.040, plc:4.462,
america:6.863, 3:1.124, which:2.445, petroleum:4.71\*
*6, oil:7.377, operated:6.224, british:4.646, reuter:1.133, pct:2.233,
said:1.330, unit:3.541]*
*        1.0: [16:3.076, 35:4.400, 26-feb-1987:5.496]*
*        1.0: [54.20:9.057, 16:3.076, 36:4.670, 26-feb-1987:5.496]*
...

I get the bags of words that end up in a given cluster, but I don't see the
original document ID from which that bag of words was generated (e.g.
reut2-111.sgm-211.txt, etc)

In the sequence file generated by 'seqdirectory', we get the following:

*[msp...@wowzers mahout-distribution-0.4]$ hadoop dfs -cat
examples/bin/work/reuters-out-seqdir/chunk-0 | head*
*SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
�$�]Ƥ��...@7a/reut2-020.sgm-237.txt�'20-OCT-1987
10:29:21.69*
*
*
*HUTTON<EFH>  REITERATES STATEMENT OF SOLVENCY*
...

In the sparse vectors (which are passed into kmeans), we get:
*[msp...@wowzers mahout-distribution-0.4]$ hadoop dfs -cat
examples/bin/work/reuters-out-seqdir-sparse/tf-vectors/part-r-00000 | head*
*1�e�>�C/reut2-000.sgm-0.txt����...@��@�v?��?�?��?�...@�?��?��$?���?���?��@�1@
�6@
�"@�c?��?��?��3?��"?���?��c?...@��?�ľ@�...@?��?��?��~?��u?��u?�؈�@?��?��...@ݗ?��?���?�̜?���@�y?��?��?�?��l?��t@
�y@"��?��...@��?��/?���...@�����3@��?��}?��...@�p@�...@�d?���?��?��m?��?���?��k@
��?�Ņ?��0?��u?�ֹ?��o?��...@�%?�֖?��|@�...@��@���?��?��o?��?��?���?��%?��...@��@�f@
��?��?��T?��f?��@,�)?��?���?���?��?��f?��j?��n@
��?���?���?��?��s?��?�؞?��]?��?��?��*
*?���?���� ?��X?���?���?��?���?��*
...

It looks like the document IDs are being passed on through the data
wrangling but then unused by kmeans and/or not reported in clusteredPoints.
  It seems to me like that'd be super useful to have them in the final
output.  Are they easy to get at?

Thanks,
Matt

Reply via email to