Hi Matt,
K-means passes NamedVectors transparently through its processing,
including the clustering output. Looking over the relevant code:
ClusterDumper.print() calls AbstractVector.asFormatString(v,bindings)
AbstractVector.asFormatString(v,bindings) evaluates
if (v instanceof NamedVector) {
buf.append(((NamedVector) v).getName()).append(" = ");
}
...before printing the formatted vector. Given that the sequence files
generated by seq2sparse and presented to kmeans contain NamedVector
wrappers (which they appear to do), the output should look like <name> =
[<vector>]
I don't know why you aren't seeing that. Can you please investigate?
Jeff
On 10/29/10 11:30 AM, Matt Spitz wrote:
Hey, folks.
If I run kmeans-clustering with the -cl option, I get
<kmeans_output>/clusteredPoints.
Running clusterdump with -p<kmeans_output>/clusteredPoints, I get output
that looks like this for a given cluster (running on the reuters corpus):
...
* Top Terms:*
* said =>
1.421944826722092*
* 3 =>
0.9007495669006188*
* reuter =>
0.8924866335531932*
...
* Weight: Point:*
* 1.0: [srd:7.671, 20.00:8.269, bp:13.510, co:3.989, financial:5.109,
under:3.297, activities:5.452, called:4.288, investment:3.831, owns:5.457,
interest:3.438, market:2.977, comp\*
*anies:3.894, plan:4.041, 02:4.706, joint:4.573, both:4.004, manage:6.697,
oversight:8.364, trading:3.842, also:2.737, venture:7.009, 15:2.866,
money:4.178, borrowing:5.718, inc:2.282, c\*
*ommittee:4.438, north:6.854, 26-feb-1987:5.496, 55:4.719, form:7.126,
management:4.015, standard:10.665, subsidiary:4.040, plc:4.462,
america:6.863, 3:1.124, which:2.445, petroleum:4.71\*
*6, oil:7.377, operated:6.224, british:4.646, reuter:1.133, pct:2.233,
said:1.330, unit:3.541]*
* 1.0: [16:3.076, 35:4.400, 26-feb-1987:5.496]*
* 1.0: [54.20:9.057, 16:3.076, 36:4.670, 26-feb-1987:5.496]*
...
I get the bags of words that end up in a given cluster, but I don't see the
original document ID from which that bag of words was generated (e.g.
reut2-111.sgm-211.txt, etc)
In the sequence file generated by 'seqdirectory', we get the following:
*[msp...@wowzers mahout-distribution-0.4]$ hadoop dfs -cat
examples/bin/work/reuters-out-seqdir/chunk-0 | head*
*SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.Text
�$�]Ƥ��...@7a/reut2-020.sgm-237.txt�'20-OCT-1987
10:29:21.69*
*
*
*HUTTON<EFH> REITERATES STATEMENT OF SOLVENCY*
...
In the sparse vectors (which are passed into kmeans), we get:
*[msp...@wowzers mahout-distribution-0.4]$ hadoop dfs -cat
examples/bin/work/reuters-out-seqdir-sparse/tf-vectors/part-r-00000 | head*
*1�e�>�C/reut2-000.sgm-0.txt����...@��@�v?��?�?��?�...@�?��?��$?���?���?��@�1@
�6@
�"@�c?��?��?��3?��"?���?��c?...@��?�ľ@�...@?��?��?��~?��u?��u?�؈�@?��?��...@ݗ?��?���?�̜?���@�y?��?��?�?��l?��t@
�y@"��?��...@��?��/?���...@�����3@��?��}?��...@�p@�...@�d?���?��?��m?��?���?��k@
��?�Ņ?��0?��u?�ֹ?��o?��...@�%?�֖?��|@�...@��@���?��?��o?��?��?���?��%?��...@��@�f@
��?��?��T?��f?��@,�)?��?���?���?��?��f?��j?��n@
��?���?���?��?��s?��?�؞?��]?��?��?��*
*?���?���� ?��X?���?���?��?���?��*
...
It looks like the document IDs are being passed on through the data
wrangling but then unused by kmeans and/or not reported in clusteredPoints.
It seems to me like that'd be super useful to have them in the final
output. Are they easy to get at?
Thanks,
Matt