Thank you for your reply, Robin.

I actually got the sequence file in the clusteredPoints directory like this:

Input Path:
/user/root/item-contents-sample/cluster/out/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.WeightedVectorWritable
Key: 45: Value: 1.0: [120:3.211]
Key: 35: Value: 1.0: [93:5.394, 120:3.211]
Key: 45: Value: 1.0: [120:3.211]
Key: 35: Value: 1.0: [93:5.394, 120:3.211]
...

Key is the cluster id, and I think, Value is not the mapping of item id, but
the mapping of the token value in the dictionary file and if-idf weight
calculated in vectorization.

Since I could not find a simple API in mahout to get the item ids in a
cluster, I did some works for that as follows:

First, I wrote a hadoop M/R job to parse the vector sequence file and
produce the csv file(item-id, dic-token-value:tf-idf-weight).
Second, I also wrote a hadoop M/R job to parse the clustered points sequence
file and produce the csv file(cluster-id, dic-token-value:tf-idf-weight).
And in the next step, using PIG, the vector csv file and cluster csv file
could be joined by dic-token-value:tf-idf-weight and grouped by cluster-id
and item-id, and finally I got the pairs of cluster-id and item-id in the
output.

- Kidong.




2011/2/16 Robin Anil <[email protected]>

> clustering code has a paramater that enables or disables whether the
> cluster-point assignments need to be generated. If set, it will create a
> folder called clusteredPoints in the output directory having a sequence
> file
> with mappings
>
> Robin
>
> On Tue, Feb 15, 2011 at 6:02 AM, Kidong Lee <[email protected]> wrote:
>
> > Hi,
> >
> > My situation is almost like '12.1 Finding similar users on Twitter' in
> > Mahout in action book.
> >
> > In my document, there are lists of item id and its contents seperated by
> > delimiter comma, for example like this CSV file(itemId, itemContents):
> > 1223, sports
> > 1344, football nike
> > ...
> >
> > First I did convert this csv file to sequence file, and vectorized the
> > sequence file with SparseVectorsFromSequenceFiles.
> > With kmeans clustering, I got the clusters. Until this, all the things
> > fine.
> >
> > I wanted to get the list of items which belong to a cluster, but I have
> no
> > idea how.
> > I have printed the entries using cluster-dumper, but there is no info
> about
> > the item id.
> >
> > Any idea how to get the list of item id which belong to a cluster?
> >
> > - Kidong.
> >
>

Reply via email to