+user to get this indexed. Its a typical mistake when you start out with
mahout.

@Kidong You are welcome


Robin

On Mon, Feb 21, 2011 at 6:24 AM, Kidong Lee <[email protected]> wrote:

> To vectorize, I have used SparseVectorsFromSequenceFiles *without using
> the parameter '-nv'(named vector flag)*.
> With '-nv' flag, I got the correct clustered points like this:
> ...
> Input Path:
> /user/root/item-contents-sample/cluster/out/clusteredPoints/part-m-00000
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.clustering.WeightedVectorWritable
> Key: 45: Value: 1.0: 1102204 = [120:3.211]
> Key: 35: Value: 1.0: 1102939 = [93:5.394, 120:3.211]
> Key: 45: Value: 1.0: 1102945 = [120:3.211]
> Key: 35: Value: 1.0: 1102946 = [93:5.394, 120:3.211]
> ....
>
> Now item id is included in the value vector representation.
>
> Thank you Robin for correcting me!
>
> - Kidong.
>
>
> 2011/2/20 Robin Anil <[email protected]>
>
> Hi Kindong, here the key is the nearest cluster id and the value is vector.
>> I am guessing the identifier is getting dropped somehow. Looks like a bug,
>> can you confirm that you have created ids for the vectors you used and
>> wrapped them in a named vector?
>>
>>
>> Robin
>>
>>
>> On Wed, Feb 16, 2011 at 6:31 AM, Kidong Lee <[email protected]> wrote:
>>
>>> Thank you for your reply, Robin.
>>>
>>> I actually got the sequence file in the clusteredPoints directory like
>>> this:
>>>
>>> Input Path:
>>> /user/root/item-contents-sample/cluster/out/clusteredPoints/part-m-00000
>>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
>>> org.apache.mahout.clustering.WeightedVectorWritable
>>> Key: 45: Value: 1.0: [120:3.211]
>>> Key: 35: Value: 1.0: [93:5.394, 120:3.211]
>>> Key: 45: Value: 1.0: [120:3.211]
>>> Key: 35: Value: 1.0: [93:5.394, 120:3.211]
>>> ...
>>>
>>> Key is the cluster id, and I think, Value is not the mapping of item id,
>>> but
>>> the mapping of the token value in the dictionary file and if-idf weight
>>> calculated in vectorization.
>>>
>>> Since I could not find a simple API in mahout to get the item ids in a
>>> cluster, I did some works for that as follows:
>>>
>>> First, I wrote a hadoop M/R job to parse the vector sequence file and
>>> produce the csv file(item-id, dic-token-value:tf-idf-weight).
>>> Second, I also wrote a hadoop M/R job to parse the clustered points
>>> sequence
>>> file and produce the csv file(cluster-id, dic-token-value:tf-idf-weight).
>>> And in the next step, using PIG, the vector csv file and cluster csv file
>>> could be joined by dic-token-value:tf-idf-weight and grouped by
>>> cluster-id
>>> and item-id, and finally I got the pairs of cluster-id and item-id in the
>>> output.
>>>
>>> - Kidong.
>>>
>>>
>>>
>>>
>>> 2011/2/16 Robin Anil <[email protected]>
>>>
>>> > clustering code has a paramater that enables or disables whether the
>>> > cluster-point assignments need to be generated. If set, it will create
>>> a
>>> > folder called clusteredPoints in the output directory having a sequence
>>> > file
>>> > with mappings
>>> >
>>> > Robin
>>> >
>>> > On Tue, Feb 15, 2011 at 6:02 AM, Kidong Lee <[email protected]>
>>> wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > My situation is almost like '12.1 Finding similar users on Twitter'
>>> in
>>> > > Mahout in action book.
>>> > >
>>> > > In my document, there are lists of item id and its contents seperated
>>> by
>>> > > delimiter comma, for example like this CSV file(itemId,
>>> itemContents):
>>> > > 1223, sports
>>> > > 1344, football nike
>>> > > ...
>>> > >
>>> > > First I did convert this csv file to sequence file, and vectorized
>>> the
>>> > > sequence file with SparseVectorsFromSequenceFiles.
>>> > > With kmeans clustering, I got the clusters. Until this, all the
>>> things
>>> > > fine.
>>> > >
>>> > > I wanted to get the list of items which belong to a cluster, but I
>>> have
>>> > no
>>> > > idea how.
>>> > > I have printed the entries using cluster-dumper, but there is no info
>>> > about
>>> > > the item id.
>>> > >
>>> > > Any idea how to get the list of item id which belong to a cluster?
>>> > >
>>> > > - Kidong.
>>> > >
>>> >
>>>
>>
>>
>

Reply via email to