+user to get this indexed. Its a typical mistake when you start out with mahout.
@Kidong You are welcome Robin On Mon, Feb 21, 2011 at 6:24 AM, Kidong Lee <[email protected]> wrote: > To vectorize, I have used SparseVectorsFromSequenceFiles *without using > the parameter '-nv'(named vector flag)*. > With '-nv' flag, I got the correct clustered points like this: > ... > Input Path: > /user/root/item-contents-sample/cluster/out/clusteredPoints/part-m-00000 > Key class: class org.apache.hadoop.io.IntWritable Value Class: class > org.apache.mahout.clustering.WeightedVectorWritable > Key: 45: Value: 1.0: 1102204 = [120:3.211] > Key: 35: Value: 1.0: 1102939 = [93:5.394, 120:3.211] > Key: 45: Value: 1.0: 1102945 = [120:3.211] > Key: 35: Value: 1.0: 1102946 = [93:5.394, 120:3.211] > .... > > Now item id is included in the value vector representation. > > Thank you Robin for correcting me! > > - Kidong. > > > 2011/2/20 Robin Anil <[email protected]> > > Hi Kindong, here the key is the nearest cluster id and the value is vector. >> I am guessing the identifier is getting dropped somehow. Looks like a bug, >> can you confirm that you have created ids for the vectors you used and >> wrapped them in a named vector? >> >> >> Robin >> >> >> On Wed, Feb 16, 2011 at 6:31 AM, Kidong Lee <[email protected]> wrote: >> >>> Thank you for your reply, Robin. >>> >>> I actually got the sequence file in the clusteredPoints directory like >>> this: >>> >>> Input Path: >>> /user/root/item-contents-sample/cluster/out/clusteredPoints/part-m-00000 >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class >>> org.apache.mahout.clustering.WeightedVectorWritable >>> Key: 45: Value: 1.0: [120:3.211] >>> Key: 35: Value: 1.0: [93:5.394, 120:3.211] >>> Key: 45: Value: 1.0: [120:3.211] >>> Key: 35: Value: 1.0: [93:5.394, 120:3.211] >>> ... >>> >>> Key is the cluster id, and I think, Value is not the mapping of item id, >>> but >>> the mapping of the token value in the dictionary file and if-idf weight >>> calculated in vectorization. >>> >>> Since I could not find a simple API in mahout to get the item ids in a >>> cluster, I did some works for that as follows: >>> >>> First, I wrote a hadoop M/R job to parse the vector sequence file and >>> produce the csv file(item-id, dic-token-value:tf-idf-weight). >>> Second, I also wrote a hadoop M/R job to parse the clustered points >>> sequence >>> file and produce the csv file(cluster-id, dic-token-value:tf-idf-weight). >>> And in the next step, using PIG, the vector csv file and cluster csv file >>> could be joined by dic-token-value:tf-idf-weight and grouped by >>> cluster-id >>> and item-id, and finally I got the pairs of cluster-id and item-id in the >>> output. >>> >>> - Kidong. >>> >>> >>> >>> >>> 2011/2/16 Robin Anil <[email protected]> >>> >>> > clustering code has a paramater that enables or disables whether the >>> > cluster-point assignments need to be generated. If set, it will create >>> a >>> > folder called clusteredPoints in the output directory having a sequence >>> > file >>> > with mappings >>> > >>> > Robin >>> > >>> > On Tue, Feb 15, 2011 at 6:02 AM, Kidong Lee <[email protected]> >>> wrote: >>> > >>> > > Hi, >>> > > >>> > > My situation is almost like '12.1 Finding similar users on Twitter' >>> in >>> > > Mahout in action book. >>> > > >>> > > In my document, there are lists of item id and its contents seperated >>> by >>> > > delimiter comma, for example like this CSV file(itemId, >>> itemContents): >>> > > 1223, sports >>> > > 1344, football nike >>> > > ... >>> > > >>> > > First I did convert this csv file to sequence file, and vectorized >>> the >>> > > sequence file with SparseVectorsFromSequenceFiles. >>> > > With kmeans clustering, I got the clusters. Until this, all the >>> things >>> > > fine. >>> > > >>> > > I wanted to get the list of items which belong to a cluster, but I >>> have >>> > no >>> > > idea how. >>> > > I have printed the entries using cluster-dumper, but there is no info >>> > about >>> > > the item id. >>> > > >>> > > Any idea how to get the list of item id which belong to a cluster? >>> > > >>> > > - Kidong. >>> > > >>> > >>> >> >> >
