Thanks for your quick answer ! - I will compare the sequential algorithm with hadoop implementation because yes I only have a few shops (but many purchases !) - I didn't know about NamedVectors, thank you for the tip I think it'll be helpful. I'll let you know how it's going.
2011/8/3 Jeff Eastman <[email protected]> > I think you are on the right track but I have some suggestions: > - How many shops do you have in your DB? Unless you have billions of them, > you can likely run the sequential (-xm sequential) algorithms which run > locally and are much faster. > - You will want to produce NamedVectors from your database, with the > shop_id as the name and the category vectors as the delegate. I'm not sure > if the Mahout ARFF converter will do this for you or not. It may be simpler > to write your own converter using > org.apache.mahout.clustering.conversion.InputDriver/Mapper as prototypes. > These will convert space-delimited files to Mahout Vectors but will not > produce NamedVectors. Nor will they produce a dictionary file but your > categories seem simple enough to forego that. > - Once you have created a directory of NV sequence files you should be able > to cluster them easily. > > Smooth sailing, > Jeff > > -----Original Message----- > From: Clément Notin [mailto:[email protected]] > Sent: Wednesday, August 03, 2011 7:03 AM > To: [email protected] > Subject: Am I starting right with clustering ? > > Hello, > > I'm new in the Mahout world and it seems really nice but it's hard to get > easy documentation :( > > I'm trying to run some clustering. Let me explain you what I'm trying to > achieve. > I have a DB with columns : shop_id (string), customer_category (string), > num_of_purchases (integer) > What I want to do is to discover groups of shops which are related because > they have some customers categories in common. > > I think the vectors should be : > "shop #1" = (1, 10, 0, 20) > which means that the customers category A has bought 1 thing in the shop, > the customers category B has bought 10 things in the shop and so... > > In my BD for this example I have : > shop_id | customer_category | num_of_purchases > --------------+-----------------------------+--------------------- > "shop #1" | "A" | 1 > "shop #1" | "B" | 10 > "shop #1" | "D" | 20 > > > I think I must convert this to an ARFF file like : > > @RELATION purchases > @ATTRIBUTE shop_id STRING > @ATTRIBUTE catA NUMERIC > @ATTRIBUTE catB NUMERIC > @ATTRIBUTE catC NUMERIC > @ATTRIBUTE catD NUMERIC > > @DATA > "shop #1",1,10,0,20 > ... > > Why ARFF file ? Because I can use the helpful sparse syntax. > But it's difficult to build this file. I think I should write a script. > > > My question is, am I heading in the good direction ? > I would appreciate some help ! Thanks :) > > Regards, > > -- > *Clément **Notin* > -- *Clément **Notin*
