Re: Am I starting right with clustering ?

Clément Notin Thu, 04 Aug 2011 01:08:17 -0700

Thanks for your quick answer !

- I will compare the sequential algorithm with hadoop implementation because
yes I only have a few shops (but many purchases !)
- I didn't know about NamedVectors, thank you for the tip I think it'll be
helpful. I'll let you know how it's going.


2011/8/3 Jeff Eastman <[email protected]>

> I think you are on the right track but I have some suggestions:
> - How many shops do you have in your DB? Unless you have billions of them,
> you can likely run the sequential (-xm sequential) algorithms which run
> locally and are much faster.
> - You will want to produce NamedVectors from your database, with the
> shop_id as the name and the category vectors as the delegate. I'm not sure
> if the Mahout ARFF converter will do this for you or not. It may be simpler
> to write your own converter using
> org.apache.mahout.clustering.conversion.InputDriver/Mapper as prototypes.
> These will convert space-delimited files to Mahout Vectors but will not
> produce NamedVectors. Nor will they produce a dictionary file but your
> categories seem simple enough to forego that.
> - Once you have created a directory of NV sequence files you should be able
> to cluster them easily.
>
> Smooth sailing,
> Jeff
>
> -----Original Message-----
> From: Clément Notin [mailto:[email protected]]
> Sent: Wednesday, August 03, 2011 7:03 AM
> To: [email protected]
> Subject: Am I starting right with clustering ?
>
> Hello,
>
> I'm new in the Mahout world and it seems really nice but it's hard to get
> easy documentation :(
>
> I'm trying to run some clustering. Let me explain you what I'm trying to
> achieve.
> I have a DB with columns  : shop_id (string), customer_category (string),
> num_of_purchases (integer)
> What I want to do is to discover groups of shops which are related because
> they have some customers categories in common.
>
> I think the vectors should be :
> "shop #1" = (1, 10, 0, 20)
> which means that the customers category A has bought 1 thing in the shop,
> the customers category B has bought 10 things in the shop and so...
>
> In my BD for this example I have :
> shop_id    | customer_category | num_of_purchases
> --------------+-----------------------------+---------------------
> "shop #1" |           "A"              |          1
> "shop #1" |           "B"              |          10
> "shop #1" |           "D"              |          20
>
>
> I think I must convert this to an ARFF file like :
>
> @RELATION purchases
> @ATTRIBUTE shop_id STRING
> @ATTRIBUTE catA NUMERIC
> @ATTRIBUTE catB NUMERIC
> @ATTRIBUTE catC NUMERIC
> @ATTRIBUTE catD NUMERIC
>
> @DATA
> "shop #1",1,10,0,20
> ...
>
> Why ARFF file ? Because I can use the helpful sparse syntax.
> But it's difficult to build this file. I think I should write a script.
>
>
> My question is, am I heading in the good direction ?
> I would appreciate some help ! Thanks :)
>
> Regards,
>
> --
> *Clément **Notin*
>



-- 
*Clément **Notin*

Re: Am I starting right with clustering ?

Reply via email to