Re: [Edit] Approach for Clustering Data

Sean Owen Tue, 18 Feb 2014 06:17:23 -0800

FYI, CDH5 includes version 0.8 + patches. But 0.9 should work fine
with CDH4. You do have to build with the Hadoop 2.x profile, as usual.


On Tue, Feb 18, 2014 at 2:06 PM, Ted Dunning <[email protected]> wrote:
> Bikash,
>
> Don't use that version.  Use a more recent release.  We can't help that
> Cloudera has an old version.
>
>
>
>
> On Tue, Feb 18, 2014 at 1:26 AM, Bikash Gupta <[email protected]>wrote:
>
>> Suneel,
>>
>> Thanks for the information.
>>
>> I am using 0.7 packaged with CDH .
>>
>> On Tue, Feb 18, 2014 at 2:14 PM, Suneel Marthi <[email protected]>
>> wrote:
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Tuesday, February 18, 2014 3:37 AM, Bikash Gupta <
>> [email protected]> wrote:
>> >
>> > Ted/Peter,
>> >
>> > Thanks for the response.
>> >
>> > This is exactly what I am trying to achieve. May be I was not able to
>> > put my questions clearly.
>> >
>> > I am clustering on few variables of Customer/User(except their
>> > customer_id/user_id) and storing customer_id/user_id list in a
>> > separate place.
>> >
>> > Question) What is the approach to identify each member in each cluster
>> > by its unique id.
>> > Answer) I have to run a script post-clustering to map
>> > customer_id/user_id for the clustered output to identify the member
>> > uniquely.
>> >
>> >>> If u r working off of Mahout 0.9 u don't have to do that. The
>> Clustered output should display the vectors with the vectorid (user_id in
>> ur case) that belong to a specfic cluster along with the distance of that
>> vector from the cluster center.
>> >
>> > Correct me if I am wrong :)
>> >
>> >
>> > On Tue, Feb 18, 2014 at 10:53 AM, Ted Dunning <[email protected]>
>> wrote:
>> >> Bikash,
>> >>
>> >> Peter is just right.
>> >>
>> >> Yes, you can cluster on these few variables that you have.  Probably you
>> >> should translate location to x,y,z coordinates so that you don't have
>> >> strange geometry problems, but location, gender and age are quite
>> >> reasonable characteristics.  You will get a fairly weak clustering since
>> >> these characteristics actually tell very little about people, but it is
>> a
>> >> start.
>> >>
>> >> You *don't* want to cluster using user ID for exactly the reasons that
>> >> Peter mentioned.  Another way to put it is that the user ID tells you
>> >> absolutely nothing about the person and thus is not useful for the
>> >> clustering.
>> >>
>> >> You *do* have to retain the assignment of users to cluster and that
>> >> assignment is usually stored as a list of user ID's for each cluster.
>>  This
>> >> does not at all imply, however, that the user ID was used to *form* the
>> >> cluster.
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Feb 17, 2014 at 9:01 PM, Peter Jaumann <
>> [email protected]>wrote:
>> >>
>> >>> Bikash,
>> >>> As Ted pointed out already......
>> >>> You can cluster on all variables except your customer_id. That's your
>> >>> identifier.
>> >>> Customers within a cluster are 'similar'; how similar depends on the
>> >>> fidelity of your clustering.
>> >>> The clustering algorithm uses (you'll choose) some kind of distance, or
>> >>> similarity/dissimilarity
>> >>> measure (which one to use depends on the type of data you have). This
>> >>> measure will,
>> >>> eventually, determine how separate/how unique your clusters are. Goal
>> is to
>> >>> have your clusters distinct
>> >>> from each other but have the cluster members, within a cluster, as
>> similar
>> >>> as possible.
>> >>>
>> >>> In the output, each member in each cluster is uniquely identified by
>> it's
>> >>> customer_id, the cluster it belongs to,
>> >>> and a distance measure that shows (usually) how close, or not, the
>> >>> customer_id is from its cluster center.
>> >>>
>> >>> In terms of what you want to do, my assumption is that you'd like to
>> find
>> >>> out a structure, or patterns,
>> >>> within your customer base, based on a set of variables that you have.
>> This
>> >>> is often called a segmentation.
>> >>>
>> >>> Hope this helps! What you want to do is a pretty basic and
>> straight-forward
>> >>> application of clustering.
>> >>> Good luck,
>> >>> -Peter
>> >>>
>> >>>
>> >>>
>> >>> On Mon, Feb 17, 2014 at 9:48 PM, Bikash Gupta <
>> [email protected]
>> >>> >wrote:
>> >>>
>> >>> > Basically I am trying to achieve customer segmentation.
>> >>> >
>> >>> > Now to measure customer similarity within a cluster I need to
>> >>> > understand which two customer are similar.
>> >>> >
>> >>> > Assumption: To understand these customer uniquely I need to provide
>> >>> > their CustomerId
>> >>> >
>> >>> > Is my assumption correct? If yes then, will customerId affect the
>> >>> > clustering output
>> >>> >
>> >>> > If no then how can I identify customer uniquely
>> >>> >
>> >>> > On Tue, Feb 18, 2014 at 2:55 AM, Ted Dunning <[email protected]>
>> >>> > wrote:
>> >>> > > That really depends on what you want to do.
>> >>> > >
>> >>> > > What is it that you want?
>> >>> > >
>> >>> > >
>> >>> > > On Mon, Feb 17, 2014 at 12:25 PM, Bikash Gupta <
>> >>> [email protected]
>> >>> > >wrote:
>> >>> > >
>> >>> > >> Ok...so UserId is not a good field for this combination, but if I
>> want
>> >>> > >> User Clustering, what should be combination(just for
>> >>> > >> understanding).....
>> >>> > >>
>> >>> > >> On Tue, Feb 18, 2014 at 1:44 AM, Ted Dunning <
>> [email protected]>
>> >>> > >> wrote:
>> >>> > >> > On Mon, Feb 17, 2014 at 9:00 AM, Bikash Gupta <
>> >>> > [email protected]
>> >>> > >> >wrote:
>> >>> > >> >
>> >>> > >> >> Let say I am clustering users, I am providing their profile
>> data to
>> >>> > >> >> discover similarity between two user.
>> >>> > >> >>
>> >>> > >> >> So my input would be [UserId, Location, Age, Gender, Time
>> Created ]
>> >>> > >> >>
>> >>> > >> >> Now if my UserId length is of minimum 10 characters which is
>> >>> > >> >> comparative very large number than other categorical data.
>> >>> > >> >>
>> >>> > >> >
>> >>> > >> > User id is not a good field for clustering.
>> >>> > >> >
>> >>> > >> > Location is fine if you want geo-graphical clsutering.
>> >>> > >> >
>> >>> > >> > Location + age + gender is fine for geo-demo-graphical
>> clustering.
>> >>> > >> >
>> >>> > >> > Adding time created might give a tiny bit of insight.
>> >>> > >> >
>> >>> > >> > But these fields are not going to lead to great insights.
>> >>> > >>
>> >>> > >>
>> >>> > >>
>> >>> > >> --
>> >>> > >> Thanks & Regards
>> >>> > >> Bikash Kumar Gupta
>> >
>> >>> > >>
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > Thanks & Regards
>> >>> > Bikash Kumar Gupta
>> >>> >
>> >>>
>> >
>> >
>> >
>> > --
>> > Thanks & Regards
>> > Bikash Kumar Gupta
>>
>>
>>
>> --
>> Thanks & Regards
>> Bikash Kumar Gupta
>>

Re: [Edit] Approach for Clustering Data

Reply via email to