Re: [Edit] Approach for Clustering Data

Bikash Gupta Tue, 18 Feb 2014 01:27:40 -0800

Suneel,

Thanks for the information.


I am using 0.7 packaged with CDH .

On Tue, Feb 18, 2014 at 2:14 PM, Suneel Marthi <[email protected]> wrote:
>
>
>
>
>
>
> On Tuesday, February 18, 2014 3:37 AM, Bikash Gupta 
> <[email protected]> wrote:
>
> Ted/Peter,
>
> Thanks for the response.
>
> This is exactly what I am trying to achieve. May be I was not able to
> put my questions clearly.
>
> I am clustering on few variables of Customer/User(except their
> customer_id/user_id) and storing customer_id/user_id list in a
> separate place.
>
> Question) What is the approach to identify each member in each cluster
> by its unique id.
> Answer) I have to run a script post-clustering to map
> customer_id/user_id for the clustered output to identify the member
> uniquely.
>
>>> If u r working off of Mahout 0.9 u don't have to do that. The Clustered 
>>> output should display the vectors with the vectorid (user_id in ur case) 
>>> that belong to a specfic cluster along with the distance of that vector 
>>> from the cluster center.
>
> Correct me if I am wrong :)
>
>
> On Tue, Feb 18, 2014 at 10:53 AM, Ted Dunning <[email protected]> wrote:
>> Bikash,
>>
>> Peter is just right.
>>
>> Yes, you can cluster on these few variables that you have.  Probably you
>> should translate location to x,y,z coordinates so that you don't have
>> strange geometry problems, but location, gender and age are quite
>> reasonable characteristics.  You will get a fairly weak clustering since
>> these characteristics actually tell very little about people, but it is a
>> start.
>>
>> You *don't* want to cluster using user ID for exactly the reasons that
>> Peter mentioned.  Another way to put it is that the user ID tells you
>> absolutely nothing about the person and thus is not useful for the
>> clustering.
>>
>> You *do* have to retain the assignment of users to cluster and that
>> assignment is usually stored as a list of user ID's for each cluster.  This
>> does not at all imply, however, that the user ID was used to *form* the
>> cluster.
>>
>>
>>
>>
>> On Mon, Feb 17, 2014 at 9:01 PM, Peter Jaumann 
>> <[email protected]>wrote:
>>
>>> Bikash,
>>> As Ted pointed out already......
>>> You can cluster on all variables except your customer_id. That's your
>>> identifier.
>>> Customers within a cluster are 'similar'; how similar depends on the
>>> fidelity of your clustering.
>>> The clustering algorithm uses (you'll choose) some kind of distance, or
>>> similarity/dissimilarity
>>> measure (which one to use depends on the type of data you have). This
>>> measure will,
>>> eventually, determine how separate/how unique your clusters are. Goal is to
>>> have your clusters distinct
>>> from each other but have the cluster members, within a cluster, as similar
>>> as possible.
>>>
>>> In the output, each member in each cluster is uniquely identified by it's
>>> customer_id, the cluster it belongs to,
>>> and a distance measure that shows (usually) how close, or not, the
>>> customer_id is from its cluster center.
>>>
>>> In terms of what you want to do, my assumption is that you'd like to find
>>> out a structure, or patterns,
>>> within your customer base, based on a set of variables that you have. This
>>> is often called a segmentation.
>>>
>>> Hope this helps! What you want to do is a pretty basic and straight-forward
>>> application of clustering.
>>> Good luck,
>>> -Peter
>>>
>>>
>>>
>>> On Mon, Feb 17, 2014 at 9:48 PM, Bikash Gupta <[email protected]
>>> >wrote:
>>>
>>> > Basically I am trying to achieve customer segmentation.
>>> >
>>> > Now to measure customer similarity within a cluster I need to
>>> > understand which two customer are similar.
>>> >
>>> > Assumption: To understand these customer uniquely I need to provide
>>> > their CustomerId
>>> >
>>> > Is my assumption correct? If yes then, will customerId affect the
>>> > clustering output
>>> >
>>> > If no then how can I identify customer uniquely
>>> >
>>> > On Tue, Feb 18, 2014 at 2:55 AM, Ted Dunning <[email protected]>
>>> > wrote:
>>> > > That really depends on what you want to do.
>>> > >
>>> > > What is it that you want?
>>> > >
>>> > >
>>> > > On Mon, Feb 17, 2014 at 12:25 PM, Bikash Gupta <
>>> [email protected]
>>> > >wrote:
>>> > >
>>> > >> Ok...so UserId is not a good field for this combination, but if I want
>>> > >> User Clustering, what should be combination(just for
>>> > >> understanding).....
>>> > >>
>>> > >> On Tue, Feb 18, 2014 at 1:44 AM, Ted Dunning <[email protected]>
>>> > >> wrote:
>>> > >> > On Mon, Feb 17, 2014 at 9:00 AM, Bikash Gupta <
>>> > [email protected]
>>> > >> >wrote:
>>> > >> >
>>> > >> >> Let say I am clustering users, I am providing their profile data to
>>> > >> >> discover similarity between two user.
>>> > >> >>
>>> > >> >> So my input would be [UserId, Location, Age, Gender, Time Created ]
>>> > >> >>
>>> > >> >> Now if my UserId length is of minimum 10 characters which is
>>> > >> >> comparative very large number than other categorical data.
>>> > >> >>
>>> > >> >
>>> > >> > User id is not a good field for clustering.
>>> > >> >
>>> > >> > Location is fine if you want geo-graphical clsutering.
>>> > >> >
>>> > >> > Location + age + gender is fine for geo-demo-graphical clustering.
>>> > >> >
>>> > >> > Adding time created might give a tiny bit of insight.
>>> > >> >
>>> > >> > But these fields are not going to lead to great insights.
>>> > >>
>>> > >>
>>> > >>
>>> > >> --
>>> > >> Thanks & Regards
>>> > >> Bikash Kumar Gupta
>
>>> > >>
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks & Regards
>>> > Bikash Kumar Gupta
>>> >
>>>
>
>
>
> --
> Thanks & Regards
> Bikash Kumar Gupta



-- 
Thanks & Regards
Bikash Kumar Gupta

Re: [Edit] Approach for Clustering Data

Reply via email to