Ted/Peter, Thanks for the response.
This is exactly what I am trying to achieve. May be I was not able to put my questions clearly. I am clustering on few variables of Customer/User(except their customer_id/user_id) and storing customer_id/user_id list in a separate place. Question) What is the approach to identify each member in each cluster by its unique id. Answer) I have to run a script post-clustering to map customer_id/user_id for the clustered output to identify the member uniquely. Correct me if I am wrong :) On Tue, Feb 18, 2014 at 10:53 AM, Ted Dunning <[email protected]> wrote: > Bikash, > > Peter is just right. > > Yes, you can cluster on these few variables that you have. Probably you > should translate location to x,y,z coordinates so that you don't have > strange geometry problems, but location, gender and age are quite > reasonable characteristics. You will get a fairly weak clustering since > these characteristics actually tell very little about people, but it is a > start. > > You *don't* want to cluster using user ID for exactly the reasons that > Peter mentioned. Another way to put it is that the user ID tells you > absolutely nothing about the person and thus is not useful for the > clustering. > > You *do* have to retain the assignment of users to cluster and that > assignment is usually stored as a list of user ID's for each cluster. This > does not at all imply, however, that the user ID was used to *form* the > cluster. > > > > > On Mon, Feb 17, 2014 at 9:01 PM, Peter Jaumann > <[email protected]>wrote: > >> Bikash, >> As Ted pointed out already...... >> You can cluster on all variables except your customer_id. That's your >> identifier. >> Customers within a cluster are 'similar'; how similar depends on the >> fidelity of your clustering. >> The clustering algorithm uses (you'll choose) some kind of distance, or >> similarity/dissimilarity >> measure (which one to use depends on the type of data you have). This >> measure will, >> eventually, determine how separate/how unique your clusters are. Goal is to >> have your clusters distinct >> from each other but have the cluster members, within a cluster, as similar >> as possible. >> >> In the output, each member in each cluster is uniquely identified by it's >> customer_id, the cluster it belongs to, >> and a distance measure that shows (usually) how close, or not, the >> customer_id is from its cluster center. >> >> In terms of what you want to do, my assumption is that you'd like to find >> out a structure, or patterns, >> within your customer base, based on a set of variables that you have. This >> is often called a segmentation. >> >> Hope this helps! What you want to do is a pretty basic and straight-forward >> application of clustering. >> Good luck, >> -Peter >> >> >> >> On Mon, Feb 17, 2014 at 9:48 PM, Bikash Gupta <[email protected] >> >wrote: >> >> > Basically I am trying to achieve customer segmentation. >> > >> > Now to measure customer similarity within a cluster I need to >> > understand which two customer are similar. >> > >> > Assumption: To understand these customer uniquely I need to provide >> > their CustomerId >> > >> > Is my assumption correct? If yes then, will customerId affect the >> > clustering output >> > >> > If no then how can I identify customer uniquely >> > >> > On Tue, Feb 18, 2014 at 2:55 AM, Ted Dunning <[email protected]> >> > wrote: >> > > That really depends on what you want to do. >> > > >> > > What is it that you want? >> > > >> > > >> > > On Mon, Feb 17, 2014 at 12:25 PM, Bikash Gupta < >> [email protected] >> > >wrote: >> > > >> > >> Ok...so UserId is not a good field for this combination, but if I want >> > >> User Clustering, what should be combination(just for >> > >> understanding)..... >> > >> >> > >> On Tue, Feb 18, 2014 at 1:44 AM, Ted Dunning <[email protected]> >> > >> wrote: >> > >> > On Mon, Feb 17, 2014 at 9:00 AM, Bikash Gupta < >> > [email protected] >> > >> >wrote: >> > >> > >> > >> >> Let say I am clustering users, I am providing their profile data to >> > >> >> discover similarity between two user. >> > >> >> >> > >> >> So my input would be [UserId, Location, Age, Gender, Time Created ] >> > >> >> >> > >> >> Now if my UserId length is of minimum 10 characters which is >> > >> >> comparative very large number than other categorical data. >> > >> >> >> > >> > >> > >> > User id is not a good field for clustering. >> > >> > >> > >> > Location is fine if you want geo-graphical clsutering. >> > >> > >> > >> > Location + age + gender is fine for geo-demo-graphical clustering. >> > >> > >> > >> > Adding time created might give a tiny bit of insight. >> > >> > >> > >> > But these fields are not going to lead to great insights. >> > >> >> > >> >> > >> >> > >> -- >> > >> Thanks & Regards >> > >> Bikash Kumar Gupta >> > >> >> > >> > >> > >> > -- >> > Thanks & Regards >> > Bikash Kumar Gupta >> > >> -- Thanks & Regards Bikash Kumar Gupta
