Bikash, As Ted pointed out already...... You can cluster on all variables except your customer_id. That's your identifier. Customers within a cluster are 'similar'; how similar depends on the fidelity of your clustering. The clustering algorithm uses (you'll choose) some kind of distance, or similarity/dissimilarity measure (which one to use depends on the type of data you have). This measure will, eventually, determine how separate/how unique your clusters are. Goal is to have your clusters distinct from each other but have the cluster members, within a cluster, as similar as possible.
In the output, each member in each cluster is uniquely identified by it's customer_id, the cluster it belongs to, and a distance measure that shows (usually) how close, or not, the customer_id is from its cluster center. In terms of what you want to do, my assumption is that you'd like to find out a structure, or patterns, within your customer base, based on a set of variables that you have. This is often called a segmentation. Hope this helps! What you want to do is a pretty basic and straight-forward application of clustering. Good luck, -Peter On Mon, Feb 17, 2014 at 9:48 PM, Bikash Gupta <[email protected]>wrote: > Basically I am trying to achieve customer segmentation. > > Now to measure customer similarity within a cluster I need to > understand which two customer are similar. > > Assumption: To understand these customer uniquely I need to provide > their CustomerId > > Is my assumption correct? If yes then, will customerId affect the > clustering output > > If no then how can I identify customer uniquely > > On Tue, Feb 18, 2014 at 2:55 AM, Ted Dunning <[email protected]> > wrote: > > That really depends on what you want to do. > > > > What is it that you want? > > > > > > On Mon, Feb 17, 2014 at 12:25 PM, Bikash Gupta <[email protected] > >wrote: > > > >> Ok...so UserId is not a good field for this combination, but if I want > >> User Clustering, what should be combination(just for > >> understanding)..... > >> > >> On Tue, Feb 18, 2014 at 1:44 AM, Ted Dunning <[email protected]> > >> wrote: > >> > On Mon, Feb 17, 2014 at 9:00 AM, Bikash Gupta < > [email protected] > >> >wrote: > >> > > >> >> Let say I am clustering users, I am providing their profile data to > >> >> discover similarity between two user. > >> >> > >> >> So my input would be [UserId, Location, Age, Gender, Time Created ] > >> >> > >> >> Now if my UserId length is of minimum 10 characters which is > >> >> comparative very large number than other categorical data. > >> >> > >> > > >> > User id is not a good field for clustering. > >> > > >> > Location is fine if you want geo-graphical clsutering. > >> > > >> > Location + age + gender is fine for geo-demo-graphical clustering. > >> > > >> > Adding time created might give a tiny bit of insight. > >> > > >> > But these fields are not going to lead to great insights. > >> > >> > >> > >> -- > >> Thanks & Regards > >> Bikash Kumar Gupta > >> > > > > -- > Thanks & Regards > Bikash Kumar Gupta >
