FYI, CDH5 includes version 0.8 + patches. But 0.9 should work fine with CDH4. You do have to build with the Hadoop 2.x profile, as usual.
On Tue, Feb 18, 2014 at 2:06 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Bikash, > > Don't use that version. Use a more recent release. We can't help that > Cloudera has an old version. > > > > > On Tue, Feb 18, 2014 at 1:26 AM, Bikash Gupta <bikash.gupt...@gmail.com>wrote: > >> Suneel, >> >> Thanks for the information. >> >> I am using 0.7 packaged with CDH . >> >> On Tue, Feb 18, 2014 at 2:14 PM, Suneel Marthi <suneel_mar...@yahoo.com> >> wrote: >> > >> > >> > >> > >> > >> > >> > On Tuesday, February 18, 2014 3:37 AM, Bikash Gupta < >> bikash.gupt...@gmail.com> wrote: >> > >> > Ted/Peter, >> > >> > Thanks for the response. >> > >> > This is exactly what I am trying to achieve. May be I was not able to >> > put my questions clearly. >> > >> > I am clustering on few variables of Customer/User(except their >> > customer_id/user_id) and storing customer_id/user_id list in a >> > separate place. >> > >> > Question) What is the approach to identify each member in each cluster >> > by its unique id. >> > Answer) I have to run a script post-clustering to map >> > customer_id/user_id for the clustered output to identify the member >> > uniquely. >> > >> >>> If u r working off of Mahout 0.9 u don't have to do that. The >> Clustered output should display the vectors with the vectorid (user_id in >> ur case) that belong to a specfic cluster along with the distance of that >> vector from the cluster center. >> > >> > Correct me if I am wrong :) >> > >> > >> > On Tue, Feb 18, 2014 at 10:53 AM, Ted Dunning <ted.dunn...@gmail.com> >> wrote: >> >> Bikash, >> >> >> >> Peter is just right. >> >> >> >> Yes, you can cluster on these few variables that you have. Probably you >> >> should translate location to x,y,z coordinates so that you don't have >> >> strange geometry problems, but location, gender and age are quite >> >> reasonable characteristics. You will get a fairly weak clustering since >> >> these characteristics actually tell very little about people, but it is >> a >> >> start. >> >> >> >> You *don't* want to cluster using user ID for exactly the reasons that >> >> Peter mentioned. Another way to put it is that the user ID tells you >> >> absolutely nothing about the person and thus is not useful for the >> >> clustering. >> >> >> >> You *do* have to retain the assignment of users to cluster and that >> >> assignment is usually stored as a list of user ID's for each cluster. >> This >> >> does not at all imply, however, that the user ID was used to *form* the >> >> cluster. >> >> >> >> >> >> >> >> >> >> On Mon, Feb 17, 2014 at 9:01 PM, Peter Jaumann < >> peter.jauma...@gmail.com>wrote: >> >> >> >>> Bikash, >> >>> As Ted pointed out already...... >> >>> You can cluster on all variables except your customer_id. That's your >> >>> identifier. >> >>> Customers within a cluster are 'similar'; how similar depends on the >> >>> fidelity of your clustering. >> >>> The clustering algorithm uses (you'll choose) some kind of distance, or >> >>> similarity/dissimilarity >> >>> measure (which one to use depends on the type of data you have). This >> >>> measure will, >> >>> eventually, determine how separate/how unique your clusters are. Goal >> is to >> >>> have your clusters distinct >> >>> from each other but have the cluster members, within a cluster, as >> similar >> >>> as possible. >> >>> >> >>> In the output, each member in each cluster is uniquely identified by >> it's >> >>> customer_id, the cluster it belongs to, >> >>> and a distance measure that shows (usually) how close, or not, the >> >>> customer_id is from its cluster center. >> >>> >> >>> In terms of what you want to do, my assumption is that you'd like to >> find >> >>> out a structure, or patterns, >> >>> within your customer base, based on a set of variables that you have. >> This >> >>> is often called a segmentation. >> >>> >> >>> Hope this helps! What you want to do is a pretty basic and >> straight-forward >> >>> application of clustering. >> >>> Good luck, >> >>> -Peter >> >>> >> >>> >> >>> >> >>> On Mon, Feb 17, 2014 at 9:48 PM, Bikash Gupta < >> bikash.gupt...@gmail.com >> >>> >wrote: >> >>> >> >>> > Basically I am trying to achieve customer segmentation. >> >>> > >> >>> > Now to measure customer similarity within a cluster I need to >> >>> > understand which two customer are similar. >> >>> > >> >>> > Assumption: To understand these customer uniquely I need to provide >> >>> > their CustomerId >> >>> > >> >>> > Is my assumption correct? If yes then, will customerId affect the >> >>> > clustering output >> >>> > >> >>> > If no then how can I identify customer uniquely >> >>> > >> >>> > On Tue, Feb 18, 2014 at 2:55 AM, Ted Dunning <ted.dunn...@gmail.com> >> >>> > wrote: >> >>> > > That really depends on what you want to do. >> >>> > > >> >>> > > What is it that you want? >> >>> > > >> >>> > > >> >>> > > On Mon, Feb 17, 2014 at 12:25 PM, Bikash Gupta < >> >>> bikash.gupt...@gmail.com >> >>> > >wrote: >> >>> > > >> >>> > >> Ok...so UserId is not a good field for this combination, but if I >> want >> >>> > >> User Clustering, what should be combination(just for >> >>> > >> understanding)..... >> >>> > >> >> >>> > >> On Tue, Feb 18, 2014 at 1:44 AM, Ted Dunning < >> ted.dunn...@gmail.com> >> >>> > >> wrote: >> >>> > >> > On Mon, Feb 17, 2014 at 9:00 AM, Bikash Gupta < >> >>> > bikash.gupt...@gmail.com >> >>> > >> >wrote: >> >>> > >> > >> >>> > >> >> Let say I am clustering users, I am providing their profile >> data to >> >>> > >> >> discover similarity between two user. >> >>> > >> >> >> >>> > >> >> So my input would be [UserId, Location, Age, Gender, Time >> Created ] >> >>> > >> >> >> >>> > >> >> Now if my UserId length is of minimum 10 characters which is >> >>> > >> >> comparative very large number than other categorical data. >> >>> > >> >> >> >>> > >> > >> >>> > >> > User id is not a good field for clustering. >> >>> > >> > >> >>> > >> > Location is fine if you want geo-graphical clsutering. >> >>> > >> > >> >>> > >> > Location + age + gender is fine for geo-demo-graphical >> clustering. >> >>> > >> > >> >>> > >> > Adding time created might give a tiny bit of insight. >> >>> > >> > >> >>> > >> > But these fields are not going to lead to great insights. >> >>> > >> >> >>> > >> >> >>> > >> >> >>> > >> -- >> >>> > >> Thanks & Regards >> >>> > >> Bikash Kumar Gupta >> > >> >>> > >> >> >>> > >> >>> > >> >>> > >> >>> > -- >> >>> > Thanks & Regards >> >>> > Bikash Kumar Gupta >> >>> > >> >>> >> > >> > >> > >> > -- >> > Thanks & Regards >> > Bikash Kumar Gupta >> >> >> >> -- >> Thanks & Regards >> Bikash Kumar Gupta >>