Checkout the recent mailing list post 'Clustering user profiles' Jeff (Eastman) sums it up clearly.
> Mahout clustering (unsupervised classification) can only deal with > continuous, homogeneous vector representations of the input data, where each > vector element is weighted the same as the other elements. Mahout > (supervised) classification can deal with continuous, categorical, word-like > and text-like features such as in your problem space. > To address your problem with Mahout clustering, you would need to develop a > mapping for each of your features to continuous vector elements and use a > WeightedDistanceMeasure to account for the different element > types and > their relative impacts on the overall distance computation. This would be an > iterative process which might or might not produce useful results. > An alternative approach would be to train a Mahout classifier with the > various features using marked training data which classifies similar users > into a finite number of "clusters" that seem natural to you. With such a > model, you could then classify new users into those "clusters". This approach > would not be very useful for discovering new "clusters" in your data, but it > would leverage the classifier training mechanisms to develop the > models as > more of a black box than above. Question also to other people reading this. I looked into this and saw that there are clustering algorithms for categorical data such as K-modes. Are these effective for solving these kind of problems? If so would they be interesting to add to Mahout? Cheers, Frank On Thu, Feb 2, 2012 at 12:38 PM, Vikas Pandya <[email protected]> wrote: > Frank. Thanks. >>>In your case you want to cluster items that have several risk levels >>> as well as other properties. You have to use your original numerical >>> data, (I assume probabilities) in a clustering algorithm, not the >>> labels like low, medium, high. How were these labels assigned? > > > RiskLevel1,RiskLevel2,RiskLevel3 all are having actual lookup values (High, > Medium,Low etc) in Solr index (Index is stored flatten) > > -Vikas > > > ________________________________ > From: Frank Scholten <[email protected]> > To: [email protected] > Sent: Wednesday, February 1, 2012 3:28 AM > Subject: Re: How to present mahout cluster in combination with Solr results > > Vikas, > > Please send messages to the mailinglist so everyone can benefit. > >> Frank, >> >> To give further details about the usecase. >> >> 1)User searches for a free text, this search is served from Solr. >> 2)User selects a record from the search result, subsequently need to display >> all the items where RiskLevels of the items match the values of Risk Levels >> of a selected item from search result (and put them under "Similar items" in >> UI). >> >> upon indexing I am copying RiskLevel1, RiskLevel2,RiskLevel3 into a single >> field (solr copyField). Vector is created against that field for Mahout to >> create clusters on it. Now the issue is (understandably) when clusters are >> created it will find distance between words and its very much possible that >> following three records get clustered into a single cluster. >> RiskLevel1, RiskLevel2, RiskLevel3 >> High High Low >> High High High >> High High Medium > > Just to make sure, in my presentation I talk about using text > clustering for document tagging. The documents are vectorized and > weighted with TF/IDF and are fed into a Mahout clustering algorithm. > > In your case you want to cluster items that have several risk levels > as well as other properties. You have to use your original numerical > data, (I assume probabilities) in a clustering algorithm, not the > labels like low, medium, high. How were these labels assigned? > >> >> But clustering on these metadata columns, requirement is to cluster as below >> (sequence of the values DO matter) >> >> Cluster1: >> RiskLevel1, RiskLevel2,RiskLevel3 >> High High Low >> High High Low >> >> Cluster2: >> RiskLevel1, RiskLevel2,RiskLevel3 >> High High High >> High High High >> >> Cluster3: >> RiskLevel1, RiskLevel2,RiskLevel3 >> High High Medium >> High High Medium >> >> I started thinking about using classification over clustering? but while >> playing with Weka (http://www.cs.waikato.ac.nz/ml/weka/ ) Swing based GUI >> tool where one can easily play around with different algorithms from UI >> directly, I found DBScan clustering did cluster results correctly per my >> requirements, to be precise it created three different clusters (if you pick >> above mentioned example). >> >> can clustering be done the way I need it to work in Mahout? or any other >> ideas that can be explore further? >> >> Thanks, > > On Fri, Jan 20, 2012 at 6:48 PM, Frank Scholten <[email protected]> > wrote: >> On Fri, Jan 20, 2012 at 4:01 PM, Vikas Pandya <[email protected]> wrote: >>> From the example below, solr search results should be clustered in some >>> following way >>> list all the items which have matching RiskLevels e.g. >>> >>> >>> Cluster 1: >>> Title RiskLevel1 RiskLevel2 RiskLevel3 >>> abc High Medium Low >>> xyz High Medium High >>> def Low Medium High >>> >>> Cluster 2: >>> Title RiskLevel1 RiskLevel2 RiskLevel3 >>> omn Low Medium Low >>> yui Low Medium High >>> bnm Medium Medium High >>> >>> Though I have a feeling I don't need to use Mahout clustering for this, I am >>> still trying to hook in mahout for this since we have more clustering >>> requirements in the pipeline to cluster based on other features (attributes >>> of objects). >>> >> >> You only have 27 unique risklevel combinations. You could just sort by >> or more risklevels to get a sense of the data. >> >> If you have more attributes then you could indeed look into clustering, >> >> Cheers, >> >> Frank >> >>> Any thoughts? >>> >>> ________________________________ >>> From: Vikas Pandya <[email protected]> >>> To: Frank Scholten <[email protected]>; "[email protected]" >>> <[email protected]> >>> Sent: Thursday, January 19, 2012 11:05 AM >>> >>> Subject: Re: How to present mahout cluster in combination with Solr results >>> >>> Hi Frank, >>> >>> Thanks for the link. That was useful. It's still bit unclear on how he built >>> his index. are we saying, we index clusterId,clusterSize and clusterLable >>> in the same index (where other data is indexed)? So one index will have two >>> sets of Solr documents in it? one containing cluster info? >>> >>> My requirement again; I have bunch of db columns which are being indexed. >>> e.g. >>> Title, RiskLevel1, RiskLevel2,RiskLevel3 etc >>> Title1 High Medium Low >>> >>> Current requirement is to cluster documents based on their riskLevels and >>> NOT the title. >>> >>> Thanks, >>> >>> >>> ________________________________ >>> From: Frank Scholten <[email protected]> >>> To: [email protected]; Vikas Pandya <[email protected]> >>> Sent: Thursday, January 19, 2012 4:24 AM >>> Subject: Re: How to present mahout cluster in combination with Solr results >>> >>> Hi Vikas, >>> >>> I suggest indexing the cluster label, cluster size and >>> cluster-document mappings so you can use that information to build a >>> tag cloud of your data. Checkout this presentation >>> http://java.dzone.com/videos/configuring-mahout-clustering >>> >>> Cheers, >>> >>> Frank >>> >>> On Thu, Jan 19, 2012 at 4:18 AM, Vikas Pandya <[email protected]> wrote: >>>> Hello, >>>> >>>> I have successfully created vectors from reading my existing Solr Index. >>>> Then created sequenceFile and mahout clusters from it. As I understand that >>>> currently solr and mahout clustering aren't integrated, what's the best way >>>> to represent mahout clusters to the user? Mine is a search application >>>> which >>>> renders results by querying solr index. Now I need to incorporate Mahout >>>> created clusters in the result. While Solr-Mahout integration isn't there >>>> yet, what's the best alternative way to represent this info? >>>> >>>> Thanks, >>>
