Checkout the recent mailing list post 'Clustering user profiles'

Jeff (Eastman) sums it up clearly.

> Mahout clustering (unsupervised classification) can only deal with 
> continuous, homogeneous vector representations of the input data, where each 
> vector element is weighted the same as the other elements. Mahout
> (supervised) classification can deal with continuous, categorical, word-like 
> and text-like features such as in your problem space.

> To address your problem with Mahout clustering, you would need to develop a 
> mapping for each of your features to continuous vector elements and use a 
> WeightedDistanceMeasure to account for the different element > types and 
> their relative impacts on the overall distance computation. This would be an 
> iterative process which might or might not produce useful results.

> An alternative approach would be to train a Mahout classifier with the 
> various features using marked training data which classifies similar users 
> into a finite number of "clusters" that seem natural to you. With such a
> model, you could then classify new users into those "clusters". This approach 
> would not be very useful for discovering new "clusters" in your data, but it 
> would leverage the classifier training mechanisms to develop the > models as 
> more of a black box than above.

Question also to other people reading this. I looked into this and saw
that there are clustering algorithms for categorical data such as
K-modes. Are these effective for solving these kind of problems? If so
would they be interesting to add to Mahout?

Cheers,

Frank

On Thu, Feb 2, 2012 at 12:38 PM, Vikas Pandya <[email protected]> wrote:
> Frank. Thanks.
>>>In your case you want to cluster items that have several risk levels
>>> as well as other properties. You have to use your original numerical
>>> data, (I assume probabilities) in a clustering algorithm, not the
>>> labels like low, medium, high. How were these labels assigned?
>
>
> RiskLevel1,RiskLevel2,RiskLevel3 all are having actual lookup values (High, 
> Medium,Low etc) in Solr index (Index is stored flatten)
>
> -Vikas
>
>
> ________________________________
>  From: Frank Scholten <[email protected]>
> To: [email protected]
> Sent: Wednesday, February 1, 2012 3:28 AM
> Subject: Re: How to present mahout cluster in combination with Solr results
>
> Vikas,
>
> Please send messages to the mailinglist so everyone can benefit.
>
>> Frank,
>>
>> To give further details about the usecase.
>>
>> 1)User searches for a free text, this search is served from Solr.
>> 2)User selects a record from the search result, subsequently need to display 
>> all the items where RiskLevels of the items match the values of Risk Levels 
>> of a selected item from search result (and put them under "Similar items" in 
>> UI).
>>
>> upon indexing I am copying RiskLevel1, RiskLevel2,RiskLevel3 into a single 
>> field (solr copyField). Vector is created against that field for Mahout to 
>> create clusters on it. Now the issue is (understandably) when clusters are 
>> created it will find distance between words and its very much possible that 
>> following three records get clustered into a single cluster.
>> RiskLevel1, RiskLevel2, RiskLevel3
>> High             High       Low
>> High             High             High
>> High             High         Medium
>
> Just to make sure, in my presentation I talk about using text
> clustering for document tagging. The documents are vectorized and
> weighted with TF/IDF and are fed into a Mahout clustering algorithm.
>
> In your case you want to cluster items that have several risk levels
> as well as other properties. You have to use your original numerical
> data, (I assume probabilities) in a clustering algorithm, not the
> labels like low, medium, high. How were these labels assigned?
>
>>
>> But clustering on these metadata columns, requirement is to cluster as below 
>> (sequence of the values DO matter)
>>
>> Cluster1:
>> RiskLevel1, RiskLevel2,RiskLevel3
>> High             High           Low
>> High             High           Low
>>
>> Cluster2:
>> RiskLevel1, RiskLevel2,RiskLevel3
>> High            High           High
>> High            High           High
>>
>> Cluster3:
>> RiskLevel1, RiskLevel2,RiskLevel3
>> High            High           Medium
>> High            High            Medium
>>
>> I started thinking about using classification over clustering? but while 
>> playing with Weka (http://www.cs.waikato.ac.nz/ml/weka/ ) Swing based GUI 
>> tool where one can easily play around with different algorithms from UI 
>> directly, I found DBScan clustering did cluster results correctly per my 
>> requirements, to be precise it created three different clusters (if you pick 
>> above mentioned example).
>>
>> can clustering be done the way I need it to work in Mahout? or any other 
>> ideas that can be explore further?
>>
>> Thanks,
>
> On Fri, Jan 20, 2012 at 6:48 PM, Frank Scholten <[email protected]> 
> wrote:
>> On Fri, Jan 20, 2012 at 4:01 PM, Vikas Pandya <[email protected]> wrote:
>>> From the example below, solr search results should be clustered in some
>>> following way
>>> list all the items which have matching RiskLevels e.g.
>>>
>>>
>>> Cluster 1:
>>> Title          RiskLevel1          RiskLevel2         RiskLevel3
>>> abc            High                     Medium             Low
>>> xyz            High                      Medium            High
>>> def            Low                        Medium           High
>>>
>>> Cluster 2:
>>> Title          RiskLevel1          RiskLevel2         RiskLevel3
>>> omn            Low                     Medium             Low
>>> yui            Low                      Medium            High
>>> bnm            Medium             Medium           High
>>>
>>> Though I have a feeling I don't need to use Mahout clustering for this, I am
>>> still trying to hook in mahout for this since we have more clustering
>>> requirements in the pipeline to cluster based on other features (attributes
>>> of objects).
>>>
>>
>> You only have 27 unique risklevel combinations. You could just sort by
>> or more risklevels to get a sense of the data.
>>
>> If you have more attributes then you could indeed look into clustering,
>>
>> Cheers,
>>
>> Frank
>>
>>> Any thoughts?
>>>
>>> ________________________________
>>> From: Vikas Pandya <[email protected]>
>>> To: Frank Scholten <[email protected]>; "[email protected]"
>>> <[email protected]>
>>> Sent: Thursday, January 19, 2012 11:05 AM
>>>
>>> Subject: Re: How to present mahout cluster in combination with Solr results
>>>
>>> Hi Frank,
>>>
>>> Thanks for the link. That was useful. It's still bit unclear on how he built
>>> his index. are we saying, we index  clusterId,clusterSize and clusterLable
>>> in the same index (where other data is indexed)? So one index will have two
>>> sets of Solr documents in it?  one containing cluster info?
>>>
>>> My requirement again; I have bunch of db columns which are being indexed.
>>> e.g.
>>> Title,             RiskLevel1, RiskLevel2,RiskLevel3 etc
>>> Title1        High             Medium      Low
>>>
>>> Current requirement is to cluster documents based on their riskLevels and
>>> NOT the title.
>>>
>>> Thanks,
>>>
>>>
>>> ________________________________
>>> From: Frank Scholten <[email protected]>
>>> To: [email protected]; Vikas Pandya <[email protected]>
>>> Sent: Thursday, January 19, 2012 4:24 AM
>>> Subject: Re: How to present mahout cluster in combination with Solr results
>>>
>>> Hi Vikas,
>>>
>>> I suggest indexing the cluster label, cluster size and
>>> cluster-document mappings so you can use that information to build a
>>> tag cloud of your data. Checkout this presentation
>>> http://java.dzone.com/videos/configuring-mahout-clustering
>>>
>>> Cheers,
>>>
>>> Frank
>>>
>>> On Thu, Jan 19, 2012 at 4:18 AM, Vikas Pandya <[email protected]> wrote:
>>>> Hello,
>>>>
>>>> I have successfully created vectors from reading my existing Solr Index.
>>>> Then created sequenceFile and mahout clusters from it. As I understand that
>>>> currently solr and mahout clustering aren't integrated, what's the best way
>>>> to represent mahout clusters to the user? Mine is a search application 
>>>> which
>>>> renders results by querying solr index. Now I need to incorporate Mahout
>>>> created clusters in the result. While Solr-Mahout integration isn't there
>>>> yet, what's the best alternative way to represent this info?
>>>>
>>>> Thanks,
>>>

Reply via email to