Ted and Jeff, Thanks very much for the advice I'll make a start on named Vectors.
thanks very much again John On Thu, Feb 17, 2011 at 5:49 PM, Ted Dunning <[email protected]> wrote: > This should be fine. > > I recommend not doing too much special with the coordinates except for > translating to unit vector positions. This allows standard Euclidean > metrics to give you the result you want. > > You will also need to scale or translate your third variable so so that it > is in the same scale of reference as the first two. The question you > should > ask yourself is whether a particular amount of change in each coordinate > represents about the same level of difference in the final result. > > On Thu, Feb 17, 2011 at 9:15 AM, Jeff Eastman <[email protected]> wrote: > > > Let me first translate your problem a little to make it more tangible. > > Suppose you have data of the following format [Item Longitude Latitude > > Altitude]: > > > > You can convert this into Mahout NamedVectors, where the name=Item and > the > > vector values are [Longitude, Latitude, Altitude]. Look in > > utils/src/main/java/m/a/o/clustering/conversion for some example jobs for > > starting points. You will likely need to write your own conversion job to > > create the NamedVectors the way you want from your input data in its > > encoding format. > > > > Now, you can cluster this data using any of the Mahout algorithms, but > the > > clustering will treat all your vector elements equally. I get that you > > really want to cluster mostly based on Altitude (cluster all the Lon/Lat > > items which have similar Altitudes). If this is the case then you can use > > one of our WeightedDistanceMeasures to minimize (or eliminate) the > effects > > of Lon/Lat and focus mostly (or entirely) on Altitudes. Or, better, you > can > > write your own SphericalDistanceMeasure (to deal with the fact that > Lon=001 > > is quite close to Lon=359, for example). > > > > Hope this helps, > > Jeff > > > > -----Original Message----- > > From: john abbott [mailto:[email protected]] > > Sent: Thursday, February 17, 2011 8:49 AM > > To: [email protected] > > Subject: Clustering assistance, mean shift > > > > Hi, > > > > I was wondering whether someone might be able to help me out. I'd like > to > > use Mahout via Elastic map Reduce to cluster some datasets but I'm not > sure > > I've got the right use case. I'm hoping someone might be able to comment > > and perhaps point me in the direction of some further advice. > > > > I have a dataset which is stored in a database and structured as follows: > > > > Item Value X Value Y Value Z > > A 2 4 3 > > A 3 5 6 > > A 6 7 9 > > B 5 8 2 > > B 2 4 7 > > ... > > > > I would like to create a series of clusters for each item based on the > > values of X and Y and Z. X and Y are geographic co-ordinates i.e. real > > world places and Z is a value observed in those places. What I'd like to > > end up with is (for each Item) a series of clusters saying these Values > of > > Z > > are coincident at this place (represented by Value X and Y). I've looked > > through and played with the quickstarts and that's all fine but I'm > > wondering: > > > > 1. Is this sort of analysis possible? > > 2. How I convert my numeric data into the correct format to be processed > > by > > a Job > > 3. Any pointers to how I might configure my job in a way that can be > > distributed and create a cluster for each item > > > > Thank you to anyone who might be able to help, I'm really excited to get > > started with Mahout but I'm struggling to understand whether it's > suitable > > and how to get started. > > > > Thanks very much, > > > > John > > > -- John Abbott Co-Founder www.oobafit.com m. 44 (0)7919392754 @scmjea
