Hi Jonathan,

Here's what u gotta do to run RowSimilarity on ur CSV formatted data.  You
would have to use the MapReduce version since the Spark version only
supports LLR.

1. Convert CSV to Vectors - use CSVIterator and store the vectors as
SequenceFiles
2.  Run RowIDJob on the SequenceFile output of (1). This should generate a
Matrix of <IntWritable, VectorWriteable> and a docIndex of <IntWritable,
Text>
3.  Run RowSimilarityjob on the matrix output from (2) specifiying
CosineDistance and a cutoff threshold. This should generate a matrix of
Rows -> Most similar rows with distances.




On Wed, May 13, 2015 at 11:42 PM, Jonathan Seale <jonathanpse...@gmail.com>
wrote:

> Thanks, Charlie,
>
> The data has been through lots of processing, but in an attempt to make it
> more Mahout-friendly, I've converted it into a single csv table with
> columns: star_id, wavelength, intensity. My motivation was to make it like
> a user_id, item_id, rating table you might see in other Mahout uses.
>
> As opposed to using my local machine, I've setup an instance on Amazon with
> hopes of turning this into a remote service. So the install is whatever
> comes with Amazon's default Mahout installation.
>
> Jonathan
>
>
>
> On Wed, May 13, 2015 at 11:29 PM, Charlie Hack <charles.t.h...@gmail.com>
> wrote:
>
> > Hi Jonathan, how do you have the data stored? More info about your setup
> > the better.
> >
> >
> > Charlie
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > —
> > Sent from Mailbox
> >
> >
> >
> >
> > On Wednesday, May 13, 2015 at 23:16, Jonathan Seale <
> > jonathanpse...@gmail.com>, wrote:
> > Scientists,
> >
> >
> > I have an astrophysical application for Mahout that I need help with.
> >
> >
> > I have 1-dimensional stellar spectra for many, many stars. Each spectrum
> >
> > consists of a series of intensity values, one per wavelength of light. I
> >
> > need to be able to find the cosine similarity between ALL pairs of stars.
> >
> > Seems to me this is simply a user-user similarity problem where I have
> >
> > stars instead of users, wavelengths instead of items, and intensities
> >
> > instead of ratings/clicks.
> >
> >
> > But I'm having difficulty using mahout's row similarity package (I'm new
> to
> >
> > this, and these days astronomers code pretty exclusively in python). I
> know
> >
> > that I must have to 1) create a sparse matrix where each row is a star,
> >
> > columns are wavelengths, and the values are intensity, and 2) implement
> row
> >
> > similarity. But I'm just not sure how to do it. Anyone have a good
> resource
> >
> > or be willing to help? I could probably offer some compensation to anyone
> >
> > that would be willing to provide a little focussed, personalized
> > assistance.
> >
> >
> > Thanks,
> >
> > Jonathan
> >
>

Reply via email to