Actually, this is probably done more easily using a simple matrix
multiplication.  The reason for not using recommendation code for this is
that your problem is entirely dense.

How exactly you should go about this is a different question.  Up to tens
of thousands of stars, you can probably do this on a single machine using
pretty standard tools like R or matlab.

For larger problems, you will need parallelize the problem.  Essentially,
if A contains your data this turns in either A A' (if stars are rows) or A'
A (if stars are columns).  The real problem is that your output is going to
be as big as the number of stars, squared.  This will probably limit the
feasibility of this computation.  A million stars will result in something
like 10TB of output.

Assuming you have a million stars and each spectrum contains a few thousand
observations, the way I would go about this computation would be to store
each spectrum as a row, and dividing your data file into batches of rows.
Call the full matrix A and each batch of rows A_1 ... A_n.  Each batch
should have however many rows it takes to get a matrix product A_i A_j' to
take 30-100 seconds.

Now, all you have to do is schedule the multiplication of every pair of A_i
and A_j.  How you do that and how you store the data won't matter very much
because the computation costs will outweigh the scheduling and I/O costs.
The output will consist of matrices B_ij that each contain the dot products
between all of the stars in A_i and all of the starts in A_j.   To find the
dot product of two arbitrary stars, you first have to find which batches
they are in, and then you need to find their product in the corresponding
B_ij file.  You should probably check out some of the efficient math
packages for doing the local multiplications.

My guess is that this is very much not what you really want to be doing.

It is much more likely that you want to have an efficient nearest neighbor
search engine so that you can quickly find the, say, thousand most similar
stars given any query star.  That can be done with packages like FLANN [1]
or others [2].  Mahout will not help you with this given the dense nature
of your data.

[1] http://www.cs.ubc.ca/research/flann/
[2] https://www.cs.umd.edu/~mount/ANN/



On Wed, May 13, 2015 at 11:15 PM, Jonathan Seale <jonathanpse...@gmail.com>
wrote:

> Scientists,
>
> I have an astrophysical application for Mahout that I need help with.
>
> I have 1-dimensional stellar spectra for many, many stars. Each spectrum
> consists of a series of intensity values, one per wavelength of light. I
> need to be able to find the cosine similarity between ALL pairs of stars.
> Seems to me this is simply a user-user similarity problem where I have
> stars instead of users, wavelengths instead of items, and intensities
> instead of ratings/clicks.
>
> But I'm having difficulty using mahout's row similarity package (I'm new to
> this, and these days astronomers code pretty exclusively in python). I know
> that I must have to 1) create a sparse matrix where each row is a star,
> columns are wavelengths, and the values are intensity, and 2) implement row
> similarity. But I'm just not sure how to do it. Anyone have a good resource
> or be willing to help? I could probably offer some compensation to anyone
> that would be willing to provide a little focussed, personalized
> assistance.
>
> Thanks,
> Jonathan
>

Reply via email to