Re: MLLib Sparse Input

Imran Rashid Wed, 05 Feb 2014 14:12:47 -0800

Hi,

I was hoping to have some discussion about how sparse matrices are
represented in MLLib.  I noticed a few commits to add a basic MatrixEntry
object:

https://github.com/apache/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/MatrixEntry.scala

I know that MatrixEntry is a really nice abstraction, but from my
experience putting all your data in objects can really destroy performance
with numerical computation.  You either waste a lot of memory with that
representation, or you waste a ton of compute time serializing &
deserializing.  Either way, you also lose cache-locality for your data.

I'm thinking we'd want the basic record type to be a sparse vector:

class SparseVector(colIds: Array[Int], colValues: Array[Double])

where the two arrays are aligned.  This is compact & lets you iterate over
all the values easily & efficiently.  Or actually for really sparse data,
you might even want to group multiple rows together into one set of arrays:

class SparseMatrixBlock(int rowStartIdx: Array[Int], colIds: Array[Int],
colValues: Array[Double])

which would store a group of rows (or cols), but saving space & getting
cache locality by not using a separate array for each one.

I'm speaking just from experience with non-distributed implementations, but
the savings in both speed & memory from using a format like this are pretty
humongous, I'd imagine they apply to spark also.  (And if not, it probably
means spark is introducing some overhead that we should try to eliminate.)

I know the format is not as pleasant to work with, but I think its
definitely worth it.  And now is really the time to get it right, before a
lot of stuff is written.  We could write a bunch of helper methods for
SparseMatrixBlock to make it easier to work with.

I know this is a big decision that will effect a lot of code.  Not sure if
anybody else has some work in progress.  I could potentially try to get
some performance numbers for comparison and point at some code I've used in
the past, though it's all very rough.

thanks,
Imran

On Fri, Jan 31, 2014 at 1:14 PM, Xiangrui Meng <[email protected]> wrote:

> Hi Jason,
>
> Sorry, I didn't see this message before I replied in another thread.
> So the following is copy-and-paste:
>
> We are currently working on the sparse data support, one of the
> highest priority features for MLlib. All existing algorithms will
> support sparse input. We will open a JIRA ticket for progress tracking
> and discussions.
>
> Best,
> Xiangrui
>
> On Fri, Jan 31, 2014 at 10:49 AM, jshao <[email protected]> wrote:
> > Hi,
> >
> > Spark is absolutely amazing for machine learning as its iterative
> process is
> > super fast. However one big issue that I realized was that the MLLib API
> > isn't suitable for sparse inputs at all because it requires the feature
> > vector to be a dense array.
> >
> > For example, I currently want to run a logistic regression on data that
> is
> > wide and sparse (each data point might have 3 million fields with most of
> > them being 0). It is impossible to represent each data point as an array
> of
> > length 3 million.
> >
> > Can I expect/contribute to any changes that might handle sparse inputs?
> >
> > Thanks,
> > Jason
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Sparse-Input-tp1085.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: MLLib Sparse Input

Reply via email to