Hi Imran,

Yes, for better performance in computation, we should use CSC or CSR
format. I believe that we are moving towards that direction. But let
us first discuss the format for sparse input. Do you mind moving our
discussion to the JIRA I created?

https://spark-project.atlassian.net/browse/MLLIB-18

Xiangrui

On Wed, Feb 5, 2014 at 2:11 PM, Imran Rashid <[email protected]> wrote:
> Hi,
>
> I was hoping to have some discussion about how sparse matrices are
> represented in MLLib.  I noticed a few commits to add a basic MatrixEntry
> object:
>
> https://github.com/apache/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/MatrixEntry.scala
>
> I know that MatrixEntry is a really nice abstraction, but from my experience
> putting all your data in objects can really destroy performance with
> numerical computation.  You either waste a lot of memory with that
> representation, or you waste a ton of compute time serializing &
> deserializing.  Either way, you also lose cache-locality for your data.
>
> I'm thinking we'd want the basic record type to be a sparse vector:
>
> class SparseVector(colIds: Array[Int], colValues: Array[Double])
>
> where the two arrays are aligned.  This is compact & lets you iterate over
> all the values easily & efficiently.  Or actually for really sparse data,
> you might even want to group multiple rows together into one set of arrays:
>
> class SparseMatrixBlock(int rowStartIdx: Array[Int], colIds: Array[Int],
> colValues: Array[Double])
>
> which would store a group of rows (or cols), but saving space & getting
> cache locality by not using a separate array for each one.
>
> I'm speaking just from experience with non-distributed implementations, but
> the savings in both speed & memory from using a format like this are pretty
> humongous, I'd imagine they apply to spark also.  (And if not, it probably
> means spark is introducing some overhead that we should try to eliminate.)
>
> I know the format is not as pleasant to work with, but I think its
> definitely worth it.  And now is really the time to get it right, before a
> lot of stuff is written.  We could write a bunch of helper methods for
> SparseMatrixBlock to make it easier to work with.
>
> I know this is a big decision that will effect a lot of code.  Not sure if
> anybody else has some work in progress.  I could potentially try to get some
> performance numbers for comparison and point at some code I've used in the
> past, though it's all very rough.
>
> thanks,
> Imran
>
>
>
> On Fri, Jan 31, 2014 at 1:14 PM, Xiangrui Meng <[email protected]> wrote:
>>
>> Hi Jason,
>>
>> Sorry, I didn't see this message before I replied in another thread.
>> So the following is copy-and-paste:
>>
>> We are currently working on the sparse data support, one of the
>> highest priority features for MLlib. All existing algorithms will
>> support sparse input. We will open a JIRA ticket for progress tracking
>> and discussions.
>>
>> Best,
>> Xiangrui
>>
>> On Fri, Jan 31, 2014 at 10:49 AM, jshao <[email protected]> wrote:
>> > Hi,
>> >
>> > Spark is absolutely amazing for machine learning as its iterative
>> > process is
>> > super fast. However one big issue that I realized was that the MLLib API
>> > isn't suitable for sparse inputs at all because it requires the feature
>> > vector to be a dense array.
>> >
>> > For example, I currently want to run a logistic regression on data that
>> > is
>> > wide and sparse (each data point might have 3 million fields with most
>> > of
>> > them being 0). It is impossible to represent each data point as an array
>> > of
>> > length 3 million.
>> >
>> > Can I expect/contribute to any changes that might handle sparse inputs?
>> >
>> > Thanks,
>> > Jason
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Sparse-Input-tp1085.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Reply via email to