Hi, I was hoping to have some discussion about how sparse matrices are represented in MLLib. I noticed a few commits to add a basic MatrixEntry object:
https://github.com/apache/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/MatrixEntry.scala I know that MatrixEntry is a really nice abstraction, but from my experience putting all your data in objects can really destroy performance with numerical computation. You either waste a lot of memory with that representation, or you waste a ton of compute time serializing & deserializing. Either way, you also lose cache-locality for your data. I'm thinking we'd want the basic record type to be a sparse vector: class SparseVector(colIds: Array[Int], colValues: Array[Double]) where the two arrays are aligned. This is compact & lets you iterate over all the values easily & efficiently. Or actually for really sparse data, you might even want to group multiple rows together into one set of arrays: class SparseMatrixBlock(int rowStartIdx: Array[Int], colIds: Array[Int], colValues: Array[Double]) which would store a group of rows (or cols), but saving space & getting cache locality by not using a separate array for each one. I'm speaking just from experience with non-distributed implementations, but the savings in both speed & memory from using a format like this are pretty humongous, I'd imagine they apply to spark also. (And if not, it probably means spark is introducing some overhead that we should try to eliminate.) I know the format is not as pleasant to work with, but I think its definitely worth it. And now is really the time to get it right, before a lot of stuff is written. We could write a bunch of helper methods for SparseMatrixBlock to make it easier to work with. I know this is a big decision that will effect a lot of code. Not sure if anybody else has some work in progress. I could potentially try to get some performance numbers for comparison and point at some code I've used in the past, though it's all very rough. thanks, Imran On Fri, Jan 31, 2014 at 1:14 PM, Xiangrui Meng <[email protected]> wrote: > Hi Jason, > > Sorry, I didn't see this message before I replied in another thread. > So the following is copy-and-paste: > > We are currently working on the sparse data support, one of the > highest priority features for MLlib. All existing algorithms will > support sparse input. We will open a JIRA ticket for progress tracking > and discussions. > > Best, > Xiangrui > > On Fri, Jan 31, 2014 at 10:49 AM, jshao <[email protected]> wrote: > > Hi, > > > > Spark is absolutely amazing for machine learning as its iterative > process is > > super fast. However one big issue that I realized was that the MLLib API > > isn't suitable for sparse inputs at all because it requires the feature > > vector to be a dense array. > > > > For example, I currently want to run a logistic regression on data that > is > > wide and sparse (each data point might have 3 million fields with most of > > them being 0). It is impossible to represent each data point as an array > of > > length 3 million. > > > > Can I expect/contribute to any changes that might handle sparse inputs? > > > > Thanks, > > Jason > > > > > > > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Sparse-Input-tp1085.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
