Hi Imran, Yes, for better performance in computation, we should use CSC or CSR format. I believe that we are moving towards that direction. But let us first discuss the format for sparse input. Do you mind moving our discussion to the JIRA I created?
https://spark-project.atlassian.net/browse/MLLIB-18 Xiangrui On Wed, Feb 5, 2014 at 2:11 PM, Imran Rashid <[email protected]> wrote: > Hi, > > I was hoping to have some discussion about how sparse matrices are > represented in MLLib. I noticed a few commits to add a basic MatrixEntry > object: > > https://github.com/apache/incubator-spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/MatrixEntry.scala > > I know that MatrixEntry is a really nice abstraction, but from my experience > putting all your data in objects can really destroy performance with > numerical computation. You either waste a lot of memory with that > representation, or you waste a ton of compute time serializing & > deserializing. Either way, you also lose cache-locality for your data. > > I'm thinking we'd want the basic record type to be a sparse vector: > > class SparseVector(colIds: Array[Int], colValues: Array[Double]) > > where the two arrays are aligned. This is compact & lets you iterate over > all the values easily & efficiently. Or actually for really sparse data, > you might even want to group multiple rows together into one set of arrays: > > class SparseMatrixBlock(int rowStartIdx: Array[Int], colIds: Array[Int], > colValues: Array[Double]) > > which would store a group of rows (or cols), but saving space & getting > cache locality by not using a separate array for each one. > > I'm speaking just from experience with non-distributed implementations, but > the savings in both speed & memory from using a format like this are pretty > humongous, I'd imagine they apply to spark also. (And if not, it probably > means spark is introducing some overhead that we should try to eliminate.) > > I know the format is not as pleasant to work with, but I think its > definitely worth it. And now is really the time to get it right, before a > lot of stuff is written. We could write a bunch of helper methods for > SparseMatrixBlock to make it easier to work with. > > I know this is a big decision that will effect a lot of code. Not sure if > anybody else has some work in progress. I could potentially try to get some > performance numbers for comparison and point at some code I've used in the > past, though it's all very rough. > > thanks, > Imran > > > > On Fri, Jan 31, 2014 at 1:14 PM, Xiangrui Meng <[email protected]> wrote: >> >> Hi Jason, >> >> Sorry, I didn't see this message before I replied in another thread. >> So the following is copy-and-paste: >> >> We are currently working on the sparse data support, one of the >> highest priority features for MLlib. All existing algorithms will >> support sparse input. We will open a JIRA ticket for progress tracking >> and discussions. >> >> Best, >> Xiangrui >> >> On Fri, Jan 31, 2014 at 10:49 AM, jshao <[email protected]> wrote: >> > Hi, >> > >> > Spark is absolutely amazing for machine learning as its iterative >> > process is >> > super fast. However one big issue that I realized was that the MLLib API >> > isn't suitable for sparse inputs at all because it requires the feature >> > vector to be a dense array. >> > >> > For example, I currently want to run a logistic regression on data that >> > is >> > wide and sparse (each data point might have 3 million fields with most >> > of >> > them being 0). It is impossible to represent each data point as an array >> > of >> > length 3 million. >> > >> > Can I expect/contribute to any changes that might handle sparse inputs? >> > >> > Thanks, >> > Jason >> > >> > >> > >> > -- >> > View this message in context: >> > http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Sparse-Input-tp1085.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. > >
