Re: SVD on larger than taller matrix
The main bottleneck of current SVD implementation is on the memory of driver node. It requires at least 5*n*k doubles in driver memory because all right singular vectors are stored in driver memory and there are some working memory required. So it is bounded by the smaller dimension of your matrix and k. For the worker nodes, memory requirements should be much smallers as long as you can distribute your sparse matrix in worker's memory. If possible, you can ask for more memory for driver node while keeping worker nodes memory small. Meanwhile we are working on removing this limitation by implementing distributed QR and Lanczos in Spark. On Thu, Sep 18, 2014 at 1:26 PM, Xiangrui Meng men...@gmail.com wrote: Did you cache `features`? Without caching it is slow because we need O(k) iterations. The storage requirement on the driver is about 2 * n * k = 2 * 3 million * 200 ~= 9GB, not considering any overhead. Computing U is also an expensive task in your case. We should use some randomized SVD implementation for your data, but this is not available now. I would recommend setting driver-memory 25g, caching `features`, and using a smaller k. -Xiangrui On Thu, Sep 18, 2014 at 1:02 PM, Glitch atremb...@datacratic.com wrote: I have a matrix of about 2 millions+ rows with 3 millions + columns in svm format* and it's sparse. As I understand it, running SVD on such a matrix shouldn't be a problem since version 1.1. I'm using 10 worker nodes on EC2, each with 30G of RAM (r3.xlarge). I was able to compute the SVD for 20 singular values, but it fails with a Java Heap Size error for 200 singular values. I'm currently trying 100. So my question is this, what kind of cluster do you need to perform this task? As I do not have any measurable experience with Spark I can't say if this is normal: my test for 100 singular values has been running for over an hour. I'm using this dataset http://archive.ics.uci.edu/ml/datasets/URL+Reputation I'm using the spark-shell with --executor-memory 15G --driver-memory 15G And the few lines of codes are /import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.util.MLUtils val data = MLUtils.loadLibSVMFile(sc, all.svm,3231961) val features = data.map(line = line.features) val mat = new RowMatrix(features) val svd = mat.computeSVD(200, computeU= true)/ svm format: label column number:value -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SVD-on-larger-than-taller-matrix-tp14611.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Li @vrilleup
Re: How can I implement eigenvalue decomposition in Spark?
@Miles, eigen-decomposition with asymmetric matrix doesn't always give real-value solutions, and it doesn't have the nice properties that symmetric matrix holds. Usually you want to symmetrize your asymmetric matrix in some way, e.g. see http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_ZhouHS05.pdf but as Sean mentioned, you can always compute the largest eigenpair with power method or some variations like pagerank, which is already implemented in graphx. On Fri, Aug 8, 2014 at 2:50 AM, Sean Owen so...@cloudera.com wrote: The SVD does not in general give you eigenvalues of its input. Are you just trying to access the U and V matrices? they are also returned in the API. But they are not the eigenvectors of M, as you note. I don't think MLlib has anything to help with the general eigenvector problem. Maybe you can implement a sort of power iteration algorithm using GraphX to find the largest eigenvector? On Fri, Aug 8, 2014 at 4:07 AM, Chunnan Yao yaochun...@gmail.com wrote: Hi there, what you've suggested are all meaningful. But to make myself clearer, my essential problems are: 1. My matrix is asymmetric, and it is a probabilistic adjacency matrix, whose entries(a_ij) represents the likelihood that user j will broadcast the information generated by user i. Apparently, a_ij and a_ji is different, caus I love you doesn't necessarily mean you love me(What a sad story~). All entries are real. 2. I know I can get eigenvalues through SVD. My problem is I can't get the corresponding eigenvectors, which requires solving equations, and I also need eigenvectors in my calculation.In my simulation of this paper, I only need the biggest eigenvalues and corresponding eigenvectors. The paper posted by Shivaram Venkataraman is also concerned about symmetric matrix. Could any one help me out? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Li @vrilleup
Re: How can I implement eigenvalue decomposition in Spark?
@Miles, the latest SVD implementation in mllib is partially distributed. Matrix-vector multiplication is computed among all workers, but the right singular vectors are all stored in the driver. If your symmetric matrix is n x n and you want the first k eigenvalues, you will need to fit n x k doubles in driver's memory. Behind the scene, it calls ARPACK to compute eigen-decomposition of A^T A. You can look into the source code for the details. @Sean, the SVD++ implementation in graphx is not the canonical definition of SVD. It doesn't have the orthogonality that SVD holds. But we might want to use graphx as the underlying matrix representation for mllib.SVD to address the problem of skewed entry distribution. On Thu, Aug 7, 2014 at 10:51 AM, Evan R. Sparks evan.spa...@gmail.com wrote: Reza Zadeh has contributed the distributed implementation of (Tall/Skinny) SVD ( http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html), which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data is sparse (which it often is in social networks), you may have better luck with this. I haven't tried the GraphX implementation, but those algorithms are often well-suited for power-law distributed graphs as you might see in social networks. FWIW, I believe you need to square elements of the sigma matrix from the SVD to get the eigenvalues. On Thu, Aug 7, 2014 at 10:20 AM, Sean Owen so...@cloudera.com wrote: (-incubator, +user) If your matrix is symmetric (and real I presume), and if my linear algebra isn't too rusty, then its SVD is its eigendecomposition. The SingularValueDecomposition object you get back has U and V, both of which have columns that are the eigenvectors. There are a few SVDs in the Spark code. The one in mllib is not distributed (right?) and is probably not an efficient means of computing eigenvectors if you really just want a decomposition of a symmetric matrix. The one I see in graphx is distributed? I haven't used it though. Maybe it could be part of a solution. On Thu, Aug 7, 2014 at 2:21 PM, yaochunnan yaochun...@gmail.com wrote: Our lab need to do some simulation on online social networks. We need to handle a 5000*5000 adjacency matrix, namely, to get its largest eigenvalue and corresponding eigenvector. Matlab can be used but it is time-consuming. Is Spark effective in linear algebra calculations and transformations? Later we would have 500*500 matrix processed. It seems emergent that we should find some distributed computation platform. I see SVD has been implemented and I can get eigenvalues of a matrix through this API. But when I want to get both eigenvalues and eigenvectors or at least the biggest eigenvalue and the corresponding eigenvector, it seems that current Spark doesn't have such API. Is it possible that I write eigenvalue decomposition from scratch? What should I do? Thanks a lot! Miles Yao View this message in context: How can I implement eigenvalue decomposition in Spark? Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Li @vrilleup
Re: Recommended pipeline automation tool? Oozie?
I like the idea of using scala to drive the workflow. Spark already comes with a scheduler, why not program a plugin to schedule other types of tasks (copy file, send email, etc.)? Scala could handle any logic required by the pipeline. Passing objects (including RDDs) between tasks is also easier. I don't know if this is an overuse of Spark scheduler, but sounds like a good tool. The only issue would be releasing resources that is not used at intermediate steps. On Fri, Jul 11, 2014 at 12:05 PM, Wei Tan w...@us.ibm.com wrote: Just curious: how about using scala to drive the workflow? I guess if you use other tools (oozie, etc) you lose the advantage of reading from RDD -- you have to read from HDFS. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center *http://researcher.ibm.com/person/us-wtan* http://researcher.ibm.com/person/us-wtan From:k.tham kevins...@gmail.com To:u...@spark.incubator.apache.org, Date:07/10/2014 01:20 PM Subject:Recommended pipeline automation tool? Oozie? -- I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some advice/suggestions. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Li @vrilleup
Re: running SparkALS
http://spark.apache.org/docs/0.9.0/mllib-guide.html#collaborative-filtering-1 One thing which is undocumented: the integers representing users and items have to be positive. Otherwise it throws exceptions. Li On 28 avr. 2014, at 10:30, Diana Carroll dcarr...@cloudera.com wrote: Hi everyone. I'm trying to run some of the Spark example code, and most of it appears to be undocumented (unless I'm missing something). Can someone help me out? I'm particularly interested in running SparkALS, which wants parameters: M U F iter slices What are these variables? They appear to be integers and the default values are 100, 500 and 10 respectively but beyond that...huh? Thanks! Diana