Re: SVD on larger than taller matrix

2014-09-18 Thread Li Pu
The main bottleneck of current SVD implementation is on the memory of
driver node. It requires at least 5*n*k doubles in driver memory because
all right singular vectors are stored in driver memory and there are some
working memory required. So it is bounded by the smaller dimension of your
matrix and k. For the worker nodes, memory requirements should be much
smallers as long as you can distribute your sparse matrix in worker's
memory. If possible, you can ask for more memory for driver node while
keeping worker nodes memory small.

Meanwhile we are working on removing this limitation by implementing
distributed QR and Lanczos in Spark.

On Thu, Sep 18, 2014 at 1:26 PM, Xiangrui Meng men...@gmail.com wrote:

 Did you cache `features`? Without caching it is slow because we need
 O(k) iterations. The storage requirement on the driver is about 2 * n
 * k = 2 * 3 million * 200 ~= 9GB, not considering any overhead.
 Computing U is also an expensive task in your case. We should use some
 randomized SVD implementation for your data, but this is not available
 now. I would recommend setting driver-memory 25g, caching `features`,
 and using a smaller k. -Xiangrui

 On Thu, Sep 18, 2014 at 1:02 PM, Glitch atremb...@datacratic.com wrote:
  I have a matrix of about 2 millions+ rows with 3 millions + columns in
 svm
  format* and it's sparse. As I understand it, running SVD on such a matrix
  shouldn't be a problem since version 1.1.
 
  I'm using 10 worker nodes on EC2, each with 30G of RAM (r3.xlarge). I was
  able to compute the SVD for 20 singular values, but it fails with a Java
  Heap Size error for 200 singular values. I'm currently trying 100.
 
  So my question is this, what kind of cluster do you need to perform this
  task?
  As I do not have any measurable experience with Spark I can't say if
 this is
  normal: my test for 100 singular values has been running for over an
 hour.
 
  I'm using this dataset
  http://archive.ics.uci.edu/ml/datasets/URL+Reputation
 
  I'm using the spark-shell with --executor-memory 15G --driver-memory 15G
 
 
  And the few lines of codes are
  /import org.apache.spark.mllib.linalg.distributed.RowMatrix
  import org.apache.spark.mllib.util.MLUtils
  val data = MLUtils.loadLibSVMFile(sc, all.svm,3231961)
  val features = data.map(line = line.features)
  val mat = new RowMatrix(features)
  val svd = mat.computeSVD(200, computeU= true)/
 
 
  svm format: label column number:value
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SVD-on-larger-than-taller-matrix-tp14611.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Li
@vrilleup


Re: How can I implement eigenvalue decomposition in Spark?

2014-08-08 Thread Li Pu
@Miles, eigen-decomposition with asymmetric matrix doesn't always give
real-value solutions, and it doesn't have the nice properties that
symmetric matrix holds. Usually you want to symmetrize your asymmetric
matrix in some way, e.g. see
http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_ZhouHS05.pdf
but as Sean mentioned, you can always compute the largest eigenpair with
power method or some variations like pagerank, which is already implemented
in graphx.


On Fri, Aug 8, 2014 at 2:50 AM, Sean Owen so...@cloudera.com wrote:

 The SVD does not in general give you eigenvalues of its input.

 Are you just trying to access the U and V matrices? they are also
 returned in the API.  But they are not the eigenvectors of M, as you
 note.

 I don't think MLlib has anything to help with the general eigenvector
 problem.
 Maybe you can implement a sort of power iteration algorithm using
 GraphX to find the largest eigenvector?

 On Fri, Aug 8, 2014 at 4:07 AM, Chunnan Yao yaochun...@gmail.com wrote:
  Hi there, what you've suggested are all meaningful. But to make myself
  clearer, my essential problems are:
  1. My matrix is asymmetric, and it is a probabilistic adjacency matrix,
  whose entries(a_ij) represents the likelihood that user j will broadcast
 the
  information generated by user i. Apparently, a_ij and a_ji is different,
  caus I love you doesn't necessarily mean you love me(What a sad story~).
 All
  entries are real.
  2. I know I can get eigenvalues through SVD. My problem is I can't get
 the
  corresponding eigenvectors, which requires solving equations, and I also
  need eigenvectors in my calculation.In my simulation of this paper, I
 only
  need the biggest eigenvalues and corresponding eigenvectors.
  The paper posted by Shivaram Venkataraman is also concerned about
 symmetric
  matrix. Could any one help me out?

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Li
@vrilleup


Re: How can I implement eigenvalue decomposition in Spark?

2014-08-07 Thread Li Pu
@Miles, the latest SVD implementation in mllib is partially distributed.
Matrix-vector multiplication is computed among all workers, but the right
singular vectors are all stored in the driver. If your symmetric matrix is
n x n and you want the first k eigenvalues, you will need to fit n x k
doubles in driver's memory. Behind the scene, it calls ARPACK to compute
eigen-decomposition of A^T A. You can look into the source code for the
details.

@Sean, the SVD++ implementation in graphx is not the canonical definition
of SVD. It doesn't have the orthogonality that SVD holds. But we might want
to use graphx as the underlying matrix representation for mllib.SVD to
address the problem of skewed entry distribution.


On Thu, Aug 7, 2014 at 10:51 AM, Evan R. Sparks evan.spa...@gmail.com
wrote:

 Reza Zadeh has contributed the distributed implementation of (Tall/Skinny)
 SVD (
 http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html),
 which is in MLlib (Spark 1.0) and a distributed sparse SVD coming in Spark
 1.1. (https://issues.apache.org/jira/browse/SPARK-1782). If your data is
 sparse (which it often is in social networks), you may have better luck
 with this.

 I haven't tried the GraphX implementation, but those algorithms are often
 well-suited for power-law distributed graphs as you might see in social
 networks.

 FWIW, I believe you need to square elements of the sigma matrix from the
 SVD to get the eigenvalues.




 On Thu, Aug 7, 2014 at 10:20 AM, Sean Owen so...@cloudera.com wrote:

 (-incubator, +user)

 If your matrix is symmetric (and real I presume), and if my linear
 algebra isn't too rusty, then its SVD is its eigendecomposition. The
 SingularValueDecomposition object you get back has U and V, both of
 which have columns that are the eigenvectors.

 There are a few SVDs in the Spark code. The one in mllib is not
 distributed (right?) and is probably not an efficient means of
 computing eigenvectors if you really just want a decomposition of a
 symmetric matrix.

 The one I see in graphx is distributed? I haven't used it though.
 Maybe it could be part of a solution.



 On Thu, Aug 7, 2014 at 2:21 PM, yaochunnan yaochun...@gmail.com wrote:
  Our lab need to do some simulation on online social networks. We need to
  handle a 5000*5000 adjacency matrix, namely, to get its largest
 eigenvalue
  and corresponding eigenvector. Matlab can be used but it is
 time-consuming.
  Is Spark effective in linear algebra calculations and transformations?
 Later
  we would have 500*500 matrix processed. It seems emergent that
 we
  should find some distributed computation platform.
 
  I see SVD has been implemented and I can get eigenvalues of a matrix
 through
  this API.  But when I want to get both eigenvalues and eigenvectors or
 at
  least the biggest eigenvalue and the corresponding eigenvector, it seems
  that current Spark doesn't have such API. Is it possible that I write
  eigenvalue decomposition from scratch? What should I do? Thanks a lot!
 
 
  Miles Yao
 
  
  View this message in context: How can I implement eigenvalue
 decomposition
  in Spark?
  Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Li
@vrilleup


Re: Recommended pipeline automation tool? Oozie?

2014-07-11 Thread Li Pu
I like the idea of using scala to drive the workflow. Spark already comes
with a scheduler, why not program a plugin to schedule other types of tasks
(copy file, send email, etc.)? Scala could handle any logic required by the
pipeline. Passing objects (including RDDs) between tasks is also easier. I
don't know if this is an overuse of Spark scheduler, but sounds like a good
tool. The only issue would be releasing resources that is not used at
intermediate steps.

On Fri, Jul 11, 2014 at 12:05 PM, Wei Tan w...@us.ibm.com wrote:

 Just curious: how about using scala to drive the workflow? I guess if you
 use other tools (oozie, etc) you lose the advantage of reading from RDD --
 you have to read from HDFS.

 Best regards,
 Wei

 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center
 *http://researcher.ibm.com/person/us-wtan*
 http://researcher.ibm.com/person/us-wtan



 From:k.tham kevins...@gmail.com
 To:u...@spark.incubator.apache.org,
 Date:07/10/2014 01:20 PM
 Subject:Recommended pipeline automation tool? Oozie?
 --



 I'm just wondering what's the general recommendation for data pipeline
 automation.

 Say, I want to run Spark Job A, then B, then invoke script C, then do D,
 and
 if D fails, do E, and if Job A fails, send email F, etc...

 It looks like Oozie might be the best choice. But I'd like some
 advice/suggestions.

 Thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.





-- 
Li
@vrilleup


Re: running SparkALS

2014-04-28 Thread Li Pu
http://spark.apache.org/docs/0.9.0/mllib-guide.html#collaborative-filtering-1

One thing which is undocumented: the integers representing users and
items have to be positive. Otherwise it throws exceptions.

Li

On 28 avr. 2014, at 10:30, Diana Carroll dcarr...@cloudera.com wrote:

 Hi everyone.  I'm trying to run some of the Spark example code, and most of 
 it appears to be undocumented (unless I'm missing something).  Can someone 
 help me out?

 I'm particularly interested in running SparkALS, which wants parameters:
 M U F iter slices

 What are these variables?  They appear to be integers and the default values 
 are 100, 500 and 10 respectively but beyond that...huh?

 Thanks!

 Diana