Spark test error

2017-01-03 Thread Yanwei Wayne Zhang
I tried to run the tests in 'GeneralizedLinearRegressionSuite', and all tests passed except for test("read/write") which yielded the following error message. Any suggestion on why this happened and how to fix it? Thanks. BTW, I ran the test in IntelliJ. The default jsonEncode only supports

Re: Invert large matrix

2016-12-29 Thread Yanwei Wayne Zhang
Thanks for the advice. I figured out a way to solve this problem by avoiding the matrix representation. Wayne From: Sean Owen <so...@cloudera.com> Sent: Thursday, December 29, 2016 1:52 PM To: Yanwei Wayne Zhang; user Subject: Re: Invert large matrix I

Invert large matrix

2016-12-28 Thread Yanwei Wayne Zhang
Hi all, I have a matrix X stored as RDD[SparseVector] that is high dimensional, say 800 million rows and 2 million columns, and more 95% of the entries are zero. Is there a way to invert (X'X + eye) efficiently, where X' is the transpose of X and eye is the identity matrix? I am thinking of

Use BLAS object for matrix operation

2016-11-03 Thread Yanwei Zhang
I would like to use some matrix operations in the BLAS object defined in ml.linalg. But for some reason, spark shell complains it cannot locate this object. I have constructed an example below to illustrate the issue. Please advise how to fix this. Thanks . import

Use a specific partition of dataframe

2016-11-02 Thread Yanwei Zhang
Is it possible to retrieve a specific partition (e.g., the first partition) of a DataFrame and apply some function there? My data is too large, and I just want to get some approximate measures using the first few partitions in the data. I'll illustrate what I want to accomplish using the

Re: spark log field clarification

2015-05-15 Thread yanwei
anybody shed some light for me? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-log-field-clarification-tp22892p22904.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

spark log field clarification

2015-05-14 Thread yanwei
I am trying to extract the *output data size* information for *each task*. What *field(s)* should I look for, given the json-format log? Also, what does Result Size stand for? Thanks a lot in advance! -Yanwei -- View this message in context: http://apache-spark-user-list.1001560.n3