How can I do pair-wise computation between RDD feature columns?

2015-05-16 Thread Chunnan Yao
Hi all, Recently I've ran into a scenario to conduct two sample tests between all paired combination of columns of an RDD. But the networking load and generation of pair-wise computation is too time consuming. That has puzzled me for a long time. I want to conduct Wilcoxon rank-sum test (http://en

Possible long lineage issue when using DStream to update a normal RDD

2015-05-07 Thread Chunnan Yao
Hi all, Recently in our project, we need to update a RDD using data regularly received from DStream, I plan to use "foreachRDD" API to achieve this: var MyRDD = ... dstream.foreachRDD { rdd => MyRDD = MyRDD.join(rdd)... ... } Is this usage correct? My concern is, as I am repeatedly

Indices of SparseVector must be ordered while computing SVD

2015-04-22 Thread Chunnan Yao
Hi all, I am using Spark 1.3.1 to write a Spectral Clustering algorithm. This really confused me today. At first I thought my implementation is wrong. It turns out it's an issue in MLlib. Fortunately, I've figured it out. I suggest to add a hint on user document of MLlib ( as far as I know, ther

Support parallelized online matrix factorization for Collaborative Filtering

2015-04-05 Thread Chunnan Yao
On-line Collaborative Filtering(CF) has been widely used and studied. To re-train a CF model from scratch every time when new data comes in is very inefficient (http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model). However, in Spark community we see few discus

Is this a bug in MLlib.stat.test ? About the mapPartitions API used in Chi-Squared test

2015-03-12 Thread Chunnan Yao
Hi everyone! I am digging into MLlib of Spark 1.2.1 currently. When reading codes of MLlib.stat.test, in the file ChiSqTest.scala under /spark/mllib/src/main/scala/org/apache/spark/mllib/stat/test, I am confused by the usage of mapPartitions API in the function def chiSquaredFeatures(data: RDD[La

Re: Spark development with IntelliJ

2015-01-17 Thread Chunnan Yao
Nice! - Feel the sparking Spark! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-development-with-IntelliJ-tp10032p10167.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. ---

Re: Spark development with IntelliJ

2015-01-17 Thread Chunnan Yao
Followed is the discussion between Imran and me. 2015-01-18 4:12 GMT+08:00 Chunnan Yao : > Thank you for your patience! Im now not so familiar with the mailing list. > I just clicked "reply" in Gmail, thinking it would be automatically > attached to the list. I will la

Re: Spark development with IntelliJ

2015-01-17 Thread Chunnan Yao
*I followed the procedures instructed by https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IntelliJ. But problems still occurs which has made me a little bit annoyed. My environment settings are:JAVA 1.7.0 Scala: 2.10.4 Spark:1.2.0, Intellij Idea 14.0.2