PySpark RDD with NumpyArray Structure

Mustafa Elbehery Sun, 06 Dec 2015 13:00:23 -0800

Hi All,

I would like to parallelize Python NumpyArray to apply scikit Learn
algorithm on top of Spark. When I call *sc.parallelize() *I receive rdd of
different structure.


To be more precise, I am trying to have the following,

X = [[ 0.49426097  1.45106697]
 [-1.42808099 -0.83706377]
 [ 0.33855918  1.03875871]
 ...,
 [-0.05713876 -0.90926105]
 [-1.16939407  0.03959692]
 [ 0.26322951 -0.92649949]]


However, what I get when I cal SC.parallelize(X) is the following


[array([ 0.49426097,  1.45106697]), array([-1.42808099, -0.83706377])]



Anyone tried this before ?

PySpark RDD with NumpyArray Structure

Reply via email to