Hi team,

We are testing the performance and capability of Spark for Linear regression 
application to replace at least sklearn linear regression.
We firstly generated data for model fitting via 
sklearn.dataset.make_regression. See the generation code 
<https://gist.github.com/yh882317/c177a8593e3bfd30e4da63a8fcb95593#file-datageneration-py>
 here.

Then, we performed 
sklearn(Python)<https://gist.github.com/yh882317/c177a8593e3bfd30e4da63a8fcb95593#file-linearregression-py>
 and 
Spark(scala)<https://gist.github.com/yh882317/c177a8593e3bfd30e4da63a8fcb95593#file-linearregression-scala>
 for model fitting. We setup our models according to the post in 
stackoverflow<https://stackoverflow.com/questions/42729431/spark-ml-regressions-do-not-calculate-same-models-as-scikit-learn>
 to try to get same model in both sides.
The Spark setting is:

l  Spark3.1

l  Spark-shell -I

l  Spark Standalone with  3 exectors(8 cores;128GB per each)

l  Driver: spark.driver.maxResultSize=20g; spark.driver.memory=100g; 
spark.executor.memory=128g

Fitting speed:
Python:19S
Spark: 600s+

Resource Utilization:
Python: 150GB
Spark: Node1(50GB+);Node2,3 and Driver node(2GB)

Got stuck on
MapPartitionsRDD [23]
treeAggregate at WeightedLeastSquares.scala:107

What we tried:

Ø  maxTreeDepth

Ø  repartion

Ø  standerlization
None of them have significent effect on Trainning speed.

Could you please help to figure out where the issue comes from? 20X slower is 
not acceptable for us despite Spark has other good features.

Thanks!
You Hu

Reply via email to