Thank you Xiangrui,
Indeed, however, if the computation involves taking matrix, even locally,
like random forest, if data increases 2x, even local computation time
should increase >2x. But I will test it with the Spark Perf and let you
know!
On Tue, Apr 7, 2015 at 4:50 PM, Xiangrui Meng wrote:
This could be empirically verified in spark-perf:
https://github.com/databricks/spark-perf. Theoretically, it would be <
2x for k-means and logistic regression, because computation is doubled
but communication cost remains the same. -Xiangrui
On Tue, Apr 7, 2015 at 7:15 AM, Vasyl Harasymiv
wrote:
Hi Spark Community,
Imagine you have a stable computing cluster (e.g. 5 nodes) with Hadoop that
does not run anything that your Spark jobs.
Now imagine you run simple machine learning on the data (e.g. 100MB):
1. K-means - 5 min
2. Logistic regression - 5 min
Now imagine that the volume