This could be empirically verified in spark-perf: https://github.com/databricks/spark-perf. Theoretically, it would be < 2x for k-means and logistic regression, because computation is doubled but communication cost remains the same. -Xiangrui
On Tue, Apr 7, 2015 at 7:15 AM, Vasyl Harasymiv <vasyl.harasy...@gmail.com> wrote: > Hi Spark Community, > > Imagine you have a stable computing cluster (e.g. 5 nodes) with Hadoop that > does not run anything that your Spark jobs. > > Now imagine you run simple machine learning on the data (e.g. 100MB): > > K-means - 5 min > Logistic regression - 5 min > > Now imagine that the volume of your data has doubled 2x to 200MB and it is > still distributed around those available 5 nodes. > > Now, how much more time would this computation take now ? > > I presume more than 2x e.g. K-Means 25 min, and logistic regression 20 min? > > Just want to have an understanding how data growth would impact > computational peformance for ML (any model in your experience is fine). > Since my gut feeling if data increases 2x, the computation on the same > cluster would increase > 2x. > > Thank you! > Vasyl --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org