Thank you Xiangrui, Indeed, however, if the computation involves taking matrix, even locally, like random forest, if data increases 2x, even local computation time should increase >2x. But I will test it with the Spark Perf and let you know!
On Tue, Apr 7, 2015 at 4:50 PM, Xiangrui Meng <men...@gmail.com> wrote: > This could be empirically verified in spark-perf: > https://github.com/databricks/spark-perf. Theoretically, it would be < > 2x for k-means and logistic regression, because computation is doubled > but communication cost remains the same. -Xiangrui > > On Tue, Apr 7, 2015 at 7:15 AM, Vasyl Harasymiv > <vasyl.harasy...@gmail.com> wrote: > > Hi Spark Community, > > > > Imagine you have a stable computing cluster (e.g. 5 nodes) with Hadoop > that > > does not run anything that your Spark jobs. > > > > Now imagine you run simple machine learning on the data (e.g. 100MB): > > > > K-means - 5 min > > Logistic regression - 5 min > > > > Now imagine that the volume of your data has doubled 2x to 200MB and it > is > > still distributed around those available 5 nodes. > > > > Now, how much more time would this computation take now ? > > > > I presume more than 2x e.g. K-Means 25 min, and logistic regression 20 > min? > > > > Just want to have an understanding how data growth would impact > > computational peformance for ML (any model in your experience is fine). > > Since my gut feeling if data increases 2x, the computation on the same > > cluster would increase > 2x. > > > > Thank you! > > Vasyl >