Re: ML consumption time based on data volume - same cluster

Vasyl Harasymiv Tue, 07 Apr 2015 17:04:05 -0700

Thank you Xiangrui,

Indeed, however, if the computation involves taking matrix, even locally,
like random forest, if data increases 2x, even local computation time
should increase >2x. But I will test it with the Spark Perf and let you
know!


On Tue, Apr 7, 2015 at 4:50 PM, Xiangrui Meng <men...@gmail.com> wrote:

> This could be empirically verified in spark-perf:
> https://github.com/databricks/spark-perf. Theoretically, it would be <
> 2x for k-means and logistic regression, because computation is doubled
> but communication cost remains the same. -Xiangrui
>
> On Tue, Apr 7, 2015 at 7:15 AM, Vasyl Harasymiv
> <vasyl.harasy...@gmail.com> wrote:
> > Hi Spark Community,
> >
> > Imagine you have a stable computing cluster (e.g. 5 nodes) with Hadoop
> that
> > does not run anything that your Spark jobs.
> >
> > Now imagine you run simple machine learning on the data (e.g. 100MB):
> >
> > K-means -  5 min
> > Logistic regression - 5 min
> >
> > Now imagine that the volume of your data has doubled 2x to 200MB and it
> is
> > still distributed around those available 5 nodes.
> >
> > Now, how much more time would this computation take now ?
> >
> > I presume more than 2x e.g. K-Means 25 min, and logistic regression 20
> min?
> >
> > Just want to have an understanding how data growth would impact
> > computational peformance for ML (any model in your experience is fine).
> > Since my gut feeling if data increases 2x, the computation on the same
> > cluster would increase > 2x.
> >
> > Thank you!
> > Vasyl
>

Re: ML consumption time based on data volume - same cluster

Reply via email to