Re: ML consumption time based on data volume - same cluster

Xiangrui Meng Tue, 07 Apr 2015 15:51:07 -0700

This could be empirically verified in spark-perf:
https://github.com/databricks/spark-perf. Theoretically, it would be <
2x for k-means and logistic regression, because computation is doubled
but communication cost remains the same. -Xiangrui


On Tue, Apr 7, 2015 at 7:15 AM, Vasyl Harasymiv
<vasyl.harasy...@gmail.com> wrote:
> Hi Spark Community,
>
> Imagine you have a stable computing cluster (e.g. 5 nodes) with Hadoop that
> does not run anything that your Spark jobs.
>
> Now imagine you run simple machine learning on the data (e.g. 100MB):
>
> K-means -  5 min
> Logistic regression - 5 min
>
> Now imagine that the volume of your data has doubled 2x to 200MB and it is
> still distributed around those available 5 nodes.
>
> Now, how much more time would this computation take now ?
>
> I presume more than 2x e.g. K-Means 25 min, and logistic regression 20 min?
>
> Just want to have an understanding how data growth would impact
> computational peformance for ML (any model in your experience is fine).
> Since my gut feeling if data increases 2x, the computation on the same
> cluster would increase > 2x.
>
> Thank you!
> Vasyl

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: ML consumption time based on data volume - same cluster

Reply via email to