ML consumption time based on data volume - same cluster

Vasyl Harasymiv Tue, 07 Apr 2015 07:16:48 -0700

Hi Spark Community,

Imagine you have a stable computing cluster (e.g. 5 nodes) with Hadoop that
does not run anything that your Spark jobs.


Now imagine you run simple machine learning on the data (e.g. 100MB):

   1. K-means -  5 min
   2. Logistic regression - 5 min

Now imagine that the volume of your data has doubled 2x to 200MB and it is
still distributed around those available 5 nodes.

Now, how much more time would this computation take now ?

I presume more than 2x e.g. K-Means 25 min, and logistic regression 20 min?

Just want to have an understanding how data growth would impact
computational peformance for ML (any model in your experience is fine).
Since my gut feeling if data increases 2x, the computation on the same
cluster would increase > 2x.

Thank you!
Vasyl

ML consumption time based on data volume - same cluster

Reply via email to