Re: ML consumption time based on data volume - same cluster

2015-04-07 Thread Vasyl Harasymiv
Thank you Xiangrui, Indeed, however, if the computation involves taking matrix, even locally, like random forest, if data increases 2x, even local computation time should increase >2x. But I will test it with the Spark Perf and let you know! On Tue, Apr 7, 2015 at 4:50 PM, Xiangrui Meng wrote:

Re: ML consumption time based on data volume - same cluster

2015-04-07 Thread Xiangrui Meng
This could be empirically verified in spark-perf: https://github.com/databricks/spark-perf. Theoretically, it would be < 2x for k-means and logistic regression, because computation is doubled but communication cost remains the same. -Xiangrui On Tue, Apr 7, 2015 at 7:15 AM, Vasyl Harasymiv wrote:

ML consumption time based on data volume - same cluster

2015-04-07 Thread Vasyl Harasymiv
Hi Spark Community, Imagine you have a stable computing cluster (e.g. 5 nodes) with Hadoop that does not run anything that your Spark jobs. Now imagine you run simple machine learning on the data (e.g. 100MB): 1. K-means - 5 min 2. Logistic regression - 5 min Now imagine that the volume