Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Dirceu Semighini Filho
Hi all, I'm running Spark 1.2.0, in Stand alone mode, on different cluster and server sizes. All of my data is cached in memory. Basically I have a mass of data, about 8gb, with about 37k of columns, and I'm running different configs of an BinaryLogisticRegressionBFGS. When I put spark to run on 9

Re: Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Sean Owen
It sounds like your computation just isn't CPU bound, right? or maybe that only some stages are. It's not clear what work you are doing beyond the core LR. Stages don't wait on each other unless one depends on the other. You'd have to clarify what you mean by running stages in parallel, like what

Re: Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Sean Owen
Yes that makes sense, but it doesn't make the jobs CPU-bound. What is the bottleneck? the model building or other stages? I would think you can get the model building to be CPU bound, unless you have chopped it up into really small partitions. I think it's best to look further into what stages are

Re: Spark performance on 32 Cpus Server Cluster

2015-02-20 Thread Dirceu Semighini Filho
Hi Sean, I'm trying to increase the cpu usage by running logistic regression in different datasets in parallel. They shouldn't depend on each other. I train several logistic regression models from different column combinations of a main dataset. I processed the combinations in a ParArray in an