- Limited the data to 100,000 records. - 6 categorical feature which go through imputation, string indexing, one hot encoding. The maximum classes for the feature is 100. As data is imputated it becomes dense. - 1 numerical feature. - Training Logistic Regression through CrossValidation with grid to optimize its regularization parameter over the values 0.0001, 0.001, 0.005, 0.01, 0.05, 0.1 - Using spark's launcher api to launch it on a yarn cluster in Amazon AWS.
I was thinking that as CrossValidator is finding the best parameters it should be able to run them independently. That sounds like something which could be ran in parallel. On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > What is the size of training data (number examples, number features)? > Dense or sparse features? How many classes? > > What commands are you using to submit your job via spark-submit? > > On Fri, 7 Apr 2017 at 13:12 Aseem Bansal <asmbans...@gmail.com> wrote: > >> When using spark ml's LogisticRegression, RandomForest, CrossValidator >> etc. do we need to give any consideration while coding in making it scale >> with more CPUs or does it scale automatically? >> >> I am reading some data from S3, using a pipeline to train a model. I am >> running the job on a spark cluster with 36 cores and 60GB RAM and I cannot >> see much usage. It is running but I was expecting spark to use all RAM >> available and make it faster. So that's why I was thinking whether we need >> to take something particular in consideration or wrong expectations? >> >