Re: Spark 2.1 ml library scalability

Aseem Bansal Fri, 07 Apr 2017 05:19:26 -0700

   - Limited the data to 100,000 records.
   - 6 categorical feature which go through imputation, string indexing,
   one hot encoding. The maximum classes for the feature is 100. As data is
   imputated it becomes dense.
   - 1 numerical feature.
   - Training Logistic Regression through CrossValidation with grid to
   optimize its regularization parameter over the values 0.0001, 0.001, 0.005,
   0.01, 0.05, 0.1
   - Using spark's launcher api to launch it on a yarn cluster in Amazon
   AWS.


I was thinking that as CrossValidator is finding the best parameters it
should be able to run them independently. That sounds like something which
could be ran in parallel.


On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> What is the size of training data (number examples, number features)?
> Dense or sparse features? How many classes?
>
> What commands are you using to submit your job via spark-submit?
>
> On Fri, 7 Apr 2017 at 13:12 Aseem Bansal <asmbans...@gmail.com> wrote:
>
>> When using spark ml's LogisticRegression, RandomForest, CrossValidator
>> etc. do we need to give any consideration while coding in making it scale
>> with more CPUs or does it scale automatically?
>>
>> I am reading some data from S3, using a pipeline to train a model. I am
>> running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
>> see much usage. It is running but I was expecting spark to use all RAM
>> available and make it faster. So that's why I was thinking whether we need
>> to take something particular in consideration or wrong expectations?
>>
>

Re: Spark 2.1 ml library scalability

Reply via email to