Re: PySpark ML: Get best set of parameters from TrainValidationSplit

2018-04-16 Thread Bryan Cutler
Hi Aakash,

First you will want to get the the random forest model stage from the best
pipeline model result, for example if RF is the first stage:

rfModel = model.bestModel.stages[0]

Then you can check the values of the params you tuned like this:


On Mon, Apr 16, 2018 at 7:52 AM, Aakash Basu 

> Hi,
> I am running a Random Forest model on a dataset using hyper parameter
> tuning with Spark's paramGrid and Train Validation Split.
> Can anyone tell me how to get the best set for all the four parameters?
> I used:
> model.bestModel()
> model.metrics()
> But none of them seem to work.
> Below is the code chunk:
> paramGrid = ParamGridBuilder() \
> .addGrid(rf.numTrees, [50, 100, 150, 200]) \
> .addGrid(rf.maxDepth, [5, 10, 15, 20]) \
> .addGrid(rf.minInfoGain, [0.001, 0.01, 0.1, 0.6]) \
> .addGrid(rf.minInstancesPerNode, [5, 15, 30, 50, 100]) \
> .build()
> tvs = TrainValidationSplit(estimator=pipeline,
># 80% of the data will be used for training, 20% 
> for validation.
> model =
> predictions = model.transform(testData)
> evaluator = MulticlassClassificationEvaluator(
> labelCol="label", predictionCol="prediction", metricName="accuracy")
> accuracy = evaluator.evaluate(predictions)
> print("Accuracy = %g" % accuracy)
> print("Test Error = %g" % (1.0 - accuracy))
> Any help?
> Thanks,
> Aakash.

PySpark ML: Get best set of parameters from TrainValidationSplit

2018-04-16 Thread Aakash Basu

I am running a Random Forest model on a dataset using hyper parameter
tuning with Spark's paramGrid and Train Validation Split.

Can anyone tell me how to get the best set for all the four parameters?

I used:


But none of them seem to work.

Below is the code chunk:

paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, [50, 100, 150, 200]) \
.addGrid(rf.maxDepth, [5, 10, 15, 20]) \
.addGrid(rf.minInfoGain, [0.001, 0.01, 0.1, 0.6]) \
.addGrid(rf.minInstancesPerNode, [5, 15, 30, 50, 100]) \

tvs = TrainValidationSplit(estimator=pipeline,
   # 80% of the data will be used for
training, 20% for validation.

model =

predictions = model.transform(testData)

evaluator = MulticlassClassificationEvaluator(
labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

Any help?
