GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

Don Drake Wed, 20 May 2015 14:45:47 -0700

I'm running Spark v1.3.1 and when I run the following against my dataset:

model = GradientBoostedTrees.trainRegressor(trainingData,
categoricalFeaturesInfo=catFeatu
res, maxDepth=6, numIterations=3)


The job will fail with the following message:
Traceback (most recent call last):
  File "/Users/drake/fd/spark/mltest.py", line 73, in <module>
    model = GradientBoostedTrees.trainRegressor(trainingData,
categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
  File
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
line 553, in trainRegressor
    loss, numIterations, learningRate, maxDepth)
  File
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
line 438, in _train
    loss, numIterations, learningRate, maxDepth)
  File
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
line 120, in callMLlibFunc
    return callJavaFunc(sc, api, *args)
  File
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
line 113, in callJavaFunc
    return _java2py(sc, func(*args))
  File
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95
py4j.protocol.Py4JJavaError: An error occurred while calling
o69.trainGradientBoostedTreesModel.
: java.lang.IllegalArgumentException: requirement failed: DecisionTree
requires maxBins (= 32) >= max categories in categorical features (= 1895)
at scala.Predef$.require(Predef.scala:233)
at
org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138)
at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
at
org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
at
org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
at
org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
at
org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595)

So, it's complaining about the maxBins, if I provide maxBins=1900 and
re-run it:

model = GradientBoostedTrees.trainRegressor(trainingData,
categoricalFeaturesInfo=catFeatu
res, maxDepth=6, numIterations=3, maxBins=1900)

Traceback (most recent call last):
  File "/Users/drake/fd/spark/mltest.py", line 73, in <module>
    model = GradientBoostedTrees.trainRegressor(trainingData,
categoricalFeaturesInfo=catF
eatures, maxDepth=6, numIterations=3, maxBins=1900)
TypeError: trainRegressor() got an unexpected keyword argument 'maxBins'

It now says it knows nothing of maxBins.

If I run the same command against DecisionTree or RandomForest (with
maxBins=1900) it works just fine.

Seems like a bug in GradientBoostedTrees.

Suggestions?

-Don

-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
800-733-2143

GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

Reply via email to