Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

Don Drake Wed, 20 May 2015 20:55:04 -0700

JIRA created: https://issues.apache.org/jira/browse/SPARK-7781


Joseph, I agree, I'm debating removing this feature altogether, but I'm
putting the model through its paces.

Thanks.

-Don

On Wed, May 20, 2015 at 7:52 PM, Joseph Bradley <jos...@databricks.com>
wrote:

> One more comment: That's a lot of categories for a feature.  If it makes
> sense for your data, it will run faster if you can group the categories or
> split the 1895 categories into a few features which have fewer categories.
>
> On Wed, May 20, 2015 at 3:17 PM, Burak Yavuz <brk...@gmail.com> wrote:
>
>> Could you please open a JIRA for it? The maxBins input is missing for the
>> Python Api.
>>
>> Is it possible if you can use the current master? In the current master,
>> you should be able to use trees with the Pipeline Api and DataFrames.
>>
>> Best,
>> Burak
>>
>> On Wed, May 20, 2015 at 2:44 PM, Don Drake <dondr...@gmail.com> wrote:
>>
>>> I'm running Spark v1.3.1 and when I run the following against my dataset:
>>>
>>> model = GradientBoostedTrees.trainRegressor(trainingData,
>>> categoricalFeaturesInfo=catFeatu
>>> res, maxDepth=6, numIterations=3)
>>>
>>> The job will fail with the following message:
>>> Traceback (most recent call last):
>>>   File "/Users/drake/fd/spark/mltest.py", line 73, in <module>
>>>     model = GradientBoostedTrees.trainRegressor(trainingData,
>>> categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
>>>   File
>>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
>>> line 553, in trainRegressor
>>>     loss, numIterations, learningRate, maxDepth)
>>>   File
>>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
>>> line 438, in _train
>>>     loss, numIterations, learningRate, maxDepth)
>>>   File
>>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
>>> line 120, in callMLlibFunc
>>>     return callJavaFunc(sc, api, *args)
>>>   File
>>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
>>> line 113, in callJavaFunc
>>>     return _java2py(sc, func(*args))
>>>   File
>>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>>> line 538, in __call__
>>>   File
>>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>>> line 300, in get_return_value
>>> 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95
>>> py4j.protocol.Py4JJavaError: An error occurred while calling
>>> o69.trainGradientBoostedTreesModel.
>>> : java.lang.IllegalArgumentException: requirement failed: DecisionTree
>>> requires maxBins (= 32) >= max categories in categorical features (= 1895)
>>> at scala.Predef$.require(Predef.scala:233)
>>> at
>>> org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128)
>>> at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138)
>>> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
>>> at
>>> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
>>> at
>>> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
>>> at
>>> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
>>> at
>>> org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595)
>>>
>>> So, it's complaining about the maxBins, if I provide maxBins=1900 and
>>> re-run it:
>>>
>>> model = GradientBoostedTrees.trainRegressor(trainingData,
>>> categoricalFeaturesInfo=catFeatu
>>> res, maxDepth=6, numIterations=3, maxBins=1900)
>>>
>>> Traceback (most recent call last):
>>>   File "/Users/drake/fd/spark/mltest.py", line 73, in <module>
>>>     model = GradientBoostedTrees.trainRegressor(trainingData,
>>> categoricalFeaturesInfo=catF
>>> eatures, maxDepth=6, numIterations=3, maxBins=1900)
>>> TypeError: trainRegressor() got an unexpected keyword argument 'maxBins'
>>>
>>> It now says it knows nothing of maxBins.
>>>
>>> If I run the same command against DecisionTree or RandomForest (with
>>> maxBins=1900) it works just fine.
>>>
>>> Seems like a bug in GradientBoostedTrees.
>>>
>>> Suggestions?
>>>
>>> -Don
>>>
>>> --
>>> Donald Drake
>>> Drake Consulting
>>> http://www.drakeconsulting.com/
>>> 800-733-2143
>>>
>>
>>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
http://www.MailLaunder.com/
http://www.DrudgeSiren.com/
http://plu.gd/
800-733-2143

Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

Reply via email to