One more comment: That's a lot of categories for a feature.  If it makes
sense for your data, it will run faster if you can group the categories or
split the 1895 categories into a few features which have fewer categories.

On Wed, May 20, 2015 at 3:17 PM, Burak Yavuz <brk...@gmail.com> wrote:

> Could you please open a JIRA for it? The maxBins input is missing for the
> Python Api.
>
> Is it possible if you can use the current master? In the current master,
> you should be able to use trees with the Pipeline Api and DataFrames.
>
> Best,
> Burak
>
> On Wed, May 20, 2015 at 2:44 PM, Don Drake <dondr...@gmail.com> wrote:
>
>> I'm running Spark v1.3.1 and when I run the following against my dataset:
>>
>> model = GradientBoostedTrees.trainRegressor(trainingData,
>> categoricalFeaturesInfo=catFeatu
>> res, maxDepth=6, numIterations=3)
>>
>> The job will fail with the following message:
>> Traceback (most recent call last):
>>   File "/Users/drake/fd/spark/mltest.py", line 73, in <module>
>>     model = GradientBoostedTrees.trainRegressor(trainingData,
>> categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
>>   File
>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
>> line 553, in trainRegressor
>>     loss, numIterations, learningRate, maxDepth)
>>   File
>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
>> line 438, in _train
>>     loss, numIterations, learningRate, maxDepth)
>>   File
>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
>> line 120, in callMLlibFunc
>>     return callJavaFunc(sc, api, *args)
>>   File
>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
>> line 113, in callJavaFunc
>>     return _java2py(sc, func(*args))
>>   File
>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>> line 538, in __call__
>>   File
>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>> line 300, in get_return_value
>> 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95
>> py4j.protocol.Py4JJavaError: An error occurred while calling
>> o69.trainGradientBoostedTreesModel.
>> : java.lang.IllegalArgumentException: requirement failed: DecisionTree
>> requires maxBins (= 32) >= max categories in categorical features (= 1895)
>> at scala.Predef$.require(Predef.scala:233)
>> at
>> org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128)
>> at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138)
>> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
>> at
>> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
>> at
>> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
>> at
>> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
>> at
>> org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595)
>>
>> So, it's complaining about the maxBins, if I provide maxBins=1900 and
>> re-run it:
>>
>> model = GradientBoostedTrees.trainRegressor(trainingData,
>> categoricalFeaturesInfo=catFeatu
>> res, maxDepth=6, numIterations=3, maxBins=1900)
>>
>> Traceback (most recent call last):
>>   File "/Users/drake/fd/spark/mltest.py", line 73, in <module>
>>     model = GradientBoostedTrees.trainRegressor(trainingData,
>> categoricalFeaturesInfo=catF
>> eatures, maxDepth=6, numIterations=3, maxBins=1900)
>> TypeError: trainRegressor() got an unexpected keyword argument 'maxBins'
>>
>> It now says it knows nothing of maxBins.
>>
>> If I run the same command against DecisionTree or RandomForest (with
>> maxBins=1900) it works just fine.
>>
>> Seems like a bug in GradientBoostedTrees.
>>
>> Suggestions?
>>
>> -Don
>>
>> --
>> Donald Drake
>> Drake Consulting
>> http://www.drakeconsulting.com/
>> 800-733-2143
>>
>
>

Reply via email to