JIRA created: https://issues.apache.org/jira/browse/SPARK-7781
Joseph, I agree, I'm debating removing this feature altogether, but I'm putting the model through its paces. Thanks. -Don On Wed, May 20, 2015 at 7:52 PM, Joseph Bradley <jos...@databricks.com> wrote: > One more comment: That's a lot of categories for a feature. If it makes > sense for your data, it will run faster if you can group the categories or > split the 1895 categories into a few features which have fewer categories. > > On Wed, May 20, 2015 at 3:17 PM, Burak Yavuz <brk...@gmail.com> wrote: > >> Could you please open a JIRA for it? The maxBins input is missing for the >> Python Api. >> >> Is it possible if you can use the current master? In the current master, >> you should be able to use trees with the Pipeline Api and DataFrames. >> >> Best, >> Burak >> >> On Wed, May 20, 2015 at 2:44 PM, Don Drake <dondr...@gmail.com> wrote: >> >>> I'm running Spark v1.3.1 and when I run the following against my dataset: >>> >>> model = GradientBoostedTrees.trainRegressor(trainingData, >>> categoricalFeaturesInfo=catFeatu >>> res, maxDepth=6, numIterations=3) >>> >>> The job will fail with the following message: >>> Traceback (most recent call last): >>> File "/Users/drake/fd/spark/mltest.py", line 73, in <module> >>> model = GradientBoostedTrees.trainRegressor(trainingData, >>> categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3) >>> File >>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py", >>> line 553, in trainRegressor >>> loss, numIterations, learningRate, maxDepth) >>> File >>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py", >>> line 438, in _train >>> loss, numIterations, learningRate, maxDepth) >>> File >>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py", >>> line 120, in callMLlibFunc >>> return callJavaFunc(sc, api, *args) >>> File >>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py", >>> line 113, in callJavaFunc >>> return _java2py(sc, func(*args)) >>> File >>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", >>> line 538, in __call__ >>> File >>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", >>> line 300, in get_return_value >>> 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95 >>> py4j.protocol.Py4JJavaError: An error occurred while calling >>> o69.trainGradientBoostedTreesModel. >>> : java.lang.IllegalArgumentException: requirement failed: DecisionTree >>> requires maxBins (= 32) >= max categories in categorical features (= 1895) >>> at scala.Predef$.require(Predef.scala:233) >>> at >>> org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128) >>> at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138) >>> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60) >>> at >>> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150) >>> at >>> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63) >>> at >>> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96) >>> at >>> org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595) >>> >>> So, it's complaining about the maxBins, if I provide maxBins=1900 and >>> re-run it: >>> >>> model = GradientBoostedTrees.trainRegressor(trainingData, >>> categoricalFeaturesInfo=catFeatu >>> res, maxDepth=6, numIterations=3, maxBins=1900) >>> >>> Traceback (most recent call last): >>> File "/Users/drake/fd/spark/mltest.py", line 73, in <module> >>> model = GradientBoostedTrees.trainRegressor(trainingData, >>> categoricalFeaturesInfo=catF >>> eatures, maxDepth=6, numIterations=3, maxBins=1900) >>> TypeError: trainRegressor() got an unexpected keyword argument 'maxBins' >>> >>> It now says it knows nothing of maxBins. >>> >>> If I run the same command against DecisionTree or RandomForest (with >>> maxBins=1900) it works just fine. >>> >>> Seems like a bug in GradientBoostedTrees. >>> >>> Suggestions? >>> >>> -Don >>> >>> -- >>> Donald Drake >>> Drake Consulting >>> http://www.drakeconsulting.com/ >>> 800-733-2143 >>> >> >> > -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ http://www.MailLaunder.com/ http://www.DrudgeSiren.com/ http://plu.gd/ 800-733-2143