[ML] RandomForestRegressor training set size for each trees
We are using |RandomForestRegressor| from Spark 2.1.1 to train a model. To make sure we have the appropriate parameters we start with a very small dataset, one that has 6024 lines. The regressor is created with this code: |val rf = new RandomForestRegressor() .setLabelCol("MyLabel") .setFeaturesCol("MyFeatures") .setImpurity("variance") .setMaxDepth(3.) .setMinInstancesPerNode(1) .setMinInfoGain(0) .setNumTrees(2) .setFeatureSubsetStrategy("onethird") .setMaxBins(32) .setSubsamplingRate(1) val model = rf.fit(train) | Using the debugger I can observe the |ImpurityStats| for each |rootNode| on each |DecisionTreeModel| inside the |trees| array. The stat that I am interested in is the first one in the |stats| array because it is the number of rows that the node has been trained with. What I find strange is that this value for each |rootNode| is not always 6024 but sometimes more and sometimes less. From my understanding of the method I was under the impression that each tree would be trained with exactly the same number of rows than the original training set. Looking at the source code, I could not fully figure out where this happens, nor why it was decided to do so. Are there any resources discussing this behavior? - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Getting multiple regression metrics at once
Hello, I'm working with the ML package for regression purposes and I get good results on my data. I'm now trying to get multiple metrics at once, as right now, I'm doing what is suggested by the examples here: https://spark.apache.org/docs/2.1.0/ml-classification-regression.html Basically the code in the examples is this: val evaluator = new RegressionEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("rmse") val rmse = evaluator.evaluate(predictions) This gives me the RMSE for my test data which is fine, but I'm also interested in MSE, MAE, MAPE, Rsquare and Qsquare I thus looked at the documentation here: https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/ml/evaluation/RegressionEvaluator.html#metricName%28%29 where I see that I can get RMSE, MSE, MAE and Rsquare but it does not appear that I can get them computed all at once, going over the data rows only once and not 5 times as the example code would suggest it is needed to do so. How can I achieve that single pass computation? Then, there are MAPE and Qsquare missing, how can I get those computed as well, ideally while computing the 4 others? Regards - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[ML] Performance issues with GBTRegressor
Hello all, I'm using Spark for medium to large datasets regression analysis and its performance are very great when using random forest or decision trees. Continuing my experimentation, I started using GBTRegressor and am finding it extremely slow when compared to R while both other methods were very fast. Two examples to illustrate : - on a 300k lines dataset, R takes 3 minutes and GBTRegressor 15 to process 2000 iterations, maxdepth = 1, MinInstancesPerNode = 50 - on a 3M lines dataset, R takes 3 minutes and GBTRegressor 47 to process 10 iterations, maxdepth = 2, MinInstancesPerNode = 50 I placed the code for the first example at the end of this message. For the 300k dataset, I understand that there is a setup cost associated to Spark which means that small datasets may not be processed as efficiently as in R, even if my testing with DecisionTree and RandomForest shows otherwise. When I look at CPU usage for the GBT, it has spikes at 90% CPU usage (7 out of 8 cores) for relatively short bursts and then goes back to 8/10% (less than one core) for quite a while. Comparing to R that takes 1 core for its full 3 minutes run, it's quite surprising What have I missed in my setup? I've been told that the behavior I'm observing may be related to data skewness, but I'm not sure what's at hand here. From my untrained eye, it looks as if there was an issue in the GBTRegressor class, but I can't figure it out. Any help would be most welcome. Regards = R code train <- read.table("c:/Path/to/file.csv", header=T, sep=";",dec=".") train$X1 <- factor(train$X1) train$X2 <- factor(train$X2) train$X3 <- factor(train$X3) train$X4 <- factor(train$X4) train$X5 <- factor(train$X5) train$X6 <- factor(train$X6) train$X7 <- factor(train$X7) train$X8 <- factor(train$X8) train$X9 <- factor(train$X9) library(gbm) boost <- gbm(Freq~X1+X2+X3+X4+X5+X6+X7+X8+X9+Y1, distribution = "gaussian", data = train, n.trees = 2000, bag.fraction = 1, shrinkY1 = 1, interaction.depth = 1, n.minobsinnode = 50, train.fraction = 1.0, cv.folds = 0, keep.data = TRUE) = scala code for Spark import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SparkSession import org.apache.spark.ml.regression.GBTRegressor val conf = new SparkConf() .setAppName("GBTExample") .set("spark.driver.memory", "8g") .set("spark.executor.memory", "8g") .set("spark.network.timeout", "120s") val sc = SparkContext.getOrCreate(conf.setMaster("local[8]")) val spark = new SparkSession.Builder().getOrCreate() import spark.implicits._ val sourceData = spark.read.format("com.databricks.spark.csv") .option("header", "true") .option("delimiter", ";") .option("inferSchema", "true") .load("c:/Path/to/file.csv") val data = sourceData.select($"X1", $"X2", $"X3", $"X4", $"X5", $"X6", $"X7", $"X8", $"X9", $"Y1".cast("double"), $"Freq".cast("double")) val X1Indexer = new StringIndexer().setInputCol("X1").setOutputCol("X1Idx") val X2Indexer = new StringIndexer().setInputCol("X2").setOutputCol("X2Idx") val X3Indexer = new StringIndexer().setInputCol("X3").setOutputCol("X3Idx") val X4Indexer = new StringIndexer().setInputCol("X4").setOutputCol("X4Idx") val X5Indexer = new StringIndexer().setInputCol("X5").setOutputCol("X5Idx") val X6Indexer = new StringIndexer().setInputCol("X6").setOutputCol("X6Idx") val X7Indexer = new StringIndexer().setInputCol("X7").setOutputCol("X7Idx") val X8Indexer = new StringIndexer().setInputCol("X8").setOutputCol("X8Idx") val X9Indexer = new StringIndexer().setInputCol("X9").setOutputCol("X9Idx") val assembler = new VectorAssembler() .setInputCols(Array("X1Idx", "X2Idx", "X3Idx", "X4Idx", "X5Idx", "X6Idx", "X7Idx", "X8Idx", "X9Idx", "Y1")) .setOutputCol("features") val dt = new GBTRegressor() .setLabelCol("Freq") .setFeaturesCol("features") .setImpurity("variance") .setMaxIter(2000) .setMinInstancesPerNode(50) .setMaxDepth(1) .setStepSize(1) .setSubsamplingRate(1) .setMaxBins(32) val pipeline = new Pipeline() .setStages(Array(X1Indexer, X2Indexer, X3Indexer, X4Indexer, X5Indexer, X6Indexer, X7Indexer, X8Indexer, X9Indexer, assembler, dt)) val model = pipeline.fit(data) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [ML] Stop conditions for RandomForest
To me, they are. Y is used to control if a split is a valid candidate when deciding which one to follow. X is used to make a node a leaf if it has too few elements to even consider candidate splits. 颜发才(Yan Facai) wrote: It seems that split will always stop when count of nodes is less than max(X, Y). Hence, are they different? On Tue, Jun 27, 2017 at 11:07 PM, OBones <mailto:obo...@free.fr>> wrote: Hello, Reading around on the theory behind tree based regression, I concluded that there are various reasons to stop exploring the tree when a given node has been reached. Among these, I have those two: 1. When starting to process a node, if its size (row count) is less than X then consider it a leaf 2. When a split for a node is considered, if any side of the split has its size less than Y, then ignore it when selecting the best split As an example, let's consider a node with 45 rows, that for a given split creates two children, containing 5 and 35 rows respectively. If I set X to 50, then the node is a leaf and no split is attempted if I set X to 10 and Y to 15, then the splits are computed but because one of them has less than 15 rows, that split is ignored. I'm using DecisionTreeRegressor and RandomForestRegressor on our data and because the former is implemented using the latter, they both share the same parameters. Going through those parameters, I found minInstancesPerNode which to me is the Y value, but I could not find any parameter for the X value. Have I missed something? If not, would there be a way to implement this? Regards - To unsubscribe e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[ML] Stop conditions for RandomForest
Hello, Reading around on the theory behind tree based regression, I concluded that there are various reasons to stop exploring the tree when a given node has been reached. Among these, I have those two: 1. When starting to process a node, if its size (row count) is less than X then consider it a leaf 2. When a split for a node is considered, if any side of the split has its size less than Y, then ignore it when selecting the best split As an example, let's consider a node with 45 rows, that for a given split creates two children, containing 5 and 35 rows respectively. If I set X to 50, then the node is a leaf and no split is attempted if I set X to 10 and Y to 15, then the splits are computed but because one of them has less than 15 rows, that split is ignored. I'm using DecisionTreeRegressor and RandomForestRegressor on our data and because the former is implemented using the latter, they both share the same parameters. Going through those parameters, I found minInstancesPerNode which to me is the Y value, but I could not find any parameter for the X value. Have I missed something? If not, would there be a way to implement this? Regards - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor
OBones wrote: So, I tried to rewrite my sample code using the ml package and it is very much easier to use, no need for the LabeledPoint transformation. Here is the code I came up with: val dt = new DecisionTreeRegressor() .setPredictionCol("Y") .setImpurity("variance") .setMaxDepth(30) .setMaxBins(32) val model = dt.fit(data) println(model.toDebugString) println(model.featureImportances.toString) However, I cannot find a way to specify which columns are features, which ones are categorical and how many categories they have, like I used to do with the mllib package. Well, further research led me to adding the following code to indicate which columns are categorical: val X1Attribute = NominalAttribute.defaultAttr.withName("X1").withValues("0", "1").toMetadata val X2Attribute = NominalAttribute.defaultAttr.withName("X2").withValues("0", "1", "2").toMetadata val dataWithAttributes = data.withColumn("X1", $"X1".as("X1", X1Attribute)).withColumn("X2", $"X2".as("X2", X2Attribute)) but when I run this: val model = dt.fit(dataWithAttributes ) I get the following error: java.lang.IllegalArgumentException: Field "features" does not exist. It makes sense because I am yet to find a way to specify which columns are features. I also have to figure out what the label column is and what differences it has from the prediction column as only the latter was used with the mllib package. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor
Hello, I have written the following scala code to train a regression tree, based on mllib: val conf = new SparkConf().setAppName("DecisionTreeRegressionExample") val sc = new SparkContext(conf) val spark = new SparkSession.Builder().getOrCreate() val sourceData = spark.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", ";").load("C:\\Data\\source_file.csv") val data = sourceData.select($"X3".cast("double"), $"Y".cast("double"), $"X1".cast("double"), $"X2".cast("double")) val featureIndices = List("X1", "X2", "X3").map(data.columns.indexOf(_)) val targetIndex = data.columns.indexOf("Y") // WARNING: Indices in categoricalFeatures info are those inside the vector we build from the featureIndices list // Column 0 has two modalities, Column 1 has three val categoricalFeaturesInfo = Map[Int, Int]((0, 2), (1, 3)) val impurity = "variance" val maxDepth = 30 val maxBins = 32 val labeled = data.map(row => LabeledPoint(row.getDouble(targetIndex), Vectors.dense(featureIndices.map(row.getDouble(_)).toArray))) val model = DecisionTree.trainRegressor(labeled.rdd, categoricalFeaturesInfo, impurity, maxDepth, maxBins) println(model.toDebugString) This works quite well, but I want some information from the model, one of them being the features importance values. As it turns out, this is not available on DecisionTreeModel but is available on DecisionTreeRegressionModel from the ml package. I then discovered that the ml package is more recent than the mllib package which explains why it gives me more control over the trees I'm building. So, I tried to rewrite my sample code using the ml package and it is very much easier to use, no need for the LabeledPoint transformation. Here is the code I came up with: val dt = new DecisionTreeRegressor() .setPredictionCol("Y") .setImpurity("variance") .setMaxDepth(30) .setMaxBins(32) val model = dt.fit(data) println(model.toDebugString) println(model.featureImportances.toString) However, I cannot find a way to specify which columns are features, which ones are categorical and how many categories they have, like I used to do with the mllib package. I did look at the DecisionTreeRegressionExample.scala example found in the source package, but it uses a VectorIndexer to automatically discover the above information which is an unnecessary step in my case because I already have the information at hand. The documentation found online (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor) did not help either because it does not indicate the format for the featuresCol string property. Thanks in advance for your help. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [How-To] Custom file format as source
Thanks to both of you, this should get me started. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[How-To] Custom file format as source
Hello, I have an application here that generates data files in a custom binary format that provides the following information: Column list, each column has a data type (64 bit integer, 32 bit string index, 64 bit IEEE float, 1 byte boolean) Catalogs that give modalities for some columns (ie, column 1 contains only the following values: A, B, C, D) Array for actual data, each row has a fixed size according to the columns. Here is an example: Col1, 64bit integer Col2, 32bit string index Col3, 64bit integer Col4, 64bit float Catalog for Col1 = 10, 20, 30, 40, 50 Catalog for Col2 = Big, Small, Large, Tall Catalog for Col3 = 101, 102, 103, 500, 5000 Catalog for Col4 = (no catalog) Data array = 8 bytes, 4 bytes, 8 bytes, 8 bytes, 8 bytes, 4 bytes, 8 bytes, 8 bytes, 8 bytes, 4 bytes, 8 bytes, 8 bytes, 8 bytes, 4 bytes, 8 bytes, 8 bytes, 8 bytes, 4 bytes, 8 bytes, 8 bytes, ... I would like to use this kind of file as a source for various ML related computations (CART, RandomForrest, Gradient boosting...) and Spark is very interesting in this area. However, I'm a bit lost as to what I should write to have Spark use that file format as a source for its computation. Considering that those files are quite big (100 million lines, hundreds of gigs on disk), I'd rather not create something that writes a new file in a built-in format, but I'd rather write some code that makes Spark accept the file as it is. I looked around and saw the textfile method but it is not applicable to my case. I also saw the spark.read.format("libsvm") syntax which tells me that there is a list of supported formats known to spark, which I believe are called Dataframes, but I could not find any tutorial on this subject. Would you have any suggestion or links to documentation that would get me started? Regards, Olivier - To unsubscribe e-mail: user-unsubscr...@spark.apache.org