[ML] RandomForestRegressor training set size for each trees

2018-03-05 Thread OBones

We are using |RandomForestRegressor| from Spark 2.1.1 to train a model.

To make sure we have the appropriate parameters we start with a very 
small dataset, one that has 6024 lines. The regressor is created with 
this code:


|val rf = new RandomForestRegressor() .setLabelCol("MyLabel") 
.setFeaturesCol("MyFeatures") .setImpurity("variance") .setMaxDepth(3.) 
.setMinInstancesPerNode(1) .setMinInfoGain(0) .setNumTrees(2) 
.setFeatureSubsetStrategy("onethird") .setMaxBins(32) 
.setSubsamplingRate(1) val model = rf.fit(train) |


Using the debugger I can observe the |ImpurityStats| for each |rootNode| 
on each |DecisionTreeModel| inside the |trees| array. The stat that I am 
interested in is the first one in the |stats| array because it is the 
number of rows that the node has been trained with.


What I find strange is that this value for each |rootNode| is not always 
6024 but sometimes more and sometimes less.
From my understanding of the method I was under the impression that 
each tree would be trained with exactly the same number of rows than the 
original training set.


Looking at the source code, I could not fully figure out where this 
happens, nor why it was decided to do so.


Are there any resources discussing this behavior?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Getting multiple regression metrics at once

2017-12-18 Thread OBones

Hello,

I'm working with the ML package for regression purposes and I get good 
results on my data.
I'm now trying to get multiple metrics at once, as right now, I'm doing 
what is suggested by the examples here:

https://spark.apache.org/docs/2.1.0/ml-classification-regression.html

Basically the code in the examples is this:

val  evaluator  =  new  RegressionEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  .setMetricName("rmse")
val  rmse  =  evaluator.evaluate(predictions)

This gives me the RMSE for my test data which is fine, but I'm also 
interested in MSE, MAE, MAPE, Rsquare and Qsquare

I thus looked at the documentation here:
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/ml/evaluation/RegressionEvaluator.html#metricName%28%29

where I see that I can get RMSE, MSE, MAE and Rsquare but it does not 
appear that I can get them computed all at once, going over the data 
rows only once and not 5 times as the example code would suggest it is 
needed to do so.


How can I achieve that single pass computation?

Then, there are MAPE and Qsquare missing, how can I get those computed 
as well, ideally while computing the 4 others?


Regards

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ML] Performance issues with GBTRegressor

2017-07-12 Thread OBones

Hello all,

I'm using Spark for medium to large datasets regression analysis and its 
performance are very great when using random forest or decision trees.
Continuing my experimentation, I started using GBTRegressor and am 
finding it extremely slow when compared to R while both other methods 
were very fast.

Two examples to illustrate :
 - on a 300k lines dataset, R takes 3 minutes and GBTRegressor 15 to 
process 2000 iterations, maxdepth = 1, MinInstancesPerNode = 50
 - on a 3M lines dataset, R takes 3 minutes and GBTRegressor 47 to 
process 10 iterations, maxdepth = 2, MinInstancesPerNode = 50


I placed the code for the first example at the end of this message.

For the 300k dataset, I understand that there is a setup cost associated 
to Spark which means that small datasets may not be processed as 
efficiently as in R, even if my testing with DecisionTree and 
RandomForest shows otherwise.
When I look at CPU usage for the GBT, it has spikes at 90% CPU usage (7 
out of 8 cores) for relatively short bursts and then goes back to 8/10% 
(less than one core) for quite a while.
Comparing to R that takes 1 core for its full 3 minutes run, it's quite 
surprising

What have I missed in my setup?

I've been told that the behavior I'm observing may be related to data 
skewness, but I'm not sure what's at hand here.


From my untrained eye, it looks as if there was an issue in the 
GBTRegressor class, but I can't figure it out.


Any help would be most welcome.

Regards

=
R code

train <- read.table("c:/Path/to/file.csv", header=T, sep=";",dec=".")
train$X1 <- factor(train$X1)
train$X2 <- factor(train$X2)
train$X3 <- factor(train$X3)
train$X4 <- factor(train$X4)
train$X5 <- factor(train$X5)
train$X6 <- factor(train$X6)
train$X7 <- factor(train$X7)
train$X8 <- factor(train$X8)
train$X9 <- factor(train$X9)

library(gbm)
boost <- gbm(Freq~X1+X2+X3+X4+X5+X6+X7+X8+X9+Y1, distribution = 
"gaussian", data = train, n.trees = 2000, bag.fraction = 1, shrinkY1 = 
1, interaction.depth = 1, n.minobsinnode = 50, train.fraction = 1.0, 
cv.folds = 0, keep.data = TRUE)


=
scala code for Spark

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.regression.GBTRegressor

val conf = new SparkConf()
  .setAppName("GBTExample")
  .set("spark.driver.memory", "8g")
  .set("spark.executor.memory", "8g")
  .set("spark.network.timeout", "120s")
val sc = SparkContext.getOrCreate(conf.setMaster("local[8]"))
val spark = new SparkSession.Builder().getOrCreate()
import spark.implicits._

val sourceData = spark.read.format("com.databricks.spark.csv")
  .option("header", "true")
  .option("delimiter", ";")
  .option("inferSchema", "true")
  .load("c:/Path/to/file.csv")

val data = sourceData.select($"X1", $"X2", $"X3", $"X4", $"X5", $"X6", 
$"X7", $"X8", $"X9", $"Y1".cast("double"), $"Freq".cast("double"))


val X1Indexer = new StringIndexer().setInputCol("X1").setOutputCol("X1Idx")
val X2Indexer = new StringIndexer().setInputCol("X2").setOutputCol("X2Idx")
val X3Indexer = new StringIndexer().setInputCol("X3").setOutputCol("X3Idx")
val X4Indexer = new StringIndexer().setInputCol("X4").setOutputCol("X4Idx")
val X5Indexer = new StringIndexer().setInputCol("X5").setOutputCol("X5Idx")
val X6Indexer = new StringIndexer().setInputCol("X6").setOutputCol("X6Idx")
val X7Indexer = new StringIndexer().setInputCol("X7").setOutputCol("X7Idx")
val X8Indexer = new StringIndexer().setInputCol("X8").setOutputCol("X8Idx")
val X9Indexer = new StringIndexer().setInputCol("X9").setOutputCol("X9Idx")

val assembler = new VectorAssembler()
  .setInputCols(Array("X1Idx", "X2Idx", "X3Idx", "X4Idx", "X5Idx", 
"X6Idx", "X7Idx", "X8Idx", "X9Idx", "Y1"))

  .setOutputCol("features")

val dt = new GBTRegressor()
  .setLabelCol("Freq")
  .setFeaturesCol("features")
  .setImpurity("variance")
  .setMaxIter(2000)
  .setMinInstancesPerNode(50)
  .setMaxDepth(1)
  .setStepSize(1)
  .setSubsamplingRate(1)
  .setMaxBins(32)

val pipeline = new Pipeline()
  .setStages(Array(X1Indexer, X2Indexer, X3Indexer, X4Indexer, 
X5Indexer, X6Indexer, X7Indexer, X8Indexer, X9Indexer, assembler, dt))


val model = pipeline.fit(data)

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [ML] Stop conditions for RandomForest

2017-06-28 Thread OBones

To me, they are.
Y is used to control if a split is a valid candidate when deciding which 
one to follow.
X is used to make a node a leaf if it has too few elements to even 
consider candidate splits.


颜发才(Yan Facai) wrote:
It seems that split will always stop when count of nodes is less than 
max(X, Y).

Hence, are they different?



On Tue, Jun 27, 2017 at 11:07 PM, OBones <mailto:obo...@free.fr>> wrote:


Hello,

Reading around on the theory behind tree based regression, I
concluded that there are various reasons to stop exploring the
tree when a given node has been reached. Among these, I have those
two:

1. When starting to process a node, if its size (row count) is
less than X then consider it a leaf
2. When a split for a node is considered, if any side of the split
has its size less than Y, then ignore it when selecting the best split

As an example, let's consider a node with 45 rows, that for a
given split creates two children, containing 5 and 35 rows
respectively.

If I set X to 50, then the node is a leaf and no split is attempted
if I set X to 10 and Y to 15, then the splits are computed but
because one of them has less than 15 rows, that split is ignored.

I'm using DecisionTreeRegressor and RandomForestRegressor on our
data and because the former is implemented using the latter, they
both share the same parameters.
Going through those parameters, I found minInstancesPerNode which
to me is the Y value, but I could not find any parameter for the X
value.
Have I missed something?
If not, would there be a way to implement this?

Regards



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>





-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ML] Stop conditions for RandomForest

2017-06-27 Thread OBones

Hello,

Reading around on the theory behind tree based regression, I concluded 
that there are various reasons to stop exploring the tree when a given 
node has been reached. Among these, I have those two:


1. When starting to process a node, if its size (row count) is less than 
X then consider it a leaf
2. When a split for a node is considered, if any side of the split has 
its size less than Y, then ignore it when selecting the best split


As an example, let's consider a node with 45 rows, that for a given 
split creates two children, containing 5 and 35 rows respectively.


If I set X to 50, then the node is a leaf and no split is attempted
if I set X to 10 and Y to 15, then the splits are computed but because 
one of them has less than 15 rows, that split is ignored.


I'm using DecisionTreeRegressor and RandomForestRegressor on our data 
and because the former is implemented using the latter, they both share 
the same parameters.
Going through those parameters, I found minInstancesPerNode which to me 
is the Y value, but I could not find any parameter for the X value.

Have I missed something?
If not, would there be a way to implement this?

Regards



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor

2017-06-15 Thread OBones

OBones wrote:
So, I tried to rewrite my sample code using the ml package and it is 
very much easier to use, no need for the LabeledPoint transformation. 
Here is the code I came up with:


val dt = new DecisionTreeRegressor()
  .setPredictionCol("Y")
  .setImpurity("variance")
  .setMaxDepth(30)
  .setMaxBins(32)

val model = dt.fit(data)

println(model.toDebugString)
println(model.featureImportances.toString)

However, I cannot find a way to specify which columns are features, 
which ones are categorical and how many categories they have, like I 
used to do with the mllib package.
Well, further research led me to adding the following code to indicate 
which columns are categorical:


val X1Attribute = 
NominalAttribute.defaultAttr.withName("X1").withValues("0", "1").toMetadata
val X2Attribute = 
NominalAttribute.defaultAttr.withName("X2").withValues("0", "1", 
"2").toMetadata


val dataWithAttributes = data.withColumn("X1", $"X1".as("X1", 
X1Attribute)).withColumn("X2", $"X2".as("X2", X2Attribute))


but when I run this:

  val model = dt.fit(dataWithAttributes )

I get the following error:
java.lang.IllegalArgumentException: Field "features" does not exist.

It makes sense because I am yet to find a way to specify which columns 
are features.
I also have to figure out what the label column is and what differences 
it has from the prediction column as only the latter was used with the 
mllib package.



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor

2017-06-15 Thread OBones

Hello,

I have written the following scala code to train a regression tree, 
based on mllib:


val conf = new SparkConf().setAppName("DecisionTreeRegressionExample")
val sc = new SparkContext(conf)
val spark = new SparkSession.Builder().getOrCreate()

val sourceData = 
spark.read.format("com.databricks.spark.csv").option("header", 
"true").option("delimiter", ";").load("C:\\Data\\source_file.csv")


val data = sourceData.select($"X3".cast("double"), 
$"Y".cast("double"), $"X1".cast("double"), $"X2".cast("double"))


val featureIndices = List("X1", "X2", 
"X3").map(data.columns.indexOf(_))

val targetIndex = data.columns.indexOf("Y")

// WARNING: Indices in categoricalFeatures info are those inside 
the vector we build from the featureIndices list

// Column 0 has two modalities, Column 1 has three
val categoricalFeaturesInfo = Map[Int, Int]((0, 2), (1, 3))
val impurity = "variance"
val maxDepth = 30
val maxBins = 32

val labeled = data.map(row => 
LabeledPoint(row.getDouble(targetIndex), 
Vectors.dense(featureIndices.map(row.getDouble(_)).toArray)))


val model = DecisionTree.trainRegressor(labeled.rdd, 
categoricalFeaturesInfo, impurity, maxDepth, maxBins)


println(model.toDebugString)

This works quite well, but I want some information from the model, one 
of them being the features importance values. As it turns out, this is 
not available on DecisionTreeModel but is available on 
DecisionTreeRegressionModel from the ml package.
I then discovered that the ml package is more recent than the mllib 
package which explains why it gives me more control over the trees I'm 
building.
So, I tried to rewrite my sample code using the ml package and it is 
very much easier to use, no need for the LabeledPoint transformation. 
Here is the code I came up with:


val dt = new DecisionTreeRegressor()
  .setPredictionCol("Y")
  .setImpurity("variance")
  .setMaxDepth(30)
  .setMaxBins(32)

val model = dt.fit(data)

println(model.toDebugString)
println(model.featureImportances.toString)

However, I cannot find a way to specify which columns are features, 
which ones are categorical and how many categories they have, like I 
used to do with the mllib package.
I did look at the DecisionTreeRegressionExample.scala example found in 
the source package, but it uses a VectorIndexer to automatically 
discover the above information which is an unnecessary step in my case 
because I already have the information at hand.


The documentation found online 
(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor) 
did not help either because it does not indicate the format for the 
featuresCol string property.


Thanks in advance for your help.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [How-To] Custom file format as source

2017-06-15 Thread OBones

Thanks to both of you, this should get me started.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[How-To] Custom file format as source

2017-06-12 Thread OBones

Hello,

I have an application here that generates data files in a custom binary 
format that provides the following information:


Column list, each column has a data type (64 bit integer, 32 bit string 
index, 64 bit IEEE float, 1 byte boolean)
Catalogs that give modalities for some columns (ie, column 1 contains 
only the following values: A, B, C, D)

Array for actual data, each row has a fixed size according to the columns.

Here is an example:

Col1, 64bit integer
Col2, 32bit string index
Col3, 64bit integer
Col4, 64bit float

Catalog for Col1 = 10, 20, 30, 40, 50
Catalog for Col2 = Big, Small, Large, Tall
Catalog for Col3 = 101, 102, 103, 500, 5000
Catalog for Col4 = (no catalog)

Data array =
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
8 bytes, 4 bytes, 8 bytes, 8 bytes,
...

I would like to use this kind of file as a source for various ML related 
computations (CART, RandomForrest, Gradient boosting...) and Spark is 
very interesting in this area.
However, I'm a bit lost as to what I should write to have Spark use that 
file format as a source for its computation. Considering that those 
files are quite big (100 million lines, hundreds of gigs on disk), I'd 
rather not create something that writes a new file in a built-in format, 
but I'd rather write some code that makes Spark accept the file as it is.


I looked around and saw the textfile method but it is not applicable to 
my case. I also saw the spark.read.format("libsvm") syntax which tells 
me that there is a list of supported formats known to spark, which I 
believe are called Dataframes, but I could not find any tutorial on this 
subject.


Would you have any suggestion or links to documentation that would get 
me started?


Regards,
Olivier

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org