Re: Input size increasing every iteration of gradient boosted trees [1.4]
Since it sounds like this has been encountered 3 times, and I've personally seen it and mostly verified it, I think it's legit enough for a JIRA: SPARK-10433 I am sorry to say I don't know what is going here though. On Thu, Sep 3, 2015 at 1:56 PM, Peter Rudenko wrote: > Confirm, having the same issue (1.4.1 mllib package). For smaller dataset > accuracy degradeted also. Haven’t tested yet in 1.5 with ml package > implementation. > > val boostingStrategy = BoostingStrategy.defaultParams("Classification") > boostingStrategy.setNumIterations(30) > boostingStrategy.setLearningRate(1.0) > boostingStrategy.treeStrategy.setMaxDepth(3) > boostingStrategy.treeStrategy.setMaxBins(128) > boostingStrategy.treeStrategy.setSubsamplingRate(1.0) > boostingStrategy.treeStrategy.setMinInstancesPerNode(1) > boostingStrategy.treeStrategy.setUseNodeIdCache(true) > boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( > > mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, > java.lang.Integer]]) > > val model = GradientBoostedTrees.train(instances, boostingStrategy) > > Thanks, > Peter Rudenko > > On 2015-08-14 00:33, Sean Owen wrote: > > Not that I have any answer at this point, but I was discussing this > exact same problem with Johannes today. An input size of ~20K records > was growing each iteration by ~15M records. I could not see why on a > first look. > > @jkbradley I know it's not much info but does that ring any bells? I > think Johannes even has an instance of this up and running for > examination. > > On Thu, Aug 13, 2015 at 10:04 PM, Matt Forbes > wrote: > > I am training a boosted trees model on a couple million input samples (with > around 300 features) and am noticing that the input size of each stage is > increasing each iteration. For each new tree, the first step seems to be > building the decision tree metadata, which does a .count() on the input > data, so this is the step I've been using to track the input size changing. > Here is what I'm seeing: > > count at DecisionTreeMetadata.scala:111 > 1. Input Size / Records: 726.1 MB / 1295620 > 2. Input Size / Records: 106.9 GB / 64780816 > 3. Input Size / Records: 160.3 GB / 97171224 > 4. Input Size / Records: 214.8 GB / 129680959 > 5. Input Size / Records: 268.5 GB / 162533424 > > Input Size / Records: 1912.6 GB / 1382017686 > > > This step goes from taking less than 10s up to 5 minutes by the 15th or so > iteration. I'm not quite sure what could be causing this. I am passing a > memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train > > Does anybody have some insight? Is this a bug or could it be an error on my > part? > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Input size increasing every iteration of gradient boosted trees [1.4]
Confirm, having the same issue (1.4.1 mllib package). For smaller dataset accuracy degradeted also. Haven’t tested yet in 1.5 with ml package implementation. |val boostingStrategy = BoostingStrategy.defaultParams("Classification") boostingStrategy.setNumIterations(30) boostingStrategy.setLearningRate(1.0) boostingStrategy.treeStrategy.setMaxDepth(3) boostingStrategy.treeStrategy.setMaxBins(128) boostingStrategy.treeStrategy.setSubsamplingRate(1.0) boostingStrategy.treeStrategy.setMinInstancesPerNode(1) boostingStrategy.treeStrategy.setUseNodeIdCache(true) boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, java.lang.Integer]]) val model = GradientBoostedTrees.train(instances, boostingStrategy) | Thanks, Peter Rudenko On 2015-08-14 00:33, Sean Owen wrote: Not that I have any answer at this point, but I was discussing this exact same problem with Johannes today. An input size of ~20K records was growing each iteration by ~15M records. I could not see why on a first look. @jkbradley I know it's not much info but does that ring any bells? I think Johannes even has an instance of this up and running for examination. On Thu, Aug 13, 2015 at 10:04 PM, Matt Forbes wrote: I am training a boosted trees model on a couple million input samples (with around 300 features) and am noticing that the input size of each stage is increasing each iteration. For each new tree, the first step seems to be building the decision tree metadata, which does a .count() on the input data, so this is the step I've been using to track the input size changing. Here is what I'm seeing: count at DecisionTreeMetadata.scala:111 1. Input Size / Records: 726.1 MB / 1295620 2. Input Size / Records: 106.9 GB / 64780816 3. Input Size / Records: 160.3 GB / 97171224 4. Input Size / Records: 214.8 GB / 129680959 5. Input Size / Records: 268.5 GB / 162533424 Input Size / Records: 1912.6 GB / 1382017686 This step goes from taking less than 10s up to 5 minutes by the 15th or so iteration. I'm not quite sure what could be causing this. I am passing a memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train Does anybody have some insight? Is this a bug or could it be an error on my part? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Input size increasing every iteration of gradient boosted trees [1.4]
Is this an artifact of a recent change? Does this not show up in any of the tests or benchmarks? On Thu, Aug 13, 2015 at 2:33 PM, Sean Owen wrote: > Not that I have any answer at this point, but I was discussing this > exact same problem with Johannes today. An input size of ~20K records > was growing each iteration by ~15M records. I could not see why on a > first look. > > @jkbradley I know it's not much info but does that ring any bells? I > think Johannes even has an instance of this up and running for > examination. > > On Thu, Aug 13, 2015 at 10:04 PM, Matt Forbes > wrote: > > I am training a boosted trees model on a couple million input samples > (with > > around 300 features) and am noticing that the input size of each stage is > > increasing each iteration. For each new tree, the first step seems to be > > building the decision tree metadata, which does a .count() on the input > > data, so this is the step I've been using to track the input size > changing. > > Here is what I'm seeing: > > > > count at DecisionTreeMetadata.scala:111 > > 1. Input Size / Records: 726.1 MB / 1295620 > > 2. Input Size / Records: 106.9 GB / 64780816 > > 3. Input Size / Records: 160.3 GB / 97171224 > > 4. Input Size / Records: 214.8 GB / 129680959 > > 5. Input Size / Records: 268.5 GB / 162533424 > > > > Input Size / Records: 1912.6 GB / 1382017686 > > > > > > This step goes from taking less than 10s up to 5 minutes by the 15th or > so > > iteration. I'm not quite sure what could be causing this. I am passing a > > memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train > > > > Does anybody have some insight? Is this a bug or could it be an error on > my > > part? > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Input size increasing every iteration of gradient boosted trees [1.4]
Not that I have any answer at this point, but I was discussing this exact same problem with Johannes today. An input size of ~20K records was growing each iteration by ~15M records. I could not see why on a first look. @jkbradley I know it's not much info but does that ring any bells? I think Johannes even has an instance of this up and running for examination. On Thu, Aug 13, 2015 at 10:04 PM, Matt Forbes wrote: > I am training a boosted trees model on a couple million input samples (with > around 300 features) and am noticing that the input size of each stage is > increasing each iteration. For each new tree, the first step seems to be > building the decision tree metadata, which does a .count() on the input > data, so this is the step I've been using to track the input size changing. > Here is what I'm seeing: > > count at DecisionTreeMetadata.scala:111 > 1. Input Size / Records: 726.1 MB / 1295620 > 2. Input Size / Records: 106.9 GB / 64780816 > 3. Input Size / Records: 160.3 GB / 97171224 > 4. Input Size / Records: 214.8 GB / 129680959 > 5. Input Size / Records: 268.5 GB / 162533424 > > Input Size / Records: 1912.6 GB / 1382017686 > > > This step goes from taking less than 10s up to 5 minutes by the 15th or so > iteration. I'm not quite sure what could be causing this. I am passing a > memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train > > Does anybody have some insight? Is this a bug or could it be an error on my > part? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Input size increasing every iteration of gradient boosted trees [1.4]
I am training a boosted trees model on a couple million input samples (with around 300 features) and am noticing that the input size of each stage is increasing each iteration. For each new tree, the first step seems to be building the decision tree metadata, which does a .count() on the input data, so this is the step I've been using to track the input size changing. Here is what I'm seeing: count at DecisionTreeMetadata.scala:111 1. Input Size / Records: 726.1 MB / 1295620 2. Input Size / Records: 106.9 GB / 64780816 3. Input Size / Records: 160.3 GB / 97171224 4. Input Size / Records: 214.8 GB / 129680959 5. Input Size / Records: 268.5 GB / 162533424 Input Size / Records: 1912.6 GB / 1382017686 This step goes from taking less than 10s up to 5 minutes by the 15th or so iteration. I'm not quite sure what could be causing this. I am passing a memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train Does anybody have some insight? Is this a bug or could it be an error on my part?