Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
Can you try reducing maxBins?  That reduces communication (at the cost of
coarser discretization of continuous features).

On Fri, Apr 1, 2016 at 11:32 AM, Joseph Bradley 
wrote:

> In my experience, 20K is a lot but often doable; 2K is easy; 200 is
> small.  Communication scales linearly in the number of features.
>
> On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> Joseph,
>>
>> Correction, there 20k features. Is it still a lot?
>> What number of features can be considered as normal?
>>
>> --
>> Be well!
>> Jean Morozov
>>
>> On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley 
>> wrote:
>>
>>> First thought: 70K features is *a lot* for the MLlib implementation (and
>>> any PLANET-like implementation)
>>>
>>> Using fewer partitions is a good idea.
>>>
>>> Which Spark version was this on?
>>>
>>> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov <
>>> evgeny.a.moro...@gmail.com> wrote:
>>>
 The questions I have in mind:

 Is it smth that the one might expect? From the stack trace itself it's
 not clear where does it come from.
 Is it an already known bug? Although I haven't found anything like that.
 Is it possible to configure something to workaround / avoid this?

 I'm not sure it's the right thing to do, but I've
 increased thread stack size 10 times (to 80MB)
 reduced default parallelism 10 times (only 20 cores are available)

 Thank you in advance.

 --
 Be well!
 Jean Morozov

 On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <
 evgeny.a.moro...@gmail.com> wrote:

> Hi,
>
> I have a web service that provides rest api to train random forest
> algo.
> I train random forest on a 5 nodes spark cluster with enough memory -
> everything is cached (~22 GB).
> On a small datasets up to 100k samples everything is fine, but with
> the biggest one (400k samples and ~70k features) I'm stuck with
> StackOverflowError.
>
> Additional options for my web service
> spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
> spark.default.parallelism = 200.
>
> On a 400k samples dataset
> - (with default thread stack size) it took 4 hours of training to get
> the error.
> - with increased stack size it took 60 hours to hit it.
> I can increase it, but it's hard to say what amount of memory it needs
> and it's applied to all of the treads and might waste a lot of memory.
>
> I'm looking at different stages at event timeline now and see that
> task deserialization time gradually increases. And at the end task
> deserialization time is roughly same as executor computing time.
>
> Code I use to train model:
>
> int MAX_BINS = 16;
> int NUM_CLASSES = 0;
> double MIN_INFO_GAIN = 0.0;
> int MAX_MEMORY_IN_MB = 256;
> double SUBSAMPLING_RATE = 1.0;
> boolean USE_NODEID_CACHE = true;
> int CHECKPOINT_INTERVAL = 10;
> int RANDOM_SEED = 12345;
>
> int NODE_SIZE = 5;
> int maxDepth = 30;
> int numTrees = 50;
> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
> maxDepth, NUM_CLASSES, MAX_BINS,
> QuantileStrategy.Sort(), new 
> scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN,
> MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
> CHECKPOINT_INTERVAL);
> RandomForestModel model = 
> RandomForest.trainRegressor(labeledPoints.rdd(), strategy, numTrees, 
> "auto", RANDOM_SEED);
>
>
> Any advice would be highly appreciated.
>
> The exception (~3000 lines long):
>  java.lang.StackOverflowError
> at
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
> at
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
> at
> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
> at
> java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
> at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at
> 

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
In my experience, 20K is a lot but often doable; 2K is easy; 200 is small.
Communication scales linearly in the number of features.

On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov 
wrote:

> Joseph,
>
> Correction, there 20k features. Is it still a lot?
> What number of features can be considered as normal?
>
> --
> Be well!
> Jean Morozov
>
> On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley 
> wrote:
>
>> First thought: 70K features is *a lot* for the MLlib implementation (and
>> any PLANET-like implementation)
>>
>> Using fewer partitions is a good idea.
>>
>> Which Spark version was this on?
>>
>> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> The questions I have in mind:
>>>
>>> Is it smth that the one might expect? From the stack trace itself it's
>>> not clear where does it come from.
>>> Is it an already known bug? Although I haven't found anything like that.
>>> Is it possible to configure something to workaround / avoid this?
>>>
>>> I'm not sure it's the right thing to do, but I've
>>> increased thread stack size 10 times (to 80MB)
>>> reduced default parallelism 10 times (only 20 cores are available)
>>>
>>> Thank you in advance.
>>>
>>> --
>>> Be well!
>>> Jean Morozov
>>>
>>> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <
>>> evgeny.a.moro...@gmail.com> wrote:
>>>
 Hi,

 I have a web service that provides rest api to train random forest
 algo.
 I train random forest on a 5 nodes spark cluster with enough memory -
 everything is cached (~22 GB).
 On a small datasets up to 100k samples everything is fine, but with the
 biggest one (400k samples and ~70k features) I'm stuck with
 StackOverflowError.

 Additional options for my web service
 spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
 spark.default.parallelism = 200.

 On a 400k samples dataset
 - (with default thread stack size) it took 4 hours of training to get
 the error.
 - with increased stack size it took 60 hours to hit it.
 I can increase it, but it's hard to say what amount of memory it needs
 and it's applied to all of the treads and might waste a lot of memory.

 I'm looking at different stages at event timeline now and see that task
 deserialization time gradually increases. And at the end task
 deserialization time is roughly same as executor computing time.

 Code I use to train model:

 int MAX_BINS = 16;
 int NUM_CLASSES = 0;
 double MIN_INFO_GAIN = 0.0;
 int MAX_MEMORY_IN_MB = 256;
 double SUBSAMPLING_RATE = 1.0;
 boolean USE_NODEID_CACHE = true;
 int CHECKPOINT_INTERVAL = 10;
 int RANDOM_SEED = 12345;

 int NODE_SIZE = 5;
 int maxDepth = 30;
 int numTrees = 50;
 Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
 maxDepth, NUM_CLASSES, MAX_BINS,
 QuantileStrategy.Sort(), new 
 scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN,
 MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
 CHECKPOINT_INTERVAL);
 RandomForestModel model = RandomForest.trainRegressor(labeledPoints.rdd(), 
 strategy, numTrees, "auto", RANDOM_SEED);


 Any advice would be highly appreciated.

 The exception (~3000 lines long):
  java.lang.StackOverflowError
 at
 java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
 at
 java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
 at
 java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
 at
 java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
 at
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
 at
 scala.collection.immutable.$colon$colon.readObject(List.scala:366)
 at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at
 

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-31 Thread Eugene Morozov
Joseph,

Correction, there 20k features. Is it still a lot?
What number of features can be considered as normal?

--
Be well!
Jean Morozov

On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley 
wrote:

> First thought: 70K features is *a lot* for the MLlib implementation (and
> any PLANET-like implementation)
>
> Using fewer partitions is a good idea.
>
> Which Spark version was this on?
>
> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> The questions I have in mind:
>>
>> Is it smth that the one might expect? From the stack trace itself it's
>> not clear where does it come from.
>> Is it an already known bug? Although I haven't found anything like that.
>> Is it possible to configure something to workaround / avoid this?
>>
>> I'm not sure it's the right thing to do, but I've
>> increased thread stack size 10 times (to 80MB)
>> reduced default parallelism 10 times (only 20 cores are available)
>>
>> Thank you in advance.
>>
>> --
>> Be well!
>> Jean Morozov
>>
>> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a web service that provides rest api to train random forest algo.
>>> I train random forest on a 5 nodes spark cluster with enough memory -
>>> everything is cached (~22 GB).
>>> On a small datasets up to 100k samples everything is fine, but with the
>>> biggest one (400k samples and ~70k features) I'm stuck with
>>> StackOverflowError.
>>>
>>> Additional options for my web service
>>> spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
>>> spark.default.parallelism = 200.
>>>
>>> On a 400k samples dataset
>>> - (with default thread stack size) it took 4 hours of training to get
>>> the error.
>>> - with increased stack size it took 60 hours to hit it.
>>> I can increase it, but it's hard to say what amount of memory it needs
>>> and it's applied to all of the treads and might waste a lot of memory.
>>>
>>> I'm looking at different stages at event timeline now and see that task
>>> deserialization time gradually increases. And at the end task
>>> deserialization time is roughly same as executor computing time.
>>>
>>> Code I use to train model:
>>>
>>> int MAX_BINS = 16;
>>> int NUM_CLASSES = 0;
>>> double MIN_INFO_GAIN = 0.0;
>>> int MAX_MEMORY_IN_MB = 256;
>>> double SUBSAMPLING_RATE = 1.0;
>>> boolean USE_NODEID_CACHE = true;
>>> int CHECKPOINT_INTERVAL = 10;
>>> int RANDOM_SEED = 12345;
>>>
>>> int NODE_SIZE = 5;
>>> int maxDepth = 30;
>>> int numTrees = 50;
>>> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
>>> maxDepth, NUM_CLASSES, MAX_BINS,
>>> QuantileStrategy.Sort(), new 
>>> scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN,
>>> MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
>>> CHECKPOINT_INTERVAL);
>>> RandomForestModel model = RandomForest.trainRegressor(labeledPoints.rdd(), 
>>> strategy, numTrees, "auto", RANDOM_SEED);
>>>
>>>
>>> Any advice would be highly appreciated.
>>>
>>> The exception (~3000 lines long):
>>>  java.lang.StackOverflowError
>>> at
>>> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
>>> at
>>> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
>>> at
>>> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
>>> at
>>> java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
>>> at
>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>> at
>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>> at
>>> scala.collection.immutable.$colon$colon.readObject(List.scala:366)
>>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>> at
>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>> at
>>> 

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-30 Thread Eugene Morozov
One more thing.

With increased stack size it completed twice more already, but now I see in
the log.

[dispatcher-event-loop-1] WARN  o.a.spark.scheduler.TaskSetManager - Stage
24860 contains a task of very large size (157 KB). The maximum recommended
task size is 100 KB.

Size of the task increases over time.
When the warning appeared first time it was around 100KB.

Also time to complete collectAsMap at DecisionTree.scala:651 also increased
from 8 seconds at the beginning of the training up to 20-24 seconds now.

--
Be well!
Jean Morozov

On Wed, Mar 30, 2016 at 12:14 AM, Eugene Morozov  wrote:

> Joseph,
>
> I'm using 1.6.0.
>
> --
> Be well!
> Jean Morozov
>
> On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley 
> wrote:
>
>> First thought: 70K features is *a lot* for the MLlib implementation (and
>> any PLANET-like implementation)
>>
>> Using fewer partitions is a good idea.
>>
>> Which Spark version was this on?
>>
>> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> The questions I have in mind:
>>>
>>> Is it smth that the one might expect? From the stack trace itself it's
>>> not clear where does it come from.
>>> Is it an already known bug? Although I haven't found anything like that.
>>> Is it possible to configure something to workaround / avoid this?
>>>
>>> I'm not sure it's the right thing to do, but I've
>>> increased thread stack size 10 times (to 80MB)
>>> reduced default parallelism 10 times (only 20 cores are available)
>>>
>>> Thank you in advance.
>>>
>>> --
>>> Be well!
>>> Jean Morozov
>>>
>>> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <
>>> evgeny.a.moro...@gmail.com> wrote:
>>>
 Hi,

 I have a web service that provides rest api to train random forest
 algo.
 I train random forest on a 5 nodes spark cluster with enough memory -
 everything is cached (~22 GB).
 On a small datasets up to 100k samples everything is fine, but with the
 biggest one (400k samples and ~70k features) I'm stuck with
 StackOverflowError.

 Additional options for my web service
 spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
 spark.default.parallelism = 200.

 On a 400k samples dataset
 - (with default thread stack size) it took 4 hours of training to get
 the error.
 - with increased stack size it took 60 hours to hit it.
 I can increase it, but it's hard to say what amount of memory it needs
 and it's applied to all of the treads and might waste a lot of memory.

 I'm looking at different stages at event timeline now and see that task
 deserialization time gradually increases. And at the end task
 deserialization time is roughly same as executor computing time.

 Code I use to train model:

 int MAX_BINS = 16;
 int NUM_CLASSES = 0;
 double MIN_INFO_GAIN = 0.0;
 int MAX_MEMORY_IN_MB = 256;
 double SUBSAMPLING_RATE = 1.0;
 boolean USE_NODEID_CACHE = true;
 int CHECKPOINT_INTERVAL = 10;
 int RANDOM_SEED = 12345;

 int NODE_SIZE = 5;
 int maxDepth = 30;
 int numTrees = 50;
 Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
 maxDepth, NUM_CLASSES, MAX_BINS,
 QuantileStrategy.Sort(), new 
 scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN,
 MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
 CHECKPOINT_INTERVAL);
 RandomForestModel model = RandomForest.trainRegressor(labeledPoints.rdd(), 
 strategy, numTrees, "auto", RANDOM_SEED);


 Any advice would be highly appreciated.

 The exception (~3000 lines long):
  java.lang.StackOverflowError
 at
 java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
 at
 java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
 at
 java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
 at
 java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
 at
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
 at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
 at
 scala.collection.immutable.$colon$colon.readObject(List.scala:366)

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
Joseph,

I'm using 1.6.0.

--
Be well!
Jean Morozov

On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley 
wrote:

> First thought: 70K features is *a lot* for the MLlib implementation (and
> any PLANET-like implementation)
>
> Using fewer partitions is a good idea.
>
> Which Spark version was this on?
>
> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> The questions I have in mind:
>>
>> Is it smth that the one might expect? From the stack trace itself it's
>> not clear where does it come from.
>> Is it an already known bug? Although I haven't found anything like that.
>> Is it possible to configure something to workaround / avoid this?
>>
>> I'm not sure it's the right thing to do, but I've
>> increased thread stack size 10 times (to 80MB)
>> reduced default parallelism 10 times (only 20 cores are available)
>>
>> Thank you in advance.
>>
>> --
>> Be well!
>> Jean Morozov
>>
>> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a web service that provides rest api to train random forest algo.
>>> I train random forest on a 5 nodes spark cluster with enough memory -
>>> everything is cached (~22 GB).
>>> On a small datasets up to 100k samples everything is fine, but with the
>>> biggest one (400k samples and ~70k features) I'm stuck with
>>> StackOverflowError.
>>>
>>> Additional options for my web service
>>> spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
>>> spark.default.parallelism = 200.
>>>
>>> On a 400k samples dataset
>>> - (with default thread stack size) it took 4 hours of training to get
>>> the error.
>>> - with increased stack size it took 60 hours to hit it.
>>> I can increase it, but it's hard to say what amount of memory it needs
>>> and it's applied to all of the treads and might waste a lot of memory.
>>>
>>> I'm looking at different stages at event timeline now and see that task
>>> deserialization time gradually increases. And at the end task
>>> deserialization time is roughly same as executor computing time.
>>>
>>> Code I use to train model:
>>>
>>> int MAX_BINS = 16;
>>> int NUM_CLASSES = 0;
>>> double MIN_INFO_GAIN = 0.0;
>>> int MAX_MEMORY_IN_MB = 256;
>>> double SUBSAMPLING_RATE = 1.0;
>>> boolean USE_NODEID_CACHE = true;
>>> int CHECKPOINT_INTERVAL = 10;
>>> int RANDOM_SEED = 12345;
>>>
>>> int NODE_SIZE = 5;
>>> int maxDepth = 30;
>>> int numTrees = 50;
>>> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
>>> maxDepth, NUM_CLASSES, MAX_BINS,
>>> QuantileStrategy.Sort(), new 
>>> scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN,
>>> MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
>>> CHECKPOINT_INTERVAL);
>>> RandomForestModel model = RandomForest.trainRegressor(labeledPoints.rdd(), 
>>> strategy, numTrees, "auto", RANDOM_SEED);
>>>
>>>
>>> Any advice would be highly appreciated.
>>>
>>> The exception (~3000 lines long):
>>>  java.lang.StackOverflowError
>>> at
>>> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
>>> at
>>> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
>>> at
>>> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
>>> at
>>> java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
>>> at
>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>> at
>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>>> at
>>> scala.collection.immutable.$colon$colon.readObject(List.scala:366)
>>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>> at
>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>>> at
>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>>> at
>>> 

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Joseph Bradley
First thought: 70K features is *a lot* for the MLlib implementation (and
any PLANET-like implementation)

Using fewer partitions is a good idea.

Which Spark version was this on?

On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov 
wrote:

> The questions I have in mind:
>
> Is it smth that the one might expect? From the stack trace itself it's not
> clear where does it come from.
> Is it an already known bug? Although I haven't found anything like that.
> Is it possible to configure something to workaround / avoid this?
>
> I'm not sure it's the right thing to do, but I've
> increased thread stack size 10 times (to 80MB)
> reduced default parallelism 10 times (only 20 cores are available)
>
> Thank you in advance.
>
> --
> Be well!
> Jean Morozov
>
> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a web service that provides rest api to train random forest algo.
>> I train random forest on a 5 nodes spark cluster with enough memory -
>> everything is cached (~22 GB).
>> On a small datasets up to 100k samples everything is fine, but with the
>> biggest one (400k samples and ~70k features) I'm stuck with
>> StackOverflowError.
>>
>> Additional options for my web service
>> spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
>> spark.default.parallelism = 200.
>>
>> On a 400k samples dataset
>> - (with default thread stack size) it took 4 hours of training to get the
>> error.
>> - with increased stack size it took 60 hours to hit it.
>> I can increase it, but it's hard to say what amount of memory it needs
>> and it's applied to all of the treads and might waste a lot of memory.
>>
>> I'm looking at different stages at event timeline now and see that task
>> deserialization time gradually increases. And at the end task
>> deserialization time is roughly same as executor computing time.
>>
>> Code I use to train model:
>>
>> int MAX_BINS = 16;
>> int NUM_CLASSES = 0;
>> double MIN_INFO_GAIN = 0.0;
>> int MAX_MEMORY_IN_MB = 256;
>> double SUBSAMPLING_RATE = 1.0;
>> boolean USE_NODEID_CACHE = true;
>> int CHECKPOINT_INTERVAL = 10;
>> int RANDOM_SEED = 12345;
>>
>> int NODE_SIZE = 5;
>> int maxDepth = 30;
>> int numTrees = 50;
>> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
>> maxDepth, NUM_CLASSES, MAX_BINS,
>> QuantileStrategy.Sort(), new scala.collection.immutable.HashMap<>(), 
>> nodeSize, MIN_INFO_GAIN,
>> MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
>> CHECKPOINT_INTERVAL);
>> RandomForestModel model = RandomForest.trainRegressor(labeledPoints.rdd(), 
>> strategy, numTrees, "auto", RANDOM_SEED);
>>
>>
>> Any advice would be highly appreciated.
>>
>> The exception (~3000 lines long):
>>  java.lang.StackOverflowError
>> at
>> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
>> at
>> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
>> at
>> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
>> at
>> java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
>> at
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at
>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>> at
>> scala.collection.immutable.$colon$colon.readObject(List.scala:366)
>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at
>> 

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
The questions I have in mind:

Is it smth that the one might expect? From the stack trace itself it's not
clear where does it come from.
Is it an already known bug? Although I haven't found anything like that.
Is it possible to configure something to workaround / avoid this?

I'm not sure it's the right thing to do, but I've
increased thread stack size 10 times (to 80MB)
reduced default parallelism 10 times (only 20 cores are available)

Thank you in advance.

--
Be well!
Jean Morozov

On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov 
wrote:

> Hi,
>
> I have a web service that provides rest api to train random forest algo.
> I train random forest on a 5 nodes spark cluster with enough memory -
> everything is cached (~22 GB).
> On a small datasets up to 100k samples everything is fine, but with the
> biggest one (400k samples and ~70k features) I'm stuck with
> StackOverflowError.
>
> Additional options for my web service
> spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
> spark.default.parallelism = 200.
>
> On a 400k samples dataset
> - (with default thread stack size) it took 4 hours of training to get the
> error.
> - with increased stack size it took 60 hours to hit it.
> I can increase it, but it's hard to say what amount of memory it needs and
> it's applied to all of the treads and might waste a lot of memory.
>
> I'm looking at different stages at event timeline now and see that task
> deserialization time gradually increases. And at the end task
> deserialization time is roughly same as executor computing time.
>
> Code I use to train model:
>
> int MAX_BINS = 16;
> int NUM_CLASSES = 0;
> double MIN_INFO_GAIN = 0.0;
> int MAX_MEMORY_IN_MB = 256;
> double SUBSAMPLING_RATE = 1.0;
> boolean USE_NODEID_CACHE = true;
> int CHECKPOINT_INTERVAL = 10;
> int RANDOM_SEED = 12345;
>
> int NODE_SIZE = 5;
> int maxDepth = 30;
> int numTrees = 50;
> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
> maxDepth, NUM_CLASSES, MAX_BINS,
> QuantileStrategy.Sort(), new scala.collection.immutable.HashMap<>(), 
> nodeSize, MIN_INFO_GAIN,
> MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
> CHECKPOINT_INTERVAL);
> RandomForestModel model = RandomForest.trainRegressor(labeledPoints.rdd(), 
> strategy, numTrees, "auto", RANDOM_SEED);
>
>
> Any advice would be highly appreciated.
>
> The exception (~3000 lines long):
>  java.lang.StackOverflowError
> at
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
> at
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
> at
> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
> at
> java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
> at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at
> scala.collection.immutable.$colon$colon.readObject(List.scala:366)
> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at
> 

SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Eugene Morozov
Hi,

I have a web service that provides rest api to train random forest algo.
I train random forest on a 5 nodes spark cluster with enough memory -
everything is cached (~22 GB).
On a small datasets up to 100k samples everything is fine, but with the
biggest one (400k samples and ~70k features) I'm stuck with
StackOverflowError.

Additional options for my web service
spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
spark.default.parallelism = 200.

On a 400k samples dataset
- (with default thread stack size) it took 4 hours of training to get the
error.
- with increased stack size it took 60 hours to hit it.
I can increase it, but it's hard to say what amount of memory it needs and
it's applied to all of the treads and might waste a lot of memory.

I'm looking at different stages at event timeline now and see that task
deserialization time gradually increases. And at the end task
deserialization time is roughly same as executor computing time.

Code I use to train model:

int MAX_BINS = 16;
int NUM_CLASSES = 0;
double MIN_INFO_GAIN = 0.0;
int MAX_MEMORY_IN_MB = 256;
double SUBSAMPLING_RATE = 1.0;
boolean USE_NODEID_CACHE = true;
int CHECKPOINT_INTERVAL = 10;
int RANDOM_SEED = 12345;

int NODE_SIZE = 5;
int maxDepth = 30;
int numTrees = 50;
Strategy strategy = new Strategy(Algo.Regression(),
Variance.instance(), maxDepth, NUM_CLASSES, MAX_BINS,
QuantileStrategy.Sort(), new
scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN,
MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE,
CHECKPOINT_INTERVAL);
RandomForestModel model =
RandomForest.trainRegressor(labeledPoints.rdd(), strategy, numTrees,
"auto", RANDOM_SEED);


Any advice would be highly appreciated.

The exception (~3000 lines long):
 java.lang.StackOverflowError
at
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
at
java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
scala.collection.immutable.$colon$colon.readObject(List.scala:366)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

--
Be well!
Jean Morozov