In my experience, 20K is a lot but often doable; 2K is easy; 200 is small. Communication scales linearly in the number of features.
On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov <evgeny.a.moro...@gmail.com> wrote: > Joseph, > > Correction, there 20k features. Is it still a lot? > What number of features can be considered as normal? > > -- > Be well! > Jean Morozov > > On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley <jos...@databricks.com> > wrote: > >> First thought: 70K features is *a lot* for the MLlib implementation (and >> any PLANET-like implementation) >> >> Using fewer partitions is a good idea. >> >> Which Spark version was this on? >> >> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov < >> evgeny.a.moro...@gmail.com> wrote: >> >>> The questions I have in mind: >>> >>> Is it smth that the one might expect? From the stack trace itself it's >>> not clear where does it come from. >>> Is it an already known bug? Although I haven't found anything like that. >>> Is it possible to configure something to workaround / avoid this? >>> >>> I'm not sure it's the right thing to do, but I've >>> increased thread stack size 10 times (to 80MB) >>> reduced default parallelism 10 times (only 20 cores are available) >>> >>> Thank you in advance. >>> >>> -- >>> Be well! >>> Jean Morozov >>> >>> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov < >>> evgeny.a.moro...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I have a web service that provides rest api to train random forest >>>> algo. >>>> I train random forest on a 5 nodes spark cluster with enough memory - >>>> everything is cached (~22 GB). >>>> On a small datasets up to 100k samples everything is fine, but with the >>>> biggest one (400k samples and ~70k features) I'm stuck with >>>> StackOverflowError. >>>> >>>> Additional options for my web service >>>> spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192" >>>> spark.default.parallelism = 200. >>>> >>>> On a 400k samples dataset >>>> - (with default thread stack size) it took 4 hours of training to get >>>> the error. >>>> - with increased stack size it took 60 hours to hit it. >>>> I can increase it, but it's hard to say what amount of memory it needs >>>> and it's applied to all of the treads and might waste a lot of memory. >>>> >>>> I'm looking at different stages at event timeline now and see that task >>>> deserialization time gradually increases. And at the end task >>>> deserialization time is roughly same as executor computing time. >>>> >>>> Code I use to train model: >>>> >>>> int MAX_BINS = 16; >>>> int NUM_CLASSES = 0; >>>> double MIN_INFO_GAIN = 0.0; >>>> int MAX_MEMORY_IN_MB = 256; >>>> double SUBSAMPLING_RATE = 1.0; >>>> boolean USE_NODEID_CACHE = true; >>>> int CHECKPOINT_INTERVAL = 10; >>>> int RANDOM_SEED = 12345; >>>> >>>> int NODE_SIZE = 5; >>>> int maxDepth = 30; >>>> int numTrees = 50; >>>> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), >>>> maxDepth, NUM_CLASSES, MAX_BINS, >>>> QuantileStrategy.Sort(), new >>>> scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN, >>>> MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, >>>> CHECKPOINT_INTERVAL); >>>> RandomForestModel model = RandomForest.trainRegressor(labeledPoints.rdd(), >>>> strategy, numTrees, "auto", RANDOM_SEED); >>>> >>>> >>>> Any advice would be highly appreciated. >>>> >>>> The exception (~3000 lines long): >>>> java.lang.StackOverflowError >>>> at >>>> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320) >>>> at >>>> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333) >>>> at >>>> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828) >>>> at >>>> java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453) >>>> at >>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) >>>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>> at >>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) >>>> at >>>> scala.collection.immutable.$colon$colon.readObject(List.scala:366) >>>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:497) >>>> at >>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>> at >>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) >>>> at >>>> scala.collection.immutable.$colon$colon.readObject(List.scala:362) >>>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:497) >>>> at >>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) >>>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) >>>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) >>>> >>>> -- >>>> Be well! >>>> Jean Morozov >>>> >>> >>> >> >