Re: why decision trees do binary split?

2014-11-06 Thread Carlos J. Gil Bellosta
Hello,

There is a big compelling reason for binary splits in general for
trees: a split is made if the difference between the two resulting
branches is significant.You also want to compare the significance of
this candidate split vs all the other candidate splits. There are many
statistical tests to compare two groups. You can even generate
something like p-values that, according to some, allow you to compare
different candidate splits.

If you introduce multibranch splits... things become much more messy.

Also, mind that breaking categorical variables into as many groups as
there are levels would in some cases separate subgroups of variables
which are not that different.  Successive binary splits could
potentially provide you with the required homogeneous subsets.

Best,

Carlos J. Gil Bellosta
http://www.datanalytics.com



2014-11-06 10:46 GMT+01:00 Sean Owen so...@cloudera.com:
 I haven't seen that done before, which may be most of the reason - I am not
 sure that is common practice.

 I can see upsides - you need not pick candidate splits to test since there
 is only one N-way rule possible. The binary split equivalent is N levels
 instead of 1.

 The big problem is that you are always segregating the data set entirely,
 and making the equivalent of those N binary rules, even when you would not
 otherwise bother because they don't add information about the target. The
 subsets matching each child are therefore unnecessarily small and this makes
 learning on each independent subset weaker.

 On Nov 6, 2014 9:36 AM, jamborta jambo...@gmail.com wrote:

 I meant above, that in the case of categorical variables it might be more
 efficient to create a node on each categorical value. Is there a reason
 why
 spark went down the binary route?

 thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/why-decision-trees-do-binary-split-tp18188p18265.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR: split, apply, combine strategy for dataframes?

2014-08-15 Thread Carlos J. Gil Bellosta
Thanks for your reply.

I think that the problem was that SparkR tried to serialize the whole
environment. Mind that the large dataframe was part of it. So every
worker received their slice / partition (which is very small) plus the
whole thing!

So I deleted the large dataframe and list before parallelizing and the
cluster ran without memory issues.

Best,

Carlos J. Gil Bellosta
http://www.datanalytics.com

2014-08-15 3:53 GMT+02:00 Shivaram Venkataraman shiva...@eecs.berkeley.edu:
 Could you try increasing the number of slices with the large data set ?
 SparkR assumes that each slice (or partition in Spark terminology) can fit
 in memory of a single machine.  Also is the error happening when you do the
 map function or does it happen when you combine the results ?

 Thanks
 Shivaram


 On Thu, Aug 14, 2014 at 3:53 PM, Carlos J. Gil Bellosta
 gilbello...@gmail.com wrote:

 Hello,

 I am having problems trying to apply the split-apply-combine strategy
 for dataframes using SparkR.

 I have a largish dataframe and I would like to achieve something similar
 to what

 ddply(df, .(id), foo)

 would do, only that using SparkR as computing engine. My df has a few
 million records and I would like to split it by id and operate on
 the pieces. These pieces are quite small in size: just a few hundred
 records.

 I do something along the following lines:

 1) Use split to transform df into a list of dfs.
 2) parallelize the resulting list as a RDD (using a few thousand slices)
 3) map my function on the pieces using Spark.
 4) recombine the results (do.call, rbind, etc.)

 My cluster works and I can perform medium sized batch jobs.

 However, it fails with my full df: I get a heap space out of memory
 error. It is funny as the slices are very small in size.

 Should I send smaller batches to my cluster? Is there any recommended
 general approach to these kind of split-apply-combine problems?

 Best,

 Carlos J. Gil Bellosta
 http://www.datanalytics.com

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org