[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19666 OK. I will waiting @smurching to merge split parts of #19433 get merged first, and then I will update this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user smurching commented on the issue: https://github.com/apache/spark/pull/19666 Discussed with @jkbradley, I'll split up #19433 so that the parts of it that'd potentially conflict with this PR (refactoring RandomForest.scala into utility classes) can be merged first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19666 Btw, this is going to conflict with https://github.com/apache/spark/pull/19433 a lot. @WeichenXu123 and @smurching have you planned for merging one before the other? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 Thank you, @WeichenXu123 . You can also use the condition "include the first bin" to filter left splits. Perhaps it is better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19666 @facaiy Your idea looks also reasonable. So we can use the condition "exclude the first bin" to do the pruning (filter out the other half symmetric splits). This condition looks simpler than `1 <= combNumber <= numSplists`, `Good idea ! And your code use another traverse order, my current PR is also backtracking, with different traverse order, but I think both of them works, and both of their complexity will be `O(2^n)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 In fact, I'm not sure whether the idea is right, so no hesitate to correct me. I assume the algorithm requires O(N^2) complexity. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 Hi, I write a demo with python. I'll be happy if it could be useful. For N bins, say `[x_1, x_2, ..., x_N]`, since all its splits contain either `x_1` or not, so we can choose the half splits which doesn't contain x_1 as left splits. If I understand it correctly, the left splits are indeed all combinations of the left bins, `[x_2, x_3, ... x_N]`. The problem can be solved by the [backtracking algorithm](https://en.wikipedia.org/wiki/Backtracking). Please correct me if I'm wrong. Thanks very much. ```python #!/usr/bin/env python def gen_splits(bins): if len(bins) == 1: return bins results = [] partial_res = [] gen_splits_iter(1, bins, partial_res, results) return results def gen_splits_iter(dep, bins, partial_res, results): if partial_res: left_splits = partial_res[:] right_splits = [x for x in bins if x not in left_splits] results.append("left: {:20}, right: {}".format(str(left_splits), right_splits)) for m in range(dep, len(bins)): partial_res.append(bins[m]) gen_splits_iter(m+1, bins, partial_res, results) partial_res.pop() if __name__ == "__main__": print("first example:") bins = ["a", "b", "c"] print("bins: {}\n-".format(bins)) splits = gen_splits(bins) for s in splits: print(s) print("\n\n=") print("second example:") bins = ["a", "b", "c", "d", "e"] print("bins: {}\n-".format(bins)) splits = gen_splits(bins) for s in splits: print(s) ``` logs: ```bash ~/Downloads â¯â¯â¯ python test.py first example: bins: ['a', 'b', 'c'] - left: ['b'] , right: ['a', 'c'] left: ['b', 'c'] , right: ['a'] left: ['c'] , right: ['a', 'b'] = second example: bins: ['a', 'b', 'c', 'd', 'e'] - left: ['b'] , right: ['a', 'c', 'd', 'e'] left: ['b', 'c'] , right: ['a', 'd', 'e'] left: ['b', 'c', 'd'] , right: ['a', 'e'] left: ['b', 'c', 'd', 'e'], right: ['a'] left: ['b', 'c', 'e'] , right: ['a', 'd'] left: ['b', 'd'] , right: ['a', 'c', 'e'] left: ['b', 'd', 'e'] , right: ['a', 'c'] left: ['b', 'e'] , right: ['a', 'c', 'd'] left: ['c'] , right: ['a', 'b', 'd', 'e'] left: ['c', 'd'] , right: ['a', 'b', 'e'] left: ['c', 'd', 'e'] , right: ['a', 'b'] left: ['c', 'e'] , right: ['a', 'b', 'd'] left: ['d'] , right: ['a', 'b', 'c', 'e'] left: ['d', 'e'] , right: ['a', 'b', 'c'] left: ['e'] , right: ['a', 'b', 'c', 'd'] ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19666 Also cc @smurching Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19666 @facaiy Thanks for your review! I put more explanation on the design purpose of `traverseUnorderedSplits`. But, if you have better solution, no hesitate to tell me! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user facaiy commented on the issue: https://github.com/apache/spark/pull/19666 I believe that unordered features will benefit a lot from the idea, however I have two questions: 1. I'm a little confused by 964L in `traverseUnorderedSplits`. Is it a backtracking algorithm? ```scala dfs(binIndex + 1, combNumber, stats) ``` 2. `traverseUnorderedSplits` succeed in decoupling an abstraction from its implementation. Cool! However, I just wonder whether we can write a simpler function, say, remove `seqOp`, `finalizer`, `T` and collect all logic together in one place? Anyway, thanks for your good work, @WeichenXu123. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19666 @smurching I guess if iterating over gray code will have higher time complexity O(n * 2^n), (Not very sure, maybe there's some high efficient algos?) , the recursive traverse in my PR only need O(2^n). and , recursive traversing has advantage in "pruning", the pruning condition I mentioned above "1 <= combNumber <= numSplits", recursive traversing can pruning the total subtree, and save about half the time. Can iterating over gray code also do this ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19666 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83481/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19666 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19666 **[Test build #83481 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83481/testReport)** for PR 19666 at commit [`e79abfd`](https://github.com/apache/spark/commit/e79abfd2a231ff29abc4be1723002542c85189df). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` test(\"Multiclass classification with unordered categorical features: split calculations\") ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19666 **[Test build #83481 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83481/testReport)** for PR 19666 at commit [`e79abfd`](https://github.com/apache/spark/commit/e79abfd2a231ff29abc4be1723002542c85189df). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org