[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-13 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19666
  
OK. I will waiting @smurching to merge split parts of #19433 get merged 
first, and then I will update this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-13 Thread smurching
Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19666
  
Discussed with @jkbradley, I'll split up #19433 so that the parts of it 
that'd potentially conflict with this PR (refactoring RandomForest.scala into 
utility classes) can be merged first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-13 Thread jkbradley
Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/19666
  
Btw, this is going to conflict with 
https://github.com/apache/spark/pull/19433 a lot.  @WeichenXu123 and @smurching 
have you planned for merging one before the other?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-09 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/19666
  
Thank you, @WeichenXu123 . You can also use the condition "include the 
first bin" to filter left splits. Perhaps it is better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-09 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19666
  
@facaiy Your idea looks also reasonable. So we can use the condition 
"exclude the first bin" to do the pruning (filter out the other half symmetric 
splits). This condition looks simpler than `1 <= combNumber <= numSplists`, 
`Good idea !
And your code use another traverse order, my current PR is also 
backtracking, with different traverse order, but I think both of them works, 
and both of their complexity will be `O(2^n)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-09 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/19666
  
In fact, I'm not sure whether the idea is right, so no hesitate to correct 
me. I assume the algorithm requires O(N^2) complexity. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-09 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/19666
  
Hi, I write a demo with python. I'll be happy if it could be useful.

For N bins, say `[x_1, x_2, ..., x_N]`, since all its splits contain either 
`x_1` or not, so we can choose the half splits which doesn't contain x_1 as 
left splits.
If I understand it correctly, the left splits are indeed all combinations 
of the left bins, `[x_2, x_3, ... x_N]`. The problem can be solved by the 
[backtracking algorithm](https://en.wikipedia.org/wiki/Backtracking).

Please correct me if I'm wrong. Thanks very much.

```python
#!/usr/bin/env python

def gen_splits(bins):
if len(bins) == 1:
return bins
results = []
partial_res = []
gen_splits_iter(1, bins, partial_res, results)
return results


def gen_splits_iter(dep, bins, partial_res, results):
if partial_res:
left_splits = partial_res[:]
right_splits = [x for x in bins if x not in left_splits]
results.append("left: {:20}, right: {}".format(str(left_splits), 
right_splits))

for m in range(dep, len(bins)):
partial_res.append(bins[m])
gen_splits_iter(m+1, bins, partial_res, results)
partial_res.pop()


if __name__ == "__main__":
print("first example:")
bins = ["a", "b", "c"]
print("bins: {}\n-".format(bins))
splits = gen_splits(bins)
for s in splits:
print(s)

print("\n\n=")
print("second example:")
bins = ["a", "b", "c", "d", "e"]
print("bins: {}\n-".format(bins))
splits = gen_splits(bins)
for s in splits:
print(s)
```

logs:
```bash
~/Downloads ❯❯❯ python test.py
first example:
bins: ['a', 'b', 'c']
-
left: ['b']   , right: ['a', 'c']
left: ['b', 'c']  , right: ['a']
left: ['c']   , right: ['a', 'b']


=
second example:
bins: ['a', 'b', 'c', 'd', 'e']
-
left: ['b']   , right: ['a', 'c', 'd', 'e']
left: ['b', 'c']  , right: ['a', 'd', 'e']
left: ['b', 'c', 'd'] , right: ['a', 'e']
left: ['b', 'c', 'd', 'e'], right: ['a']
left: ['b', 'c', 'e'] , right: ['a', 'd']
left: ['b', 'd']  , right: ['a', 'c', 'e']
left: ['b', 'd', 'e'] , right: ['a', 'c']
left: ['b', 'e']  , right: ['a', 'c', 'd']
left: ['c']   , right: ['a', 'b', 'd', 'e']
left: ['c', 'd']  , right: ['a', 'b', 'e']
left: ['c', 'd', 'e'] , right: ['a', 'b']
left: ['c', 'e']  , right: ['a', 'b', 'd']
left: ['d']   , right: ['a', 'b', 'c', 'e']
left: ['d', 'e']  , right: ['a', 'b', 'c']
left: ['e']   , right: ['a', 'b', 'c', 'd']
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19666
  
Also cc @smurching Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19666
  
@facaiy Thanks for your review! I put more explanation on the design 
purpose of `traverseUnorderedSplits`. But, if you have better solution, no 
hesitate to tell me!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-07 Thread facaiy
Github user facaiy commented on the issue:

https://github.com/apache/spark/pull/19666
  
I believe that unordered features will benefit a lot from the idea, however 
I have two questions:
1. I'm a little confused by 964L in `traverseUnorderedSplits`.  Is it a 
backtracking algorithm?
```scala
dfs(binIndex + 1, combNumber, stats) 
```
2. `traverseUnorderedSplits` succeed in decoupling an abstraction from its 
implementation. Cool! However, I just wonder whether we can write a simpler 
function, say, remove `seqOp`, `finalizer`, `T` and collect all logic together 
in one place?

Anyway, thanks for your good work, @WeichenXu123.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-06 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19666
  
@smurching I guess if iterating over gray code will have higher time 
complexity O(n * 2^n), (Not very sure, maybe there's some high efficient 
algos?) , the recursive traverse in my PR only need O(2^n).
and , recursive traversing has advantage in "pruning", the pruning 
condition I mentioned above "1 <= combNumber <= numSplits", recursive 
traversing can pruning the total subtree, and save about half the time. Can 
iterating over gray code also do this ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19666
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83481/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19666
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19666
  
**[Test build #83481 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83481/testReport)**
 for PR 19666 at commit 
[`e79abfd`](https://github.com/apache/spark/commit/e79abfd2a231ff29abc4be1723002542c85189df).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  test(\"Multiclass classification with unordered categorical 
features: split calculations\") `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19666: [SPARK-22451][ML] Reduce decision tree aggregate size fo...

2017-11-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19666
  
**[Test build #83481 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83481/testReport)**
 for PR 19666 at commit 
[`e79abfd`](https://github.com/apache/spark/commit/e79abfd2a231ff29abc4be1723002542c85189df).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org