[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2018-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17000
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2018-05-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17000
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2018-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17000
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-12-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17000
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-11-06 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/17000
  
@MLnick It looks like VF-LBFGS has a different scenario. In VF algos, the 
vectors will be too large to store in driver memory, so we slice the vectors  
into different machines (stored by `RDD[Vector], and the use partitionID as 
slice key).
and , in VF-LBFGS, there're only very few large vectors(usually 4-10 
vectors) need to aggregate together. so, what this PR do looks different with 
VF-LBFGS.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-03-12 Thread ZunwenYou
Github user ZunwenYou commented on the issue:

https://github.com/apache/spark/pull/17000
  
ping @yanboliang ,  please has a look at this improvement. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-23 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/17000
  
cc @yanboliang - it seems actually similar in effect to the VL-BFGS work 
with RDD-based coefficients?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-23 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/17000
  
I'm not totally certain there will be some huge benefit with porting vector 
summary to UDAF framework. But there are API-level benefits to doing so. 
Perhaps there is a way to incorporate the `sliceAggregate` idea into the 
summarizer or into catalyst operations that work with arrays...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-23 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/17000
  
@ZunwenYou yes I understand that the `sliceAggregate` is different from 
SPARK-19634 and more comparable to `treeAggregate`. But I'm not sure, if we 
plan to port the vector summary to use `DataFrame` based UDAF, whether we can 
incorporate the benefit of `sliceAggregate`.

So my point would probably be to try to see how much benefit accrues from 
(a) using UDAF mechanism and (b) not computing unnecessary things. Then we can 
compare to the benefit here and decide.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-21 Thread ZunwenYou
Github user ZunwenYou commented on the issue:

https://github.com/apache/spark/pull/17000
  
Hi, @MLnick 
Firstly, `sliceAggregate `is a common aggregate for array-like data. 
Besides `MultivariateOnlineSummarizer ` case, it can be used in many large 
machine learning cases. I chose `MultivariateOnlineSummarizer `to do our 
experiment, just because it is really a bottleneck of `LogisticRegression `in 
ml package.

[This](https://issues.apache.org/jira/browse/SPARK-19634) is a good 
improvement for `MultivariateOnlineSummarizer`,  but I do not think it's a good 
idea to compare these two improvement. In my opinion, it is reasonable to 
compare `sliceAggregate `to `treeAggregate`.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/17000
  
Is the speedup coming mostly from the `MultivariateOnlineSummarizer` stage?

See https://issues.apache.org/jira/browse/SPARK-19634 which is for porting 
this operation to use DataFrame UDAF and for computing only the required 
metrics (instead of forcing computing all as is done currently). I wonder how 
that will compare?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread ZunwenYou
Github user ZunwenYou commented on the issue:

https://github.com/apache/spark/pull/17000
  
Hi, @hhbyyh 

In our experiment, the class **_MultivariateOnlineSummarizer_** contains 8 
arrays, if the dimension reaches 20 million, the memory of 
MultivariateOnlineSummarizer is 1280M(8Bit* 20M * 8).

The experiment configuration as follows:
spark.driver.maxResultSize 6g
spark.kryoserializer.buffer.max 2047m
driver-memory 20g 
num-executors 100 
executor-cores 2 
executor-memory 15g

RDD and aggregate parameter:
RDD partition number 300
treeAggregate depth 5
As the description of configuration, treeAggregate will run into four 
stages, each stage task number is 300, 75, 18, 4.
At the last stage of treeAggrate, tasks will be killed, because executors 
throw exception _**java.lang.OutOfMemoryError: Requested array size exceeds VM 
limit**_. 
I set treeAggregate depth=7, executor-memory=30g, the last stage still 
failed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread ZunwenYou
Github user ZunwenYou commented on the issue:

https://github.com/apache/spark/pull/17000
  
Hi, @MLnick
You are right, sliceAggregate splits an array into smaller chunks before 
shuffle.
It has three advantage
Firstly, the shuffle data is less than treeAggregate during the whole 
transformation operation.
Secondly, as your description, it allows more concurrency, not only during 
the collect operation of driver, but also in the process of run **_seqOp_** and 
**_combOp_**.
Thirdly, as I observed, when an record is larger than 1G Bit(an array of 
100 million dimension), shuffle among executors becomes less efficiency. At the 
same time, the rest of executos is waiting. I am not clear the reason for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread ZunwenYou
Github user ZunwenYou commented on the issue:

https://github.com/apache/spark/pull/17000
  
Hi, @MLnick 
You are right, sliceAggregate splits an array into smaller chunks before 
shuffle.  
It has three advantage
Firstly, the shuffle data is less than treeAggregate during the whole 
transformation operation.
Secondly, as your description, it allows more concurrency, not only during 
the collect operation of driver, but also in the process of run **_seqOp_** and 
**_combOp_**. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/17000
  
Hi @ZunwenYou Do you know what's the reason that treeAggregate failed when  
feature dimension reach 20 million? 
I think this potentially can help with the 2G disk shuffle spill limit. (to 
be verified).
Also we should evaluate the memory consumption due to the slice and copy.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/17000
  
Just to be clear - this is essentially just splitting an array up into 
smaller chunks so that overall communication is more efficient? It would be 
good to look at why Spark is not doing a good job with one big array. Is the 
bottleneck really the executor communication (shuffle part)? Or is it 
collecting the big array back at the end of tree aggregation (ie this patch 
sort of allows more concurrency in the `collect` operation)?

cc @dbtsai @sethah @yanboliang  who were looking at linear model 
scalability recently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...

2017-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17000
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org