[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17000 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17000 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17000 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17000 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/17000 @MLnick It looks like VF-LBFGS has a different scenario. In VF algos, the vectors will be too large to store in driver memory, so we slice the vectors into different machines (stored by `RDD[Vector], and the use partitionID as slice key). and , in VF-LBFGS, there're only very few large vectors(usually 4-10 vectors) need to aggregate together. so, what this PR do looks different with VF-LBFGS. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user ZunwenYou commented on the issue: https://github.com/apache/spark/pull/17000 ping @yanboliang , please has a look at this improvement. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 cc @yanboliang - it seems actually similar in effect to the VL-BFGS work with RDD-based coefficients? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 I'm not totally certain there will be some huge benefit with porting vector summary to UDAF framework. But there are API-level benefits to doing so. Perhaps there is a way to incorporate the `sliceAggregate` idea into the summarizer or into catalyst operations that work with arrays... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 @ZunwenYou yes I understand that the `sliceAggregate` is different from SPARK-19634 and more comparable to `treeAggregate`. But I'm not sure, if we plan to port the vector summary to use `DataFrame` based UDAF, whether we can incorporate the benefit of `sliceAggregate`. So my point would probably be to try to see how much benefit accrues from (a) using UDAF mechanism and (b) not computing unnecessary things. Then we can compare to the benefit here and decide. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user ZunwenYou commented on the issue: https://github.com/apache/spark/pull/17000 Hi, @MLnick Firstly, `sliceAggregate `is a common aggregate for array-like data. Besides `MultivariateOnlineSummarizer ` case, it can be used in many large machine learning cases. I chose `MultivariateOnlineSummarizer `to do our experiment, just because it is really a bottleneck of `LogisticRegression `in ml package. [This](https://issues.apache.org/jira/browse/SPARK-19634) is a good improvement for `MultivariateOnlineSummarizer`, but I do not think it's a good idea to compare these two improvement. In my opinion, it is reasonable to compare `sliceAggregate `to `treeAggregate`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 Is the speedup coming mostly from the `MultivariateOnlineSummarizer` stage? See https://issues.apache.org/jira/browse/SPARK-19634 which is for porting this operation to use DataFrame UDAF and for computing only the required metrics (instead of forcing computing all as is done currently). I wonder how that will compare? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user ZunwenYou commented on the issue: https://github.com/apache/spark/pull/17000 Hi, @hhbyyh In our experiment, the class **_MultivariateOnlineSummarizer_** contains 8 arrays, if the dimension reaches 20 million, the memory of MultivariateOnlineSummarizer is 1280M(8Bit* 20M * 8). The experiment configuration as follows: spark.driver.maxResultSize 6g spark.kryoserializer.buffer.max 2047m driver-memory 20g num-executors 100 executor-cores 2 executor-memory 15g RDD and aggregate parameter: RDD partition number 300 treeAggregate depth 5 As the description of configuration, treeAggregate will run into four stages, each stage task number is 300, 75, 18, 4. At the last stage of treeAggrate, tasks will be killed, because executors throw exception _**java.lang.OutOfMemoryError: Requested array size exceeds VM limit**_. I set treeAggregate depth=7, executor-memory=30g, the last stage still failed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user ZunwenYou commented on the issue: https://github.com/apache/spark/pull/17000 Hi, @MLnick You are right, sliceAggregate splits an array into smaller chunks before shuffle. It has three advantage Firstly, the shuffle data is less than treeAggregate during the whole transformation operation. Secondly, as your description, it allows more concurrency, not only during the collect operation of driver, but also in the process of run **_seqOp_** and **_combOp_**. Thirdly, as I observed, when an record is larger than 1G Bit(an array of 100 million dimension), shuffle among executors becomes less efficiency. At the same time, the rest of executos is waiting. I am not clear the reason for this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user ZunwenYou commented on the issue: https://github.com/apache/spark/pull/17000 Hi, @MLnick You are right, sliceAggregate splits an array into smaller chunks before shuffle. It has three advantage Firstly, the shuffle data is less than treeAggregate during the whole transformation operation. Secondly, as your description, it allows more concurrency, not only during the collect operation of driver, but also in the process of run **_seqOp_** and **_combOp_**. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/17000 Hi @ZunwenYou Do you know what's the reason that treeAggregate failed when feature dimension reach 20 million? I think this potentially can help with the 2G disk shuffle spill limit. (to be verified). Also we should evaluate the memory consumption due to the slice and copy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/17000 Just to be clear - this is essentially just splitting an array up into smaller chunks so that overall communication is more efficient? It would be good to look at why Spark is not doing a good job with one big array. Is the bottleneck really the executor communication (shuffle part)? Or is it collecting the big array back at the end of tree aggregation (ie this patch sort of allows more concurrency in the `collect` operation)? cc @dbtsai @sethah @yanboliang who were looking at linear model scalability recently. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17000: [SPARK-18946][ML] sliceAggregate which is a new aggregat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17000 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org