[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 Agreed this is too ML specific to be deserve a fix in core. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 As in https://github.com/apache/spark/pull/16078 I do not think we should merge a change like this. Let's fix the problem directly in https://github.com/apache/spark/pull/16037 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 @mridulm we don't know how to monitor the size of the serialize task. Sure it would not 10MB due to all those zeros. But we nonetheless measure a significant increase in performance and (more importantly) in stability when using the workaround and our custom TreeAggregatorWithZeroGenerator @srowen when the density of a Sparse Vector increases it became *very* inefficient, not only because it is using almost twice the size of memory, but mainly because you fragment memory access on overload the GC :( See #16078 for a generic wrapper around treeAggregate with uses Option[U] do denote the zero of U by None and generate a full representation by need when calling seqOp or comboOp. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/16038 Without understanding the specifics of the ML part here - wont the actual impact of a large dense vector on Task 'bytes' be minimal at best ? We do compress the task binary; and 1B zero's should have fairly low compressed footprint, no ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 I see, you are saying that encoding an actually-dense vector as sparse is somewhat less efficient, and if the implementations happened to make that choice given sparse input, could be bad. In that case, just short-cut the problem and use `.toDense` to force a dense representation immediately. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 Actually aggregating (and thus sending on the network) on quite dense SparseVectors with 10s of million elements is not to taken lightly. This would required serious benchmarking. What I tell is that we can not do that lightly because we know that aggregating `acc += partial` when `partial` is sparse and `acc` dense is efficient locally (i.e. within a partition) and that after a partition has been aggregated the vector is already quite dense (enough so that the dense and sparse representation have similar size). Anyway if you consider this is not worth an new API and an optimization in core, the discusion should probably go on on the MLlib only PR #16037 for which we already have a specific work around. I just naively assumed that aggregating over big accumulator was typical enough to get some support from core, if this is not the case, I'll fall back on a MLlib only workaround. Feel free to close this PR is you deem it (ir)relevant. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 Right now, everything is dense here, right? That's the worst case. Your goal is to avoid serializing a dense zero vector and I say it can just be sparse, which solves the immediate problem. From there, some operations may or may not result in sparse vectors, but that's not relevant -- it can only be better than being all dense, which is the current case. It matters at the end because the return type is dense, but, making it dense is easy. Paying the cost of copying to a dense representation is the only downside, but that's small compared to the saving in serialization (I presume). I don't understand the problem you're suggesting? are you saying that you _don't_ want _any_ operations to be on sparse vectors? I'd leave that choice to the implementations, but, if you're worried about it, simply force the argument of the seqOp to become dense. Then, the only change is the smaller serialization and everything should be identical afterwards. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 THe big doubt I have is on *Of course, the first call to axpy should produce a dense vector, but, that's already on the executor*: the other operand is Sparse and has to be sparse, and this axpy call between to sparse vectors has to return a sparse vector. And we do not want to cast at the end because the performance penalties of using sparse instead of dense will already have been paid. Fully exploring the trade-off between sparse and dense aggregation os ot-of-scope and likely very application specific. I'll have a new look at the problem trying to manage a work around along the line of Sparse vectors, but this will not solve the typing issue as far as i can see for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 It might be as simple as writing `Vectors.sparse(n, Seq())` as the zero value. Everything else appears to operate on a Vector that's either sparse or dense then. (Of course, the first call to axpy should produce a dense vector, but, that's already on the executor.) The method does ultimately want to return a dense vector, and does insert a cast that assumes it's already a dense vector. If that's a problem, it's easy to insert a call to `.toDense` on the `Vector` at the end, which is a no-op if it's already dense, which definitely produces a dense vector. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 Yes sure the zero Vector is very sparse. Bu ti do not get your suggestion ? I see no way to pass a Sparse Vector as zero and get the type of vector to change underway to Dense Vector with only operations on Sparse vectors. What do you propose ? We do want to accumulate the Sparse Vector built from each datapoint into a Dense Vector. And because the type of the zero is the type of the return then we need to use a Dense vector as the zero, doesn't it ? You might have seen a way out of this I haven't, have you ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 Yes, but the result is inherently dense -- not going to be many zeroes in the gradient if any. There's no way around that. The discussion has been about the initial value, the zero vector, right? That you agree is sparse. I don't see what the problem is then. A huge sparse 0 vector is _not_ huge when serialized. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 No this does not do the trick as the result of the aggregation IS dense. And the zero in (tree)aggregate has the same type as the result. Said otherwise, in L-BFGS, we do aggregate vectors that are each pretty sparse but whose aggregation is dense as they have different support. So taking a dense vector as the zero of the aggregator make perfect sense, adding a sparse contribution to a dense aggregator yield a dense aggregator and this is what is desired. We just need not to waiste bandwith by sending this huge zero to executors. If you do not want to change or add a public API in core, we might contribute a treeAggregateWithZeroGenerator to MLlib using a third way to solve the issue: use an Option[DenseVector] as the aggregate type and unwrap it as a dense zero bby wrapping the seqOp and comboOp, we have a draft of that locally... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 (I am in the UK, but let's keep the discussion online here) OK, I had assumed that `Vectors.zeros` makes a sparse vector. It does not; it makes a dense vector. If it were sparse, then it would be tiny to serialize. Hence I didn't understand this to be the issue, and though you were saying that another copy of a dense representation was accidentally in the closure. However, this means there's a more direct solution without changing or adding to the public API: pass a sparse vector (of zeroes) as the zero value. It looks like this should work, even be a little faster on the first pass when the gradient is 0. If for some reason something doesn't like the sparse vector, it's easy enough to call `.toDense` such that it's only expanded once and not on the driver. In that case, I would not make the change in this PR, and change https://github.com/apache/spark/pull/16037 to make the initial vector sparse only. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 Maybe we could have a live talk/chat this evening Paris time / Morning West Coast time? If I'm not mistaken both #16037 and this change fix the issue at two different levels. * #16037 consider the the is specific to ML and that we can & should work around it in MLlib * This change consider the root cause is an inefficiency of treeAggregate in core whose fix would enable a better fix in MLlib. I updated this PR to reflect this. I might have misunderstood your comment https://github.com/apache/spark/pull/15905#pullrequestreview-8844854 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 If you're right here, then https://github.com/apache/spark/pull/16037 does nothing, right? In which case my understanding at https://github.com/apache/spark/pull/15905#pullrequestreview-8844854 is wrong. But if your other change does work, and that would be explained by my understanding, then this doesn't help. At https://issues.apache.org/jira/browse/SPARK-18471?focusedCommentId=15670795&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15670795 you describe aggregating a dense vector, but this is different from having a dense vector zero value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 It is related as if this PR get accepted, then the fix for PR to #16037 becomes trivial: replace treeAggregate with a call to `treeAggregateWithZeroGenerator( () => (Vectors.zeros(n), 0.0))`, in which case only the n gets caught in the closure if I'm not mistaken. This fix then gets much easier to port to other similat use cases in MLlib. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16038 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 This is not really related to the other change or the JIRA though. It also doesn't seem to address the problem as you describe it elsewhere. It's not that the zero value is big (a sparse 0 vector serializes into something tiny) but that some other representation gets into the closure? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 @srowen here is the companion PR to #16037. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org