[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-30 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 Agreed this is too ML specific to be deserve a fix in core. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-30 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 As in https://github.com/apache/spark/pull/16078 I do not think we should merge a change like this. Let's fix the problem directly in https://github.com/apache/spark/pull/16037 --- If your project

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-30 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 @mridulm we don't know how to monitor the size of the serialize task. Sure it would not 10MB due to all those zeros. But we nonetheless measure a significant increase in performance and

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/16038 Without understanding the specifics of the ML part here - wont the actual impact of a large dense vector on Task 'bytes' be minimal at best ? We do compress the task binary; and 1B zero's should

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 I see, you are saying that encoding an actually-dense vector as sparse is somewhat less efficient, and if the implementations happened to make that choice given sparse input, could be bad. In that

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 Actually aggregating (and thus sending on the network) on quite dense SparseVectors with 10s of million elements is not to taken lightly. This would required serious benchmarking. What I

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 Right now, everything is dense here, right? That's the worst case. Your goal is to avoid serializing a dense zero vector and I say it can just be sparse, which solves the immediate problem. From

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 THe big doubt I have is on *Of course, the first call to axpy should produce a dense vector, but, that's already on the executor*: the other operand is Sparse and has to be sparse, and this

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 It might be as simple as writing `Vectors.sparse(n, Seq())` as the zero value. Everything else appears to operate on a Vector that's either sparse or dense then. (Of course, the first call to axpy

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 Yes sure the zero Vector is very sparse. Bu ti do not get your suggestion ? I see no way to pass a Sparse Vector as zero and get the type of vector to change underway to Dense Vector with

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 Yes, but the result is inherently dense -- not going to be many zeroes in the gradient if any. There's no way around that. The discussion has been about the initial value, the zero vector, right?

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 No this does not do the trick as the result of the aggregation IS dense. And the zero in (tree)aggregate has the same type as the result. Said otherwise, in L-BFGS, we do aggregate vectors

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 (I am in the UK, but let's keep the discussion online here) OK, I had assumed that `Vectors.zeros` makes a sparse vector. It does not; it makes a dense vector. If it were sparse, then it

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 Maybe we could have a live talk/chat this evening Paris time / Morning West Coast time? If I'm not mistaken both #16037 and this change fix the issue at two different levels. *

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-28 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 If you're right here, then https://github.com/apache/spark/pull/16037 does nothing, right? In which case my understanding at

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-28 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 It is related as if this PR get accepted, then the fix for PR to #16037 becomes trivial: replace treeAggregate with a call to `treeAggregateWithZeroGenerator( () => (Vectors.zeros(n),

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16038 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-28 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/16038 This is not really related to the other change or the JIRA though. It also doesn't seem to address the problem as you describe it elsewhere. It's not that the zero value is big (a sparse 0 vector

[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-28 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue: https://github.com/apache/spark/pull/16038 @srowen here is the companion PR to #16037. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this