Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
Agreed this is too ML specific to be deserve a fix in core.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16038
As in https://github.com/apache/spark/pull/16078 I do not think we should
merge a change like this. Let's fix the problem directly in
https://github.com/apache/spark/pull/16037
---
If your project
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
@mridulm we don't know how to monitor the size of the serialize task. Sure
it would not 10MB due to all those zeros. But we nonetheless measure a
significant increase in performance and
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/16038
Without understanding the specifics of the ML part here - wont the actual
impact of a large dense vector on Task 'bytes' be minimal at best ?
We do compress the task binary; and 1B zero's should
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16038
I see, you are saying that encoding an actually-dense vector as sparse is
somewhat less efficient, and if the implementations happened to make that
choice given sparse input, could be bad. In that
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
Actually aggregating (and thus sending on the network) on quite dense
SparseVectors with 10s of million elements is not to taken lightly. This would
required serious benchmarking. What I
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16038
Right now, everything is dense here, right? That's the worst case. Your
goal is to avoid serializing a dense zero vector and I say it can just be
sparse, which solves the immediate problem. From
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
THe big doubt I have is on *Of course, the first call to axpy should
produce a dense vector, but, that's already on the executor*: the other operand
is Sparse and has to be sparse, and this
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16038
It might be as simple as writing `Vectors.sparse(n, Seq())` as the zero
value. Everything else appears to operate on a Vector that's either sparse or
dense then. (Of course, the first call to axpy
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
Yes sure the zero Vector is very sparse. Bu ti do not get your suggestion ?
I see no way to pass a Sparse Vector as zero and get the type of vector to
change underway to Dense Vector with
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16038
Yes, but the result is inherently dense -- not going to be many zeroes in
the gradient if any. There's no way around that. The discussion has been about
the initial value, the zero vector, right?
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
No this does not do the trick as the result of the aggregation IS dense.
And the zero in (tree)aggregate has the same type as the result. Said
otherwise, in L-BFGS, we do aggregate vectors
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16038
(I am in the UK, but let's keep the discussion online here)
OK, I had assumed that `Vectors.zeros` makes a sparse vector. It does not;
it makes a dense vector. If it were sparse, then it
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
Maybe we could have a live talk/chat this evening Paris time / Morning West
Coast time?
If I'm not mistaken both #16037 and this change fix the issue at two
different levels.
*
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16038
If you're right here, then https://github.com/apache/spark/pull/16037 does
nothing, right?
In which case my understanding at
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
It is related as if this PR get accepted, then the fix for PR to #16037
becomes trivial: replace treeAggregate with a call to
`treeAggregateWithZeroGenerator( () => (Vectors.zeros(n),
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16038
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/16038
This is not really related to the other change or the JIRA though. It also
doesn't seem to address the problem as you describe it elsewhere. It's not that
the zero value is big (a sparse 0 vector
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
@srowen here is the companion PR to #16037.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
19 matches
Mail list logo