[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-30 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
Agreed this is too ML specific to be deserve a fix in core.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-30 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16038
  
As in https://github.com/apache/spark/pull/16078 I do not think we should 
merge a change like this. Let's fix the problem directly in 
https://github.com/apache/spark/pull/16037


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-30 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
@mridulm we don't know how to monitor the size of the serialize task. Sure 
it would not 10MB due to all those zeros. But we nonetheless measure a 
significant increase in performance and (more importantly) in stability when 
using the workaround and our custom TreeAggregatorWithZeroGenerator 

@srowen when the density of a Sparse Vector increases it became *very* 
inefficient, not only because it is using almost twice the size of memory, but 
mainly because you fragment memory access on overload the GC :( 

See #16078 for a generic wrapper around treeAggregate with uses Option[U] 
do denote the zero of U by None and generate a full representation by need when 
calling seqOp or comboOp. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread mridulm
Github user mridulm commented on the issue:

https://github.com/apache/spark/pull/16038
  
Without understanding the specifics of the ML part here - wont the actual 
impact of a large dense vector on Task 'bytes' be minimal at best ?
We do compress the task binary; and 1B zero's should have fairly low 
compressed footprint, no ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16038
  
I see, you are saying that encoding an actually-dense vector as sparse is 
somewhat less efficient, and if the implementations happened to make that 
choice given sparse input, could be bad. In that case, just short-cut the 
problem and use `.toDense` to force a dense representation immediately.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
Actually aggregating (and thus sending on the network) on quite dense 
SparseVectors with 10s of million elements is not to taken lightly. This would 
required serious benchmarking. What I tell is that we can not do that lightly 
because we know that aggregating  `acc += partial`  when `partial` is sparse 
and `acc` dense is efficient locally (i.e. within a partition) and that after a 
partition has been aggregated the vector is already quite dense (enough so that 
the dense and sparse representation have similar size).

Anyway if you consider this is not worth an new API and an optimization in 
core, the discusion should probably go on on the MLlib only PR #16037 for which 
we already have a specific work around.
I just naively assumed  that aggregating over big accumulator was typical 
enough to get some support from core, if this is not the case, I'll fall back 
on a MLlib only workaround. 

Feel free to close this PR is you deem it (ir)relevant.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16038
  
Right now, everything is dense here, right? That's the worst case. Your 
goal is to avoid serializing a dense zero vector and I say it can just be 
sparse, which solves the immediate problem. From there, some operations may or 
may not result in sparse vectors, but that's not relevant -- it can only be 
better than being all dense, which is the current case. It matters at the end 
because the return type is dense, but, making it dense is easy. Paying the cost 
of copying to a dense representation is the only downside, but that's small 
compared to the saving in serialization (I presume). 

I don't understand the problem you're suggesting? are you saying that you 
_don't_ want _any_ operations to be on sparse vectors? I'd leave that choice to 
the implementations, but, if you're worried about it, simply force the argument 
of the seqOp to become dense. Then, the only change is the smaller 
serialization and everything should be identical afterwards.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
THe big doubt I have is on *Of course, the first call to axpy should 
produce a dense vector, but, that's already on the executor*: the other operand 
is Sparse and has to be sparse, and this axpy call between to sparse vectors 
has to return a sparse vector. And we do not want to cast at the end because 
the performance penalties of using sparse instead of dense will already have 
been paid.

Fully exploring the trade-off between sparse and  dense aggregation os 
ot-of-scope and likely very application specific.  I'll have a new look at the 
problem trying to manage a work around along the line of Sparse vectors, but 
this will not solve the typing issue as far as i can see for now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16038
  
It might be as simple as writing `Vectors.sparse(n, Seq())` as the zero 
value. Everything else appears to operate on a Vector that's either sparse or 
dense then. (Of course, the first call to axpy should produce a dense vector, 
but, that's already on the executor.)

The method does ultimately want to return a dense vector, and does insert a 
cast that assumes it's already a dense vector. If that's a problem, it's easy 
to insert a call to `.toDense` on the `Vector` at the end, which is a no-op if 
it's already dense, which definitely produces a dense vector.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
Yes sure the zero Vector is very sparse. Bu ti do not get your suggestion ? 
I see no way to pass a Sparse Vector as zero and get the type of vector to 
change underway to Dense Vector with only operations on Sparse vectors. 

What do you propose ? We do want to accumulate the Sparse Vector built from 
each datapoint into a Dense Vector. And because the type of the zero is the 
type of the return then we need to use a Dense vector as the zero, doesn't it ? 
You might have seen a way out of this I haven't, have you ?
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16038
  
Yes, but the result is inherently dense -- not going to be many zeroes in 
the gradient if any. There's no way around that. The discussion has been about 
the initial value, the zero vector, right? That you agree is sparse. I don't 
see what the problem is then. A huge sparse 0 vector is _not_ huge when 
serialized.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
No this does not do the trick as the result of the aggregation IS dense. 
And the zero in (tree)aggregate has the same type as the result. Said 
otherwise, in L-BFGS, we do aggregate vectors that are each pretty sparse but 
whose aggregation is dense as they have different support. So taking a dense 
vector as the zero of the aggregator make perfect sense, adding a sparse 
contribution to a dense aggregator yield a dense aggregator and this is what is 
desired. We just need not to waiste bandwith by sending this huge zero to 
executors.

If you do not want to change or add a public API in core, we might 
contribute a treeAggregateWithZeroGenerator to MLlib using a third way to solve 
the issue: use an Option[DenseVector] as the aggregate type and unwrap it as a 
dense zero bby wrapping the seqOp and comboOp, we have a draft of that 
locally...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16038
  
(I am in the UK, but let's keep the discussion online here)

OK, I had assumed that `Vectors.zeros` makes a sparse vector. It does not; 
it makes a dense vector. If it were sparse, then it would be tiny to serialize. 
Hence I didn't understand this to be the issue, and though you were saying that 
another copy of a dense representation was accidentally in the closure.

However, this means there's a more direct solution without changing or 
adding to the public API: pass a sparse vector (of zeroes) as the zero value. 
It looks like this should work, even be a little faster on the first pass when 
the gradient is 0. If for some reason something doesn't like the sparse vector, 
it's easy enough to call `.toDense` such that it's only expanded once and not 
on the driver.

In that case, I would not make the change in this PR, and change 
https://github.com/apache/spark/pull/16037 to make the initial vector sparse 
only.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-29 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
Maybe we could have a live talk/chat this evening Paris time / Morning West 
Coast time?

If I'm not mistaken both  #16037 and this change fix the issue at two 
different levels.
* #16037 consider the the is specific to ML and that we can & should work 
around it in MLlib
* This change consider the root cause is an inefficiency of treeAggregate 
in core whose fix would enable a better fix in MLlib. I updated this PR to 
reflect this.

I might have misunderstood your comment 
https://github.com/apache/spark/pull/15905#pullrequestreview-8844854


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-28 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16038
  
If you're right here, then https://github.com/apache/spark/pull/16037 does 
nothing, right?

In which case my understanding at 
https://github.com/apache/spark/pull/15905#pullrequestreview-8844854 is wrong.

But if your other change does work, and that would be explained by my 
understanding, then this doesn't help.

At 
https://issues.apache.org/jira/browse/SPARK-18471?focusedCommentId=15670795&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15670795
 you describe aggregating a dense vector, but this is different from having a 
dense vector zero value.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-28 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
It is related as if this PR get accepted, then the fix for PR to #16037 
becomes trivial: replace treeAggregate with a call to 
`treeAggregateWithZeroGenerator( () => (Vectors.zeros(n), 0.0))`, in which case 
only the n gets caught in the closure if I'm not mistaken. This fix then gets 
much easier to port to other similat use cases in MLlib.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16038
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-28 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/16038
  
This is not really related to the other change or the JIRA though. It also 
doesn't seem to address the problem as you describe it elsewhere. It's not that 
the zero value is big (a sparse 0 vector serializes into something tiny) but 
that some other representation gets into the closure?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16038: [SPARK-18471][CORE] New treeAggregate overload for big l...

2016-11-28 Thread AnthonyTruchet
Github user AnthonyTruchet commented on the issue:

https://github.com/apache/spark/pull/16038
  
@srowen here is the companion PR to #16037.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org