[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2018-02-03 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
@hhbyyh who shall we ping? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2018-02-03 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/19565
  
It's probably better to wait for the opinion from a committer. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2018-02-01 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
@hhbyyh So, I guess, I should just roll the refactoring back, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-13 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
@hhbyyh, is there a cluster I can use for this? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-12 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
@WeichenXu123 
@jkbradley said, pings on Git don't work for him...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-07 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19565
  
ok I agree this change. @jkbradley Can you take a look ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-07 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
ping @WeichenXu123 , @srowen , @hhbyyh 
Further comments?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83330/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19565
  
**[Test build #83330 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83330/testReport)**
 for PR 19565 at commit 
[`8bf6f6d`](https://github.com/apache/spark/commit/8bf6f6d9d594120ed190b167c19511bcd3abf453).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19565
  
**[Test build #83330 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83330/testReport)**
 for PR 19565 at commit 
[`8bf6f6d`](https://github.com/apache/spark/commit/8bf6f6d9d594120ed190b167c19511bcd3abf453).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-01 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
Okay... any idea why tests failed? It says 
```ERROR: Step ?Publish JUnit test result report? failed: No test report 
files were found. Configuration error?```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83301/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-01 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
@WeichenXu123, in a case of large dataset this "adjustment" would have 
infinitesimal effect. (IMO, no adjustment is needed -- the expected number of 
non-empty docs in the same and does not depend on the order of filter and 
sample and equals to `docs.size * miniBatchFraction * fractionOfNonEmptyDocs`). 

So I believe, we all agree that sampling should go before filtering. I'll 
send a commit shortly. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-11-01 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19565
  
Yes I think when dataset is large enough, using the same 
`miniBatchFraction`, the result RDD size of "filter before sample" and "filter 
after sample" will be asymptotically equal, no matter how many empty elements 
in dataset. (correct me if I am wrong, I am a little confused about 
"miniBatchFraction should be adjusted proportionally", if adjusted, then 
asymptotically equality is broken?)  If So, does it really effect the algo ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
@hhbyyh OK, but it returns almost the same number of elements. Anyway, the 
variance is going to be much smaller that in the case with sample before filter.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/19565
  
Let me know if I missed anything, but I don't quite catch the part 

> all the batches have the same length

 IMO
`docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction, 
randomGenerator.nextLong())` does not return the same number of documents 
during multiple function calls. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
@hhbyyh 

Yes, in this way we don't change semantics of `miniBatchFraction`. But is 
the way it is defined now actually correct? As I mentioned above, in the 
`upstram/master` the number of non-empty documents in the mini-batch is 
asymptotically normally distributed. Hence, the size of RDD fed to 
`treeAggregate` differs from one batch to another. While in this PR (filtering 
before sampling) all the batches have the same length.

But then again, for large datasets this should make no difference. If 
nobody thinks this disparity of batch sizes is an issue, I won't object against 
sampling before filtering.

@WeichenXu123, I believe, you were advocating for filter before sample. Do 
you still have the preference?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/19565
  
@akopich I'm actually leaning towards "filter after sample". 

1. so we don't need to change `miniBatchFraction` in 
` docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction,
randomGenerator.nextLong())`. 

2. Minor: I'm not sure corpusSize = nonEmptyDoc is always good. I'd prefer 
to keep docs and corpusSize referring to the entire dataset. 





---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-31 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
Ping @hhbyyh, @WeichenXu123, @srowen. 

Seems like the discussion is stuck. Does anybody think that the general 
approach implemented in this PR should be changed? Currently it is filtering 
before sampling with no caching. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83088/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19565
  
**[Test build #83088 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83088/testReport)**
 for PR 19565 at commit 
[`f376a7b`](https://github.com/apache/spark/commit/f376a7b97aefa700868c14da24fb6410358826df).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
@hhbyyh, in case of "filter before sample" in a local test the overhead is 
negligible. 

Regarding "sample before filter", you are right. There (strictly speaking) 
should be adjustment of `miniBatchFraction`. Which is why I do prefer "filter 
before sample".

Also note, version "sample before filter" is logically equivalent to the 
current upstream/master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19565
  
**[Test build #83088 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83088/testReport)**
 for PR 19565 at commit 
[`f376a7b`](https://github.com/apache/spark/commit/f376a7b97aefa700868c14da24fb6410358826df).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/19565
  
I'm curious about the performance comparison, if "filter before sample" 
triggers a filter over the whole dataset for each `submitMiniBatch`, then 
there'll be some performance impact.

And if "filter before sample" is used, IMO `miniBatchFraction`  should be 
adjusted proportionally.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19565
  
Yes, it changed the probability of samples indeed compared with current 
code.
But according to the comments coming from @jkbradley in #18924 , "in order 
to make **corpusSize**, batchSize, and nonEmptyDocsN all refer to the same 
filtered corpus", I think @jkbradley 's meaning is to filter out empty docs 
before sampling, the relationship should be:
`batchSize = miniBatchFraction * corpusSize` and `batchSize == 
nonEmptyDocsN`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
And the empty docs were not explicitly filtered out. They've just been 
ignored in `submitMiniBatch`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
I'm saying they are not the same, but for larger datasets this should not 
matter.

There is a change in logic. The hack with 
`val batchSize = (miniBatchFraction * corpusSize).ceil.toInt` 
is not used anymore. The function `updateLambda` uses the real number of 
non-empty docs. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19565
  
I agree, they're the same. You said at 
https://github.com/apache/spark/pull/19565#issuecomment-339638791 that they 
weren't.

But if you're saying the code already filters out empty docs further 
upstream anyway, then there is no change in logic, just where the filtering 
happens. Or did I misunderstand that part?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
Consider the following scenario. Let `docs` be an RDD containing 1000 empty 
documents and 1000 non-empty documents and let `miniBatchFraction = 0.05`.

Assume, we use `filter(...).sample(...)`. Then the resulted RDD will have 
around `50` elements. 

If we use `sample(...).filter(...)` instead, the `sample` returns around 
`100` elements. Now the number of elements in the RDD returned by `filter` is 
normally distributed. The expectation is `50` again though. 
Do I miss smth?

However, for larger samples this shouldn't make any difference. 

On the purpose of the issue: there were two different variables 
`batchSize`, and `nonEmptyDocsN` which could not be used interchangeably. The 
purpose is to submit a batch containing no empty docs which makes the two  
variables refer the same value. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19565
  
I assume not-selecting a record in a sample is cheaper than just about any 
other operation, including filtering on a predicate. All else equal, I'd rather 
sample, then evaluate a predicate on only the sample.

The two versions produce the same result though. Either way, every x% 
sample of non-empty docs appears with equal probability.

Yes, I doubt one is meaningfully faster than the other overall though. If 
that's not the motive though, and there's not a functional change, what's the 
purpose here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
@WeichenXu123, yes there indeed is a difference in logic. Eventually it 
boils down to semantics of `miniBatchFraction`. If it is a fraction of 
non-empty documents being sampled, the version with `filter` going first is 
correct. If it's a fraction of documents (empty and non-empty) being sampled, 
then the version with `sample` going first is correct. To me the first version 
seems more reasonable (who cares about empty docs anyway). @srowen, if I get it 
right, you would prefer the second option. Why? 

@WeichenXu123, I agree with you: filtering introduces a minimal overhead. 

@srowen, regarding performance... I don't actually think it makes any 
difference unless complexity of `sample` depends on the length of the parent 
RDD. In all the subsequent computations empty documents can be handled 
effectively. 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19565
  
Filtering after sampling makes more sense. Though sampling isn't 
deterministic, it doesn't change the probability that any particular sample is 
produced.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19565
  
@akopich IMO the filter won't cost too much, don't worry about the 
performance. (Or you can make a test to make sure)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
I am sure that caching may by avoided here. Hence, it should not be used. 

@srowen, maybe I don't get something, but I'm afraid, that currently 
lineage for a single mini-batch submission looks like this 
`docs.filter(nonEmpty).sample(minibatchFraction).treeAggregate(...)`. 

And I'm afraid that for each mini-batch `filter` will be performed all over 
again. But if we have smth like 
`docs.sample(minibatchFraction).filter(nonEmpty).treeAggregate(...)`,
this will be avoided. And no caching is needed. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-26 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19565
  
Regarding caching: I think that can be ignored for purposes of this change. 
All this does is add a filter, and it doesn't cause an RDD to computed more 
than it was before.

The only question is whether the filtering is worth it; does it filter out 
enough that it makes later work faster?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread WeichenXu123
Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19565
  
@akopich If you want to cache the input dataset, create JIAR to discuss it 
first. It's another issue I think. This JIAR also related to input caching 
issues: https://issues.apache.org/jira/browse/SPARK-19422


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/19565
  
I wonder if we should add cache() for lda training data, even not for this 
feature. 

@srowen Not sure where we're on caching the training data or not for 
different algorithms. Appreciate your advice.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83058/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19565
  
**[Test build #83058 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83058/testReport)**
 for PR 19565 at commit 
[`8b7f30b`](https://github.com/apache/spark/commit/8b7f30bdb11abcd3efe2204c614695b286c0ae95).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19565
  
**[Test build #83058 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83058/testReport)**
 for PR 19565 at commit 
[`8b7f30b`](https://github.com/apache/spark/commit/8b7f30bdb11abcd3efe2204c614695b286c0ae95).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83045/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19565
  
**[Test build #83045 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83045/testReport)**
 for PR 19565 at commit 
[`1d923f5`](https://github.com/apache/spark/commit/1d923f5908fd3a7a2ba8fa284d909aeb914ebc0a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19565
  
**[Test build #83045 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83045/testReport)**
 for PR 19565 at commit 
[`1d923f5`](https://github.com/apache/spark/commit/1d923f5908fd3a7a2ba8fa284d909aeb914ebc0a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
Or (and I think it would be the most efficient approach) we can just stick 
in the check for emptiness of the document to the `seqOp` of `treeAggregate`. 
However, it doesn't look like "filtering out beforehand". So, would this be OK?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
Now I feel that filtering empty docs out in the `initialize` is not a good 
idea, because it will be performed as many times, as the number of times 
`sample` in `next` gets called. Right?

Alternatively we can cache `this.docs`, but it's a waste of space...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83040/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19565
  
**[Test build #83040 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83040/testReport)**
 for PR 19565 at commit 
[`40685ee`](https://github.com/apache/spark/commit/40685ee960ef2c0e26b31ef96d1f3c8c974d3851).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19565
  
**[Test build #83040 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83040/testReport)**
 for PR 19565 at commit 
[`40685ee`](https://github.com/apache/spark/commit/40685ee960ef2c0e26b31ef96d1f3c8c974d3851).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83015/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19565
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19565
  
**[Test build #83015 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83015/testReport)**
 for PR 19565 at commit 
[`721f235`](https://github.com/apache/spark/commit/721f235934f26e75172d39f0398365606616267f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-24 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19565
  
**[Test build #83015 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83015/testReport)**
 for PR 19565 at commit 
[`721f235`](https://github.com/apache/spark/commit/721f235934f26e75172d39f0398365606616267f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...

2017-10-24 Thread akopich
Github user akopich commented on the issue:

https://github.com/apache/spark/pull/19565
  
@WeichenXu123, @hhbyyh, looking forward to your opinion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org