[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh who shall we ping? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19565 It's probably better to wait for the opinion from a committer. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh So, I guess, I should just roll the refactoring back, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh, is there a cluster I can use for this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @WeichenXu123 @jkbradley said, pings on Git don't work for him... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19565 ok I agree this change. @jkbradley Can you take a look ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 ping @WeichenXu123 , @srowen , @hhbyyh Further comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83330/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83330 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83330/testReport)** for PR 19565 at commit [`8bf6f6d`](https://github.com/apache/spark/commit/8bf6f6d9d594120ed190b167c19511bcd3abf453). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83330 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83330/testReport)** for PR 19565 at commit [`8bf6f6d`](https://github.com/apache/spark/commit/8bf6f6d9d594120ed190b167c19511bcd3abf453). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Okay... any idea why tests failed? It says ```ERROR: Step ?Publish JUnit test result report? failed: No test report files were found. Configuration error?``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83301/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @WeichenXu123, in a case of large dataset this "adjustment" would have infinitesimal effect. (IMO, no adjustment is needed -- the expected number of non-empty docs in the same and does not depend on the order of filter and sample and equals to `docs.size * miniBatchFraction * fractionOfNonEmptyDocs`). So I believe, we all agree that sampling should go before filtering. I'll send a commit shortly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19565 Yes I think when dataset is large enough, using the same `miniBatchFraction`, the result RDD size of "filter before sample" and "filter after sample" will be asymptotically equal, no matter how many empty elements in dataset. (correct me if I am wrong, I am a little confused about "miniBatchFraction should be adjusted proportionally", if adjusted, then asymptotically equality is broken?) If So, does it really effect the algo ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh OK, but it returns almost the same number of elements. Anyway, the variance is going to be much smaller that in the case with sample before filter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19565 Let me know if I missed anything, but I don't quite catch the part > all the batches have the same length IMO `docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction, randomGenerator.nextLong())` does not return the same number of documents during multiple function calls. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh Yes, in this way we don't change semantics of `miniBatchFraction`. But is the way it is defined now actually correct? As I mentioned above, in the `upstram/master` the number of non-empty documents in the mini-batch is asymptotically normally distributed. Hence, the size of RDD fed to `treeAggregate` differs from one batch to another. While in this PR (filtering before sampling) all the batches have the same length. But then again, for large datasets this should make no difference. If nobody thinks this disparity of batch sizes is an issue, I won't object against sampling before filtering. @WeichenXu123, I believe, you were advocating for filter before sample. Do you still have the preference? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19565 @akopich I'm actually leaning towards "filter after sample". 1. so we don't need to change `miniBatchFraction` in ` docs.sample(withReplacement = sampleWithReplacement, miniBatchFraction, randomGenerator.nextLong())`. 2. Minor: I'm not sure corpusSize = nonEmptyDoc is always good. I'd prefer to keep docs and corpusSize referring to the entire dataset. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Ping @hhbyyh, @WeichenXu123, @srowen. Seems like the discussion is stuck. Does anybody think that the general approach implemented in this PR should be changed? Currently it is filtering before sampling with no caching. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83088/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83088 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83088/testReport)** for PR 19565 at commit [`f376a7b`](https://github.com/apache/spark/commit/f376a7b97aefa700868c14da24fb6410358826df). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @hhbyyh, in case of "filter before sample" in a local test the overhead is negligible. Regarding "sample before filter", you are right. There (strictly speaking) should be adjustment of `miniBatchFraction`. Which is why I do prefer "filter before sample". Also note, version "sample before filter" is logically equivalent to the current upstream/master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83088 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83088/testReport)** for PR 19565 at commit [`f376a7b`](https://github.com/apache/spark/commit/f376a7b97aefa700868c14da24fb6410358826df). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19565 I'm curious about the performance comparison, if "filter before sample" triggers a filter over the whole dataset for each `submitMiniBatch`, then there'll be some performance impact. And if "filter before sample" is used, IMO `miniBatchFraction` should be adjusted proportionally. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19565 Yes, it changed the probability of samples indeed compared with current code. But according to the comments coming from @jkbradley in #18924 , "in order to make **corpusSize**, batchSize, and nonEmptyDocsN all refer to the same filtered corpus", I think @jkbradley 's meaning is to filter out empty docs before sampling, the relationship should be: `batchSize = miniBatchFraction * corpusSize` and `batchSize == nonEmptyDocsN` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 And the empty docs were not explicitly filtered out. They've just been ignored in `submitMiniBatch`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 I'm saying they are not the same, but for larger datasets this should not matter. There is a change in logic. The hack with `val batchSize = (miniBatchFraction * corpusSize).ceil.toInt` is not used anymore. The function `updateLambda` uses the real number of non-empty docs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19565 I agree, they're the same. You said at https://github.com/apache/spark/pull/19565#issuecomment-339638791 that they weren't. But if you're saying the code already filters out empty docs further upstream anyway, then there is no change in logic, just where the filtering happens. Or did I misunderstand that part? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Consider the following scenario. Let `docs` be an RDD containing 1000 empty documents and 1000 non-empty documents and let `miniBatchFraction = 0.05`. Assume, we use `filter(...).sample(...)`. Then the resulted RDD will have around `50` elements. If we use `sample(...).filter(...)` instead, the `sample` returns around `100` elements. Now the number of elements in the RDD returned by `filter` is normally distributed. The expectation is `50` again though. Do I miss smth? However, for larger samples this shouldn't make any difference. On the purpose of the issue: there were two different variables `batchSize`, and `nonEmptyDocsN` which could not be used interchangeably. The purpose is to submit a batch containing no empty docs which makes the two variables refer the same value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19565 I assume not-selecting a record in a sample is cheaper than just about any other operation, including filtering on a predicate. All else equal, I'd rather sample, then evaluate a predicate on only the sample. The two versions produce the same result though. Either way, every x% sample of non-empty docs appears with equal probability. Yes, I doubt one is meaningfully faster than the other overall though. If that's not the motive though, and there's not a functional change, what's the purpose here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @WeichenXu123, yes there indeed is a difference in logic. Eventually it boils down to semantics of `miniBatchFraction`. If it is a fraction of non-empty documents being sampled, the version with `filter` going first is correct. If it's a fraction of documents (empty and non-empty) being sampled, then the version with `sample` going first is correct. To me the first version seems more reasonable (who cares about empty docs anyway). @srowen, if I get it right, you would prefer the second option. Why? @WeichenXu123, I agree with you: filtering introduces a minimal overhead. @srowen, regarding performance... I don't actually think it makes any difference unless complexity of `sample` depends on the length of the parent RDD. In all the subsequent computations empty documents can be handled effectively. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19565 Filtering after sampling makes more sense. Though sampling isn't deterministic, it doesn't change the probability that any particular sample is produced. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19565 @akopich IMO the filter won't cost too much, don't worry about the performance. (Or you can make a test to make sure) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 I am sure that caching may by avoided here. Hence, it should not be used. @srowen, maybe I don't get something, but I'm afraid, that currently lineage for a single mini-batch submission looks like this `docs.filter(nonEmpty).sample(minibatchFraction).treeAggregate(...)`. And I'm afraid that for each mini-batch `filter` will be performed all over again. But if we have smth like `docs.sample(minibatchFraction).filter(nonEmpty).treeAggregate(...)`, this will be avoided. And no caching is needed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19565 Regarding caching: I think that can be ignored for purposes of this change. All this does is add a filter, and it doesn't cause an RDD to computed more than it was before. The only question is whether the filtering is worth it; does it filter out enough that it makes later work faster? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19565 @akopich If you want to cache the input dataset, create JIAR to discuss it first. It's another issue I think. This JIAR also related to input caching issues: https://issues.apache.org/jira/browse/SPARK-19422 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user hhbyyh commented on the issue: https://github.com/apache/spark/pull/19565 I wonder if we should add cache() for lda training data, even not for this feature. @srowen Not sure where we're on caching the training data or not for different algorithms. Appreciate your advice. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83058/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83058 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83058/testReport)** for PR 19565 at commit [`8b7f30b`](https://github.com/apache/spark/commit/8b7f30bdb11abcd3efe2204c614695b286c0ae95). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83058 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83058/testReport)** for PR 19565 at commit [`8b7f30b`](https://github.com/apache/spark/commit/8b7f30bdb11abcd3efe2204c614695b286c0ae95). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83045/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83045 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83045/testReport)** for PR 19565 at commit [`1d923f5`](https://github.com/apache/spark/commit/1d923f5908fd3a7a2ba8fa284d909aeb914ebc0a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83045 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83045/testReport)** for PR 19565 at commit [`1d923f5`](https://github.com/apache/spark/commit/1d923f5908fd3a7a2ba8fa284d909aeb914ebc0a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Or (and I think it would be the most efficient approach) we can just stick in the check for emptiness of the document to the `seqOp` of `treeAggregate`. However, it doesn't look like "filtering out beforehand". So, would this be OK? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 Now I feel that filtering empty docs out in the `initialize` is not a good idea, because it will be performed as many times, as the number of times `sample` in `next` gets called. Right? Alternatively we can cache `this.docs`, but it's a waste of space... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83040/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83040 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83040/testReport)** for PR 19565 at commit [`40685ee`](https://github.com/apache/spark/commit/40685ee960ef2c0e26b31ef96d1f3c8c974d3851). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83040 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83040/testReport)** for PR 19565 at commit [`40685ee`](https://github.com/apache/spark/commit/40685ee960ef2c0e26b31ef96d1f3c8c974d3851). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83015/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19565 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83015 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83015/testReport)** for PR 19565 at commit [`721f235`](https://github.com/apache/spark/commit/721f235934f26e75172d39f0398365606616267f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19565 **[Test build #83015 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83015/testReport)** for PR 19565 at commit [`721f235`](https://github.com/apache/spark/commit/721f235934f26e75172d39f0398365606616267f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19565: [SPARK-22111][MLLIB] OnlineLDAOptimizer should filter ou...
Github user akopich commented on the issue: https://github.com/apache/spark/pull/19565 @WeichenXu123, @hhbyyh, looking forward to your opinion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org