[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...
Github user willb commented on the issue: https://github.com/apache/spark/pull/13440 @srowen @thunterdb this PR passes all tests and merges cleanly -- can you take another look? It's been open for quite a while now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...
Github user willb commented on the issue: https://github.com/apache/spark/pull/13440 I agree with @felixcheung -- @srowen or @thunterdb, can you take a look at this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...
Github user willb commented on the issue: https://github.com/apache/spark/pull/13440 @HyukjinKwon @thunterdb can you all take a look at this? It's been under review for quite a long time! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...
Github user willb commented on the issue: https://github.com/apache/spark/pull/13440 @thunterdb can you take a look at this now that 2.2 is out? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15150: [SPARK-17595] [MLLib] Use a bounded priority queu...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15150#discussion_r79641840 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -580,10 +581,14 @@ class Word2VecModel private[spark] ( ind += 1 } -val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2) +val pq = new BoundedPriorityQueue[(String, Double)](num + 1)(Ordering.by(_._2)) + +pq ++= wordList.zip(cosVec) --- End diff -- Agreed. I'll push as soon as I finish running tests locally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15150: [SPARK-17595] [MLLib] Use a bounded priority queu...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15150#discussion_r79637489 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -580,10 +581,14 @@ class Word2VecModel private[spark] ( ind += 1 } -val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2) +val pq = new BoundedPriorityQueue[(String, Double)](num + 1)(Ordering.by(_._2)) + +pq ++= wordList.zip(cosVec) --- End diff -- Yeah, I guess I figured that since we were allocating the tuples anyway a single copy of the array wasn't a lot of extra overhead vs. having slightly cleaner code. But I'm happy to make the change if you think it's a good idea. I agree that allocating an array just to iterate through it isn't ideal. (I'm ambivalent, partially because I don't have a great sense for the vocabulary sizes people typically use this code for in the wild. For my example corpus, my patch as-is, zipping collection iterators, and explicit iteration over indices are all more or less equivalent in time performance. My intuition is that allocating even the single array from `zip` is a bad deal if we're dealing with a very large vocabulary but probably not if the typical case is on the order of 10^5 words or less.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15150: [SPARK-17595] [MLLib] Use a bounded priority queu...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15150#discussion_r79600882 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -580,7 +581,11 @@ class Word2VecModel private[spark] ( ind += 1 } -val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2) +val pq = new BoundedPriorityQueue[(String, Double)](num + 1)(Ordering.by(_._2)) + +pq ++= wordList.zip(cosVec) + +val scored = pq.toSeq.sortBy(-_._2) val filtered = wordOpt match { case Some(w) => scored.take(num + 1).filter(tup => w != tup._1) --- End diff -- Thanks, @hhbyyh! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15150: [SPARK-17595] [MLLib] Use a bounded priority queue to fi...
Github user willb commented on the issue: https://github.com/apache/spark/pull/15150 Thanks for the feedback, @srowen! I've made the changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15150: Use a bounded priority queue to find synonyms in ...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/15150 Use a bounded priority queue to find synonyms in Word2VecModel ## What changes were proposed in this pull request? The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements with the highest similarity to the query vector currently sorts the collection of similarities for every vocabulary element. This involves making multiple copies of the collection of similarities while doing a (relatively) expensive sort. It would be more efficient to find the best matches by maintaining a bounded priority queue and populating it with a single pass over the vocabulary, and that is exactly what this patch does. ## How was this patch tested? This patch adds no user-visible functionality and its correctness should be exercised by existing tests. To ensure that this approach is actually faster, I made a microbenchmark for `findSynonyms`: ``` object W2VTiming { import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.mllib.feature.Word2VecModel def run(modelPath: String, scOpt: Option[SparkContext] = None) { val sc = scOpt.getOrElse(new SparkContext(new SparkConf(true).setMaster("local[*]").setAppName("test"))) val model = Word2VecModel.load(sc, modelPath) val keys = model.getVectors.keys val start = System.currentTimeMillis for(key <- keys) { model.findSynonyms(key, 5) model.findSynonyms(key, 10) model.findSynonyms(key, 25) model.findSynonyms(key, 50) } val finish = System.currentTimeMillis println("run completed in " + (finish - start) + "ms") } } ``` I ran this test on a model generated from the complete works of Jane Austen and found that the new approach was over 3x faster than the old approach. (As the `num` argument to `findSynonyms` approaches the vocabulary size, the new approach will have less of an advantage over the old one.) You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark SPARK-17595 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15150.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15150 commit ddba6577a6ece67f77b3757da859c88fa9065c04 Author: William Benton <wi...@redhat.com> Date: 2016-09-19T14:35:57Z Use a bounded priority queue to find synonyms in Word2VecModel --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...
Github user willb commented on the issue: https://github.com/apache/spark/pull/15105 Thanks for the feedback @srowen! I think 18e6bfe addresses everything from a code perspective, but it missed removing the comment about assuming that distinct words have distinct vector representations, so I've just pushed another commit that just removes that comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r79157601 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -227,7 +227,7 @@ class Word2VecModel private[ml] ( */ @Since("1.5.0") def findSynonyms(word: String, num: Int): DataFrame = { -findSynonyms(wordVectors.transform(word), num) +findSynonyms(wordVectors.transform(word), num, Some(word)) --- End diff -- In this case (and similarly in [`Word2VecModelWrapper`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala)) I opted to call the three-argument version because the wrappers both explicitly convert their argument to a vector before calling `findSynonyms` on the underlying model (and so `wordOpt` would not be defined if the wrapper were invoked with a word). If we were to make the three-argument `findSynonyms` private we wouldn't be able to share a code path in the wrapper classes and would need to duplicate the code to tidy and reformat results in both methods (data frame creation in this case, unzipping and `asJava` in the Python model wrapper) or factor it out to a separate method. Let me know how you want me to proceed here. I agree that updating the docs makes sense and will make it clearer to future maintainers as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...
Github user willb commented on the issue: https://github.com/apache/spark/pull/15105 @srowen Yes. But if we're querying for words similar to a given vector, then there's no word to filter out (and, indeed, no way to know which word we might want to filter out if multiple words might map to the same vector representation). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r79005872 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- Consider the case where you're passing in a vector that is extremely similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary. (A concrete example is (in this test I added)[https://github.com/willb/spark/blob/fix/findSynonyms/mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala#L72].) In this case, `wordOpt` is not defined (because you are querying for words whose vector representations are similar to a given vector) but you nonetheless are not interested in discarding the best match because it is not a trivial match (that is, it is not going to be your query term). Related to your other comment (and discussion elsewhere on this PR), I think we could make a case for changing the documented behavior (especially since it is only documented as such in pyspark) in the case where `findSynonyms` is invoked with the vector representation of a word that is _in_ the vocabulary. Instead of rejecting the best match in that case, we could return it. The argument there is that such a result is telling you something you didn't necessarily know (otherwise, you'd probably be querying for a word and not a vector) and that there is an easy way to identify that such a match is trivially the best match. I recognize that changing documented behavior is a big deal, but in this case it seems like it could be the best way to address a minor bug. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...
Github user willb commented on the issue: https://github.com/apache/spark/pull/15105 Can we assume in general that distinct words will not have identical (viz., by `.equals`) vector representations? I ask because the behavior in the current documentation (in that pyspark test I linked above) assumes that you don't want a word to turn up as a synonym for its own vector representation. But this is impossible to enforce if we aren't guaranteed that distinct words won't have identical vector representations. This seems like a mostly reasonable assumption but it is definitely a corner case that we might want to be robust to (or at least take note of). If we can't accept this assumption, then we should figure out whether or not changing the expected behavior in the documentation is acceptable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78995496 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- @srowen So actually reverting to the old code but filtering only if `wordOpt` is defined doesn't handle the original case I was considering here, where you pass in a vector that is very similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78983451 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- I wasn't sure if the todo was referring to the sorting or to earlier optimizations. I'll get it started as a separate issue; thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78981374 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala --- @@ -68,6 +69,21 @@ class Word2VecSuite extends SparkFunSuite with MLlibTestSparkContext { assert(syms(1)._1 == "japan") } + test("findSynonyms doesn't reject similar word vectors when called with a vector") { +val num = 2 +val word2VecMap = Map( + ("china", Array(0.50f, 0.50f, 0.50f, 0.50f)), + ("japan", Array(0.40f, 0.50f, 0.50f, 0.50f)), + ("taiwan", Array(0.60f, 0.50f, 0.50f, 0.50f)), + ("korea", Array(0.45f, 0.60f, 0.60f, 0.60f)) +) +val model = new Word2VecModel(word2VecMap) +val syms = model.findSynonyms(Vectors.dense(Array(0.52d, 0.50d, 0.50d, 0.50d)), num) --- End diff -- Yes to both (the two significant digits was simply following the style earlier in the test). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78981066 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- @srowen This code in general kind of bothers me (I'd rather see a single pass through the tuples with a bounded priority queue keeping track of the `num + 1` candidates than converting to a sequence and then allocating an array to sort in place). But I'm inclined to get some numbers showing that that is a good idea and make it a separate PR unless this is a good time to fold it in (so to speak). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78973934 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- @hhbyyh So [one of the python doctests](https://github.com/apache/spark/blob/branch-2.0/python/pyspark/mllib/feature.py#L575) depends on a word not being a synonym of its vector representation. I think since this is the documented behavior now, that's the direction the fix should go as well, but I'll use `Vector.equals` instead of checking similarity in any case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...
Github user willb commented on the issue: https://github.com/apache/spark/pull/15105 Thanks, @hhbyyh. This is what I get for running `sbt "testOnly ..."` I'll push a fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/15105 [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector ## What changes were proposed in this pull request? This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector. ## How was this patch tested? I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark fix/findSynonyms Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15105.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15105 commit 5a7cf31e1c76c2fe4e172c2889fa23deb2152af9 Author: William Benton <wi...@redhat.com> Date: 2016-09-14T17:07:14Z test for spurious rejection of similar vectors in findSynonyms commit 757ce7c73e76f751395d55de3b5b31fa4a05cde8 Author: William Benton <wi...@redhat.com> Date: 2016-09-14T21:35:29Z `Word2VecModel.findSynonyms` no longer spuriously rejects the best match Previously, the `findSynonyms` method in `Word2VecModel` rejected the closest-matching vector. This was typically correct in cases where we were searching for synonyms of a word, but was incorrect in cases where we were searching for words most similar to a given vector, since the given vector might not correspond to a word in the model's vocabulary. With this commit, `findSynonyms` will not discard the best matching term unless the best matching word is also the query word (or is maximally similar to the query vector). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2863: [SQL] Add facilities for function-...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2768#issuecomment-60934954 I've rebased this to fix merge conflicts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2863: [SQL] Add facilities for function-...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2768#issuecomment-58893902 Two open questions, and the latter is more relevant: is requiring that actuals are casted to the types of formals too restrictive? Is it likely to lead to type-coercion rules oscillating? (Obviously, it should be possible to pass, e.g., a value of a narrow numeric type where a wider one is expected. But if all the type-coercion rules we can anticipate ultimately widen types or convert from strings into other values, then the rules will still make progress.) (cc @marmbrus) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2863: [SQL] Add facilities for function-...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/2768 SPARK-2863: [SQL] Add facilities for function-argument coercion This commit adds the `SignedFunction` trait and modifies the `Sqrt` expression class to use it for coercing its argument to `DoubleType`. `SignedFunction` represents a fixed-arity function whose arguments should be casted to particular types. Expression classes extending SignedFunction must provide `formalTypes`, a List of expected types for formal parameters, `actualParams`, a list of Expressions corresponding to actual parameters, and create, which creates an instance of that expression class from a list of expressions corresponding to actuals. The type parameter for SignedFunction should be the expression class extending it. See the Sqrt class for a concrete example. This trait (or one or several abstract classes extending this trait) could be exposed to code outside `sql` in the future. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-2863 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2768.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2768 commit 4f9517a2c11d13f439f3ed7ea447a4559f9e9088 Author: William Benton wi...@redhat.com Date: 2014-10-11T12:40:10Z Adds SignedFunction trait and type coercion rules SignedFunction represents a fixed-arity function whose arguments should be casted to particular types. Expression classes extending SignedFunction must provide `formalTypes`, a List of expected types for formal parameters, `actualParams`, a list of Expressions corresponding to actual parameters, and create, which creates an instance of that expression class from a list of expressions corresponding to actuals. The type parameter for SignedFunction should be the expression class extending it. See the Sqrt class for a concrete example. This trait (or one or several abstract classes extending this trait) could be exposed to code outside `sql` in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3699: SQL and Hive console tasks now cle...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/2547 SPARK-3699: SQL and Hive console tasks now clean up appropriately The sbt tasks sql/console and hive/console will now `stop()` the `SparkContext` upon exit. Previously, they left an ugly stack trace when quitting. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark consoleCleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2547.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2547 commit d5e431f0a1b9047a5afc27cb371dbfb7014fb6e0 Author: William Benton wi...@redhat.com Date: 2014-09-26T18:45:20Z SQL and Hive console tasks now clean up. The sbt tasks sql/console and hive/console will now clean up the SparkContext upon exit. Previously, they left an ugly stack trace when quitting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3699: SQL and Hive console tasks now cle...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2547#issuecomment-57006526 Looks like Jenkins and GitHub are having a minor disagreement: FATAL: Failed to fetch from https://github.com/apache/spark.git --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3699: SQL and Hive console tasks now cle...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2547#issuecomment-57032615 This patch only changes the sbt build file; it adds no classes and should not have caused any tests to fail. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-56121260 @andrewor14 This bothers me too, but in a slightly different way: calling the parameter âoverheadâ when it really refers to how to scale requested memory to accommodate anticipated overhead seems wrong. (115% overhead would be appalling indeed!) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2401#issuecomment-55902476 Here's a somewhat-related concern: it seems like JVM overhead is unlikely to scale linearly with requested heap size, so maybe a straight-up 15% isn't a great default? (If you have hard data on how heap requirements grow with job size, I'd be interested in seeing it.) Perhaps it would make more sense to reserve whichever is smaller of 15% or some fixed but reasonable amount. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix postfixOps warnings in the test suite
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1323#issuecomment-55345856 Hi @jkbradley; thanks for taking a look. Here are the warnings as I see them when compiling tests on the immediate ancestor of my branch, which is 56e009d (I'm running on OS X 10.9 in this case): ``` [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:38: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] This can be achieved by adding the import clause 'import scala.language.postfixOps' [warn] or by setting the compiler option -language:postfixOps. [warn] See the Scala docs for value scala.language.postfixOps for a discussion [warn] why the feature should be explicitly enabled. [warn] implicit val defaultTimeout = timeout(1 millis) [warn] ^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:99: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] preGCTester.assertCleanup()(timeout(1000 millis)) [warn]^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:117: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] preGCTester.assertCleanup()(timeout(1000 millis)) [warn]^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:134: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] preGCTester.assertCleanup()(timeout(1000 millis)) [warn]^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:156: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] preGCTester.assertCleanup()(timeout(1000 millis)) [warn]^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:187: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] preGCTester.assertCleanup()(timeout(1000 millis)) [warn]^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:290: postfix operator millis should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] eventually(waitTimeout, interval(100 millis)) { [warn]^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/rdd/AsyncRDDActionsSuite.scala:138: postfix operator seconds should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] failAfter(10 seconds) { [warn] ^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/rdd/AsyncRDDActionsSuite.scala:174: postfix operator seconds should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] failAfter(10 seconds) { [warn] ^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ui/UISuite.scala:42: postfix operator seconds should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] eventually(timeout(10 seconds), interval(50 milliseconds)) { [warn] ^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ui/UISuite.scala:42: postfix operator milliseconds should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] eventually(timeout(10 seconds), interval(50 milliseconds)) { [warn] ^ [warn] /Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ui/UISuite.scala:56: postfix operator seconds should be enabled [warn] by making the implicit value scala.language.postfixOps visible. [warn] eventually(timeout(10 seconds), interval(50 milliseconds)) { [warn
[GitHub] spark pull request: Fix postfixOps warnings in the test suite
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1323#issuecomment-55351655 Hey @andrewor14, thanks for the reply. First off, I absolutely agree with @srowen's comment on #1330 that `import`s (not compiler flags) are the right way to handle enabling these language features. It looks to me like `SpanSugar` pulls in `postfixOps` -- and that it's the only thing in those files that uses `postfixOps`. (I guess having both `SpanSugar` and `postfixOps` imported at toplevel was causing some implicit resolution confusion?) In any case, the approach in #1330 is probably the way to go since explicitly importing `postfixOps` seems unnecessary and removing the compiler flag is a good idea. Thanks again for taking a look! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix postfixOps warnings in the test suite
Github user willb closed the pull request at: https://github.com/apache/spark/pull/1323 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3423: [SQL] Implement BETWEEN for SQLPar...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/2295 SPARK-3423: [SQL] Implement BETWEEN for SQLParser You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark sql-between Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2295.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2295 commit 0016d300a77e8ef80899a679f54e7451544361dc Author: William Benton wi...@redhat.com Date: 2014-09-05T23:00:29Z Implement BETWEEN for SQLParser --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2220#issuecomment-54303443 This failure (in `SparkSubmitSuite`) appears unrelated to my patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2220#issuecomment-54165469 @concretevitamin I cherry-picked @aarondav's fix (and added a very simple fix to handle cases that it didn't). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/2220 SPARK-3329: [SQL] Don't depend on Hive SET pair ordering. This fixes some possible spurious test failures in `HiveQuerySuite` by comparing sets of key-value pairs as sets, rather than as lists. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-3329 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2220.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2220 commit 36ff52aee8ac2d8da336a56df195e6093e8f7807 Author: William Benton wi...@redhat.com Date: 2014-08-31T17:08:00Z Don't depend on Hive SET pair ordering. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/2220#issuecomment-53996991 Thanks, @concretevitamin! I'll close this one then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: spark-729: predictable closure capture
Github user willb closed the pull request at: https://github.com/apache/spark/pull/1322 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: spark-729: predictable closure capture
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1322#issuecomment-53903366 @mateiz sure; I've tracked down the problem but am a bit stumped by how to fix it. I'll reopen when I have a solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1750#issuecomment-53753423 @marmbrus I've rebased this atop the current master and added support for casting string-valued SQRT arguments to double and a test case in SQLQuerySuite for the same. (These will be unnecessary when I finish general support for string coercion in numeric functions, but they make this patch mergeable now.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1750#discussion_r16815625 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -935,6 +936,7 @@ private[hive] object HiveQl { case Token(DIV(), left :: right:: Nil) = Cast(Divide(nodeToExpr(left), nodeToExpr(right)), LongType) case Token(%, left :: right:: Nil) = Remainder(nodeToExpr(left), nodeToExpr(right)) +case Token(TOK_FUNCTION, Token(SQRT(), Nil) :: arg :: Nil) = Sqrt(nodeToExpr(arg)) --- End diff -- Hey @liancheng, I agree that it is possible to special-case this function (and can certainly do so for this PR). I recently [wrote up what Hive does](http://chapeau.freevariable.com/2014/08/existing-system-coercion.html) and have some work in progress on a more general solution that should capture what Hive does. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1750#discussion_r15894307 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -935,6 +936,7 @@ private[hive] object HiveQl { case Token(DIV(), left :: right:: Nil) = Cast(Divide(nodeToExpr(left), nodeToExpr(right)), LongType) case Token(%, left :: right:: Nil) = Remainder(nodeToExpr(left), nodeToExpr(right)) +case Token(TOK_FUNCTION, Token(SQRT(), Nil) :: arg :: Nil) = Sqrt(nodeToExpr(arg)) --- End diff -- @marmbrus to clarify, should I have a solution for the new type-coercion JIRA before doing any other reimplementations, or can I keep submitting reimplementations as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix postfixOps warnings in the test suite
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1323#issuecomment-51378794 Can someone take a look at this again? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1750#discussion_r15791823 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -935,6 +936,7 @@ private[hive] object HiveQl { case Token(DIV(), left :: right:: Nil) = Cast(Divide(nodeToExpr(left), nodeToExpr(right)), LongType) case Token(%, left :: right:: Nil) = Remainder(nodeToExpr(left), nodeToExpr(right)) +case Token(TOK_FUNCTION, Token(SQRT(), Nil) :: arg :: Nil) = Sqrt(nodeToExpr(arg)) --- End diff -- It does. What's the best way to handle this in general? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1750#issuecomment-51143752 @marmbrus I'll file a JIRA for that and am happy to put it at the front of my plate; sounds like a fun problem! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1750 SPARK-2813: [SQL] Implement SQRT() directly in Catalyst This PR adds a native implementation for SQL SQRT() and thus avoids delegating this function to Hive. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-2813 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1750.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1750 commit 18d63f93316e56b9f0e137e272981b5a2eb84074 Author: William Benton wi...@redhat.com Date: 2014-08-02T15:30:26Z Added native SQRT implementation commit bb8022612c468ae99531fbcc9ddff8a5f45bcf36 Author: William Benton wi...@redhat.com Date: 2014-08-02T16:22:40Z added SQRT test to SqlQuerySuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49876099 Thanks, @marmbrus; this works now and I understand where things were going wrong (and also, hopefully, how to use Catalyst more idiomatically). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49912686 @marmbrus, thanks again; sorry to have missed those earlier! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: spark-729: predictable closure capture
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1322#issuecomment-49913436 @mateiz, I think there's a memory blowup somewhere in this patch as it is and am trying to track it down. Coincidentally, it's what I was switching context back to when I saw this comment. @rxin, can you point me to the broadcast change so I can track it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: spark-729: predictable closure capture
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1322#issuecomment-49926246 Thanks, @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49726353 @pwendell Sure, and I'll do this in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49726591 @marmbrus I've made the changes you suggested and moved the rule to the Resolution batch but now the newly-inserted attribute references remain unresolved after analysis. I'm going to try and figure out where things are going wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49783600 @marmbrus I've pushed it to https://github.com/willb/spark/tree/spark-2226-broken -- if I split out the `UnresolvedHavingClauseAttributes` to a separate batch and add a `ResolveReferences` (as in willb@16b3dfc), things seem to work, but if I put `UnresolvedHavingClauseAttributes` in Resolution (as in willb@76ae5c1), things fail with No function to evaluate expression. type: AttributeReference I'm sure I'm missing something simple here. Thanks for taking a look! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1497#discussion_r15172123 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -152,6 +155,34 @@ class Analyzer(catalog: Catalog, registry: FunctionRegistry, caseSensitive: Bool } /** + * This rule finds expressions in HAVING clause filters that depend on + * unresolved attributes. It pushes these expressions down to the underlying + * aggregates and then projects them away above the filter. + */ + object UnresolvedHavingClauseAttributes extends Rule[LogicalPlan] { +def apply(plan: LogicalPlan): LogicalPlan = plan transform { + case pl @ Filter(fexp, agg @ Aggregate(_, ae, _)) if !fexp.childrenResolved = { +val alias = Alias(fexp, makeTmp())() +val aggExprs = Seq(alias) ++ ae + +val newCond = EqualTo(Cast(alias.toAttribute, BooleanType), Literal(true, BooleanType)) + +val newFilter = ResolveReferences(pl.copy(condition = newCond, + child = agg.copy(aggregateExpressions = aggExprs))) + +Project(pl.output, newFilter) + } +} + +private val curId = new java.util.concurrent.atomic.AtomicLong() + +private def makeTmp() = { + val id = curId.getAndIncrement() + stmp_cond_$id --- End diff -- @marmbrus Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1497#issuecomment-49542704 Thanks, @rxin! I've made these changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with unre...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1497 SPARK-2226: transform HAVING clauses with unresolvable attributes This commit adds an analyzer rule to 1. find expressions in `HAVING` clause filters that depend on unresolved attributes, 2. push these expressions down to the underlying aggregates, and then 3. project them away above the filter. It also enables the `HAVING` queries in the Hive compatibility suite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-2226 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1497.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1497 commit 29a26e3ab6a21e6619f003d905bc7aa7d1cb2976 Author: William Benton wi...@redhat.com Date: 2014-07-17T15:36:37Z Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226) commit c7f2b2c8a19b09ec095a316cb965f18d474d7144 Author: William Benton wi...@redhat.com Date: 2014-07-17T17:16:18Z Whitelist HAVING queries. Also adds golden outputs for HAVING tests. commit 5a12647c169ee06bba5355c3956a158699247e43 Author: William Benton wi...@redhat.com Date: 2014-07-19T17:08:17Z Explanatory comments and stylistic cleanups. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1442#issuecomment-49159849 That was my thought as well, @egraldlo. Thanks for submitting this, @chutium! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1442#issuecomment-49160982 @egraldlo, couldn't it be `(SUBSTR | SUBSTRING) ~ // ... ` in that case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2486: Utils.getCallSite is now resilient...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1413#issuecomment-49023679 @aarondav @pwendell Yes, with this patch I'm able to enable the YourKit features that were causing crashes before. I'll submit an update to fix the bracket style and cc you both. Thanks for the quick review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reformat multi-line closure argument.
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1419 Reformat multi-line closure argument. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark reformat-2486 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1419.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1419 commit 26762310ddf0ea88a418c506ed0e86892fe6e4d5 Author: William Benton wi...@redhat.com Date: 2014-07-15T12:35:13Z Reformat multi-line closure argument. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Reformat multi-line closure argument.
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1419#issuecomment-49024982 (See discussion on #1413; cc @aarondav and @pwendell.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1359#issuecomment-48971559 Thanks, Michael. I've fixed the conflict. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2486: Utils.getCallSite is now resilient...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1413 SPARK-2486: Utils.getCallSite is now resilient to bogus frames When running Spark under certain instrumenting profilers, Utils.getCallSite could crash with an NPE. This commit makes it more resilient to failures occurring while inspecting stack frames. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-2486 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1413.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1413 commit 0f0c1ae825d53d773ef12a02d74072b88f65d91a Author: William Benton wi...@redhat.com Date: 2014-07-15T03:40:51Z Utils.getCallSite is now resilient to bogus frames When running Spark under certain instrumenting profilers, Utils.getCallSite could crash with an NPE. This commit makes it more resilient to failures occurring while inspecting stack frames. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2486: Utils.getCallSite is now resilient...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1413#discussion_r14916362 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -809,7 +809,11 @@ private[spark] object Utils extends Logging { */ def getCallSite: CallSite = { val trace = Thread.currentThread.getStackTrace() - .filterNot(_.getMethodName.contains(getStackTrace)) + .filterNot((ste:StackTraceElement) = +// When running under some profilers, the current stack trace might contain some bogus +// frames. This Try is intended to ensure that we don't crash in these situations by +// ignoring any frames that we can't examine. +Try(ste.getMethodName.contains(getStackTrace)).getOrElse(true)) --- End diff -- So I was thinking that either `ste` or the result of `ste.getMethodName` could be `null` when running under instrumentation. I'll rewrite to use checks instead of `Try`, though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14782197 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { +val len = str.length +// Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and +// negative indices for start positions. If a start index i is greater than 0, it +// refers to element i-1 in the sequence. If a start index i is less than 0, it refers +// to the -ith element before the end of the sequence. If a start index i is 0, it +// refers to the first element. + +val start = startPos match { + case pos if pos 0 = pos - 1 + case neg if neg 0 = len + neg + case _ = 0 +} + +val end = sliceLen match { + case max if max == Integer.MAX_VALUE = max + case x = start + x +} + +str.slice(start, end) + } + + override def eval(input: Row): Any = { +val string = str.eval(input) + +val po = pos.eval(input) +val ln = len.eval(input) + +if ((string == null) || (po == null) || (ln == null)) { + null +} else { + val start = po.asInstanceOf[Int] + val length = ln.asInstanceOf[Int] + + string match { +case ba: Array[Byte] = slice(ba, start, length) +case other = slice(other.toString, start, length) + } +} + } + + override def toString = sSUBSTR($str, $pos, $len) --- End diff -- Makes sense! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14782431 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -860,6 +860,7 @@ private[hive] object HiveQl { val BETWEEN = (?i)BETWEEN.r val WHEN = (?i)WHEN.r val CASE = (?i)CASE.r + val SUBSTR = (?i)I_SUBSTR(?:ING)?.r --- End diff -- Good catch -- the `I_` was spuriously committed. (I had that in my working tree so that I could easily compare Catalyst and Hive `substr` from the console.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1359#issuecomment-48643276 Thanks for the comments, @concretevitamin. I've made the changes from your comments and will think about reducing branch overhead before pushing an update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14783468 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { +val len = str.length +// Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and --- End diff -- Hive supports 0-based indexing in the same way as this patch. I agree that supporting both in this way is ugly (both from an interface and from an implementation perspective), but it seems likely that people are depending on this behavior in the wild, doesn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14783626 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { +val len = str.length +// Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and +// negative indices for start positions. If a start index i is greater than 0, it +// refers to element i-1 in the sequence. If a start index i is less than 0, it refers +// to the -ith element before the end of the sequence. If a start index i is 0, it +// refers to the first element. + +val start = startPos match { + case pos if pos 0 = pos - 1 + case neg if neg 0 = len + neg + case _ = 0 +} + +val end = sliceLen match { --- End diff -- This is the behavior of `IndexedSeqOptimized[A,B].slice` (and thus this patch) as well as of Hive, too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1359#issuecomment-48648670 @concretevitamin Inlining `Substring.slice` seems to make a nontrivial difference (~11.5% speedup) on a very simple `Substring.eval()` microbenchmark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14802394 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +207,71 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (!resolved) { + throw new UnresolvedException(this, sCannot resolve since $children are not resolved) +} +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + @inline + def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { --- End diff -- Yes, I think I can do it with implicit parameters. I'll push an update once I've run the Catalyst suite against my change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix (some of the) warnings in the test suite
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1323#issuecomment-48326403 @rxin Done; I also updated the comment to reflect the narrower focus after eliminating overlap with #1153. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: spark-729: predictable closure capture
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1322 spark-729: predictable closure capture SPARK-729 concerns when free variables in closure arguments to transformations are captured. Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created). This PR causes free variables in closure arguments to RDD transformations to be captured at closure creation time by modifying `ClosureCleaner` to serialize and deserialize its argument. This PR is based on #189 (which is closed) but has fixes to work with some changes in 1.0. In particular, it ensures that the cloned `Broadcast` objects produced by closure capture are registered with `ContextCleaner` so that broadcast variables won't become invalid simply because variable capture (implemented this way) causes strong references to the original broadcast variables to go away. (See #189 for additional discussion and background.) You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-729 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1322.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1322 commit c24b7c8e92c035cd5b48acf11fb29dc060f68fca Author: William Benton wi...@redhat.com Date: 2014-07-04T16:26:59Z Added reference counting for Broadcasts. commit b10f613b6debe78ec1a1d53dd7f28168f8adb359 Author: William Benton wi...@redhat.com Date: 2014-07-07T17:52:26Z Added ContextCleaner.withCurrentCleaner This method allows code that needs access to the currently-active ContextCleaner to access it via a DynamicVariable. commit 14e074a59616732d422454c1ce881bc9b29060cd Author: William Benton wi...@redhat.com Date: 2014-03-18T14:55:57Z Added tests for variable capture in closures The two tests added to ClosureCleanerSuite ensure that variable values are captured at RDD definition time, not at job-execution time. commit 4e026a9d130c6fc92ce4cd73a68de013ed7aee5d Author: William Benton wi...@redhat.com Date: 2014-03-20T15:48:17Z Predictable closure environment capture The environments of serializable closures are now captured as part of closure cleaning. Since we already proactively check most closures for serializability, ClosureCleaner.clean now returns the result of deserializing the serialized version of the cleaned closure. Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala commit 39960620eb9c37f92ade2c35e2d5402cad6dd686 Author: William Benton wi...@redhat.com Date: 2014-03-26T04:45:45Z Skip proactive closure capture for runJob There are two possible cases for runJob calls: either they are called by RDD action methods from inside Spark or they are called from client code. There's no need to proactively check the closure argument to runJob for serializability or force variable capture in either case: 1. if they are called by RDD actions, their closure arguments consist of mapping an already-serializable closure (with an already-frozen environment) to each element in the RDD; 2. in both cases, the closure is about to execute and thus the benefit of proactively checking for serializability (or ensuring immediate variable capture) is nonexistent. (Note that ensuring capture via serializability on closure arguments to runJob also causes pyspark accumulators to fail to update.) Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala commit f4ed7535a1a1af0a518d4b7c9585073026a745c1 Author: William Benton wi...@redhat.com Date: 2014-03-26T16:31:56Z Split closure-serializability failure tests This splits the test identifying expected failures due to closure serializability into three cases. commit b507dd85b0ea1595ed216ffd8226964ba671676c Author: William Benton wi...@redhat.com Date: 2014-04-04T21:39:55Z Fixed style issues in tests Conflicts: core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala commit 5284569120cff51b3d2253e03b435047258798a0 Author: William Benton wi...@redhat.com Date: 2014-04-04T22:15:50Z Stylistic changes and cleanups Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala commit d6d49304f543b5435b3062558f80ea61d2e1a757 Author: William Benton wi...@redhat.com Date: 2014-05-02T14:23:49Z Removed proactive closure serialization from DStream Conflicts: streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala commit c052a63dba60b62325abb7ad541fe3d3f1a6e7d0 Author: William Benton wi...@redhat.com Date: 2014-07
[GitHub] spark pull request: Fix (some of the) warnings in the test suite
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1323 Fix (some of the) warnings in the test suite This PR fixes three classes of compiler and deprecation warnings in the test suite: * `expectResult` is currently deprecated and has been replaced with `assertResult`; * assertions of the form `should be ===` are deprecated in favor of the form `shouldEqual`; and * `scala.language.postfixOps` was not in scope within the test classes in which postfix operations were actually used. The fixes for these issues were almost entirely mechanical, but they eliminate many lines of warning output. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark testCleanups Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1323.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1323 commit b05304666eb61d7d1d1849772c27cf9711699dae Author: William Benton wi...@redhat.com Date: 2014-07-07T20:31:27Z Replace expectResult with assertResult Assertions.expectResult is deprecated. This commit replaces expectResult invocations with the assertResult invocations. commit 5377534fdfe741aab7d6d539573ab78636bfd554 Author: William Benton wi...@redhat.com Date: 2014-07-07T20:37:53Z Replaced deprecated 'be ===' assertions should be === assertions have been deprecated in favor of shouldEqual assertions. This commit replaces the deprecated form with the supported form. commit a672a9b5d579c4946a20fe4c6f88c4582949ef3f Author: William Benton wi...@redhat.com Date: 2014-07-07T21:00:58Z Ensure language.postfixOps is in scope where used Previously, language.postfixOps was imported at toplevel, which meant compiler warnings since it wasn't visible inside the classes that used postfix operations. This commit moves the import to suppress these warnings. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix (some of the) warnings in the test suite
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1323#issuecomment-48243538 @srowen sorry, I hadn't noticed the other PR! I think the `postfixOps` change in mine is disjoint, though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/143#discussion_r14329726 --- Diff: core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala --- @@ -153,6 +153,18 @@ private[spark] object ClosureCleaner extends Logging { field.setAccessible(true) field.set(func, outer) } + +if (checkSerializable) { + ensureSerializable(func) +} + } + + private def ensureSerializable(func: AnyRef) { +try { + SparkEnv.get.closureSerializer.newInstance().serialize(func) +} catch { + case ex: Exception = throw new SparkException(Task not serializable: + ex.toString) --- End diff -- I agree that it is better to wrap the underlying exception but was following the style of this error in DAGScheduler. I'll make the change and update that as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-47459912 Sorry, I missed FailureSuite. I have a fix but ran out of battery before I could push. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-46860741 Can someone take another look at this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46677366 Thanks for the quick review and patch, @rxin! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46724443 @rxin, re: the former, seems like most implementations signal this as an error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46725581 OK, I wasn't sure if strict Hive compatibility was the goal. I'm happy to take these tickets. Thanks again! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2225] Turn HAVING without GROUP BY into...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1161#issuecomment-46727408 LGTM; this is basically exactly what I did (https://github.com/willb/spark/commit/b272f6be925ba50741e0a5093244926ea4a7a9a8) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/1136 SPARK-2180: support HAVING clauses in Hive queries This PR extends Spark's HiveQL support to handle HAVING clauses in aggregations. The HAVING test from the Hive compatibility suite doesn't appear to be runnable from within Spark, so I added a simple comparable test to `HiveQuerySuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark SPARK-2180 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1136.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1136 commit f43787b155f6dcf306fe3df3395785598dc9d6ca Author: William Benton wi...@redhat.com Date: 2014-06-18T20:20:41Z Add support for HAVING clauses in Hive queries. commit 33b12220a0d8437ec821a8ee166112f519a2d1bc Author: William Benton wi...@redhat.com Date: 2014-06-19T18:59:54Z Added test coverage for SPARK-2180. This is a simple test for HAVING clauses. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46615691 @rxin, I'm not 100% sure but I think it's a problem with local map/reduce (the stack trace isn't too informative, but it's the same as the one for tests that are blacklisted due to missing local map/reduce). I have another commit to push here (adding a semantic exception when `HAVING` is specified without `GROUP BY` and test coverage for same). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46635597 Thanks for the catch, @rxin! I'll make the change and add tests for it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46640150 So I've added a cast in cases in which non-boolean expressions are supplied to having expressions. It appears that `Cast(_, BooleanType)` isn't idempotent, though -- if you apply it to a Boolean (say, `x 4`), it will translate that to `NOT ((x 4) = 0)`. This seems like a bug, but it's possible that I'm missing the reason why it should work that way. Should I change `Cast` so that casting an _X_ to _X_ is a no-op? (Checking the type of a variable during parse doesn't work, so I wind up with a different exception in examples like the one you posted. I'll either need to fix the behavior of `Cast` or delay adding the cast until I have type information.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46642661 Thanks! I'm happy to put together a preliminary patch as well, but probably won't be able to take a look until tomorrow morning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-46498345 I just rebased this branch atop master so it could be tested again. I see that it failed under Jenkins. However, I am unable to reproduce the local metrics failure in my own environment; is this an intermittent issue in CI or is there something I should look at? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-46500775 Thanks @rxin! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-42712135 I'm not able to reproduce the above failure locally (either on OS X or Linux). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
GitHub user willb reopened a pull request: https://github.com/apache/spark/pull/143 SPARK-897: preemptively serialize closures These commits cause `ClosureCleaner.clean` to attempt to serialize the cleaned closure with the default closure serializer and throw a `SparkException` if doing so fails. This behavior is enabled by default but can be disabled at individual callsites of `SparkContext.clean`. Commit 98e01ae8 fixes some no-op assertions in `GraphSuite` that this work exposed; I'm happy to put that in a separate PR if that would be more appropriate. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-897 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/143.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #143 commit 5cd11d51c19321981a6234a7765c7a5be6913433 Author: Ivan Wick ivanwick+git...@gmail.com Date: 2014-04-11T00:49:30Z Set spark.executor.uri from environment variable (needed by Mesos) The Mesos backend uses this property when setting up a slave process. It is similarly set in the Scala repl (org.apache.spark.repl.SparkILoop), but I couldn't find any analogous for pyspark. Author: Ivan Wick ivanwick+git...@gmail.com This patch had conflicts when merged, resolved by Committer: Matei Zaharia ma...@databricks.com Closes #311 from ivanwick/master and squashes the following commits: da0c3e4 [Ivan Wick] Set spark.executor.uri from environment variable (needed by Mesos) commit 7b4203ab4c640f7875ae3536228ed4d791062017 Author: Harvey Feng hyfeng...@gmail.com Date: 2014-04-11T01:25:54Z Add Spark v0.9.1 to ec2 launch script and use it as the default Mainly ported from branch-0.9. Author: Harvey Feng hyfeng...@gmail.com Closes #385 from harveyfeng/0.9.1-ec2 and squashes the following commits: 769ac2f [Harvey Feng] Add Spark v0.9.1 to ec2 launch script and use it as the default commit 44f654eecd3c181f2aeaff3871acf7f00eacc6b9 Author: Patrick Wendell pwend...@gmail.com Date: 2014-04-11T03:43:56Z SPARK-1202: Improvements to task killing in the UI. 1. Adds a separate endpoint for the killing logic that is outside of a page. 2. Narrows the scope of the killingEnabled tracking. 3. Some style improvements. Author: Patrick Wendell pwend...@gmail.com Closes #386 from pwendell/kill-link and squashes the following commits: 8efe02b [Patrick Wendell] Improvements to task killing in the UI. commit 446bb3417a2855a194d49acc0ac316a021eced9d Author: Thomas Graves tgra...@apache.org Date: 2014-04-11T07:47:48Z SPARK-1417: Spark on Yarn - spark UI link from resourcemanager is broken Author: Thomas Graves tgra...@apache.org Closes #344 from tgravescs/SPARK-1417 and squashes the following commits: c450b5f [Thomas Graves] fix test e1c1d7e [Thomas Graves] add missing $ to appUIAddress e982ddb [Thomas Graves] use appUIHostPort in appUIAddress 0803ec2 [Thomas Graves] Review comment updates - remove extra newline, simplify assert in test 658a8ec [Thomas Graves] Add a appUIHostPort routine 0614208 [Thomas Graves] Fix test 2a6b1b7 [Thomas Graves] SPARK-1417: Spark on Yarn - spark UI link from resourcemanager is broken commit 98225a6effd077a1b97c7e485d45ffd89b2c5b7f Author: Patrick Wendell pwend...@gmail.com Date: 2014-04-11T17:45:27Z Some clean up in build/docs (a) Deleted an outdated line from the docs (b) Removed a work around that is no longer necessary given the mesos version bump. Author: Patrick Wendell pwend...@gmail.com Closes #382 from pwendell/maven-clean and squashes the following commits: f0447fa [Patrick Wendell] Minor doc clean-up commit f5ace8da34c58d1005c7c377cfe3df21102c1dd6 Author: Xiangrui Meng m...@databricks.com Date: 2014-04-11T19:06:13Z [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationMetrics This PR implements a generic version of `AreaUnderCurve` using the `RDD.sliding` implementation from https://github.com/apache/spark/pull/136 . It also contains refactoring of https://github.com/apache/spark/pull/160 for binary classification evaluation. Author: Xiangrui Meng m...@databricks.com Closes #364 from mengxr/auc and squashes the following commits: a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names 3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics b1b7dab [Xiangrui Meng] fix code styles 9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator ca31da5 [Xiangrui Meng] remove
[GitHub] spark pull request: SPARK-571: forbid return statements in cleaned...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/717#issuecomment-43009698 Thanks, @rxin! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-571: forbid return statements in cleaned...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/717#issuecomment-42911366 @rxin, yes! (I was firing up a debugger in case you wanted to know exactly where it was returning to.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-42703799 I'd like to reopen this PR, since #189 had to be reverted and is still in flight, but these fixes are independently useful. I've rebased the branch atop the current master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-571: forbid return statements in cleaned...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/717 SPARK-571: forbid return statements in cleaned closures This patch checks top-level closure arguments to `ClosureCleaner.clean` for `return` statements and raises an exception if it finds any. This is mainly a user-friendliness addition, since programs with return statements in closure arguments will currently fail upon RDD actions with a less-than-intuitive error message. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark spark-571 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/717.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #717 commit b017c4779a0bfc54f215bc1135d9d23ee9761c20 Author: William Benton wi...@redhat.com Date: 2014-05-09T20:05:49Z Added a test for SPARK-571 commit 295b6a5960fd9a4509e64b225b0e4b479704978b Author: William Benton wi...@redhat.com Date: 2014-05-09T21:14:37Z Forbid return statements in closure arguments. This commit ensures that ClosureCleaner.clean will identify toplevel return statements in closures and raise an exception if they are encountered. This is intended to fix SPARK-571. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-571: forbid return statements in cleaned...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/717#issuecomment-42730126 Thanks for the quick feedback, @andrewor14! I'll make those changes. I'll also look in to see if I can figure out why that repartition test is failing; oddly, it's working for me locally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-571: forbid return statements in cleaned...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/717#issuecomment-42760025 Thanks for the clarification, @pwendell. @andrewor14, I've grouped my imports and collapsed that function. Thanks again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1501: Ensure assertions in Graph.apply a...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/415 SPARK-1501: Ensure assertions in Graph.apply are asserted. The Graph.apply test in GraphSuite had some assertions in a closure in a graph transformation. As a consequence, these assertions never actually executed. Furthermore, these closures had a reference to (non-serializable) test harness classes because they called assert(), which could be a problem if we proactively check closure serializability in the future. This commit simply changes the Graph.apply test to collect the graph triplets so it can assert about each triplet from a map method. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark graphsuite-nop-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/415.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #415 commit 0b636586b797546ce0cf78dbbfbe7462712aeaa4 Author: William Benton wi...@redhat.com Date: 2014-03-14T16:40:56Z Ensure assertions in Graph.apply are asserted. The Graph.apply test in GraphSuite had some assertions in a closure in a graph transformation. As a consequence, these assertions never actually executed. Furthermore, these closures had a reference to (non-serializable) test harness classes because they called assert(), which could be a problem if we proactively check closure serializability in the future. This commit simply changes the Graph.apply test to collect the graph triplets so it can assert about each triplet from a map method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...
Github user willb commented on the pull request: https://github.com/apache/spark/pull/189#issuecomment-40070735 @pwendell That's odd, since I didn't see any failures locally either. I haven't rebased this branch in a while, though, so I can check again. Can you send me a link to the failing tests? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user willb commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-40035997 This is subsumed by #189. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---