[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2018-09-17 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/13440
  
@srowen @thunterdb this PR passes all tests and merges cleanly -- can you 
take another look?  It's been open for quite a while now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2017-10-18 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/13440
  
I agree with @felixcheung -- @srowen or @thunterdb, can you take a look at 
this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2017-08-23 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/13440
  
@HyukjinKwon @thunterdb can you all take a look at this?  It's been under 
review for quite a long time!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13440: [SPARK-15699] [ML] Implement a Chi-Squared test statisti...

2017-07-21 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/13440
  
@thunterdb can you take a look at this now that 2.2 is out?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15150: [SPARK-17595] [MLLib] Use a bounded priority queu...

2016-09-20 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15150#discussion_r79641840
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -580,10 +581,14 @@ class Word2VecModel private[spark] (
   ind += 1
 }
 
-val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
+val pq = new BoundedPriorityQueue[(String, Double)](num + 
1)(Ordering.by(_._2))
+
+pq ++= wordList.zip(cosVec)
--- End diff --

Agreed.  I'll push as soon as I finish running tests locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15150: [SPARK-17595] [MLLib] Use a bounded priority queu...

2016-09-20 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15150#discussion_r79637489
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -580,10 +581,14 @@ class Word2VecModel private[spark] (
   ind += 1
 }
 
-val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
+val pq = new BoundedPriorityQueue[(String, Double)](num + 
1)(Ordering.by(_._2))
+
+pq ++= wordList.zip(cosVec)
--- End diff --

Yeah, I guess I figured that since we were allocating the tuples anyway a 
single copy of the array wasn't a lot of extra overhead vs. having slightly 
cleaner code.  But I'm happy to make the change if you think it's a good idea.  
I agree that allocating an array just to iterate through it isn't ideal.

(I'm ambivalent, partially because I don't have a great sense for the 
vocabulary sizes people typically use this code for in the wild.  For my 
example corpus, my patch as-is, zipping collection iterators, and explicit 
iteration over indices are all more or less equivalent in time performance.  My 
intuition is that allocating even the single array from `zip` is a bad deal if 
we're dealing with a very large vocabulary but probably not if the typical case 
is on the order of 10^5 words or less.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15150: [SPARK-17595] [MLLib] Use a bounded priority queu...

2016-09-20 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15150#discussion_r79600882
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -580,7 +581,11 @@ class Word2VecModel private[spark] (
   ind += 1
 }
 
-val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
+val pq = new BoundedPriorityQueue[(String, Double)](num + 
1)(Ordering.by(_._2))
+
+pq ++= wordList.zip(cosVec)
+
+val scored = pq.toSeq.sortBy(-_._2)
 
 val filtered = wordOpt match {
   case Some(w) => scored.take(num + 1).filter(tup => w != tup._1)
--- End diff --

Thanks, @hhbyyh!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15150: [SPARK-17595] [MLLib] Use a bounded priority queue to fi...

2016-09-19 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/15150
  
Thanks for the feedback, @srowen!  I've made the changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15150: Use a bounded priority queue to find synonyms in ...

2016-09-19 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/15150

Use a bounded priority queue to find synonyms in Word2VecModel

## What changes were proposed in this pull request?

The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements 
with the highest similarity to the query vector currently sorts the collection 
of similarities for every vocabulary element. This involves making multiple 
copies of the collection of similarities while doing a (relatively) expensive 
sort. It would be more efficient to find the best matches by maintaining a 
bounded priority queue and populating it with a single pass over the 
vocabulary, and that is exactly what this patch does.

## How was this patch tested?

This patch adds no user-visible functionality and its correctness should be 
exercised by existing tests.  To ensure that this approach is actually faster, 
I made a microbenchmark for `findSynonyms`:

```
object W2VTiming {
  import org.apache.spark.{SparkContext, SparkConf}
  import org.apache.spark.mllib.feature.Word2VecModel
  def run(modelPath: String, scOpt: Option[SparkContext] = None) {
val sc = scOpt.getOrElse(new SparkContext(new 
SparkConf(true).setMaster("local[*]").setAppName("test")))
val model = Word2VecModel.load(sc, modelPath)
val keys = model.getVectors.keys
val start = System.currentTimeMillis
for(key <- keys) {
  model.findSynonyms(key, 5)
  model.findSynonyms(key, 10)
  model.findSynonyms(key, 25)
  model.findSynonyms(key, 50)
}
val finish = System.currentTimeMillis
println("run completed in " + (finish - start) + "ms")
  }
}
```

I ran this test on a model generated from the complete works of Jane Austen 
and found that the new approach was over 3x faster than the old approach.  (As 
the `num` argument to `findSynonyms` approaches the vocabulary size, the new 
approach will have less of an advantage over the old one.)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark SPARK-17595

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15150.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15150


commit ddba6577a6ece67f77b3757da859c88fa9065c04
Author: William Benton <wi...@redhat.com>
Date:   2016-09-19T14:35:57Z

Use a bounded priority queue to find synonyms in Word2VecModel




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-16 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/15105
  
Thanks for the feedback @srowen!  I think 18e6bfe addresses everything from 
a code perspective, but it missed removing the comment about assuming that 
distinct words have distinct vector representations, so I've just pushed 
another commit that just removes that comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-16 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r79157601
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -227,7 +227,7 @@ class Word2VecModel private[ml] (
*/
   @Since("1.5.0")
   def findSynonyms(word: String, num: Int): DataFrame = {
-findSynonyms(wordVectors.transform(word), num)
+findSynonyms(wordVectors.transform(word), num, Some(word))
--- End diff --

In this case (and similarly in 
[`Word2VecModelWrapper`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala))
 I opted to call the three-argument version because the wrappers both 
explicitly convert their argument to a vector before calling `findSynonyms` on 
the underlying model (and so `wordOpt` would not be defined if the wrapper were 
invoked with a word).  If we were to make the three-argument `findSynonyms` 
private we wouldn't be able to share a code path in the wrapper classes and 
would need to duplicate the code to tidy and reformat results in both methods 
(data frame creation in this case, unzipping and `asJava` in the Python model 
wrapper) or factor it out to a separate method.  Let me know how you want me to 
proceed here.

I agree that updating the docs makes sense and will make it clearer to 
future maintainers as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-15 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/15105
  
@srowen Yes.  But if we're querying for words similar to a given vector, 
then there's no word to filter out (and, indeed, no way to know which word we 
might want to filter out if multiple words might map to the same vector 
representation).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r79005872
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

Consider the case where you're passing in a vector that is extremely 
similar to the representation of a word in the vocabulary but that is not 
itself the representation of a word in the vocabulary.  (A concrete example is 
(in this test I 
added)[https://github.com/willb/spark/blob/fix/findSynonyms/mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala#L72].)
  In this case, `wordOpt` is not defined (because you are querying for words 
whose vector representations are similar to a given vector) but you nonetheless 
are not interested in discarding the best match because it is not a trivial 
match (that is, it is not going to be your query term).

Related to your other comment (and discussion elsewhere on this PR), I 
think we could make a case for changing the documented behavior (especially 
since it is only documented as such in pyspark) in the case where 
`findSynonyms` is invoked with the vector representation of a word that is _in_ 
the vocabulary.  Instead of rejecting the best match in that case, we could 
return it. The argument there is that such a result is telling you something 
you didn't necessarily know (otherwise, you'd probably be querying for a word 
and not a vector) and that there is an easy way to identify that such a match 
is trivially the best match.  I recognize that changing documented behavior is 
a big deal, but in this case it seems like it could be the best way to address 
a minor bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-15 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/15105
  
Can we assume in general that distinct words will not have identical (viz., 
by `.equals`) vector representations?  I ask because the behavior in the 
current documentation (in that pyspark test I linked above) assumes that you 
don't want a word to turn up as a synonym for its own vector representation.  
But this is impossible to enforce if we aren't guaranteed that distinct words 
won't have identical vector representations.

This seems like a mostly reasonable assumption but it is definitely a 
corner case that we might want to be robust to (or at least take note of).  If 
we can't accept this assumption, then we should figure out whether or not 
changing the expected behavior in the documentation is acceptable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78995496
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

@srowen So actually reverting to the old code but filtering only if 
`wordOpt` is defined doesn't handle the original case I was considering here, 
where you pass in a vector that is very similar to the representation of a word 
in the vocabulary but that is not itself the representation of a word in the 
vocabulary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78983451
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

I wasn't sure if the todo was referring to the sorting or to earlier 
optimizations.  I'll get it started as a separate issue; thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78981374
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala ---
@@ -68,6 +69,21 @@ class Word2VecSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(syms(1)._1 == "japan")
   }
 
+  test("findSynonyms doesn't reject similar word vectors when called with 
a vector") {
+val num = 2
+val word2VecMap = Map(
+  ("china", Array(0.50f, 0.50f, 0.50f, 0.50f)),
+  ("japan", Array(0.40f, 0.50f, 0.50f, 0.50f)),
+  ("taiwan", Array(0.60f, 0.50f, 0.50f, 0.50f)),
+  ("korea", Array(0.45f, 0.60f, 0.60f, 0.60f))
+)
+val model = new Word2VecModel(word2VecMap)
+val syms = model.findSynonyms(Vectors.dense(Array(0.52d, 0.50d, 0.50d, 
0.50d)), num)
--- End diff --

Yes to both (the two significant digits was simply following the style 
earlier in the test).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78981066
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

@srowen This code in general kind of bothers me (I'd rather see a single 
pass through the tuples with a bounded priority queue keeping track of the `num 
+ 1` candidates than converting to a sequence and then allocating an array to 
sort in place).  But I'm inclined to get some numbers showing that that is a 
good idea and make it a separate PR unless this is a good time to fold it in 
(so to speak).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78973934
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

@hhbyyh So [one of the python 
doctests](https://github.com/apache/spark/blob/branch-2.0/python/pyspark/mllib/feature.py#L575)
 depends on a word not being a synonym of its vector representation.  I think 
since this is the documented behavior now, that's the direction the fix should 
go as well, but I'll use `Vector.equals` instead of checking similarity in any 
case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no long...

2016-09-14 Thread willb
Github user willb commented on the issue:

https://github.com/apache/spark/pull/15105
  
Thanks, @hhbyyh.  This is what I get for running `sbt "testOnly ..."`  I'll 
push a fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-14 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/15105

[SPARK-17548] [MLlib] Word2VecModel.findSynonyms no longer spuriously 
rejects the best match when invoked with a vector

## What changes were proposed in this pull request?

This pull request changes the behavior of `Word2VecModel.findSynonyms` so 
that it will not spuriously reject the best match when invoked with a vector 
that does not correspond to a word in the model's vocabulary.  Instead of 
blindly discarding the best match, the changed implementation discards a match 
that corresponds to the query word (in cases where `findSynonyms` is invoked 
with a word) or that has an identical angle to the query vector.

## How was this patch tested?

I added a test to `Word2VecSuite` to ensure that the word with the most 
similar vector from a supplied vector would not be spuriously rejected.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark fix/findSynonyms

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15105


commit 5a7cf31e1c76c2fe4e172c2889fa23deb2152af9
Author: William Benton <wi...@redhat.com>
Date:   2016-09-14T17:07:14Z

test for spurious rejection of similar vectors in findSynonyms

commit 757ce7c73e76f751395d55de3b5b31fa4a05cde8
Author: William Benton <wi...@redhat.com>
Date:   2016-09-14T21:35:29Z

`Word2VecModel.findSynonyms` no longer spuriously rejects the best match

Previously, the `findSynonyms` method in `Word2VecModel` rejected the
closest-matching vector.  This was typically correct in cases where we
were searching for synonyms of a word, but was incorrect in cases
where we were searching for words most similar to a given vector,
since the given vector might not correspond to a word in the model's
vocabulary.

With this commit, `findSynonyms` will not discard the best matching
term unless the best matching word is also the query word (or is
maximally similar to the query vector).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2863: [SQL] Add facilities for function-...

2014-10-29 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2768#issuecomment-60934954
  
I've rebased this to fix merge conflicts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2863: [SQL] Add facilities for function-...

2014-10-13 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2768#issuecomment-58893902
  
Two open questions, and the latter is more relevant:  is requiring that 
actuals are casted to the types of formals too restrictive?  Is it likely to 
lead to type-coercion rules oscillating?  (Obviously, it should be possible to 
pass, e.g., a value of a narrow numeric type where a wider one is expected.  
But if all the type-coercion rules we can anticipate ultimately widen types or 
convert from strings into other values, then the rules will still make 
progress.)

(cc @marmbrus)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2863: [SQL] Add facilities for function-...

2014-10-11 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/2768

SPARK-2863: [SQL] Add facilities for function-argument coercion

This commit adds the `SignedFunction` trait and modifies the `Sqrt` 
expression
class to use it for coercing its argument to `DoubleType`.

`SignedFunction` represents a fixed-arity function whose arguments should be
casted to particular types. Expression classes extending SignedFunction
must provide `formalTypes`, a List of expected types for formal parameters,
`actualParams`, a list of Expressions corresponding to actual parameters,
and create, which creates an instance of that expression class from a list
of expressions corresponding to actuals. The type parameter for
SignedFunction should be the expression class extending it.

See the Sqrt class for a concrete example.

This trait (or one or several abstract classes extending this trait) could
be exposed to code outside `sql` in the future.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-2863

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2768.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2768


commit 4f9517a2c11d13f439f3ed7ea447a4559f9e9088
Author: William Benton wi...@redhat.com
Date:   2014-10-11T12:40:10Z

Adds SignedFunction trait and type coercion rules

SignedFunction represents a fixed-arity function whose arguments should be
casted to particular types. Expression classes extending SignedFunction
must provide `formalTypes`, a List of expected types for formal parameters,
`actualParams`, a list of Expressions corresponding to actual parameters,
and create, which creates an instance of that expression class from a list
of expressions corresponding to actuals. The type parameter for
SignedFunction should be the expression class extending it.

See the Sqrt class for a concrete example.

This trait (or one or several abstract classes extending this trait) could
be exposed to code outside `sql` in the future.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3699: SQL and Hive console tasks now cle...

2014-09-26 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/2547

SPARK-3699:  SQL and Hive console tasks now clean up appropriately

The sbt tasks sql/console and hive/console will now `stop()`
the `SparkContext` upon exit.  Previously, they left an ugly stack
trace when quitting.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark consoleCleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2547.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2547


commit d5e431f0a1b9047a5afc27cb371dbfb7014fb6e0
Author: William Benton wi...@redhat.com
Date:   2014-09-26T18:45:20Z

SQL and Hive console tasks now clean up.

The sbt tasks sql/console and hive/console will now clean up
the SparkContext upon exit.  Previously, they left an ugly stack
trace when quitting.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3699: SQL and Hive console tasks now cle...

2014-09-26 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2547#issuecomment-57006526
  
Looks like Jenkins and GitHub are having a minor disagreement:

FATAL: Failed to fetch from https://github.com/apache/spark.git


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3699: SQL and Hive console tasks now cle...

2014-09-26 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2547#issuecomment-57032615
  
This patch only changes the sbt build file; it adds no classes and should 
not have caused any tests to fail. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-18 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-56121260
  
@andrewor14 This bothers me too, but in a slightly different way:  calling 
the parameter “overhead” when it really refers to how to scale requested 
memory to accommodate anticipated overhead seems wrong.  (115% overhead would 
be appalling indeed!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3535][Mesos] Add 15% task memory overhe...

2014-09-17 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2401#issuecomment-55902476
  
Here's a somewhat-related concern:  it seems like JVM overhead is unlikely 
to scale linearly with requested heap size, so maybe a straight-up 15% isn't a 
great default?  (If you have hard data on how heap requirements grow with job 
size, I'd be interested in seeing it.)  Perhaps it would make more sense to 
reserve whichever is smaller of 15% or some fixed but reasonable amount.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fix postfixOps warnings in the test suite

2014-09-11 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-55345856
  
Hi @jkbradley; thanks for taking a look.  Here are the warnings as I see 
them when compiling tests on the immediate ancestor of my branch, which is 
56e009d (I'm running on OS X 10.9 in this case):

```
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:38:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn] This can be achieved by adding the import clause 'import 
scala.language.postfixOps'
[warn] or by setting the compiler option -language:postfixOps.
[warn] See the Scala docs for value scala.language.postfixOps for a 
discussion
[warn] why the feature should be explicitly enabled.
[warn]   implicit val defaultTimeout = timeout(1 millis)
[warn]   ^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:99:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   preGCTester.assertCleanup()(timeout(1000 millis))
[warn]^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:117:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   preGCTester.assertCleanup()(timeout(1000 millis))
[warn]^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:134:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   preGCTester.assertCleanup()(timeout(1000 millis))
[warn]^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:156:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   preGCTester.assertCleanup()(timeout(1000 millis))
[warn]^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:187:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   preGCTester.assertCleanup()(timeout(1000 millis))
[warn]^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala:290:
 postfix operator millis should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   eventually(waitTimeout, interval(100 millis)) {
[warn]^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/rdd/AsyncRDDActionsSuite.scala:138:
 postfix operator seconds should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn] failAfter(10 seconds) {
[warn]  ^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/rdd/AsyncRDDActionsSuite.scala:174:
 postfix operator seconds should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn] failAfter(10 seconds) {
[warn]  ^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ui/UISuite.scala:42:
 postfix operator seconds should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   eventually(timeout(10 seconds), interval(50 milliseconds)) {
[warn] ^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ui/UISuite.scala:42:
 postfix operator milliseconds should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   eventually(timeout(10 seconds), interval(50 milliseconds)) {
[warn]   ^
[warn] 
/Users/willb/Documents/src/scala/alternate-spark/core/src/test/scala/org/apache/spark/ui/UISuite.scala:56:
 postfix operator seconds should be enabled
[warn] by making the implicit value scala.language.postfixOps visible.
[warn]   eventually(timeout(10 seconds), interval(50 milliseconds)) {
[warn

[GitHub] spark pull request: Fix postfixOps warnings in the test suite

2014-09-11 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-55351655
  
Hey @andrewor14, thanks for the reply.  First off, I absolutely agree with 
@srowen's comment on #1330 that `import`s (not compiler flags) are the right 
way to handle enabling these language features.  It looks to me like 
`SpanSugar` pulls in `postfixOps` -- and that it's the only thing in those 
files that uses `postfixOps`.  (I guess having both `SpanSugar` and 
`postfixOps` imported at toplevel was causing some implicit resolution 
confusion?)

In any case, the approach in #1330 is probably the way to go since 
explicitly importing `postfixOps` seems unnecessary and removing the compiler 
flag is a good idea.  Thanks again for taking a look!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fix postfixOps warnings in the test suite

2014-09-11 Thread willb
Github user willb closed the pull request at:

https://github.com/apache/spark/pull/1323


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3423: [SQL] Implement BETWEEN for SQLPar...

2014-09-05 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/2295

SPARK-3423:  [SQL] Implement BETWEEN for SQLParser



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark sql-between

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2295.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2295


commit 0016d300a77e8ef80899a679f54e7451544361dc
Author: William Benton wi...@redhat.com
Date:   2014-09-05T23:00:29Z

Implement BETWEEN for SQLParser




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...

2014-09-03 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2220#issuecomment-54303443
  
This failure (in `SparkSubmitSuite`) appears unrelated to my patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...

2014-09-02 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2220#issuecomment-54165469
  
@concretevitamin I cherry-picked @aarondav's fix (and added a very simple 
fix to handle cases that it didn't).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...

2014-08-31 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/2220

SPARK-3329:  [SQL] Don't depend on Hive SET pair ordering.

This fixes some possible spurious test failures in `HiveQuerySuite` by 
comparing sets of key-value pairs as sets, rather than as lists.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-3329

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2220.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2220


commit 36ff52aee8ac2d8da336a56df195e6093e8f7807
Author: William Benton wi...@redhat.com
Date:   2014-08-31T17:08:00Z

Don't depend on Hive SET pair ordering.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...

2014-08-31 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/2220#issuecomment-53996991
  
Thanks, @concretevitamin!  I'll close this one then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: spark-729: predictable closure capture

2014-08-29 Thread willb
Github user willb closed the pull request at:

https://github.com/apache/spark/pull/1322


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: spark-729: predictable closure capture

2014-08-29 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1322#issuecomment-53903366
  
@mateiz sure; I've tracked down the problem but am a bit stumped by how to 
fix it.  I'll reopen when I have a solution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-28 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1750#issuecomment-53753423
  
@marmbrus I've rebased this atop the current master and added support for 
casting string-valued SQRT arguments to double and a test case in SQLQuerySuite 
for the same.  (These will be unnecessary when I finish general support for 
string coercion in numeric functions, but they make this patch mergeable now.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-27 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1750#discussion_r16815625
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -935,6 +936,7 @@ private[hive] object HiveQl {
 case Token(DIV(), left :: right:: Nil) =
   Cast(Divide(nodeToExpr(left), nodeToExpr(right)), LongType)
 case Token(%, left :: right:: Nil) = Remainder(nodeToExpr(left), 
nodeToExpr(right))
+case Token(TOK_FUNCTION, Token(SQRT(), Nil) :: arg :: Nil) = 
Sqrt(nodeToExpr(arg))
--- End diff --

Hey @liancheng, I agree that it is possible to special-case this function 
(and can certainly do so for this PR).  I recently [wrote up what Hive 
does](http://chapeau.freevariable.com/2014/08/existing-system-coercion.html) 
and have some work in progress on a more general solution that should capture 
what Hive does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-06 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1750#discussion_r15894307
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -935,6 +936,7 @@ private[hive] object HiveQl {
 case Token(DIV(), left :: right:: Nil) =
   Cast(Divide(nodeToExpr(left), nodeToExpr(right)), LongType)
 case Token(%, left :: right:: Nil) = Remainder(nodeToExpr(left), 
nodeToExpr(right))
+case Token(TOK_FUNCTION, Token(SQRT(), Nil) :: arg :: Nil) = 
Sqrt(nodeToExpr(arg))
--- End diff --

@marmbrus to clarify, should I have a solution for the new type-coercion 
JIRA before doing any other reimplementations, or can I keep submitting 
reimplementations as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fix postfixOps warnings in the test suite

2014-08-06 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-51378794
  
Can someone take a look at this again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-04 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1750#discussion_r15791823
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -935,6 +936,7 @@ private[hive] object HiveQl {
 case Token(DIV(), left :: right:: Nil) =
   Cast(Divide(nodeToExpr(left), nodeToExpr(right)), LongType)
 case Token(%, left :: right:: Nil) = Remainder(nodeToExpr(left), 
nodeToExpr(right))
+case Token(TOK_FUNCTION, Token(SQRT(), Nil) :: arg :: Nil) = 
Sqrt(nodeToExpr(arg))
--- End diff --

It does.  What's the best way to handle this in general?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-04 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1750#issuecomment-51143752
  
@marmbrus I'll file a JIRA for that and am happy to put it at the front of 
my plate; sounds like a fun problem!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2813: [SQL] Implement SQRT() directly in...

2014-08-02 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1750

SPARK-2813:  [SQL] Implement SQRT() directly in Catalyst

This PR adds a native implementation for SQL SQRT() and thus avoids 
delegating this function to Hive.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-2813

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1750.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1750


commit 18d63f93316e56b9f0e137e272981b5a2eb84074
Author: William Benton wi...@redhat.com
Date:   2014-08-02T15:30:26Z

Added native SQRT implementation

commit bb8022612c468ae99531fbcc9ddff8a5f45bcf36
Author: William Benton wi...@redhat.com
Date:   2014-08-02T16:22:40Z

added SQRT test to SqlQuerySuite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...

2014-07-23 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49876099
  
Thanks, @marmbrus; this works now and I understand where things were going 
wrong (and also, hopefully, how to use Catalyst more idiomatically).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...

2014-07-23 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49912686
  
@marmbrus, thanks again; sorry to have missed those earlier!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: spark-729: predictable closure capture

2014-07-23 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1322#issuecomment-49913436
  
@mateiz, I think there's a memory blowup somewhere in this patch as it is 
and am trying to track it down.  Coincidentally, it's what I was switching 
context back to when I saw this comment.

@rxin, can you point me to the broadcast change so I can track it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: spark-729: predictable closure capture

2014-07-23 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1322#issuecomment-49926246
  
Thanks, @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...

2014-07-22 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49726353
  
@pwendell Sure, and I'll do this in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...

2014-07-22 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49726591
  
@marmbrus I've made the changes you suggested and moved the rule to the 
Resolution batch but now the newly-inserted attribute references remain 
unresolved after analysis.  I'm going to try and figure out where things are 
going wrong.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: [SQL] transform HAVING clauses wit...

2014-07-22 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49783600
  
@marmbrus I've pushed it to 
https://github.com/willb/spark/tree/spark-2226-broken -- if I split out the 
`UnresolvedHavingClauseAttributes` to a separate batch and add a 
`ResolveReferences` (as in willb@16b3dfc), things seem to work, but if I put 
`UnresolvedHavingClauseAttributes` in Resolution (as in willb@76ae5c1), things 
fail with

 No function to evaluate expression. type: AttributeReference

I'm sure I'm missing something simple here.  Thanks for taking a look!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...

2014-07-21 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1497#discussion_r15172123
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -152,6 +155,34 @@ class Analyzer(catalog: Catalog, registry: 
FunctionRegistry, caseSensitive: Bool
   }
 
   /**
+   * This rule finds expressions in HAVING clause filters that depend on
+   * unresolved attributes.  It pushes these expressions down to the 
underlying
+   * aggregates and then projects them away above the filter.
+   */
+  object UnresolvedHavingClauseAttributes extends Rule[LogicalPlan] {
+def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+  case pl @ Filter(fexp, agg @ Aggregate(_, ae, _)) if 
!fexp.childrenResolved = {
+val alias = Alias(fexp, makeTmp())()
+val aggExprs = Seq(alias) ++ ae
+
+val newCond = EqualTo(Cast(alias.toAttribute, BooleanType), 
Literal(true, BooleanType))
+
+val newFilter = ResolveReferences(pl.copy(condition = newCond,
+  child = agg.copy(aggregateExpressions = aggExprs)))
+
+Project(pl.output, newFilter)
+  }
+}
+
+private val curId = new java.util.concurrent.atomic.AtomicLong()
+
+private def makeTmp() = {
+  val id = curId.getAndIncrement()
+  stmp_cond_$id
--- End diff --

@marmbrus Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with aggr...

2014-07-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1497#issuecomment-49542704
  
Thanks, @rxin!  I've made these changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2226: transform HAVING clauses with unre...

2014-07-19 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1497

SPARK-2226:  transform HAVING clauses with unresolvable attributes

This commit adds an analyzer rule to
  1. find expressions in `HAVING` clause filters that depend on unresolved 
attributes, 
  2. push these expressions down to the underlying aggregates, and then
  3. project them away above the filter.

It also enables the `HAVING` queries in the Hive compatibility suite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-2226

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1497.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1497


commit 29a26e3ab6a21e6619f003d905bc7aa7d1cb2976
Author: William Benton wi...@redhat.com
Date:   2014-07-17T15:36:37Z

Added rule to handle unresolved attributes in HAVING clauses (SPARK-2226)

commit c7f2b2c8a19b09ec095a316cb965f18d474d7144
Author: William Benton wi...@redhat.com
Date:   2014-07-17T17:16:18Z

Whitelist HAVING queries.

Also adds golden outputs for HAVING tests.

commit 5a12647c169ee06bba5355c3956a158699247e43
Author: William Benton wi...@redhat.com
Date:   2014-07-19T17:08:17Z

Explanatory comments and stylistic cleanups.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()

2014-07-16 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1442#issuecomment-49159849
  
That was my thought as well, @egraldlo.  Thanks for submitting this, 
@chutium!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added Parser of SQL SUBSTR()

2014-07-16 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1442#issuecomment-49160982
  
@egraldlo, couldn't it be `(SUBSTR | SUBSTRING) ~ // ... ` in that case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2486: Utils.getCallSite is now resilient...

2014-07-15 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1413#issuecomment-49023679
  
@aarondav @pwendell Yes, with this patch I'm able to enable the YourKit 
features that were causing crashes before.  I'll submit an update to fix the 
bracket style and cc you both.  Thanks for the quick review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reformat multi-line closure argument.

2014-07-15 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1419

Reformat multi-line closure argument.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark reformat-2486

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1419.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1419


commit 26762310ddf0ea88a418c506ed0e86892fe6e4d5
Author: William Benton wi...@redhat.com
Date:   2014-07-15T12:35:13Z

Reformat multi-line closure argument.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Reformat multi-line closure argument.

2014-07-15 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1419#issuecomment-49024982
  
(See discussion on #1413; cc @aarondav and @pwendell.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-14 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48971559
  
Thanks, Michael.  I've fixed the conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2486: Utils.getCallSite is now resilient...

2014-07-14 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1413

SPARK-2486: Utils.getCallSite is now resilient to bogus frames

When running Spark under certain instrumenting profilers,
Utils.getCallSite could crash with an NPE.  This commit
makes it more resilient to failures occurring while inspecting
stack frames.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-2486

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1413.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1413


commit 0f0c1ae825d53d773ef12a02d74072b88f65d91a
Author: William Benton wi...@redhat.com
Date:   2014-07-15T03:40:51Z

Utils.getCallSite is now resilient to bogus frames

When running Spark under certain instrumenting profilers,
Utils.getCallSite could crash with an NPE.  This commit
makes it more resilient to failures occurring while inspecting
stack frames.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2486: Utils.getCallSite is now resilient...

2014-07-14 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1413#discussion_r14916362
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -809,7 +809,11 @@ private[spark] object Utils extends Logging {
*/
   def getCallSite: CallSite = {
 val trace = Thread.currentThread.getStackTrace()
-  .filterNot(_.getMethodName.contains(getStackTrace))
+  .filterNot((ste:StackTraceElement) = 
+// When running under some profilers, the current stack trace 
might contain some bogus 
+// frames. This Try is intended to ensure that we don't crash in 
these situations by
+// ignoring any frames that we can't examine.
+Try(ste.getMethodName.contains(getStackTrace)).getOrElse(true))
--- End diff --

So I was thinking that either `ste` or the result of `ste.getMethodName` 
could be `null` when running under instrumentation.  I'll rewrite to use checks 
instead of `Try`, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14782197
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
+// negative indices for start positions. If a start index i is greater 
than 0, it 
+// refers to element i-1 in the sequence. If a start index i is less 
than 0, it refers
+// to the -ith element before the end of the sequence. If a start 
index i is 0, it
+// refers to the first element.
+
+val start = startPos match {
+  case pos if pos  0 = pos - 1
+  case neg if neg  0 = len + neg
+  case _ = 0
+}
+
+val end = sliceLen match {
+  case max if max == Integer.MAX_VALUE = max
+  case x = start + x
+}
+  
+str.slice(start, end)
+  }
+  
+  override def eval(input: Row): Any = {
+val string = str.eval(input)
+
+val po = pos.eval(input)
+val ln = len.eval(input)
+
+if ((string == null) || (po == null) || (ln == null)) {
+  null
+} else {
+  val start = po.asInstanceOf[Int]
+  val length = ln.asInstanceOf[Int] 
+  
+  string match {
+case ba: Array[Byte] = slice(ba, start, length)
+case other = slice(other.toString, start, length)
+  }
+}
+  }
+  
+  override def toString = sSUBSTR($str, $pos, $len)
--- End diff --

Makes sense!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14782431
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -860,6 +860,7 @@ private[hive] object HiveQl {
   val BETWEEN = (?i)BETWEEN.r
   val WHEN = (?i)WHEN.r
   val CASE = (?i)CASE.r
+  val SUBSTR = (?i)I_SUBSTR(?:ING)?.r
--- End diff --

Good catch -- the `I_` was spuriously committed.  (I had that in my working 
tree so that I could easily compare Catalyst and Hive `substr` from the 
console.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48643276
  
Thanks for the comments, @concretevitamin.  I've made the changes from your 
comments and will think about reducing branch overhead before pushing an update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14783468
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
--- End diff --

Hive supports 0-based indexing in the same way as this patch.  I agree that 
supporting both in this way is ugly (both from an interface and from an 
implementation perspective), but it seems likely that people are depending on 
this behavior in the wild, doesn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14783626
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
+// negative indices for start positions. If a start index i is greater 
than 0, it 
+// refers to element i-1 in the sequence. If a start index i is less 
than 0, it refers
+// to the -ith element before the end of the sequence. If a start 
index i is 0, it
+// refers to the first element.
+
+val start = startPos match {
+  case pos if pos  0 = pos - 1
+  case neg if neg  0 = len + neg
+  case _ = 0
+}
+
+val end = sliceLen match {
--- End diff --

This is the behavior of `IndexedSeqOptimized[A,B].slice` (and thus this 
patch) as well as of Hive, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48648670
  
@concretevitamin Inlining `Substring.slice` seems to make a nontrivial 
difference (~11.5% speedup) on a very simple `Substring.eval()` microbenchmark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14802394
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +207,71 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (!resolved) {
+  throw new UnresolvedException(this, sCannot resolve since $children 
are not resolved)
+}
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  @inline
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
--- End diff --

Yes, I think I can do it with implicit parameters.  I'll push an update 
once I've run the Catalyst suite against my change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix (some of the) warnings in the test suite

2014-07-08 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-48326403
  
@rxin Done; I also updated the comment to reflect the narrower focus after 
eliminating overlap with #1153.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: spark-729: predictable closure capture

2014-07-07 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1322

spark-729:  predictable closure capture

SPARK-729 concerns when free variables in closure arguments to 
transformations are captured. Currently, it is possible for closures to get the 
environment in which they are serialized (not the environment in which they are 
created).  This PR causes free variables in closure arguments to RDD 
transformations to be captured at closure creation time by modifying 
`ClosureCleaner` to serialize and deserialize its argument.

This PR is based on #189 (which is closed) but has fixes to work with some 
changes in 1.0.  In particular, it ensures that the cloned `Broadcast` objects 
produced by closure capture are registered with `ContextCleaner` so that 
broadcast variables won't become invalid simply because variable capture 
(implemented this way) causes strong references to the original broadcast 
variables to go away.

(See #189 for additional discussion and background.)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-729

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1322.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1322


commit c24b7c8e92c035cd5b48acf11fb29dc060f68fca
Author: William Benton wi...@redhat.com
Date:   2014-07-04T16:26:59Z

Added reference counting for Broadcasts.

commit b10f613b6debe78ec1a1d53dd7f28168f8adb359
Author: William Benton wi...@redhat.com
Date:   2014-07-07T17:52:26Z

Added ContextCleaner.withCurrentCleaner

This method allows code that needs access to the currently-active
ContextCleaner to access it via a DynamicVariable.

commit 14e074a59616732d422454c1ce881bc9b29060cd
Author: William Benton wi...@redhat.com
Date:   2014-03-18T14:55:57Z

Added tests for variable capture in closures

The two tests added to ClosureCleanerSuite ensure that variable values
are captured at RDD definition time, not at job-execution time.

commit 4e026a9d130c6fc92ce4cd73a68de013ed7aee5d
Author: William Benton wi...@redhat.com
Date:   2014-03-20T15:48:17Z

Predictable closure environment capture

The environments of serializable closures are now captured as
part of closure cleaning. Since we already proactively check most
closures for serializability, ClosureCleaner.clean now returns
the result of deserializing the serialized version of the cleaned
closure.

Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala

commit 39960620eb9c37f92ade2c35e2d5402cad6dd686
Author: William Benton wi...@redhat.com
Date:   2014-03-26T04:45:45Z

Skip proactive closure capture for runJob

There are two possible cases for runJob calls: either they are called
by RDD action methods from inside Spark or they are called from client
code. There's no need to proactively check the closure argument to
runJob for serializability or force variable capture in either case:

1. if they are called by RDD actions, their closure arguments consist
of mapping an already-serializable closure (with an already-frozen
environment) to each element in the RDD;

2. in both cases, the closure is about to execute and thus the benefit
of proactively checking for serializability (or ensuring immediate
variable capture) is nonexistent.

(Note that ensuring capture via serializability on closure arguments to
runJob also causes pyspark accumulators to fail to update.)

Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala

commit f4ed7535a1a1af0a518d4b7c9585073026a745c1
Author: William Benton wi...@redhat.com
Date:   2014-03-26T16:31:56Z

Split closure-serializability failure tests

This splits the test identifying expected failures due to
closure serializability into three cases.

commit b507dd85b0ea1595ed216ffd8226964ba671676c
Author: William Benton wi...@redhat.com
Date:   2014-04-04T21:39:55Z

Fixed style issues in tests

Conflicts:

core/src/test/scala/org/apache/spark/serializer/ProactiveClosureSerializationSuite.scala

commit 5284569120cff51b3d2253e03b435047258798a0
Author: William Benton wi...@redhat.com
Date:   2014-04-04T22:15:50Z

Stylistic changes and cleanups

Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala

commit d6d49304f543b5435b3062558f80ea61d2e1a757
Author: William Benton wi...@redhat.com
Date:   2014-05-02T14:23:49Z

Removed proactive closure serialization from DStream

Conflicts:

streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala

commit c052a63dba60b62325abb7ad541fe3d3f1a6e7d0
Author: William Benton wi...@redhat.com
Date:   2014-07

[GitHub] spark pull request: Fix (some of the) warnings in the test suite

2014-07-07 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1323

Fix (some of the) warnings in the test suite

This PR fixes three classes of compiler and deprecation warnings in the 
test suite:

* `expectResult` is currently deprecated and has been replaced with 
`assertResult`;
* assertions of the form `should be ===` are deprecated in favor of the 
form `shouldEqual`; and
* `scala.language.postfixOps` was not in scope within the test classes in 
which postfix operations were actually used.

The fixes for these issues were almost entirely mechanical, but they 
eliminate many lines of warning output.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark testCleanups

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1323.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1323


commit b05304666eb61d7d1d1849772c27cf9711699dae
Author: William Benton wi...@redhat.com
Date:   2014-07-07T20:31:27Z

Replace expectResult with assertResult

Assertions.expectResult is deprecated.  This commit replaces
expectResult invocations with the assertResult invocations.

commit 5377534fdfe741aab7d6d539573ab78636bfd554
Author: William Benton wi...@redhat.com
Date:   2014-07-07T20:37:53Z

Replaced deprecated 'be ===' assertions

should be === assertions have been deprecated in favor of
shouldEqual assertions.  This commit replaces the deprecated
form with the supported form.

commit a672a9b5d579c4946a20fe4c6f88c4582949ef3f
Author: William Benton wi...@redhat.com
Date:   2014-07-07T21:00:58Z

Ensure language.postfixOps is in scope where used

Previously, language.postfixOps was imported at toplevel, which meant
compiler warnings since it wasn't visible inside the classes that used
postfix operations.  This commit moves the import to suppress these
warnings.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix (some of the) warnings in the test suite

2014-07-07 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1323#issuecomment-48243538
  
@srowen sorry, I hadn't noticed the other PR!  I think the `postfixOps` 
change in mine is disjoint, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-06-29 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/143#discussion_r14329726
  
--- Diff: core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala ---
@@ -153,6 +153,18 @@ private[spark] object ClosureCleaner extends Logging {
   field.setAccessible(true)
   field.set(func, outer)
 }
+
+if (checkSerializable) {
+  ensureSerializable(func)
+}
+  }
+
+  private def ensureSerializable(func: AnyRef) {
+try {
+  SparkEnv.get.closureSerializer.newInstance().serialize(func)
+} catch {
+  case ex: Exception = throw new SparkException(Task not 
serializable:  + ex.toString)
--- End diff --

I agree that it is better to wrap the underlying exception but was 
following the style of this error in DAGScheduler.  I'll make the change and 
update that as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-06-29 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-47459912
  
Sorry, I missed FailureSuite.  I have a fix but ran out of battery before I 
could push.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-06-23 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-46860741
  
Can someone take another look at this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46677366
  
Thanks for the quick review and patch, @rxin!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46724443
  
@rxin, re: the former, seems like most implementations signal this as an 
error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46725581
  
OK, I wasn't sure if strict Hive compatibility was the goal.  I'm happy to 
take these tickets.  Thanks again!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2225] Turn HAVING without GROUP BY into...

2014-06-20 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1161#issuecomment-46727408
  
LGTM; this is basically exactly what I did 
(https://github.com/willb/spark/commit/b272f6be925ba50741e0a5093244926ea4a7a9a8)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-19 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/1136

SPARK-2180:  support HAVING clauses in Hive queries

This PR extends Spark's HiveQL support to handle HAVING clauses in 
aggregations.  The HAVING test from the Hive compatibility suite doesn't appear 
to be runnable from within Spark, so I added a simple comparable test to 
`HiveQuerySuite`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark SPARK-2180

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1136.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1136


commit f43787b155f6dcf306fe3df3395785598dc9d6ca
Author: William Benton wi...@redhat.com
Date:   2014-06-18T20:20:41Z

Add support for HAVING clauses in Hive queries.

commit 33b12220a0d8437ec821a8ee166112f519a2d1bc
Author: William Benton wi...@redhat.com
Date:   2014-06-19T18:59:54Z

Added test coverage for SPARK-2180.

This is a simple test for HAVING clauses.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-19 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46615691
  
@rxin, I'm not 100% sure but I think it's a problem with local map/reduce 
(the stack trace isn't too informative, but it's the same as the one for tests 
that are blacklisted due to missing local map/reduce).

I have another commit to push here (adding a semantic exception when 
`HAVING` is specified without `GROUP BY` and test coverage for same).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-19 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46635597
  
Thanks for the catch, @rxin!  I'll make the change and add tests for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-19 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46640150
  
So I've added a cast in cases in which non-boolean expressions are supplied 
to having expressions.  It appears that `Cast(_, BooleanType)` isn't 
idempotent, though -- if you apply it to a Boolean (say, `x  4`), it will 
translate that to `NOT ((x  4) = 0)`.  This seems like a bug, but it's 
possible that I'm missing the reason why it should work that way.  Should I 
change `Cast` so that casting an _X_ to _X_ is a no-op?

(Checking the type of a variable during parse doesn't work, so I wind up 
with a different exception in examples like the one you posted.  I'll either 
need to fix the behavior of `Cast` or delay adding the cast until I have type 
information.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...

2014-06-19 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1136#issuecomment-46642661
  
Thanks!  I'm happy to put together a preliminary patch as well, but 
probably won't be able to take a look until tomorrow morning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-06-18 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-46498345
  
I just rebased this branch atop master so it could be tested again.  I see 
that it failed under Jenkins.  However, I am unable to reproduce the local 
metrics failure in my own environment; is this an intermittent issue in CI or 
is there something I should look at?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-06-18 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-46500775
  
Thanks @rxin!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-05-15 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-42712135
  
I'm not able to reproduce the above failure locally (either on OS X or 
Linux).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-05-15 Thread willb
GitHub user willb reopened a pull request:

https://github.com/apache/spark/pull/143

SPARK-897:  preemptively serialize closures

These commits cause `ClosureCleaner.clean` to attempt to serialize the 
cleaned closure with the default closure serializer and throw a 
`SparkException` if doing so fails.  This behavior is enabled by default but 
can be disabled at individual callsites of `SparkContext.clean`.

Commit 98e01ae8 fixes some no-op assertions in `GraphSuite` that this work 
exposed; I'm happy to put that in a separate PR if that would be more 
appropriate.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-897

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/143.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #143


commit 5cd11d51c19321981a6234a7765c7a5be6913433
Author: Ivan Wick ivanwick+git...@gmail.com
Date:   2014-04-11T00:49:30Z

Set spark.executor.uri from environment variable (needed by Mesos)

The Mesos backend uses this property when setting up a slave process.  It 
is similarly set in the Scala repl (org.apache.spark.repl.SparkILoop), but I 
couldn't find any analogous for pyspark.

Author: Ivan Wick ivanwick+git...@gmail.com

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia ma...@databricks.com

Closes #311 from ivanwick/master and squashes the following commits:

da0c3e4 [Ivan Wick] Set spark.executor.uri from environment variable 
(needed by Mesos)

commit 7b4203ab4c640f7875ae3536228ed4d791062017
Author: Harvey Feng hyfeng...@gmail.com
Date:   2014-04-11T01:25:54Z

Add Spark v0.9.1 to ec2 launch script and use it as the default

Mainly ported from branch-0.9.

Author: Harvey Feng hyfeng...@gmail.com

Closes #385 from harveyfeng/0.9.1-ec2 and squashes the following commits:

769ac2f [Harvey Feng] Add Spark v0.9.1 to ec2 launch script and use it as 
the default

commit 44f654eecd3c181f2aeaff3871acf7f00eacc6b9
Author: Patrick Wendell pwend...@gmail.com
Date:   2014-04-11T03:43:56Z

SPARK-1202: Improvements to task killing in the UI.

1. Adds a separate endpoint for the killing logic that is outside of a page.
2. Narrows the scope of the killingEnabled tracking.
3. Some style improvements.

Author: Patrick Wendell pwend...@gmail.com

Closes #386 from pwendell/kill-link and squashes the following commits:

8efe02b [Patrick Wendell] Improvements to task killing in the UI.

commit 446bb3417a2855a194d49acc0ac316a021eced9d
Author: Thomas Graves tgra...@apache.org
Date:   2014-04-11T07:47:48Z

SPARK-1417: Spark on Yarn - spark UI link from resourcemanager is broken

Author: Thomas Graves tgra...@apache.org

Closes #344 from tgravescs/SPARK-1417 and squashes the following commits:

c450b5f [Thomas Graves] fix test
e1c1d7e [Thomas Graves] add missing $ to appUIAddress
e982ddb [Thomas Graves] use appUIHostPort in appUIAddress
0803ec2 [Thomas Graves] Review comment updates - remove extra newline, 
simplify assert in test
658a8ec [Thomas Graves] Add a appUIHostPort routine
0614208 [Thomas Graves] Fix test
2a6b1b7 [Thomas Graves] SPARK-1417: Spark on Yarn - spark UI link from 
resourcemanager is broken

commit 98225a6effd077a1b97c7e485d45ffd89b2c5b7f
Author: Patrick Wendell pwend...@gmail.com
Date:   2014-04-11T17:45:27Z

Some clean up in build/docs

(a) Deleted an outdated line from the docs
(b) Removed a work around that is no longer necessary given the mesos 
version bump.

Author: Patrick Wendell pwend...@gmail.com

Closes #382 from pwendell/maven-clean and squashes the following commits:

f0447fa [Patrick Wendell] Minor doc clean-up

commit f5ace8da34c58d1005c7c377cfe3df21102c1dd6
Author: Xiangrui Meng m...@databricks.com
Date:   2014-04-11T19:06:13Z

[SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and 
BinaryClassificationMetrics

This PR implements a generic version of `AreaUnderCurve` using the 
`RDD.sliding` implementation from https://github.com/apache/spark/pull/136 . It 
also contains refactoring of https://github.com/apache/spark/pull/160 for 
binary classification evaluation.

Author: Xiangrui Meng m...@databricks.com

Closes #364 from mengxr/auc and squashes the following commits:

a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names
3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr
fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics
b1b7dab [Xiangrui Meng] fix code styles
9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator
ca31da5 [Xiangrui Meng] remove

[GitHub] spark pull request: SPARK-571: forbid return statements in cleaned...

2014-05-13 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/717#issuecomment-43009698
  
Thanks, @rxin!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-571: forbid return statements in cleaned...

2014-05-12 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/717#issuecomment-42911366
  
@rxin, yes!  (I was firing up a debugger in case you wanted to know exactly 
where it was returning to.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-05-11 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-42703799
  
I'd like to reopen this PR, since #189 had to be reverted and is still in 
flight, but these fixes are independently useful.  I've rebased the branch atop 
the current master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-571: forbid return statements in cleaned...

2014-05-10 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/717

SPARK-571: forbid return statements in cleaned closures

This patch checks top-level closure arguments to `ClosureCleaner.clean` for 
`return` statements and raises an exception if it finds any.  This is mainly a 
user-friendliness addition, since programs with return statements in closure 
arguments will currently fail upon RDD actions with a less-than-intuitive error 
message.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark spark-571

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/717.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #717


commit b017c4779a0bfc54f215bc1135d9d23ee9761c20
Author: William Benton wi...@redhat.com
Date:   2014-05-09T20:05:49Z

Added a test for SPARK-571

commit 295b6a5960fd9a4509e64b225b0e4b479704978b
Author: William Benton wi...@redhat.com
Date:   2014-05-09T21:14:37Z

Forbid return statements in closure arguments.

This commit ensures that ClosureCleaner.clean will identify
toplevel return statements in closures and raise an exception
if they are encountered.

This is intended to fix SPARK-571.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-571: forbid return statements in cleaned...

2014-05-10 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/717#issuecomment-42730126
  
Thanks for the quick feedback, @andrewor14!  I'll make those changes.  I'll 
also look in to see if I can figure out why that repartition test is failing; 
oddly, it's working for me locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-571: forbid return statements in cleaned...

2014-05-10 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/717#issuecomment-42760025
  
Thanks for the clarification, @pwendell.

@andrewor14, I've grouped my imports and collapsed that function.  Thanks 
again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1501: Ensure assertions in Graph.apply a...

2014-04-15 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/415

SPARK-1501: Ensure assertions in Graph.apply are asserted.

The Graph.apply test in GraphSuite had some assertions in a closure in
a graph transformation. As a consequence, these assertions never
actually executed.  Furthermore, these closures had a reference to
(non-serializable) test harness classes because they called assert(),
which could be a problem if we proactively check closure serializability
in the future.

This commit simply changes the Graph.apply test to collect the graph
triplets so it can assert about each triplet from a map method.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark graphsuite-nop-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/415.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #415


commit 0b636586b797546ce0cf78dbbfbe7462712aeaa4
Author: William Benton wi...@redhat.com
Date:   2014-03-14T16:40:56Z

Ensure assertions in Graph.apply are asserted.

The Graph.apply test in GraphSuite had some assertions in a closure in
a graph transformation. As a consequence, these assertions never
actually executed.  Furthermore, these closures had a reference to
(non-serializable) test harness classes because they called assert(),
which could be a problem if we proactively check closure serializability
in the future.

This commit simply changes the Graph.apply test to collect the graph
triplets so it can assert about each triplet from a map method.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-729: Closures not always serialized at c...

2014-04-10 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/189#issuecomment-40070735
  
@pwendell That's odd, since I didn't see any failures locally either.  I 
haven't rebased this branch in a while, though, so I can check again.  Can you 
send me a link to the failing tests?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-897: preemptively serialize closures

2014-04-09 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/143#issuecomment-40035997
  
This is subsumed by #189.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   >