date:20161022

[GitHub] spark pull request #15513: [SPARK-17963][SQL][Documentation] Add examples (e...

2016-10-22 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15513#discussion_r84590609
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/CallMethodViaReflection.scala
 ---
@@ -43,11 +43,20 @@ import org.apache.spark.util.Utils
  * and the second element should be a literal string for 
the method name,
  * and the remaining are input arguments to the Java 
method.
  */
-// scalastyle:off line.size.limit
 @ExpressionDescription(
-  usage = "_FUNC_(class,method[,arg1[,arg2..]]) calls method with 
reflection",
-  extended = "> SELECT _FUNC_('java.util.UUID', 'randomUUID');\n 
c33fb387-8500-4bfa-81d2-6e0e3e930df2")
-// scalastyle:on line.size.limit
+  usage = "_FUNC_(class, method[, arg1[, arg2 ..]]) - Calls method with 
reflection.",
+  extended = """
+Arguments:
+  class - a string literal that represents a fully-qualified class 
name.
+  method - a string literal that represents a method name.
+  arg - a string literal that represents arguments for the method.
--- End diff --

Oh, it seems `arg` is not. Let me try to fine such cases here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15513: [SPARK-17963][SQL][Documentation] Add examples (e...

2016-10-22 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15513#discussion_r84590562
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala 
---
@@ -125,7 +129,7 @@ case class DescribeFunctionCommand(
 
   if (isExtended) {
 result :+
-  Row(s"Extended 
Usage:\n${replaceFunctionName(info.getExtended, info.getName)}")
+  Row(s"Extended Usage:${replaceFunctionName(info.getExtended, 
info.getName)}")
--- End diff --

I don't think stripMargin works (at least in one version of the scala we 
support perhaps 2.10) in annotations.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15595: [SPARK-18058][SQL] Comparing column types ignorin...

2016-10-22 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15595#discussion_r84590334
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala
 ---
@@ -377,4 +377,14 @@ class AnalysisSuite extends AnalysisTest {
 assertExpressionType(sum(Divide(Decimal(1), 2.0)), DoubleType)
 assertExpressionType(sum(Divide(1.0, Decimal(2.0))), DoubleType)
   }
+
+
+  test("SPARK-18058: union operations shall not care about the nullability 
of columns") {
--- End diff --

+1 (actually, it'd be nicer if it has unit test and end-to-end test).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15219: [SPARK-14098][SQL] Generate Java code to build CachedCol...

2016-10-22 Thread kiszk

Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/15219
  
@davies @rxin, would it be possible to review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15588: [SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTim...

2016-10-22 Thread Astralidea

Github user Astralidea commented on the issue:

https://github.com/apache/spark/pull/15588
  
@lw-lin
Thanks for you reply. In my private cluster running spark is a little 
different. (I start drivr & executor by myself)
I had try maxRegisteredWaitingTime, but I had not try 
minRegisteredResourcesRatio.
I thought minRegisteredResourcesRatio will not work if 
maxRegisteredWaitingTime won't work.
Maybe it works, I will try spark.scheduler.minRegisteredResourcesRatio 
tommorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15575
  
LGTM, since the scope of this PR is just refactoring. 

Let me first post the existing code for `outputPartitioning` in 
`ExpandExec`:
```Scala
  // The GroupExpressions can output data with arbitrary partitioning, so 
set it
  // as UNKNOWN partitioning
  override def outputPartitioning: Partitioning = UnknownPartitioning(0)
```

It makes sense to set it either `UnknownPartitioning` or 
`child.outputPartitioning`. However, the above code is setting to a wrong 
number of partitions. We need to correct it no matter whether we are using the 
number or not. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15361: [SPARK-17765][SQL] Support for writing out user-d...

2016-10-22 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15361#discussion_r84590191
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala ---
@@ -91,6 +91,16 @@ class OrcQuerySuite extends QueryTest with 
BeforeAndAfterAll with OrcTest {
 }
   }
 
+  test("Read/write UserDefinedType") {
+withTempPath { path =>
+  val data = Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25
+  val udtDF = data.toDF("id", "vectors")
+  udtDF.write.orc(path.getAbsolutePath)
+  val readBack = 
spark.read.schema(udtDF.schema).orc(path.getAbsolutePath)
--- End diff --

It seems fine for reading because it refers the schema from ORC (detecting 
the fields via field names). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15361: [SPARK-17765][SQL] Support for writing out user-d...

2016-10-22 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/15361#discussion_r84590147
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala ---
@@ -246,6 +246,9 @@ private[hive] trait HiveInspectors {
* Wraps with Hive types based on object inspector.
*/
   protected def wrapperFor(oi: ObjectInspector, dataType: DataType): Any 
=> Any = oi match {
+case _ if dataType.isInstanceOf[UserDefinedType[_]] =>
--- End diff --

>This codepath is shared by many things apart from ORC. Won't those be 
affected ?

It seems this path is being used in `hiveUDFs.scala` and 
`hiveWriterContainers.scala`.
Actually, it'd be fine that a value converter for UDT uses the equivalent 
type (inner sql type) converter. It is a common pattern for other data sources 
as well. 

>I would put this case in the every end. The reason being UserDefinedType 
are not that common compared to other types (esp. primitive types). So putting 
it below in the switch case will be better for perf.

Cool :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15441: [SPARK-4411] [Web UI] Add "kill" link for jobs in the UI

2016-10-22 Thread ajbozarth

Github user ajbozarth commented on the issue:

https://github.com/apache/spark/pull/15441
  
@srowen I addressed most of your comments except the one about the 
try-finally I commented on above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15441: [SPARK-4411] [Web UI] Add "kill" link for jobs in the UI

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15441
  
**[Test build #67406 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67406/consoleFull)**
 for PR 15441 at commit 
[`b1e77ba`](https://github.com/apache/spark/commit/b1e77baaff2bae12e745d623ea27e7cb2ad5e2be).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15588: [SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTim...

2016-10-22 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/15588
  
Spark Streaming would do a very simple dummy job ensure that all slaves 
have registered before scheduling the `Receiver`s; please see 
https://github.com/apache/spark/blob/v2.0.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala#L436-L447.

@Astralidea, `spark.scheduler.minRegisteredResourcesRatio`  is the minimum 
ratio of registered resources to wait for before the dummy job begins.In our 
private clusters, configuring that to be `0.9` or even `1.0` helps a lot to 
balance our 100+ `Receiver`s. Maybe you could also give it a try.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15441: [SPARK-4411] [Web UI] Add "kill" link for jobs in...

2016-10-22 Thread ajbozarth

Github user ajbozarth commented on a diff in the pull request:

https://github.com/apache/spark/pull/15441#discussion_r84589829
  
--- Diff: core/src/test/scala/org/apache/spark/ui/UISeleniumSuite.scala ---
@@ -651,6 +671,15 @@ class UISeleniumSuite extends SparkFunSuite with 
WebBrowser with Matchers with B
 }
   }
 
+  def getResponseCode(url: URL, method: String): Int = {
+val connection = url.openConnection().asInstanceOf[HttpURLConnection]
+connection.setRequestMethod(method)
+connection.connect()
+val code = connection.getResponseCode()
+connection.disconnect()
--- End diff --

It might just because it's late and I'm tired but I'm not quite sure where 
you think the try-finally should be


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15484: [SPARK-17868][SQL] Do not use bitmasks during parsing an...

2016-10-22 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/15484
  
@tejasapatil @rxin I've addressed most of your comments, thanks for 
reviewing this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15484: [SPARK-17868][SQL] Do not use bitmasks during par...

2016-10-22 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15484#discussion_r84589691
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -216,10 +216,16 @@ class Analyzer(
  *  Group Count: N + 1 (N is the number of group expressions)
  *
  *  We need to get all of its subsets for the rule described above, 
the subset is
- *  represented as the bit masks.
+ *  represented as sequence of expressions.
  */
-def bitmasks(r: Rollup): Seq[Int] = {
-  Seq.tabulate(r.groupByExprs.length + 1)(idx => (1 << idx) - 1)
+def rollupExprs(exprs: Seq[Expression]): Seq[Seq[Expression]] = {
+  val buffer = ArrayBuffer.empty[Seq[Expression]]
--- End diff --

The use of `ArrayBuffer` will make this piece of code more concise, since 
the sequence of `exprs` is not usually very long, maybe performance is not the 
major concern here, I'd prefer to keep this one, is it OK? @hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15582
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67403/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15582
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15484: [SPARK-17868][SQL] Do not use bitmasks during parsing an...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15484
  
**[Test build #67405 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67405/consoleFull)**
 for PR 15484 at commit 
[`a47cc68`](https://github.com/apache/spark/commit/a47cc687d9606d8a22d0de9d9c9762fef44f897d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15582
  
**[Test build #67403 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67403/consoleFull)**
 for PR 15582 at commit 
[`5acbd6c`](https://github.com/apache/spark/commit/5acbd6ce3a1d8becc84c4e53b7f175b13bb8b7bf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13705: [SPARK-15472][SQL] Add support for writing in `csv` form...

2016-10-22 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/13705
  
closing this in favor of SPARK-17924


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13705: [SPARK-15472][SQL] Add support for writing in `cs...

2016-10-22 Thread lw-lin

Github user lw-lin closed the pull request at:

https://github.com/apache/spark/pull/13705


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15582
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67404/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15582
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15582
  
**[Test build #67404 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67404/consoleFull)**
 for PR 15582 at commit 
[`3066efc`](https://github.com/apache/spark/commit/3066efc6b54111e0ec69dcd6110f32b8e7f56dbf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15573: [SPARK-18035] [SQL] Introduce performant and memo...

2016-10-22 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15573


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15573: [SPARK-18035] [SQL] Introduce performant and memory effi...

2016-10-22 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15573
  
Merging to master! Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15573: [SPARK-18035] [SQL] Introduce performant and memory effi...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15573
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15573: [SPARK-18035] [SQL] Introduce performant and memory effi...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15573
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67402/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15573: [SPARK-18035] [SQL] Introduce performant and memory effi...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15573
  
**[Test build #67402 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67402/consoleFull)**
 for PR 15573 at commit 
[`b263278`](https://github.com/apache/spark/commit/b263278573adc00fcc3f9fc72604b573936a5516).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r84589297
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/RandomProjectionSuite.scala ---
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import breeze.numerics.{cos, sin}
+import breeze.numerics.constants.Pi
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+
+class RandomProjectionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+  test("RandomProjection") {
+val data = {
+  for (i <- -5 until 5; j <- -5 until 5) yield 
Vectors.dense(i.toDouble, j.toDouble)
+}
+val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys")
+
+// Project from 2 dimensional Euclidean Space to 1 dimensions
+val rp = new RandomProjection()
+  .setOutputDim(1)
+  .setInputCol("keys")
+  .setOutputCol("values")
+  .setBucketLength(1.0)
+  .setSeed(12345)
+
+val (falsePositive, falseNegative) = LSHTest.calculateLSHProperty(df, 
rp, 8.0, 2.0)
+assert(falsePositive < 0.05)
+assert(falseNegative < 0.06)
+  }
+
+  test("RandomProjection with high dimension data") {
+val numDim = 100
+val data = {
+  for (i <- 0 until numDim; j <- Seq(-2, -1, 1, 2))
+yield Vectors.sparse(numDim, Seq((i, j.toDouble)))
+}
+val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys")
+
+// Project from 100 dimensional Euclidean Space to 10 dimensions
+val rp = new RandomProjection()
+  .setOutputDim(10)
+  .setInputCol("keys")
+  .setOutputCol("values")
+  .setBucketLength(2.5)
+  .setSeed(12345)
+
+val (falsePositive, falseNegative) = LSHTest.calculateLSHProperty(df, 
rp, 3.0, 2.0)
+assert(falsePositive == 0.0)
+assert(falseNegative < 0.05)
+  }
+
+  test("approxNearestNeighbors for random projection") {
+val data = {
+  for (i <- -10 until 10; j <- -10 until 10) yield 
Vectors.dense(i.toDouble, j.toDouble)
+}
+val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys")
+val key = Vectors.dense(1.2, 3.4)
+
+val rp = new RandomProjection()
+  .setOutputDim(2)
+  .setInputCol("keys")
+  .setOutputCol("values")
+  .setBucketLength(4.0)
+  .setSeed(12345)
+
+val (precision, recall) = LSHTest.calculateApproxNearestNeighbors(rp, 
df, key, 100,
+  singleProbing = true)
+assert(precision >= 0.6)
+assert(recall >= 0.6)
+  }
+
+  test("approxNearestNeighbors with multiple probing") {
--- End diff --

If the goal here is to ensure multiple probing is a strict improvement, 
then I'd combine the unit tests to ensure that the data and Param settings 
remain the same.  I see the Params are already different, but perhaps they 
should be made identical.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r84588285
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala ---
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT}
+import org.apache.spark.ml.param.shared.HasSeed
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * :: Experimental ::
+ * Model produced by [[MinHash]]
+ * @param hashFunctions An array of hash functions, mapping elements to 
their hash values.
+ */
+@Experimental
+@Since("2.1.0")
+class MinHashModel private[ml] (override val uid: String, hashFunctions: 
Array[Int => Long])
+  extends LSHModel[MinHashModel] {
+
+  @Since("2.1.0")
+  override protected[this] val hashFunction: Vector => Vector = {
+elems: Vector =>
+  require(elems.numNonzeros > 0, "Must have at least 1 non zero 
entry.")
+  val elemsList = elems.toSparse.indices.toList
+  Vectors.dense(hashFunctions.map(func => 
elemsList.map(func).min.toDouble))
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def keyDistance(x: Vector, y: Vector): Double = {
+val xSet = x.toSparse.indices.toSet
+val ySet = y.toSparse.indices.toSet
+val intersectionSize = xSet.intersect(ySet).size.toDouble
+val unionSize = xSet.size + ySet.size - intersectionSize
+assert(unionSize > 0, "The union of two input sets must have at least 
1 elements")
+1 - intersectionSize / unionSize
+  }
+
+  @Since("2.1.0")
+  override protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(pair => math.abs(pair._1 - 
pair._2)).min
+  }
+}
+
+/**
+ * :: Experimental ::
+ * LSH class for Jaccard distance.
+ *
--- End diff --

Could you please link to Wikipedia?  That tends to be useful: 
[https://en.wikipedia.org/wiki/MinHash]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r84588545
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala ---
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+
+class MinHashSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test("MinHash") {
--- End diff --

name test more specifically: "MinHash: test of LSH property"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r84589114
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala ---
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+
+class MinHashSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test("MinHash") {
+val data = {
+  for (i <- 0 to 95) yield Vectors.sparse(100, (i until i + 5).map((_, 
1.0)))
--- End diff --

If you're reusing data across tests, then I'd put it in a class member val. 
 See example: 
[https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala#L40]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...

2016-10-22 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15582
  
It'd be great to move those as well!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
In practice, setting the `outputPartitioning` of a physical plan like 
`ExpandExec` to `child.outputPartitioning` doesn't cause any real problem, even 
this physical plan doesn't keep the same row distribution of its child. That is 
because if the physical plan changes output, it will have different output 
attributes, e.g., `col` to `col'` as @tejasapatil pointed out.

If its parent plan requires a distribution, says `HashPartition`, this 
distribution will bound to the physical plan's output `col'`, instead of its 
child plan's `col`. So even the physical plan uses `child.outputPartitioning`, 
`EnsureRequirements ` will step in and inject an extra shuffle exchange of 
`HashPartition(col')` to satisfy the requirement.

It works like that as per my understanding. However it doesn't mean the 
physical plan's output partitioning is exactly as same as its child's, i.e., 
`HashPartition(col)`, because it doesn't have the output `col`. This part might 
be confusing to some people, so I think it might be better to explain it more. 
That is what I understood about this, if I am wrong please kindly point out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15588: [SPARK-18039][Scheduler] fix bug maxRegisteredWai...

2016-10-22 Thread Astralidea

Github user Astralidea commented on a diff in the pull request:

https://github.com/apache/spark/pull/15588#discussion_r84588811
  
--- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala
 ---
@@ -440,7 +430,10 @@ class ReceiverTracker(ssc: StreamingContext, 
skipReceiverLaunch: Boolean = false
   rcvr
 }
 
-runDummySparkJob()
+while ((System.currentTimeMillis() - createTime) < 
maxRegisteredWaitingTimeMs) {}
--- End diff --

You're right. But I think it only waste a little time,
and how to write code gracefully? 
I hope to make it better but do not know how to do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15588: [SPARK-18039][Scheduler] fix bug maxRegisteredWaitingTim...

2016-10-22 Thread Astralidea

Github user Astralidea commented on the issue:

https://github.com/apache/spark/pull/15588
  
@srowen But in my cluster I tested 10 times. 9 times successed, 1 time 
failed. 
Why not necessary? receiver balance scheduler affect performance. 
If new executor delay add to driver. receiver won't scheduler again. Or any 
other solution?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15354: [SPARK-17764][SQL] Add `to_json` supporting to convert n...

2016-10-22 Thread marmbrus

Github user marmbrus commented on the issue:

https://github.com/apache/spark/pull/15354
  
It would be really nice to fail in analysis rather than execution.  What if 
it only fails after hours of computation? As a user I'd be upset.  I'm also 
concerned they will think it's a spark bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15361: [SPARK-17765][SQL] Support for writing out user-d...

2016-10-22 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15361#discussion_r84588700
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala ---
@@ -91,6 +91,16 @@ class OrcQuerySuite extends QueryTest with 
BeforeAndAfterAll with OrcTest {
 }
   }
 
+  test("Read/write UserDefinedType") {
+withTempPath { path =>
+  val data = Seq((1, new UDT.MyDenseVector(Array(0.25, 2.25, 4.25
+  val udtDF = data.toDF("id", "vectors")
+  udtDF.write.orc(path.getAbsolutePath)
+  val readBack = 
spark.read.schema(udtDF.schema).orc(path.getAbsolutePath)
--- End diff --

Curious : how does this work ? I mean you added support for 
`UserDefinedType` in `wrapper` side, but at the `unwrapper` side I don't see 
`UserDefinedType` being handled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15361: [SPARK-17765][SQL] Support for writing out user-d...

2016-10-22 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15361#discussion_r84588678
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala ---
@@ -246,6 +246,9 @@ private[hive] trait HiveInspectors {
* Wraps with Hive types based on object inspector.
*/
   protected def wrapperFor(oi: ObjectInspector, dataType: DataType): Any 
=> Any = oi match {
+case _ if dataType.isInstanceOf[UserDefinedType[_]] =>
--- End diff --

- This codepath is shared by many things apart from ORC. Won't those be 
affected ?
- I would put this case in the every end. The reason being 
`UserDefinedType` are not that common compared to other types (esp. primitive 
types). So putting it below in the switch case will be better for perf.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
@tejasapatil yeah, that is correct. however, I am wondering if we can say 
this `ExpandExec` have the same distribution of rows as its child...because it 
even doesn't have the `col`...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15148
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67401/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15148
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15148
  
**[Test build #67401 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67401/consoleFull)**
 for PR 15148 at commit 
[`e14f73e`](https://github.com/apache/spark/commit/e14f73e8a49d409e09a6ed541d4b40f07dc81013).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...

2016-10-22 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/15582
  
@rxin I've moved the testcases added in this PR to an query file test, do 
we need to move other test cases for `ROLLUP/CUBE/GROUPING-SETS` too? Currently 
in `SQLQuerySuite` we have the following:
```
test("rollup")
test("grouping sets when aggregate functions containing groupBy columns") 
test("cube")
test("grouping sets")
test("grouping and grouping_id")
test("grouping and grouping_id in having")
test("grouping and grouping_id in sort")
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15582
  
**[Test build #67404 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67404/consoleFull)**
 for PR 15582 at commit 
[`3066efc`](https://github.com/apache/spark/commit/3066efc6b54111e0ec69dcd6110f32b8e7f56dbf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15463: [SPARK-17894] [CORE] Ensure uniqueness of TaskSetManager...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15463
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67399/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15463: [SPARK-17894] [CORE] Ensure uniqueness of TaskSetManager...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15463
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15463: [SPARK-17894] [CORE] Ensure uniqueness of TaskSetManager...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15463
  
**[Test build #67399 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67399/consoleFull)**
 for PR 15463 at commit 
[`cd6d240`](https://github.com/apache/spark/commit/cd6d240c8972e843a1abf586c6d324bff8beefd5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15595: [SPARK-18058][SQL] Comparing column types ignorin...

2016-10-22 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15595#discussion_r84588354
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala
 ---
@@ -377,4 +377,14 @@ class AnalysisSuite extends AnalysisTest {
 assertExpressionType(sum(Divide(Decimal(1), 2.0)), DoubleType)
 assertExpressionType(sum(Divide(1.0, Decimal(2.0))), DoubleType)
   }
+
+
+  test("SPARK-18058: union operations shall not care about the nullability 
of columns") {
--- End diff --

This PR also affects `SetOperation`. Could you please also add tests for 
that ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15582: [SPARK-18045][SQL][TESTS] Move `HiveDataFrameAnalyticsSu...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15582
  
**[Test build #67403 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67403/consoleFull)**
 for PR 15582 at commit 
[`5acbd6c`](https://github.com/apache/spark/commit/5acbd6ce3a1d8becc84c4e53b7f175b13bb8b7bf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15595: [SPARK-18058][SQL] Comparing column types ignorin...

2016-10-22 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15595#discussion_r84588308
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala
 ---
@@ -377,4 +377,14 @@ class AnalysisSuite extends AnalysisTest {
 assertExpressionType(sum(Divide(Decimal(1), 2.0)), DoubleType)
 assertExpressionType(sum(Divide(1.0, Decimal(2.0))), DoubleType)
   }
+
--- End diff --

nit: delete extra newline


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread tejasapatil

Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/15575
  
@viirya : As per my understanding, if the child operator emits `col`, after 
applying `ExpandExec`, the output is `col'`. The original child partitioning 
being over `col`, `ExpandExec` does not seem to alter that. 

The table above was to summarise the state of things before this PR. I did 
not change any semantics in this PR as its pure refactoring.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15573: [SPARK-18035] [SQL] Unwrapping java maps in HiveInspecto...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15573
  
**[Test build #67402 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67402/consoleFull)**
 for PR 15573 at commit 
[`b263278`](https://github.com/apache/spark/commit/b263278573adc00fcc3f9fc72604b573936a5516).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15573: [SPARK-18035] [SQL] Unwrapping java maps in HiveInspecto...

2016-10-22 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15573
  
LGTM pending test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15573: [SPARK-18035] [SQL] Unwrapping java maps in HiveI...

2016-10-22 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15573#discussion_r84588059
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala
 ---
@@ -433,18 +413,12 @@ object CatalystTypeConverters {
 case seq: Seq[Any] => new 
GenericArrayData(seq.map(convertToCatalyst).toArray)
 case r: Row => InternalRow(r.toSeq.map(convertToCatalyst): _*)
 case arr: Array[Any] => new 
GenericArrayData(arr.map(convertToCatalyst))
-case m: Map[_, _] =>
-  val length = m.size
-  val convertedKeys = new Array[Any](length)
-  val convertedValues = new Array[Any](length)
-
-  var i = 0
-  for ((key, value) <- m) {
-convertedKeys(i) = convertToCatalyst(key)
-convertedValues(i) = convertToCatalyst(value)
-i += 1
-  }
-  ArrayBasedMapData(convertedKeys, convertedValues)
+case map: Map[_, _] =>
+  ArrayBasedMapData(
+map.iterator,
+map.size,
+(key) => convertToCatalyst(key),
+(value) => convertToCatalyst(value))
--- End diff --

changed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15148
  
**[Test build #67401 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67401/consoleFull)**
 for PR 15148 at commit 
[`e14f73e`](https://github.com/apache/spark/commit/e14f73e8a49d409e09a6ed541d4b40f07dc81013).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15573: [SPARK-18035] [SQL] Unwrapping java maps in HiveI...

2016-10-22 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15573#discussion_r84587828
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala
 ---
@@ -433,18 +413,12 @@ object CatalystTypeConverters {
 case seq: Seq[Any] => new 
GenericArrayData(seq.map(convertToCatalyst).toArray)
 case r: Row => InternalRow(r.toSeq.map(convertToCatalyst): _*)
 case arr: Array[Any] => new 
GenericArrayData(arr.map(convertToCatalyst))
-case m: Map[_, _] =>
-  val length = m.size
-  val convertedKeys = new Array[Any](length)
-  val convertedValues = new Array[Any](length)
-
-  var i = 0
-  for ((key, value) <- m) {
-convertedKeys(i) = convertToCatalyst(key)
-convertedValues(i) = convertToCatalyst(value)
-i += 1
-  }
-  ArrayBasedMapData(convertedKeys, convertedValues)
+case map: Map[_, _] =>
+  ArrayBasedMapData(
+map.iterator,
+map.size,
+(key) => convertToCatalyst(key),
+(value) => convertToCatalyst(value))
--- End diff --

It just looks weird to use different apply functions in the same file. How 
about this?
```Scala
  ArrayBasedMapData(
map, (key: Any) => convertToCatalyst(key), (value: Any) => 
convertToCatalyst(value))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15595: [SPARK-18058][SQL] Comparing column types ignoring Nulla...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15595
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14547
  
**[Test build #67400 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67400/consoleFull)**
 for PR 14547 at commit 
[`66d3396`](https://github.com/apache/spark/commit/66d33963fcba05b4303d34891635607f54e10364).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14547
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67400/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14547
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15595: [SPARK-18058][SQL] Comparing column types ignoring Nulla...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15595
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67397/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15595: [SPARK-18058][SQL] Comparing column types ignoring Nulla...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15595
  
**[Test build #67397 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67397/consoleFull)**
 for PR 15595 at commit 
[`e7b5a9b`](https://github.com/apache/spark/commit/e7b5a9b32328c5896e676284db1638819530b6dc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
@rxin yeah, I am curious why `ExpandExec` and `GenerateExec` have different 
`outputPartitioning`...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
@tejasapatil I see there is 1:1 mapping among output partition of child 
operator and output partition of `ExpandExec`. 

For example we have an Expand applying on a data set like col: [1, 2, 3]. 
If the projections are col, col + 1, col + 2.

Assume the partition of the data set is HashPartition(col). We have three 
partitions:
p1: [1]
p2: [2]
p3: [3]

After the Expand, the data set becomes:

p1: [1, 2, 3]
p2: [2, 3, 4]
p3: [3, 4, 5]

Is it still valid for HashPartition(col)? Looks like it doesn't. I think It 
is why there is a comment on ExpandExec in the code position you links to.

BTW, in your table `ExpandExec`'s `outputPartitioning` is 
`UnknownPartitioning`, right? If it doesn't change child's partition, why we 
don't set it to child's outputPartitioning?

  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14547
  
**[Test build #67400 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67400/consoleFull)**
 for PR 14547 at commit 
[`66d3396`](https://github.com/apache/spark/commit/66d33963fcba05b4303d34891635607f54e10364).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread tejasapatil

Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/15575
  
@viirya

>> However, if its child has certain partition such as HashPartition, after 
ExpandExec it becomes a UnknownPartitioning

The notion of `Partitioning` in Spark is the distribution of rows across 
tasks. Even if the child's output has `HashPartitioning`, there is a 1:1 
mapping among output partition of child operator and output partition of 
`ExpandExec`. So, applying `ExpandExec` does not alter the partitioning of 
child outputs.






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15575
  
The current thing LGTM. 

cc @yhuai do you have any other feedback?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r84587197
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala ---
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+
+class MinHashSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test("MinHash") {
+val data = {
+  for (i <- 0 to 95) yield Vectors.sparse(100, (i until i + 5).map((_, 
1.0)))
+}
+val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys")
+
+val mh = new MinHash()
+  .setOutputDim(1)
+  .setInputCol("keys")
+  .setOutputCol("values")
+  .setSeed(0)
--- End diff --

Here and elsewhere


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r84587191
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,340 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+@Since("2.1.0")
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for the dimension of LSH OR-amplification.
+   *
+   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
+   * higher the dimension is, the lower the false negative rate.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension, where" +
+"increasing dimensionality lowers the false negative rate, and 
decreasing dimensionality" +
+" improves the running performance", ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getOutputDim: Int = $(outputDim)
+
+  setDefault(outputDim -> 1, outputCol -> "lshFeatures")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without [[outputCol]]
+   * @return A derived schema with [[outputCol]] added
+   */
+  @Since("2.1.0")
+  protected[this] final def validateAndTransformSchema(schema: 
StructType): StructType = {
+SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+@Since("2.1.0")
+private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] 
with LSHParams {
+  self: T =>
+
+  @Since("2.1.0")
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+
+  /**
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  @Since("2.1.0")
+  protected[this] val hashFunction: Vector => Vector
+
+  /**
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One input vector in the metric space
+   * @param y One input vector in the metric space
+   * @return The distance between x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def keyDistance(x: Vector, y: Vector): Double
+
+  /**
+   * Calculate the distance between two different hash Vectors.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double
+
+  @Since("2.1.0")
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  @Since("2.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+validateAndTransformSchema(schema)
+  }
+
+  /**
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the item. If the [[outputCol]] is missing, the method 
will transform the data; if
+   * the [[outputCol]] exists, it will use the

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r84587195
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/MinHashSuite.scala ---
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+
+class MinHashSuite extends SparkFunSuite with MLlibTestSparkContext {
+  test("MinHash") {
+val data = {
+  for (i <- 0 to 95) yield Vectors.sparse(100, (i until i + 5).map((_, 
1.0)))
+}
+val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("keys")
+
+val mh = new MinHash()
+  .setOutputDim(1)
+  .setInputCol("keys")
+  .setOutputCol("values")
+  .setSeed(0)
--- End diff --

Use seed != 0 as a habit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15148
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15148
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67398/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15148
  
**[Test build #67398 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67398/consoleFull)**
 for PR 15148 at commit 
[`cad4ecb`](https://github.com/apache/spark/commit/cad4ecb3cea47e16b9c1073d30d8fd57bc397621).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread tejasapatil

Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/15575
  
@viirya 

>> In the table in the description, CoalesceExec output UnknownPartitioning

Yes. Since partitions == 1 is a corner case, I did not put that in the 
table. If you look at the code, its doing the right thing: 

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala#L490


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15600: [SPARK-17698] [SQL] Join predicates should not co...

2016-10-22 Thread tejasapatil

Github user tejasapatil closed the pull request at:

https://github.com/apache/spark/pull/15600


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15463: [SPARK-17894] [CORE] Ensure uniqueness of TaskSetManager...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15463
  
**[Test build #67399 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67399/consoleFull)**
 for PR 15463 at commit 
[`cd6d240`](https://github.com/apache/spark/commit/cd6d240c8972e843a1abf586c6d324bff8beefd5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15600: [SPARK-17698] [SQL] Join predicates should not contain f...

2016-10-22 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15600
  
Thanks - merging in. Can you close this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15463: [SPARK-17894] [CORE] Ensure uniqueness of TaskSetManager...

2016-10-22 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/15463
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/15575
  
@viirya of course if you say coalesce(1) it is a single partition -- any 
operator that changes partition to 1 partition is single partition.

For Expand isn't it just the same as Generate?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15575: [SPARK-18038] [SQL] Move output partitioning definition ...

2016-10-22 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15575
  
@rxin In the table in the description, `CoalesceExec` output 
`UnknownPartitioning`, actually it can be `SinglePartition` if what you do is 
`coalesce(1)`.

`ExpandExec` doesn't actually move rows across partitions as @tejasapatil 
pointed out. However, if its child has certain partition such as 
`HashPartition`, after `ExpandExec` it becomes a `UnknownPartitioning`. I am 
not sure if it does change the partitioning or not. From the view of output 
partition of the physical plan, it is changed indeed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r84586831
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+@Since("2.1.0")
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for the dimension of LSH OR-amplification.
+   *
+   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
+   * higher the dimension is, the lower the false negative rate.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension, where" +
+"increasing dimensionality lowers the false negative rate, and 
decreasing dimensionality" +
+" improves the running performance", ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getOutputDim: Int = $(outputDim)
+
+  setDefault(outputDim -> 1, outputCol -> "lshFeatures")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without [[outputCol]]
+   * @return A derived schema with [[outputCol]] added
+   */
+  @Since("2.1.0")
+  protected[this] final def validateAndTransformSchema(schema: 
StructType): StructType = {
+SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+@Since("2.1.0")
+private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] 
with LSHParams {
+  self: T =>
+
+  @Since("2.1.0")
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+
+  /**
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  @Since("2.1.0")
+  protected[this] val hashFunction: Vector => Vector
+
+  /**
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One input vector in the metric space
+   * @param y One input vector in the metric space
+   * @return The distance between x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def keyDistance(x: Vector, y: Vector): Double
+
+  /**
+   * Calculate the distance between two different hash Vectors.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double
+
+  @Since("2.1.0")
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  @Since("2.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+validateAndTransformSchema(schema)
+  }
+
+  /**
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the item. If the [[outputCol]] is missing, the method 
will transform the data; if
+   * the [[outputCol]] exists, it will use the

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r84586829
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -0,0 +1,343 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import scala.util.Random
+
+import org.apache.spark.annotation.Since
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util.SchemaUtils
+import org.apache.spark.sql._
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+@Since("2.1.0")
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for the dimension of LSH OR-amplification.
+   *
+   * In this implementation, we use LSH OR-amplification to reduce the 
false negative rate. The
+   * higher the dimension is, the lower the false negative rate.
+   * @group param
+   */
+  @Since("2.1.0")
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension, where" +
+"increasing dimensionality lowers the false negative rate, and 
decreasing dimensionality" +
+" improves the running performance", ParamValidators.gt(0))
+
+  /** @group getParam */
+  @Since("2.1.0")
+  final def getOutputDim: Int = $(outputDim)
+
+  setDefault(outputDim -> 1, outputCol -> "lshFeatures")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without [[outputCol]]
+   * @return A derived schema with [[outputCol]] added
+   */
+  @Since("2.1.0")
+  protected[this] final def validateAndTransformSchema(schema: 
StructType): StructType = {
+SchemaUtils.appendColumn(schema, $(outputCol), new VectorUDT)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+@Since("2.1.0")
+private[ml] abstract class LSHModel[T <: LSHModel[T]] extends Model[T] 
with LSHParams {
+  self: T =>
+
+  @Since("2.1.0")
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+
+  /**
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  @Since("2.1.0")
+  protected[this] val hashFunction: Vector => Vector
+
+  /**
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One input vector in the metric space
+   * @param y One input vector in the metric space
+   * @return The distance between x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def keyDistance(x: Vector, y: Vector): Double
+
+  /**
+   * Calculate the distance between two different hash Vectors.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y
+   */
+  @Since("2.1.0")
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double
+
+  @Since("2.1.0")
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  @Since("2.1.0")
+  override def transformSchema(schema: StructType): StructType = {
+validateAndTransformSchema(schema)
+  }
+
+  /**
+   * Given a large dataset and an item, approximately find at most k items 
which have the closest
+   * distance to the item. If the [[outputCol]] is missing, the method 
will transform the data; if
+   * the [[outputCol]] exists, it will use the

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread Yunni

Github user Yunni commented on the issue:

https://github.com/apache/spark/pull/15148
  
Thanks @jkbradley. I have removed BitSampling and SignRandomProjection for 
a follow-up PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15148
  
**[Test build #67398 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67398/consoleFull)**
 for PR 15148 at commit 
[`cad4ecb`](https://github.com/apache/spark/commit/cad4ecb3cea47e16b9c1073d30d8fd57bc397621).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14529: [TRIVIAL][SQL] Match the name of OrcRelation companion o...

2016-10-22 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14529
  
Thanks, I am closing this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14529: [TRIVIAL][SQL] Match the name of OrcRelation comp...

2016-10-22 Thread HyukjinKwon

Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/14529


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15600: [SPARK-17698] [SQL] Join predicates should not contain f...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15600
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15600: [SPARK-17698] [SQL] Join predicates should not contain f...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15600
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67396/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15600: [SPARK-17698] [SQL] Join predicates should not contain f...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15600
  
**[Test build #67396 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67396/consoleFull)**
 for PR 15600 at commit 
[`df50838`](https://github.com/apache/spark/commit/df5083894198e1a85fb17544fc596a3869a9e1b6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15595: [SPARK-18058][SQL] Comparing column types ignoring Nulla...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15595
  
**[Test build #67397 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67397/consoleFull)**
 for PR 15595 at commit 
[`e7b5a9b`](https://github.com/apache/spark/commit/e7b5a9b32328c5896e676284db1638819530b6dc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15541: [SPARK-17637][Scheduler]Packed scheduling for Spark task...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15541
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67395/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15541: [SPARK-17637][Scheduler]Packed scheduling for Spark task...

2016-10-22 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15541
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15541: [SPARK-17637][Scheduler]Packed scheduling for Spark task...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15541
  
**[Test build #67395 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67395/consoleFull)**
 for PR 15541 at commit 
[`dd2b207`](https://github.com/apache/spark/commit/dd2b2077430bbb07047e928d20c1ad8fe940827a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15484: [SPARK-17868][SQL] Do not use bitmasks during par...

2016-10-22 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15484#discussion_r84585794
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -255,98 +265,125 @@ class Analyzer(
   expr transform {
 case e: GroupingID =>
   if (e.groupByExprs.isEmpty || e.groupByExprs == groupByExprs) {
-gid
+Alias(gid, toPrettySQL(e))()
   } else {
 throw new AnalysisException(
   s"Columns of grouping_id (${e.groupByExprs.mkString(",")}) 
does not match " +
 s"grouping columns (${groupByExprs.mkString(",")})")
   }
-case Grouping(col: Expression) =>
+case e @ Grouping(col: Expression) =>
   val idx = groupByExprs.indexOf(col)
   if (idx >= 0) {
-Cast(BitwiseAnd(ShiftRight(gid, Literal(groupByExprs.length - 
1 - idx)),
-  Literal(1)), ByteType)
+Alias(Cast(BitwiseAnd(ShiftRight(gid, 
Literal(groupByExprs.length - 1 - idx)),
+  Literal(1)), ByteType), toPrettySQL(e))()
   } else {
 throw new AnalysisException(s"Column of grouping ($col) can't 
be found " +
   s"in grouping columns ${groupByExprs.mkString(",")}")
   }
   }
 }
 
-// This require transformUp to replace grouping()/grouping_id() in 
resolved Filter/Sort
-def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-  case a if !a.childrenResolved => a // be sure all of the children 
are resolved.
-  case p if p.expressions.exists(hasGroupingAttribute) =>
-failAnalysis(
-  s"${VirtualColumn.hiveGroupingIdName} is deprecated; use 
grouping_id() instead")
-
-  case Aggregate(Seq(c @ Cube(groupByExprs)), aggregateExpressions, 
child) =>
-GroupingSets(bitmasks(c), groupByExprs, child, 
aggregateExpressions)
-  case Aggregate(Seq(r @ Rollup(groupByExprs)), aggregateExpressions, 
child) =>
-GroupingSets(bitmasks(r), groupByExprs, child, 
aggregateExpressions)
+/*
+ * Create new alias for all group by expressions for `Expand` operator.
+ */
+private def constructGroupByAlias(groupByExprs: Seq[Expression]): 
Seq[Alias] = {
+  groupByExprs.map {
+case e: NamedExpression => Alias(e, e.name)()
+case other => Alias(other, other.toString)()
+  }
+}
 
-  // Ensure all the expressions have been resolved.
-  case x: GroupingSets if x.expressions.forall(_.resolved) =>
-val gid = AttributeReference(VirtualColumn.groupingIdName, 
IntegerType, false)()
-
-// Expand works by setting grouping expressions to null as 
determined by the bitmasks. To
-// prevent these null values from being used in an aggregate 
instead of the original value
-// we need to create new aliases for all group by expressions that 
will only be used for
-// the intended purpose.
-val groupByAliases: Seq[Alias] = x.groupByExprs.map {
-  case e: NamedExpression => Alias(e, e.name)()
-  case other => Alias(other, other.toString)()
+/*
+ * Construct [[Expand]] operator with grouping sets.
+ */
+private def constructExpand(
+selectedGroupByExprs: Seq[Seq[Expression]],
+child: LogicalPlan,
+groupByAliases: Seq[Alias],
+gid: Attribute): LogicalPlan = {
+  // Change the nullability of group by aliases if necessary. For 
example, if we have
+  // GROUPING SETS ((a,b), a), we do not need to change the 
nullability of a, but we
+  // should change the nullabilty of b to be TRUE.
+  // TODO: For Cube/Rollup just set nullability to be `true`.
+  val expandedAttributes = groupByAliases.zipWithIndex.map { case (a, 
idx) =>
--- End diff --

+1. Looking at it more, I feel `zipWithIndex` is not needed at all and the 
`map` would suffice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15484: [SPARK-17868][SQL] Do not use bitmasks during par...

2016-10-22 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15484#discussion_r84585577
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -216,10 +216,16 @@ class Analyzer(
  *  Group Count: N + 1 (N is the number of group expressions)
  *
  *  We need to get all of its subsets for the rule described above, 
the subset is
- *  represented as the bit masks.
+ *  represented as sequence of expressions.
  */
-def bitmasks(r: Rollup): Seq[Int] = {
-  Seq.tabulate(r.groupByExprs.length + 1)(idx => (1 << idx) - 1)
+def rollupExprs(exprs: Seq[Expression]): Seq[Seq[Expression]] = {
+  val buffer = ArrayBuffer.empty[Seq[Expression]]
--- End diff --

Avoid using `ArrayBuffer` as insertions would lead to expansion of 
underlying array and copying of data to the new one. Since you know the size 
upfront, you could create an `Array` of required size.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15484: [SPARK-17868][SQL] Do not use bitmasks during par...

2016-10-22 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15484#discussion_r84585881
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -255,98 +265,125 @@ class Analyzer(
   expr transform {
 case e: GroupingID =>
   if (e.groupByExprs.isEmpty || e.groupByExprs == groupByExprs) {
-gid
+Alias(gid, toPrettySQL(e))()
   } else {
 throw new AnalysisException(
   s"Columns of grouping_id (${e.groupByExprs.mkString(",")}) 
does not match " +
 s"grouping columns (${groupByExprs.mkString(",")})")
   }
-case Grouping(col: Expression) =>
+case e @ Grouping(col: Expression) =>
   val idx = groupByExprs.indexOf(col)
   if (idx >= 0) {
-Cast(BitwiseAnd(ShiftRight(gid, Literal(groupByExprs.length - 
1 - idx)),
-  Literal(1)), ByteType)
+Alias(Cast(BitwiseAnd(ShiftRight(gid, 
Literal(groupByExprs.length - 1 - idx)),
+  Literal(1)), ByteType), toPrettySQL(e))()
   } else {
 throw new AnalysisException(s"Column of grouping ($col) can't 
be found " +
   s"in grouping columns ${groupByExprs.mkString(",")}")
   }
   }
 }
 
-// This require transformUp to replace grouping()/grouping_id() in 
resolved Filter/Sort
-def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
-  case a if !a.childrenResolved => a // be sure all of the children 
are resolved.
-  case p if p.expressions.exists(hasGroupingAttribute) =>
-failAnalysis(
-  s"${VirtualColumn.hiveGroupingIdName} is deprecated; use 
grouping_id() instead")
-
-  case Aggregate(Seq(c @ Cube(groupByExprs)), aggregateExpressions, 
child) =>
-GroupingSets(bitmasks(c), groupByExprs, child, 
aggregateExpressions)
-  case Aggregate(Seq(r @ Rollup(groupByExprs)), aggregateExpressions, 
child) =>
-GroupingSets(bitmasks(r), groupByExprs, child, 
aggregateExpressions)
+/*
+ * Create new alias for all group by expressions for `Expand` operator.
+ */
+private def constructGroupByAlias(groupByExprs: Seq[Expression]): 
Seq[Alias] = {
+  groupByExprs.map {
+case e: NamedExpression => Alias(e, e.name)()
+case other => Alias(other, other.toString)()
+  }
+}
 
-  // Ensure all the expressions have been resolved.
-  case x: GroupingSets if x.expressions.forall(_.resolved) =>
-val gid = AttributeReference(VirtualColumn.groupingIdName, 
IntegerType, false)()
-
-// Expand works by setting grouping expressions to null as 
determined by the bitmasks. To
-// prevent these null values from being used in an aggregate 
instead of the original value
-// we need to create new aliases for all group by expressions that 
will only be used for
-// the intended purpose.
-val groupByAliases: Seq[Alias] = x.groupByExprs.map {
-  case e: NamedExpression => Alias(e, e.name)()
-  case other => Alias(other, other.toString)()
+/*
+ * Construct [[Expand]] operator with grouping sets.
+ */
+private def constructExpand(
+selectedGroupByExprs: Seq[Seq[Expression]],
+child: LogicalPlan,
+groupByAliases: Seq[Alias],
+gid: Attribute): LogicalPlan = {
+  // Change the nullability of group by aliases if necessary. For 
example, if we have
+  // GROUPING SETS ((a,b), a), we do not need to change the 
nullability of a, but we
+  // should change the nullabilty of b to be TRUE.
+  // TODO: For Cube/Rollup just set nullability to be `true`.
+  val expandedAttributes = groupByAliases.zipWithIndex.map { case (a, 
idx) =>
+if (selectedGroupByExprs.exists(!_.contains(a.child))) {
+  a.toAttribute.withNullability(true)
+} else {
+  a.toAttribute
 }
+  }
 
-val nonNullBitmask = x.bitmasks.reduce(_ & _)
-
-val expandedAttributes = groupByAliases.zipWithIndex.map { case 
(a, idx) =>
-  a.toAttribute.withNullability((nonNullBitmask & 1 << idx) == 0)
+  val groupingSetsAttributes = selectedGroupByExprs.map { 
groupingSetExprs =>
+groupingSetExprs.map { expr =>
+  val alias = 
groupByAliases.find(_.child.semanticEquals(expr)).getOrElse(
+failAnalysis(s"$expr doesn't show up in the GROUP BY list"))
--- End diff --

can you also display the GROUP BY list in the message ?


---
If your project is set up for it, you can reply to

[GitHub] spark issue #15600: [SPARK-17698] [SQL] Join predicates should not contain f...

2016-10-22 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15600
  
**[Test build #67396 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67396/consoleFull)**
 for PR 15600 at commit 
[`df50838`](https://github.com/apache/spark/commit/df5083894198e1a85fb17544fc596a3869a9e1b6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15272: [SPARK-17698] [SQL] Join predicates should not contain f...

2016-10-22 Thread tejasapatil

Github user tejasapatil commented on the issue:

https://github.com/apache/spark/pull/15272
  
@rxin : Here is the backport for 2.0 branch: 
https://github.com/apache/spark/pull/15600


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15600: [SPARK-17698] [SQL] Join predicates should not co...

2016-10-22 Thread tejasapatil

GitHub user tejasapatil opened a pull request:

https://github.com/apache/spark/pull/15600

[SPARK-17698] [SQL] Join predicates should not contain filter clauses

## What changes were proposed in this pull request?

This is a backport of https://github.com/apache/spark/pull/15272 to 2.0 
branch.

Jira : https://issues.apache.org/jira/browse/SPARK-17698

`ExtractEquiJoinKeys` is incorrectly using filter predicates as the join 
condition for joins. `canEvaluate` [0] tries to see if the an `Expression` can 
be evaluated using output of a given `Plan`. In case of filter predicates (eg. 
`a.id='1'`), the `Expression` passed for the right hand side (ie. '1' ) is a 
`Literal` which does not have any attribute references. Thus `expr.references` 
is an empty set which theoretically is a subset of any set. This leads to 
`canEvaluate` returning `true` and `a.id='1'` is treated as a join predicate. 
While this does not lead to incorrect results but in case of bucketed + sorted 
tables, we might miss out on avoiding un-necessary shuffle + sort. See example 
below:

[0] : 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L91

eg.

```
val df = (1 until 10).toDF("id").coalesce(1)
hc.sql("DROP TABLE IF EXISTS table1").collect
df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table1")
hc.sql("DROP TABLE IF EXISTS table2").collect
df.write.bucketBy(8, "id").sortBy("id").saveAsTable("table2")

sqlContext.sql("""
  SELECT a.id, b.id
  FROM table1 a
  FULL OUTER JOIN table2 b
  ON a.id = b.id AND a.id='1' AND b.id='1'
""").explain(true)
```

BEFORE: This is doing shuffle + sort over table scan outputs which is not 
needed as both tables are bucketed and sorted on the same columns and have same 
number of buckets. This should be a single stage job.

```
SortMergeJoin [id#38, cast(id#38 as double), 1.0], [id#39, 1.0, cast(id#39 
as double)], FullOuter
:- *Sort [id#38 ASC NULLS FIRST, cast(id#38 as double) ASC NULLS FIRST, 1.0 
ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(id#38, cast(id#38 as double), 1.0, 200)
: +- *FileScan parquet default.table1[id#38] Batched: true, Format: 
ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
+- *Sort [id#39 ASC NULLS FIRST, 1.0 ASC NULLS FIRST, cast(id#39 as double) 
ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#39, 1.0, cast(id#39 as double), 200)
  +- *FileScan parquet default.table2[id#39] Batched: true, Format: 
ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
```

AFTER :

```
SortMergeJoin [id#32], [id#33], FullOuter, ((cast(id#32 as double) = 1.0) 
&& (cast(id#33 as double) = 1.0))
:- *FileScan parquet default.table1[id#32] Batched: true, Format: 
ParquetFormat, InputPaths: file:spark-warehouse/table1, PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
+- *FileScan parquet default.table2[id#33] Batched: true, Format: 
ParquetFormat, InputPaths: file:spark-warehouse/table2, PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct
```

## How was this patch tested?

- Added a new test case for this scenario : `SPARK-17698 Join predicates 
should not contain filter clauses`
- Ran all the tests in `BucketedReadSuite`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tejasapatil/spark SPARK-17698_2.0_backport

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15600.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15600


commit df5083894198e1a85fb17544fc596a3869a9e1b6
Author: Tejas Patil 
Date:   2016-10-22T20:16:40Z

Backport to 2.0 : [SPARK-17698] [SQL] Join predicates should not contain 
filter clauses




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 294 matches

Mail list logo