date:20160925

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-09-25 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r80411490
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/lsh/LSH.scala ---
@@ -0,0 +1,290 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature.lsh
+
+import scala.util.Random
+
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.sql._
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for output dimension.
+   *
+   * @group param
+   */
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension",
+ParamValidators.gt(0))
+
+  /** @group getParam */
+  final def getOutputDim: Int = $(outputDim)
+
+  setDefault(outputDim -> 1)
+
+  setDefault(outputCol -> "lsh_output")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without outputCol
+   * @return A derived schema with outputCol added
+   */
+  final def transformLSHSchema(schema: StructType): StructType = {
+val outputFields = schema.fields :+
+  StructField($(outputCol), new VectorUDT, nullable = false)
+StructType(outputFields)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+abstract class LSHModel[KeyType, T <: LSHModel[KeyType, T]] private[ml]
+  extends Model[T] with LSHParams {
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+  /**
+   * :: DeveloperApi ::
+   *
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  protected[this] val hashFunction: KeyType => Vector
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One of the point in the metric space
+   * @param y Another the point in the metric space
+   * @return The distance between x and y in double
+   */
+  protected[ml] def keyDistance(x: KeyType, y: KeyType): Double
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different hash Vectors. By 
default, the distance is the
+   * minimum distance of two hash values in any dimension.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y in double
+   */
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
--- End diff --

Yes, I am planning to override it for BitSampling (LSH for Hamming distance)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15224: [SPARK-17650] malformed url's throw exceptions before br...

2016-09-25 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15224
  
Thanks. Merging to master and 2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-09-25 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r80411374
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/lsh/LSH.scala ---
@@ -0,0 +1,290 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature.lsh
+
+import scala.util.Random
+
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.sql._
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for output dimension.
+   *
+   * @group param
+   */
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension",
+ParamValidators.gt(0))
+
+  /** @group getParam */
+  final def getOutputDim: Int = $(outputDim)
+
+  setDefault(outputDim -> 1)
+
+  setDefault(outputCol -> "lsh_output")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without outputCol
+   * @return A derived schema with outputCol added
+   */
+  final def transformLSHSchema(schema: StructType): StructType = {
+val outputFields = schema.fields :+
+  StructField($(outputCol), new VectorUDT, nullable = false)
+StructType(outputFields)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+abstract class LSHModel[KeyType, T <: LSHModel[KeyType, T]] private[ml]
+  extends Model[T] with LSHParams {
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+  /**
+   * :: DeveloperApi ::
+   *
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  protected[this] val hashFunction: KeyType => Vector
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One of the point in the metric space
+   * @param y Another the point in the metric space
+   * @return The distance between x and y in double
+   */
+  protected[ml] def keyDistance(x: KeyType, y: KeyType): Double
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different hash Vectors. By 
default, the distance is the
+   * minimum distance of two hash values in any dimension.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y in double
+   */
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+
+  /**
+   * Transforms the input dataset.
+   */
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Check transform validity and derive the output schema from the input 
schema.
+   *
+   * Typical implementation should first conduct verification on schema 
change and parameter
+   * validity, including complex parameter interaction checks.
+   */
+  override def transformSchema(schema: StructType): StructType = {
+transformLSHSchema(schema)
+  }
+
+  /**
+   * Given a

[GitHub] spark issue #15238: [SPARK-17653][SQL] Remove unnecessary distincts in multi...

2016-09-25 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15238
  
cc @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15233: [SPARK-17659] [SQL] Partitioned View is Not Suppo...

2016-09-25 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15233#discussion_r8044
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -376,6 +376,10 @@ private[hive] class HiveClientImpl(
 unsupportedFeatures += "bucketing"
   }
 
+  if (h.getTableType == HiveTableType.VIRTUAL_VIEW && 
partCols.nonEmpty) {
+unsupportedFeatures += "partitioned view"
--- End diff --

After digging it deeper and deeper, I am really doubting the initial 
motivation of partitioned views makes sense... 

First, see the Hive design link: 
https://cwiki.apache.org/confluence/display/Hive/ViewDev
> Update 30-Dec-2009: Prasad pointed out that even without supporting 
materialized views, it may be necessary to provide users with metadata about 
data dependencies between views and underlying table partitions so that users 
can avoid seeing inconsistent results during the window when not all partitions 
have been refreshed with the latest data. One option is to attempt to derive 
this information automatically (using an overconservative guess in cases where 
the dependency analysis can't be made smart enough); another is to allow view 
creators to declare the dependency rules in some fashion as part of the view 
definition. Based on a design review meeting, we will probably go with the 
automatic analysis approach once dependency tracking is implemented. The 
analysis will be performed on-demand, perhaps as part of describing the view or 
submitting a query job against it. Until this becomes available, users may be 
able to do their own analysis either via empirical lineage tools or via 
 view->table dependency tracking metadata once it is implemented. See HIVE-1079.
> Update 1-Feb-2011: For the latest on this, see PartitionedViews.

Basically, this feature just affects the metadata of views. It does not 
affect the query execution.

To add the partition info into the views, users have to manually issue the 
SQL:
```SQL
ALTER VIEW view_name ADD [IF NOT EXISTS] partition_spec partition_spec ...
ALTER VIEW view_name DROP [IF EXISTS] partition_spec, partition_spec, ...
```

I read the code changes and test cases in the Hive JIRA: 
https://issues.apache.org/jira/browse/HIVE-1079. I think we do not need to 
worry about this Hive-specific feature. The usage scenario is very limited. 
Maybe the code changes in the existing PR is enough.

If you think we should support it, we might need to also need code changes 
in `SHOW PARTITIONS` and [`DESC table 
PARTITONS`](https://github.com/apache/spark/pull/15168). Then, we need to 
change the [`fromHivePartition` 
function](https://github.com/apache/spark/blob/6b156e2fcf9c0c1ed0770a7ad9c54fa374760e17/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L834-L842),
 because `getSD` will be `NULL` for partitioned views. We will get a 
`NullPointerException`.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15238: [SPARK-17653][SQL] Remove unnecessary distincts in multi...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15238
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15238: [SPARK-17653][SQL] Remove unnecessary distincts in multi...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15238
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65892/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15107: [SPARK-17551][SQL] complete the NULL ordering support in...

2016-09-25 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15107
  
@xwu0226 Can you please close this, please? @hvanhovell already added you 
as the contributor in another PR, which has been merged. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15238: [SPARK-17653][SQL] Remove unnecessary distincts in multi...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15238
  
**[Test build #65892 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65892/consoleFull)**
 for PR 15238 at commit 
[`c770a9a`](https://github.com/apache/spark/commit/c770a9a9948c301a831daa555360702c73542aa2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15216: [SPARK-17577][Follow-up][SparkR] SparkR spark.addFile su...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15216
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65895/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15216: [SPARK-17577][Follow-up][SparkR] SparkR spark.addFile su...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15216
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15216: [SPARK-17577][Follow-up][SparkR] SparkR spark.addFile su...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15216
  
**[Test build #65895 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65895/consoleFull)**
 for PR 15216 at commit 
[`2b6e2af`](https://github.com/apache/spark/commit/2b6e2af9457e1c99c64e7c15b656e433a85e5f17).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14897
  
**[Test build #65896 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65896/consoleFull)**
 for PR 14897 at commit 
[`67e459a`](https://github.com/apache/spark/commit/67e459a48c82ef2b13ffedbc23f3921db0721204).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12601: [SPARK-14525][SQL] Make DataFrameWrite.save work for jdb...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12601
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65891/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12601: [SPARK-14525][SQL] Make DataFrameWrite.save work for jdb...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/12601
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12601: [SPARK-14525][SQL] Make DataFrameWrite.save work for jdb...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12601
  
**[Test build #65891 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65891/consoleFull)**
 for PR 12601 at commit 
[`724bbe2`](https://github.com/apache/spark/commit/724bbe22b23050f3bdbf6d1bf14d4dabc52113b2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15168: [SPARK-17612][SQL] Support `DESCRIBE table PARTITION` SQ...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15168
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15168: [SPARK-17612][SQL] Support `DESCRIBE table PARTITION` SQ...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15168
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65889/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15168: [SPARK-17612][SQL] Support `DESCRIBE table PARTITION` SQ...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15168
  
**[Test build #65889 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65889/consoleFull)**
 for PR 15168 at commit 
[`c78ea7c`](https://github.com/apache/spark/commit/c78ea7c5bd82652195c8987d21843227b905241d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15216: [SPARK-17577][Follow-up][SparkR] SparkR spark.addFile su...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15216
  
**[Test build #65895 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65895/consoleFull)**
 for PR 15216 at commit 
[`2b6e2af`](https://github.com/apache/spark/commit/2b6e2af9457e1c99c64e7c15b656e433a85e5f17).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15231
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15231
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65894/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15231
  
**[Test build #65894 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65894/consoleFull)**
 for PR 15231 at commit 
[`07dca5d`](https://github.com/apache/spark/commit/07dca5d3d76b9e58dd1b2cdef8036c937a01f51b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown

2016-09-25 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14912
  
ping @cloud-fan @hvanhovell @srinathshankar again, please take look if you 
have time. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80407704
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala ---
@@ -60,20 +90,21 @@ case class CreateViewCommand(
 child: LogicalPlan,
 allowExisting: Boolean,
 replace: Boolean,
-isTemporary: Boolean)
+viewType: ViewType)
   extends RunnableCommand {
 
   override protected def innerChildren: Seq[QueryPlan[_]] = Seq(child)
 
-  if (!isTemporary) {
-require(originalText.isDefined,
-  "The table to created with CREATE VIEW must have 'originalText'.")
+  if (viewType == PermanentView) {
+require(originalText.isDefined, "'originalText' must be provided to 
create permanent view")
   }
 
   if (allowExisting && replace) {
 throw new AnalysisException("CREATE VIEW with both IF NOT EXISTS and 
REPLACE is not allowed.")
   }
 
+  private def isTemporary = viewType == LocalTempView || viewType == 
GlobalTempView
+
--- End diff --

it's only used here, maybe it's OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15231
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65893/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15231
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15231
  
**[Test build #65893 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65893/consoleFull)**
 for PR 15231 at commit 
[`1440195`](https://github.com/apache/spark/commit/1440195f49b346bde24843bf5c97360dc0488daf).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15231
  
**[Test build #65894 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65894/consoleFull)**
 for PR 15231 at commit 
[`07dca5d`](https://github.com/apache/spark/commit/07dca5d3d76b9e58dd1b2cdef8036c937a01f51b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15189: [SPARK-17549][sql] Coalesce cached relation stats...

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15189#discussion_r80406877
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala
 ---
@@ -232,4 +232,29 @@ class InMemoryColumnarQuerySuite extends QueryTest 
with SharedSQLContext {
 val columnTypes2 = List.fill(length2)(IntegerType)
 val columnarIterator2 = GenerateColumnAccessor.generate(columnTypes2)
   }
+
+  test("SPARK-17549: cached table size should be correctly calculated") {
+val data = spark.sparkContext.parallelize(1 to 10, 5).map { i => (i, 
i.toLong) }
+  .toDF("col1", "col2")
+val plan = spark.sessionState.executePlan(data.logicalPlan).sparkPlan
+val cached = InMemoryRelation(true, 5, MEMORY_ONLY, plan, None)
+
+// Materialize the data.
+val expectedAnswer = data.collect()
+checkAnswer(cached, expectedAnswer)
+
+// Check that the right size was calculated.
+val expectedColSizes = expectedAnswer.size * (INT.defaultSize + 
LONG.defaultSize)
+assert(cached.statistics.sizeInBytes === expectedColSizes)
+
+// Create a projection of the cached data and make sure the statistics 
are correct.
+val projected = cached.withOutput(Seq(plan.output.last))
+assert(projected.statistics.sizeInBytes === expectedAnswer.size * 
LONG.defaultSize)
--- End diff --

I am not sure if I understand the last two parts. After we cache the 
dataset, I am not sure if we can change the number of output columns (this 
test) or the data types (the next one).

If we do a project on the cached dataset, we will see a project operator on 
top of the InMemoryRelation. 

I am wondering what kinds of queries can cause this problems?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15189: [SPARK-17549][sql] Coalesce cached relation stats...

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/15189#discussion_r80406261
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
@@ -44,6 +44,70 @@ object InMemoryRelation {
 new InMemoryRelation(child.output, useCompression, batchSize, 
storageLevel, child, tableName)()
 }
 
+/**
+ * Accumulator for storing column stats. Summarizes the data in the driver 
to curb the amount of
+ * memory being used. Only "sizeInBytes" for each column is kept.
+ */
+class ColStatsAccumulator(originalOutput: Seq[Attribute])
--- End diff --

Should we make the class name explicitly say that it is for sizeInBytes?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15231
  
(I just updated the PR description too)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15231
  
**[Test build #65893 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65893/consoleFull)**
 for PR 15231 at commit 
[`1440195`](https://github.com/apache/spark/commit/1440195f49b346bde24843bf5c97360dc0488daf).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15090: [SPARK-17073] [SQL] generate column-level statistics

2016-09-25 Thread wzhfy

Github user wzhfy commented on the issue:

https://github.com/apache/spark/pull/15090
  
To help us choose a better design, we need to first clarify the usage of 
column stats.
A simple example may look like this (e.g. predicate: col < 5):
```java
  filter.condition match {
case LessThan(ar: AttributeReference, Literal(value, _)) =>
  if (filter.statistics.colStats.contains(ar.name)) {
val colStat = filter.statistics.colStats(ar.name)
val estimatedRowCount = ar.dataType match {
  case _: IntegralType =>
val longColStat = colStat.forNumeric[Long]
val longValue = value.toString.toLong
if (longColStat.max < longValue) {
  // all records satisfy the filter condition
  filter.child.statistics.rowCount  
} else if (longColStat.min >= longValue) {
  // none of the records satisfy the filter condition
  0
} else {
  // do detailed estimation (using histogram)
  ...
}
  case FloatType | DoubleType =>
...
  case DecimalType() =>
...
}
  }
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15053: [Doc] improve python API docstrings

2016-09-25 Thread mortada

Github user mortada commented on a diff in the pull request:

https://github.com/apache/spark/pull/15053#discussion_r80405104
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -411,7 +415,7 @@ def monotonically_increasing_id():
 
 The generated ID is guaranteed to be monotonically increasing and 
unique, but not consecutive.
 The current implementation puts the partition ID in the upper 31 bits, 
and the record number
-within each partition in the lower 33 bits. The assumption is that the 
data frame has
+within each partition in the lower 33 bits. The assumption is that the 
DataFrame has
--- End diff --

@HyukjinKwon great idea, will update 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15238: [SPARK-17653][SQL] Remove unnecessary distincts in multi...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15238
  
**[Test build #65892 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65892/consoleFull)**
 for PR 15238 at commit 
[`c770a9a`](https://github.com/apache/spark/commit/c770a9a9948c301a831daa555360702c73542aa2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15231
  
**[Test build #65890 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65890/consoleFull)**
 for PR 15231 at commit 
[`5c3d222`](https://github.com/apache/spark/commit/5c3d2228aeddf3a43888c170e00729ecbfe6e4bd).
 * This patch **fails R style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15231
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65890/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15231
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15238: [SPARK-17653][SQL] Remove unnecessary distincts i...

2016-09-25 Thread viirya

GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/15238

[SPARK-17653][SQL] Remove unnecessary distincts in multiple unions

## What changes were proposed in this pull request?

Currently for `Union [Distinct]`, a `Distinct` operator is necessary to be 
on the top of `Union`. Once there are adjacent `Union [Distinct]`,  there will 
be multiple `Distinct` in the query plan.

E.g.,

For a query like: select 1 a union select 2 b union select 3 c

Before this patch, its physical plan looks like:

*HashAggregate(keys=[a#13], functions=[])
+- Exchange hashpartitioning(a#13, 200)
   +- *HashAggregate(keys=[a#13], functions=[])
  +- Union
 :- *HashAggregate(keys=[a#13], functions=[])
 :  +- Exchange hashpartitioning(a#13, 200)
 : +- *HashAggregate(keys=[a#13], functions=[])
 :+- Union
 :   :- *Project [1 AS a#13]
 :   :  +- Scan OneRowRelation[]
 :   +- *Project [2 AS b#14]
 :  +- Scan OneRowRelation[]
 +- *Project [3 AS c#15]
+- Scan OneRowRelation[]

Only the top distinct should be necessary.

After this patch, the physical plan looks like:

*HashAggregate(keys=[a#221], functions=[], output=[a#221])
+- Exchange hashpartitioning(a#221, 5)
   +- *HashAggregate(keys=[a#221], functions=[], output=[a#221])
  +- Union
 :- *Project [1 AS a#221]
 :  +- Scan OneRowRelation[]
 :- *Project [2 AS b#222]
 :  +- Scan OneRowRelation[]
 +- *Project [3 AS c#223]
+- Scan OneRowRelation[]

## How was this patch tested?

Jenkins tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 remove-extra-distinct-union

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15238.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15238


commit c770a9a9948c301a831daa555360702c73542aa2
Author: Liang-Chi Hsieh 
Date:   2016-09-26T03:37:46Z

Remove unnecessary distincts in multiple unions.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12601: [SPARK-14525][SQL] Make DataFrameWrite.save work for jdb...

2016-09-25 Thread JustinPihony

Github user JustinPihony commented on the issue:

https://github.com/apache/spark/pull/12601
  
@srowen The doc changes have been reviewed, so this should be good to go


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] read.df/write.df API taking path o...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15231
  
**[Test build #65890 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65890/consoleFull)**
 for PR 15231 at commit 
[`5c3d222`](https://github.com/apache/spark/commit/5c3d2228aeddf3a43888c170e00729ecbfe6e4bd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12601: [SPARK-14525][SQL] Make DataFrameWrite.save work for jdb...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/12601
  
**[Test build #65891 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65891/consoleFull)**
 for PR 12601 at commit 
[`724bbe2`](https://github.com/apache/spark/commit/724bbe22b23050f3bdbf6d1bf14d4dabc52113b2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12601: [SPARK-14525][SQL] Make DataFrameWrite.save work ...

2016-09-25 Thread JustinPihony

Github user JustinPihony commented on a diff in the pull request:

https://github.com/apache/spark/pull/12601#discussion_r80404639
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala ---
@@ -208,4 +210,84 @@ class JDBCWriteSuite extends SharedSQLContext with 
BeforeAndAfter {
 assert(2 === spark.read.jdbc(url1, "TEST.PEOPLE1", properties).count())
 assert(2 === spark.read.jdbc(url1, "TEST.PEOPLE1", 
properties).collect()(0).length)
   }
+
+  test("save works for format(\"jdbc\") if url and dbtable are set") {
+val df = sqlContext.createDataFrame(sparkContext.parallelize(arr2x2), 
schema2)
+
+df.write.format("jdbc")
+.options(Map("url" -> url, "dbtable" -> "TEST.SAVETEST"))
+.save
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #12601: [SPARK-14525][SQL] Make DataFrameWrite.save work ...

2016-09-25 Thread JustinPihony

Github user JustinPihony commented on a diff in the pull request:

https://github.com/apache/spark/pull/12601#discussion_r80404577
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1096,13 +1096,17 @@ the Data Sources API. The following options are 
supported:
 
 {% highlight sql %}
 
-CREATE TEMPORARY VIEW jdbcTable
+CREATE TEMPORARY TABLE jdbcTable
--- End diff --

Done, thanks. I had been going off of the tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80403654
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -393,21 +459,25 @@ class SessionCatalog(
*/
   def renameTable(oldName: TableIdentifier, newName: String): Unit = 
synchronized {
 val db = formatDatabaseName(oldName.database.getOrElse(currentDb))
-requireDbExists(db)
 val oldTableName = formatTableName(oldName.table)
 val newTableName = formatTableName(newName)
-if (oldName.database.isDefined || !tempTables.contains(oldTableName)) {
-  requireTableExists(TableIdentifier(oldTableName, Some(db)))
-  requireTableNotExists(TableIdentifier(newTableName, Some(db)))
-  externalCatalog.renameTable(db, oldTableName, newTableName)
+if (db == globalTempDB) {
+  globalTempViews.rename(oldTableName, newTableName)
--- End diff --

we do support, see 
https://github.com/apache/spark/pull/14897/files#diff-b3f9800839b9b9a1df9da9cbfc01adf8L410


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80403612
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -371,16 +431,24 @@ class SessionCatalog(
*/
   def getTempViewOrPermanentTableMetadata(name: TableIdentifier): 
CatalogTable = synchronized {
--- End diff --

yea, I'd like to rename them to `getPersistedTableMetadataOption`, but 
probably not in this PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80403509
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -142,8 +149,12 @@ class SessionCatalog(
   // 

 
   def createDatabase(dbDefinition: CatalogDatabase, ignoreIfExists: 
Boolean): Unit = {
-val qualifiedPath = 
makeQualifiedPath(dbDefinition.locationUri).toString
 val dbName = formatDatabaseName(dbDefinition.name)
+if (dbName == globalTempDB) {
--- End diff --

see see 
https://github.com/apache/spark/pull/14897/files#diff-42e78d37f5dcb2a1576f83b53bbf4b55R40

`globalTempDB` is always lower-cased, so here we respect the case 
sensitivity config, e.g. `gLobAl_TEmp.viewName` can also refer to a global temp 
view in case insensitive context


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80403416
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala ---
@@ -37,6 +37,20 @@ import org.apache.spark.util.{MutableURLClassLoader, 
Utils}
  */
 private[sql] class SharedState(val sparkContext: SparkContext) extends 
Logging {
 
+  // System preserved database should not exists in metastore. However 
it's hard to guarantee it
+  // for every session, because case-sensitivity differs. Here we always 
lowercase it to make our
+  // life easier.
+  val globalTempDB = 
sparkContext.conf.get(GLOBAL_TEMP_DATABASE).toLowerCase
--- End diff --

yea, we do: 
https://github.com/apache/spark/pull/14897/files#diff-42e78d37f5dcb2a1576f83b53bbf4b55R46


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80403386
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala ---
@@ -222,8 +265,8 @@ case class AlterViewAsCommand(
 qe.assertAnalyzed()
 val analyzedPlan = qe.analyzed
 
-if (session.sessionState.catalog.isTemporaryTable(name)) {
-  session.sessionState.catalog.createTempView(name.table, 
analyzedPlan, overrideIfExists = true)
+if (session.sessionState.catalog.alterTempViewDefinition(name, 
analyzedPlan)) {
+  // a local/global temp view has been altered, we are done.
--- End diff --

The previous one is not atomic, and here we need a 
`createLocalOrGlobalTempView` if we follow the existing style, so I went ahead 
and make an atomic operation: `alterTempViewDefinition`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15168: [SPARK-17612][SQL] Support `DESCRIBE table PARTITION` SQ...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15168
  
**[Test build #65889 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65889/consoleFull)**
 for PR 15168 at commit 
[`c78ea7c`](https://github.com/apache/spark/commit/c78ea7c5bd82652195c8987d21843227b905241d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15168: [SPARK-17612][SQL] Support `DESCRIBE table PARTITION` SQ...

2016-09-25 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/15168
  
Definitely! Thank you, @gatorsmile . I added the logic to cover that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80402604
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/GlobalTempViewManager.scala
 ---
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.catalog
+
+import javax.annotation.concurrent.GuardedBy
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.AnalysisException
+import 
org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.util.StringUtils
+
+
+/**
+ * A thread-safe manager for global temporary views, providing atomic 
operations to manage them,
+ * e.g. create, update, remove, etc.
+ *
+ * Note that, the view name is always case-sensitive here, callers are 
responsible to format the
+ * view name w.r.t. case-sensitive config.
+ */
+class GlobalTempViewManager {
+
+  /** List of view definitions, mapping from view name to logical plan. */
+  @GuardedBy("this")
+  private val viewDefinitions = new mutable.HashMap[String, LogicalPlan]
+
+  def get(name: String): Option[LogicalPlan] = synchronized {
+viewDefinitions.get(name)
+  }
+
+  def create(
+  name: String,
+  viewDefinition: LogicalPlan,
+  overrideIfExists: Boolean): Unit = synchronized {
+if (!overrideIfExists && viewDefinitions.contains(name)) {
+  throw new TempTableAlreadyExistsException(name)
+}
+viewDefinitions.put(name, viewDefinition)
+  }
+
+  def update(
+  name: String,
+  viewDefinition: LogicalPlan): Boolean = synchronized {
+// Only update it when the view with the given name exits.
+if (viewDefinitions.contains(name)) {
+  viewDefinitions.put(name, viewDefinition)
+  true
+} else {
+  false
+}
+  }
--- End diff --

CREATE VIEW and ALTER VIEW.

We can have a single API for them, but need to introduce a write mode: 
errorIfExists, overrideIfExists, updateOnlyIfExists.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15195: [SPARK-17632][SQL]make console sink and other sin...

2016-09-25 Thread chuanlei

Github user chuanlei commented on a diff in the pull request:

https://github.com/apache/spark/pull/15195#discussion_r80401708
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala 
---
@@ -290,8 +284,8 @@ final class DataStreamWriter[T] private[sql](ds: 
Dataset[T]) {
 df,
 dataSource.createSink(outputMode),
 outputMode,
-useTempCheckpointLocation = useTempCheckpointLocation,
-recoverFromCheckpointLocation = recoverFromCheckpointLocation,
+useTempCheckpointLocation = true,
--- End diff --

Actually, I am implementing kafka source for kafka0.8. When I ran the tests 
for my implementation, I find the `ConsoleSink` cannot **continue** to process 
data.

So I think if we provide `checkPointLocation` for the **sinks**,  it can 
continue to process data. If we do not provide this option, it is ok  to start 
from beginning.  Not only the **other sinks**, I think all the sinks should 
follow the pattern above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15168: [SPARK-17612][SQL] Support `DESCRIBE table PARTITION` SQ...

2016-09-25 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15168
  
This PR needs to cover more negative cases. Below is an example:
```Scala
spark.range(10).select('id as 'a, 'id as 'b).createTempView("view1")
sql("DESC view1 PARTITION (c='Us', d=1)").show()
```

We should issue an exception, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15172: [SPARK-13331] AES support for over-the-wire encry...

2016-09-25 Thread cjjnjust

Github user cjjnjust commented on a diff in the pull request:

https://github.com/apache/spark/pull/15172#discussion_r80399872
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/sasl/aes/SparkAesCipher.java
 ---
@@ -0,0 +1,270 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.sasl.aes;
+
+import java.io.IOException;
+import java.security.InvalidAlgorithmParameterException;
+import java.security.InvalidKeyException;
+import java.security.NoSuchAlgorithmException;
+import java.util.Arrays;
+import java.util.Properties;
+import javax.crypto.Cipher;
+import javax.crypto.Mac;
+import javax.crypto.SecretKey;
+import javax.crypto.ShortBufferException;
+import javax.crypto.spec.SecretKeySpec;
+import javax.crypto.spec.IvParameterSpec;
+import javax.security.sasl.SaslException;
+
+import org.apache.commons.crypto.cipher.CryptoCipher;
+import org.apache.commons.crypto.utils.Utils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * AES cipher for encryption and decryption.
+ */
+public class SparkAesCipher {
+  private static final Logger logger = 
LoggerFactory.getLogger(SparkAesCipher.class);
+  private static final byte[] EMPTY_BYTE_ARRAY = new byte[0];
+  public final static String supportedTransformation[] = {
+"AES/CBC/NoPadding", "AES/CTR/NoPadding"
+  };
+
+  private final CryptoCipher encryptor;
+  private final CryptoCipher decryptor;
+
+  private final Integrity integrity;
+
+  public SparkAesCipher(
+  String cipherTransformation,
+  Properties properties,
+  byte[] inKey,
+  byte[] outKey,
+  byte[] inIv,
+  byte[] outIv) throws IOException {
+if 
(!Arrays.asList(supportedTransformation).contains(cipherTransformation)) {
+  logger.warn("AES cipher transformation is not supported: " + 
cipherTransformation);
+  cipherTransformation = "AES/CTR/NoPadding";
+  logger.warn("Use default AES/CTR/NoPadding");
+}
+
+final SecretKeySpec inKeySpec = new SecretKeySpec(inKey, "AES");
+final IvParameterSpec inIvSpec = new IvParameterSpec(inIv);
+final SecretKeySpec outKeySpec = new SecretKeySpec(outKey, "AES");
+final IvParameterSpec outIvSpec = new IvParameterSpec(outIv);
+
+// Encryptor
+encryptor = Utils.getCipherInstance(cipherTransformation, properties);
+try {
+  logger.debug("Initialize encryptor");
+  encryptor.init(Cipher.ENCRYPT_MODE, outKeySpec, outIvSpec);
+} catch (InvalidKeyException | InvalidAlgorithmParameterException e) {
+  throw new IOException("Failed to initialize encryptor", e);
+}
+
+// Decryptor
+decryptor = Utils.getCipherInstance(cipherTransformation, properties);
+try {
+  logger.debug("Initialize decryptor");
+  decryptor.init(Cipher.DECRYPT_MODE, inKeySpec, inIvSpec);
+} catch (InvalidKeyException | InvalidAlgorithmParameterException e) {
+  throw new IOException("Failed to initialize decryptor", e);
+}
+
+integrity = new Integrity(outKey, inKey);
+  }
+
+  /**
+   * Encrypts input data. The result composes of (msg, padding if needed, 
mac) and sequence num.
+   * @param data the input byte array
+   * @param offset the offset in input where the input starts
+   * @param len the input length
+   * @return the new encrypted byte array.
+   * @throws SaslException if error happens
+   */
+  public byte[] wrap(byte[] data, int offset, int len) throws 
SaslException {
--- End diff --

I suppose you were talking about SaslEncrytionbackend. Actually, the real 
underlying interfaces of wrap/unwrap are defined by SaslClient which provided 
by java security. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request #15195: [SPARK-17632][SQL]make console sink and other sin...

2016-09-25 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/15195#discussion_r80396970
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala 
---
@@ -290,8 +284,8 @@ final class DataStreamWriter[T] private[sql](ds: 
Dataset[T]) {
 df,
 dataSource.createSink(outputMode),
 outputMode,
-useTempCheckpointLocation = useTempCheckpointLocation,
-recoverFromCheckpointLocation = recoverFromCheckpointLocation,
+useTempCheckpointLocation = true,
--- End diff --

If you want "other sinks" to use temporary checkpoint dir, you could 
achieve this out of box. Modifying here will change the behavior of the current 
code and lose the restriction for user who miss the setting of checkpoint 
location unintentionally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80396050
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala ---
@@ -37,6 +37,20 @@ import org.apache.spark.util.{MutableURLClassLoader, 
Utils}
  */
 private[sql] class SharedState(val sparkContext: SparkContext) extends 
Logging {
 
+  // System preserved database should not exists in metastore. However 
it's hard to guarantee it
+  // for every session, because case-sensitivity differs. Here we always 
lowercase it to make our
+  // life easier.
+  val globalTempDB = 
sparkContext.conf.get(GLOBAL_TEMP_DATABASE).toLowerCase
--- End diff --

btw, do we check if there is an existing db having the same name as 
`GLOBAL_TEMP_DATABASE`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80395764
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -453,7 +532,11 @@ class SessionCatalog(
   val db = formatDatabaseName(name.database.getOrElse(currentDb))
   val table = formatTableName(name.table)
   val relationAlias = alias.getOrElse(table)
-  if (name.database.isDefined || !tempTables.contains(table)) {
+  if (db == globalTempDB) {
+globalTempViews.get(table).map { viewDef =>
+  SubqueryAlias(relationAlias, viewDef, Some(name))
+}.getOrElse(throw new NoSuchTableException(db, table))
+  } else if (name.database.isDefined || !tempTables.contains(table)) {
--- End diff --

+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80396017
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala ---
@@ -277,7 +275,7 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
   }
 
   /**
-   * Drops the temporary view with the given view name in the catalog.
+   * Drops the local temporary view with the given view name in the 
catalog.
--- End diff --

(probably it is good to explain the meaning of local in the doc for create 
temp view)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80395565
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -142,8 +149,12 @@ class SessionCatalog(
   // 

 
   def createDatabase(dbDefinition: CatalogDatabase, ignoreIfExists: 
Boolean): Unit = {
-val qualifiedPath = 
makeQualifiedPath(dbDefinition.locationUri).toString
 val dbName = formatDatabaseName(dbDefinition.name)
+if (dbName == globalTempDB) {
--- End diff --

yea, I have the same question.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80395798
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala ---
@@ -222,8 +265,8 @@ case class AlterViewAsCommand(
 qe.assertAnalyzed()
 val analyzedPlan = qe.analyzed
 
-if (session.sessionState.catalog.isTemporaryTable(name)) {
-  session.sessionState.catalog.createTempView(name.table, 
analyzedPlan, overrideIfExists = true)
+if (session.sessionState.catalog.alterTempViewDefinition(name, 
analyzedPlan)) {
+  // a local/global temp view has been altered, we are done.
--- End diff --

Is this change needed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80395752
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -393,21 +459,25 @@ class SessionCatalog(
*/
   def renameTable(oldName: TableIdentifier, newName: String): Unit = 
synchronized {
 val db = formatDatabaseName(oldName.database.getOrElse(currentDb))
-requireDbExists(db)
 val oldTableName = formatTableName(oldName.table)
 val newTableName = formatTableName(newName)
-if (oldName.database.isDefined || !tempTables.contains(oldTableName)) {
-  requireTableExists(TableIdentifier(oldTableName, Some(db)))
-  requireTableNotExists(TableIdentifier(newTableName, Some(db)))
-  externalCatalog.renameTable(db, oldTableName, newTableName)
+if (db == globalTempDB) {
+  globalTempViews.rename(oldTableName, newTableName)
--- End diff --

Do we support rename for local temp view?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80395517
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/GlobalTempViewManager.scala
 ---
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.catalog
+
+import javax.annotation.concurrent.GuardedBy
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.AnalysisException
+import 
org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.util.StringUtils
+
+
+/**
+ * A thread-safe manager for global temporary views, providing atomic 
operations to manage them,
+ * e.g. create, update, remove, etc.
+ *
+ * Note that, the view name is always case-sensitive here, callers are 
responsible to format the
+ * view name w.r.t. case-sensitive config.
+ */
+class GlobalTempViewManager {
--- End diff --

Seems methods in this class need docs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80395993
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala ---
@@ -277,7 +275,7 @@ class CatalogImpl(sparkSession: SparkSession) extends 
Catalog {
   }
 
   /**
-   * Drops the temporary view with the given view name in the catalog.
+   * Drops the local temporary view with the given view name in the 
catalog.
--- End diff --

Seems we can also explain what local means. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80395571
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -47,6 +50,8 @@ object SessionCatalog {
  */
 class SessionCatalog(
 externalCatalog: ExternalCatalog,
+globalTempDB: String,
+globalTempViews: GlobalTempViewManager,
--- End diff --

Same question


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80395605
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -329,33 +343,77 @@ class SessionCatalog(
   // --
 
   /**
-   * Create a temporary table.
+   * Create a local temporary view.
*/
   def createTempView(
   name: String,
-  tableDefinition: LogicalPlan,
+  viewDefinition: LogicalPlan,
   overrideIfExists: Boolean): Unit = synchronized {
-val table = formatTableName(name)
-if (tempTables.contains(table) && !overrideIfExists) {
+val viewName = formatTableName(name)
+if (tempTables.contains(viewName) && !overrideIfExists) {
   throw new TempTableAlreadyExistsException(name)
--- End diff --

Maybe good to avoid of having this kind of changes in this PR? We can 
submit a follow-up one to make the variable name consistent. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80395754
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -393,21 +459,25 @@ class SessionCatalog(
*/
   def renameTable(oldName: TableIdentifier, newName: String): Unit = 
synchronized {
 val db = formatDatabaseName(oldName.database.getOrElse(currentDb))
-requireDbExists(db)
 val oldTableName = formatTableName(oldName.table)
 val newTableName = formatTableName(newName)
-if (oldName.database.isDefined || !tempTables.contains(oldTableName)) {
-  requireTableExists(TableIdentifier(oldTableName, Some(db)))
-  requireTableNotExists(TableIdentifier(newTableName, Some(db)))
-  externalCatalog.renameTable(db, oldTableName, newTableName)
+if (db == globalTempDB) {
+  globalTempViews.rename(oldTableName, newTableName)
--- End diff --

If not, seems we do not need to support rename for global temp view?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80396066
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/GlobalTempViewSuite.scala
 ---
@@ -0,0 +1,107 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.{AnalysisException, QueryTest, Row}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.analysis.NoSuchTableException
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.StructType
+
+class GlobalTempViewSuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  override protected def beforeAll(): Unit = {
+super.beforeAll()
+globalTempDB = spark.sharedState.globalTempDB
+  }
+
+  private var globalTempDB: String = _
+
+  test("basic semantic") {
+sql("CREATE GLOBAL TEMP VIEW src AS SELECT 1, 'a'")
+
+// If there is no database in table name, we should try local temp 
view first, if not found,
+// try table/view in current database, which is "default" in this 
case. So we expect
+// NoSuchTableException here.
+intercept[NoSuchTableException](spark.table("src"))
+
+// Use qualified name to refer to the global temp view explicitly.
+checkAnswer(spark.table(s"$globalTempDB.src"), Row(1, "a"))
+
+// Table name without database will never refer to a global temp view.
+intercept[NoSuchTableException](sql("DROP VIEW src"))
+
+sql(s"DROP VIEW $globalTempDB.src")
+// The global temp view should be dropped successfully.
+intercept[NoSuchTableException](spark.table(s"$globalTempDB.src"))
+
+// We can also use Dataset API to create global temp view
+Seq(1 -> "a").toDF("i", "j").createGlobalTempView("src")
+checkAnswer(spark.table(s"$globalTempDB.src"), Row(1, "a"))
+
+// Use qualified name to rename a global temp view.
+sql(s"ALTER VIEW $globalTempDB.src RENAME TO src2")
+intercept[NoSuchTableException](spark.table(s"$globalTempDB.src"))
+checkAnswer(spark.table(s"$globalTempDB.src2"), Row(1, "a"))
+
+// Use qualified name to alter a global temp view.
+sql(s"ALTER VIEW $globalTempDB.src2 AS SELECT 2, 'b'")
+checkAnswer(spark.table(s"$globalTempDB.src2"), Row(2, "b"))
+
+// We can also use Catalog API to drop global temp view
+spark.catalog.dropGlobalTempView("src2")
+intercept[NoSuchTableException](spark.table(s"$globalTempDB.src2"))
+  }
+
+  test("global temp view database should be preserved") {
+val e = intercept[AnalysisException](sql(s"CREATE DATABASE 
$globalTempDB"))
+assert(e.message.contains("system preserved database"))
+
+val e2 = intercept[AnalysisException](sql(s"USE $globalTempDB"))
+assert(e2.message.contains("system preserved database"))
+  }
+
+  test("CREATE TABLE LIKE should work for global temp view") {
+try {
+  sql("CREATE GLOBAL TEMP VIEW src AS SELECT 1 AS a, '2' AS b")
+  sql(s"CREATE TABLE cloned LIKE ${globalTempDB}.src")
+  val tableMeta = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier("cloned"))
+  assert(tableMeta.schema == new StructType().add("a", "int", 
false).add("b", "string", false))
+} finally {
+  spark.catalog.dropGlobalTempView("src")
+  sql("DROP TABLE default.cloned")
+}
+  }
+
+  test("list global temp views") {
+try {
+  sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4")
+  sql("CREATE TEMP VIEW v2 AS SELECT 1, 2")
+
+  checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"),
+Row(globalTempDB, "v1", true) ::
+Row("", "v2", true) :: Nil)
+
+  
assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == 
Seq("v1", "v2"))
+} finally {
+

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80395510
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/GlobalTempViewManager.scala
 ---
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.catalog
+
+import javax.annotation.concurrent.GuardedBy
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.AnalysisException
+import 
org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.catalyst.util.StringUtils
+
+
+/**
+ * A thread-safe manager for global temporary views, providing atomic 
operations to manage them,
+ * e.g. create, update, remove, etc.
+ *
+ * Note that, the view name is always case-sensitive here, callers are 
responsible to format the
+ * view name w.r.t. case-sensitive config.
+ */
+class GlobalTempViewManager {
+
+  /** List of view definitions, mapping from view name to logical plan. */
+  @GuardedBy("this")
+  private val viewDefinitions = new mutable.HashMap[String, LogicalPlan]
+
+  def get(name: String): Option[LogicalPlan] = synchronized {
+viewDefinitions.get(name)
+  }
+
+  def create(
+  name: String,
+  viewDefinition: LogicalPlan,
+  overrideIfExists: Boolean): Unit = synchronized {
+if (!overrideIfExists && viewDefinitions.contains(name)) {
+  throw new TempTableAlreadyExistsException(name)
+}
+viewDefinitions.put(name, viewDefinition)
+  }
+
+  def update(
+  name: String,
+  viewDefinition: LogicalPlan): Boolean = synchronized {
+// Only update it when the view with the given name exits.
+if (viewDefinitions.contains(name)) {
+  viewDefinitions.put(name, viewDefinition)
+  true
+} else {
+  false
+}
+  }
--- End diff --

When do we use `update` and when do we use `create`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80395895
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/views.scala ---
@@ -197,6 +201,45 @@ case class CreateViewCommand(
   }
 }
 
+
+/**
+ * Create or replace a local/global temporary view with given data source.
+ */
+case class CreateTempViewUsing(
+tableIdent: TableIdentifier,
+userSpecifiedSchema: Option[StructType],
+replace: Boolean,
+global: Boolean,
+provider: String,
+options: Map[String, String]) extends RunnableCommand {
+
+  if (tableIdent.database.isDefined) {
+throw new AnalysisException(
+  s"Temporary view '$tableIdent' should not have specified a database")
+  }
+
+  def run(sparkSession: SparkSession): Seq[Row] = {
+val dataSource = DataSource(
+  sparkSession,
+  userSpecifiedSchema = userSpecifiedSchema,
+  className = provider,
+  options = options)
+
+val catalog = sparkSession.sessionState.catalog
+val viewDefinition = Dataset.ofRows(
+  sparkSession, 
LogicalRelation(dataSource.resolveRelation())).logicalPlan
+
+if (global) {
+  catalog.createGlobalTempView(tableIdent.table, viewDefinition, 
replace)
+} else {
+  catalog.createTempView(tableIdent.table, viewDefinition, replace)
+}
+
+Seq.empty[Row]
+  }
+}
--- End diff --

Is this command moved from somewhere? If so, is it possible to move it in 
the follow-up pr?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15226: [SPARK-17649][CORE] Log how many Spark events got...

2016-09-25 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/15226#discussion_r80395860
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/AsynchronousListenerBus.scala ---
@@ -117,6 +124,24 @@ private[spark] abstract class 
AsynchronousListenerBus[L <: AnyRef, E](name: Stri
   eventLock.release()
 } else {
   onDropEvent(event)
+  droppedEventsCounter.incrementAndGet()
+}
+
+val droppedEvents = droppedEventsCounter.get
+if (droppedEvents > 0) {
+  // Don't log too frequently
+  if (System.currentTimeMillis() - lastReportTimestamp >= 60 * 1000) {
--- End diff --

@rxin this is not for measuring elapsed time and we also want to log it 
using Date. According to the javadoc:

> This method can only be used to measure elapsed time and is not related 
to any other notion of system or wall-clock time. The value returned represents 
nanoseconds since some fixed but arbitrary origin time (perhaps in the future, 
so values may be negative). The same origin is used by all invocations of this 
method in an instance of a Java virtual machine; other virtual machine 
instances are likely to use a different origin.

It's better to use `System.currentTimeMillis()` since we want to report the 
timestamp in the log.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15123: [SPARK-17551][SQL] Add DataFrame API for null ord...

2016-09-25 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15123


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15123: [SPARK-17551][SQL] Add DataFrame API for null ordering

2016-09-25 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/15123
  
LGTM - merging to master. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15235: [SPARK-17661][SQL] Consolidate various listLeafFiles imp...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15235
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65888/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15235: [SPARK-17661][SQL] Consolidate various listLeafFiles imp...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15235
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15235: [SPARK-17661][SQL] Consolidate various listLeafFiles imp...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15235
  
**[Test build #65888 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65888/consoleFull)**
 for PR 15235 at commit 
[`5c6a640`](https://github.com/apache/spark/commit/5c6a6402253b6dbdd8561ca108babfcb8cc4fe36).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] write.df API taking path optionall...

2016-09-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15231
  
Oh, BTW it seems `read.df` also seems not allowing this? I will try to test 
and fix here together if so.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] write.df API taking path optionall...

2016-09-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15231
  
@felixcheung , I usually don't like to answer by quote but let me do this 
just to clarify.

> Hmm, should we hold till 12601 is merged then? Seems like we shouldn't 
allow this unless internal datasources are supporting this more broadly.

As omitting `path` is what the datasource interface allows, maybe, it'd be 
just okay to test if it goes through JVM fine. Also, I worry if I can easily 
add a test for JDBC datasource within SparkR. If it can be easily done, I am 
also happy to hold this.

> Also, before the path parameter type is in the signature, ie.
> 
> ```
> write.df(df, c(1, 2))
> ```
> 
> Would error with some descriptive error, with this change it would get 
some JVM exception which seems to degrade the experience a bit.



Yeap, I could add some type checks

> Similarly for the path not specified case 
java.lang.IllegalArgumentException - we generally try to avoid JVM exception 
showing up if possible.

Also, yes. Maybe, we could avoid the direct JVM message after catching this 
and make it pretty within R just like PySpark does[1]. (although I am not sure 
if it sounds good in R).

> Could you add checks to path for these cases and give more descriptive 
messages?

Sure, I will try to address the points.


[1]https://github.com/apache/spark/blob/9a5071996b968148f6b9aba12e0d3fe888d9acd8/python/pyspark/sql/utils.py#L64-L80



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-25 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r80393533
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -459,7 +459,8 @@ class Analyzer(
   case u: UnresolvedRelation =>
 val table = u.tableIdentifier
 if (table.database.isDefined && conf.runSQLonFile &&
-(!catalog.databaseExists(table.database.get) || 
!catalog.tableExists(table))) {
+(!(catalog.databaseExists(table.database.get) || 
catalog.isTemporaryTable(table)) ||
+!catalog.tableExists(table))) {
--- End diff --

Do we need to update the comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15237: [SPARK-17663] [CORE] SchedulableBuilder should handle in...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15237
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15237: [SPARK-17663] [CORE] SchedulableBuilder should ha...

2016-09-25 Thread erenavsarogullari

GitHub user erenavsarogullari opened a pull request:

https://github.com/apache/spark/pull/15237

[SPARK-17663] [CORE] SchedulableBuilder should handle invalid data access 
via scheduler.alâ¦

## What changes were proposed in this pull request?
If `spark.scheduler.allocation.file` has invalid `minShare` or/and `weight` 
values, these cause `NumberFormatException` due to `toInt` function and 
`SparkContext` can not be initialized. 

Currently, if `schedulingMode` does not have a valid value, a warning 
message is logged and default value is set as `FIFO`. Same pattern can be used 
for `minShare`(default: 0) and `weight`(default: 1) as well.

PR offers :
- `schedulingMode` supports just empty values as current. It also needs to 
be supported for **whitespace**, **non-uppercase**(fair, FaIr etc...) or 
`SchedulingMode.NONE` cases by setting default value(`FIFO`)
- `minShare` and `weight` support just empty values as current. They also 
need to be supported for **non-integer** cases by setting default values.
- Some refactoring of `PoolSuite`.

**Code to Reproduce :**
```
val conf = new 
SparkConf().setAppName("spark-fairscheduler").setMaster("local")
conf.set("spark.scheduler.mode", "FAIR")
conf.set("spark.scheduler.allocation.file", 
"src/main/resources/fairscheduler-invalid-data.xml")
val sc = new SparkContext(conf)
```
**fairscheduler-invalid-data.xml :**
```


FIFO
invalid_weight
2


```
**Stacktrace :**
```
Exception in thread "main" java.lang.NumberFormatException: For input 
string: "invalid_weight"
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at 
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at 
org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:127)
at 
org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$org$apache$spark$scheduler$FairSchedulableBuilder$$buildFairSchedulerPool$1.apply(SchedulableBuilder.scala:102)
```

## How was this patch tested?
Added Unit Test Case.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/erenavsarogullari/spark SPARK-17663

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15237


commit 46e63f1cfa545e38d060929ca916fdcf6d53e4d5
Author: erenavsarogullari 
Date:   2016-09-25T21:42:20Z

SchedulableBuilder should handle invalid data access via 
scheduler.allocation.file




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-09-25 Thread karlhigley

Github user karlhigley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r80393070
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/lsh/LSH.scala ---
@@ -0,0 +1,290 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature.lsh
+
+import scala.util.Random
+
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.sql._
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for output dimension.
+   *
+   * @group param
+   */
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension",
+ParamValidators.gt(0))
+
+  /** @group getParam */
+  final def getOutputDim: Int = $(outputDim)
+
+  setDefault(outputDim -> 1)
+
+  setDefault(outputCol -> "lsh_output")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without outputCol
+   * @return A derived schema with outputCol added
+   */
+  final def transformLSHSchema(schema: StructType): StructType = {
+val outputFields = schema.fields :+
+  StructField($(outputCol), new VectorUDT, nullable = false)
+StructType(outputFields)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+abstract class LSHModel[KeyType, T <: LSHModel[KeyType, T]] private[ml]
+  extends Model[T] with LSHParams {
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+  /**
+   * :: DeveloperApi ::
+   *
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  protected[this] val hashFunction: KeyType => Vector
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One of the point in the metric space
+   * @param y Another the point in the metric space
+   * @return The distance between x and y in double
+   */
+  protected[ml] def keyDistance(x: KeyType, y: KeyType): Double
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different hash Vectors. By 
default, the distance is the
+   * minimum distance of two hash values in any dimension.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y in double
+   */
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+
+  /**
+   * Transforms the input dataset.
+   */
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Check transform validity and derive the output schema from the input 
schema.
+   *
+   * Typical implementation should first conduct verification on schema 
change and parameter
+   * validity, including complex parameter interaction checks.
+   */
+  override def transformSchema(schema: StructType): StructType = {
+transformLSHSchema(schema)
+  }
+
+  /**
+   * Given a

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-09-25 Thread karlhigley

Github user karlhigley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r80392464
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/lsh/LSH.scala ---
@@ -0,0 +1,290 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature.lsh
+
+import scala.util.Random
+
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.sql._
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for output dimension.
+   *
+   * @group param
+   */
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension",
+ParamValidators.gt(0))
+
+  /** @group getParam */
+  final def getOutputDim: Int = $(outputDim)
+
+  setDefault(outputDim -> 1)
+
+  setDefault(outputCol -> "lsh_output")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without outputCol
+   * @return A derived schema with outputCol added
+   */
+  final def transformLSHSchema(schema: StructType): StructType = {
+val outputFields = schema.fields :+
+  StructField($(outputCol), new VectorUDT, nullable = false)
+StructType(outputFields)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+abstract class LSHModel[KeyType, T <: LSHModel[KeyType, T]] private[ml]
+  extends Model[T] with LSHParams {
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+  /**
+   * :: DeveloperApi ::
+   *
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  protected[this] val hashFunction: KeyType => Vector
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One of the point in the metric space
+   * @param y Another the point in the metric space
+   * @return The distance between x and y in double
+   */
+  protected[ml] def keyDistance(x: KeyType, y: KeyType): Double
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different hash Vectors. By 
default, the distance is the
+   * minimum distance of two hash values in any dimension.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y in double
+   */
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
--- End diff --

By default, this is computing the Manhattan distance between hash values, 
which probably works as a proxy for the distance between hash buckets when 
using LSH based on p-stable distributions and any other approach that produces 
vectors of integers/doubles as hash signatures (e.g. MinHash).

However, the default won't work for approaches that produce vectors of 
booleans as hash signatures (e.g. sign random projection for cosine distance). 
It could be overridden to compute Hamming distance in that case, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at

[GitHub] spark pull request #15148: [SPARK-5992][ML] Locality Sensitive Hashing

2016-09-25 Thread karlhigley

Github user karlhigley commented on a diff in the pull request:

https://github.com/apache/spark/pull/15148#discussion_r80392692
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/lsh/LSH.scala ---
@@ -0,0 +1,290 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature.lsh
+
+import scala.util.Random
+
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.linalg.{Vector, VectorUDT}
+import org.apache.spark.ml.param.{IntParam, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.sql._
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[LSH]].
+ */
+private[ml] trait LSHParams extends HasInputCol with HasOutputCol {
+  /**
+   * Param for output dimension.
+   *
+   * @group param
+   */
+  final val outputDim: IntParam = new IntParam(this, "outputDim", "output 
dimension",
+ParamValidators.gt(0))
+
+  /** @group getParam */
+  final def getOutputDim: Int = $(outputDim)
+
+  setDefault(outputDim -> 1)
+
+  setDefault(outputCol -> "lsh_output")
+
+  /**
+   * Transform the Schema for LSH
+   * @param schema The schema of the input dataset without outputCol
+   * @return A derived schema with outputCol added
+   */
+  final def transformLSHSchema(schema: StructType): StructType = {
+val outputFields = schema.fields :+
+  StructField($(outputCol), new VectorUDT, nullable = false)
+StructType(outputFields)
+  }
+}
+
+/**
+ * Model produced by [[LSH]].
+ */
+abstract class LSHModel[KeyType, T <: LSHModel[KeyType, T]] private[ml]
+  extends Model[T] with LSHParams {
+  override def copy(extra: ParamMap): T = defaultCopy(extra)
+  /**
+   * :: DeveloperApi ::
+   *
+   * The hash function of LSH, mapping a predefined KeyType to a Vector
+   * @return The mapping of LSH function.
+   */
+  protected[this] val hashFunction: KeyType => Vector
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different keys using the distance 
metric corresponding
+   * to the hashFunction
+   * @param x One of the point in the metric space
+   * @param y Another the point in the metric space
+   * @return The distance between x and y in double
+   */
+  protected[ml] def keyDistance(x: KeyType, y: KeyType): Double
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Calculate the distance between two different hash Vectors. By 
default, the distance is the
+   * minimum distance of two hash values in any dimension.
+   *
+   * @param x One of the hash vector
+   * @param y Another hash vector
+   * @return The distance between hash vectors x and y in double
+   */
+  protected[ml] def hashDistance(x: Vector, y: Vector): Double = {
+// Since it's generated by hashing, it will be a pair of dense vectors.
+x.toDense.values.zip(y.toDense.values).map(x => math.abs(x._1 - 
x._2)).min
+  }
+
+  /**
+   * Transforms the input dataset.
+   */
+  override def transform(dataset: Dataset[_]): DataFrame = {
+transformSchema(dataset.schema, logging = true)
+val transformUDF = udf(hashFunction, new VectorUDT)
+dataset.withColumn($(outputCol), transformUDF(dataset($(inputCol
+  }
+
+  /**
+   * :: DeveloperApi ::
+   *
+   * Check transform validity and derive the output schema from the input 
schema.
+   *
+   * Typical implementation should first conduct verification on schema 
change and parameter
+   * validity, including complex parameter interaction checks.
+   */
+  override def transformSchema(schema: StructType): StructType = {
+transformLSHSchema(schema)
+  }
+
+  /**
+   * Given a

[GitHub] spark issue #15235: [SPARK-17661][SQL] Consolidate various listLeafFiles imp...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15235
  
**[Test build #65888 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65888/consoleFull)**
 for PR 15235 at commit 
[`5c6a640`](https://github.com/apache/spark/commit/5c6a6402253b6dbdd8561ca108babfcb8cc4fe36).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15235: [SPARK-17661][SQL] Consolidate various listLeafFiles imp...

2016-09-25 Thread petermaxlee

Github user petermaxlee commented on the issue:

https://github.com/apache/spark/pull/15235
  
@brkyvz I think this also impacts the change you just did in 
https://github.com/apache/spark/pull/15153. This change makes both code path 
consistent.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15153: [SPARK-17599] Prevent ListingFileCatalog from failing if...

2016-09-25 Thread brkyvz

Github user brkyvz commented on the issue:

https://github.com/apache/spark/pull/15153
  
@petermaxlee It is true that the parallel version can fail as well, the 
same kind of race condition can bite people there


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15153: [SPARK-17599] Prevent ListingFileCatalog from failing if...

2016-09-25 Thread petermaxlee

Github user petermaxlee commented on the issue:

https://github.com/apache/spark/pull/15153
  
@brkyvz the change here only affects the serial version, and not the 
parallel version, does it?

Wouldn't that be a problem?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15219: [WIP][SPARK-14098][SQL] Generate Java code to build Cach...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15219
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65887/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15219: [WIP][SPARK-14098][SQL] Generate Java code to build Cach...

2016-09-25 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15219
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15219: [WIP][SPARK-14098][SQL] Generate Java code to build Cach...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15219
  
**[Test build #65887 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65887/consoleFull)**
 for PR 15219 at commit 
[`cd5ade5`](https://github.com/apache/spark/commit/cd5ade594be50a16160aa4e9de6467b25e1c6f1e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15232: [SPARK-17499][SPARKR][FOLLOWUP] Check null first for lay...

2016-09-25 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15232
  
Oh I meant 244.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15219: [WIP][SPARK-14098][SQL] Generate Java code to build Cach...

2016-09-25 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15219
  
**[Test build #65887 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65887/consoleFull)**
 for PR 15219 at commit 
[`cd5ade5`](https://github.com/apache/spark/commit/cd5ade594be50a16160aa4e9de6467b25e1c6f1e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15097: [SPARK-17540][SparkR][Spark Core] fix SparkR array serde...

2016-09-25 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15097
  
@WeichenXu123 do you have the user code and sample data that when run with 
SparkR will cause this issue? I think that will help us understand how this 
happens better.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15232: [SPARK-17499][SPARKR][FOLLOWUP] Check null first for lay...

2016-09-25 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15232
  
change LGTM to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15232: [SPARK-17499][SPARKR][FOLLOWUP] Check null first for lay...

2016-09-25 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15232
  
you mean issue 224 of testthat on github? doesn't seem like it's related?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15231: [SPARK-17658][SPARKR] write.df API taking path optionall...

2016-09-25 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15231
  
Hmm, should we hold till 12601 is merged then? Seems like we shouldn't 
allow this unless internal datasources are supporting this more broadly.

Also, before the path parameter type is in the signature, ie.
```
write.df(df, c(1, 2))
```
Would error with some descriptive error, with this change it would get some 
JVM exception which seems to degrade the experience a bit.
Could you add checks for `path` and give more descriptive message?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15216: [SPARK-17577][Follow-up][SparkR] SparkR spark.add...

2016-09-25 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15216#discussion_r80388454
  
--- Diff: R/pkg/R/context.R ---
@@ -231,17 +231,21 @@ setCheckpointDir <- function(sc, dirName) {
 #' filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark 
jobs,
 #' use spark.getSparkFiles(fileName) to find its download location.
 #'
+#' A directory can be given if the recursive option is set to true.
+#' Currently directories are only supported for Hadoop-supported 
filesystems.
--- End diff --

It depends. Recently someone was asking about why SparkR was using Hadoop 
file system classes to read NFS, local, etc. in the user list - it might not be 
obvious to users


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15168: [SPARK-17612][SQL] Support `DESCRIBE table PARTITION` SQ...

2016-09-25 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/15168
  
Thank you for review @gatorsmile .

Hi, @hvanhovell .
Could you review this again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 197 matches

Mail list logo