date:20160705

[GitHub] spark issue #14014: [SPARK-16344][SQL] Decoding Parquet array of struct with...

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14014
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14014: [SPARK-16344][SQL] Decoding Parquet array of struct with...

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14014
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61811/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14014: [SPARK-16344][SQL] Decoding Parquet array of struct with...

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14014
  
**[Test build #61811 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61811/consoleFull)**
 for PR 14014 at commit 
[`5d3fdd4`](https://github.com/apache/spark/commit/5d3fdd4d31a5ddb91d4b4fd30a8a267fbe519952).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13950: [SPARK-15487] [Web UI] Spark Master UI to reverse...

2016-07-05 Thread gurvindersingh

Github user gurvindersingh commented on a diff in the pull request:

https://github.com/apache/spark/pull/13950#discussion_r69678617
  
--- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala ---
@@ -186,6 +188,67 @@ private[spark] object JettyUtils extends Logging {
 contextHandler
   }
 
+  /** Create a handler for proxying request to Workers and Application 
Drivers */
+  def createProxyHandler(
+  prefix: String,
+  target: String): ServletContextHandler = {
+val servlet = new ProxyServlet {
+  override def rewriteTarget(request: HttpServletRequest): String = {
+val path = request.getRequestURI();
+if (!path.startsWith(prefix)) return null
+
+val uri = new StringBuilder(target)
+if (target.endsWith("/")) uri.setLength(uri.length() - 1)
+val rest = path.substring(prefix.length())
+if (!rest.isEmpty())
+{
--- End diff --

Both are fixed now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69678507
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ComplexTypeSuite.scala
 ---
@@ -246,4 +246,17 @@ class ComplexTypeSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 checkMetadata(CreateStructUnsafe(Seq(a, b)))
 checkMetadata(CreateNamedStructUnsafe(Seq("a", a, "b", b)))
   }
+
+  test("StringToMap") {
--- End diff --

Could you add a boundary testcase for Hive compatibility?
```
hive> select str_to_map('');
OK
_c0
{"":null}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14004
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14004
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61807/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14004
  
**[Test build #61807 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61807/consoleFull)**
 for PR 14004 at commit 
[`f1a5c1b`](https://github.com/apache/spark/commit/f1a5c1b645840bc8bc5db3cc9dbfe1642eb109a0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69678271
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -393,3 +393,64 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def prettyName: String = "named_struct_unsafe"
 }
+
+/**
+ * Creates a map after splitting the input text into key/value pairs using 
delimeters
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map 
after splitting the text into
+key/value pairs using delimiters.
+Default delimiters are ',' for pairDelim and '=' for keyValueDelim.""",
+  extended = """ > SELECT _FUNC_('a:1,b:2,c:3',',',':');\n 
map("a":"1","b":"2","c":"3") """)
+case class StringToMap(text: Expression, pairDelim: Expression, 
keyValueDelim: Expression)
+  extends TernaryExpression with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(","), Literal("="))
+  }
+
+  override def children: Seq[Expression] = Seq(text, pairDelim, 
keyValueDelim)
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+
+  override def dataType: DataType = MapType(StringType, StringType, 
valueContainsNull = false)
+
+  override def nullSafeEval(str: Any, delim1: Any, delim2: Any): Any = {
+val array = str.asInstanceOf[UTF8String]
+  .split(delim1.asInstanceOf[UTF8String], -1)
+  .map{_.split(delim2.asInstanceOf[UTF8String], 2)}
+
+ArrayBasedMapData(array.map(_(0)), array.map(_(1)))
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+
+nullSafeCodeGen(ctx, ev, (text, pairDelim, keyValueDelim) => {
+  val arrayClass = classOf[GenericArrayData].getName
+  val mapClass = classOf[ArrayBasedMapData].getName
+
+  val keyArray = ctx.freshName("keyArray")
+  val valueArray = ctx.freshName("valueArray")
+  val tempArray = ctx.freshName("tempArray")
+  val keyValue = ctx.freshName("keyValue")
+  val i = ctx.freshName("i")
+
+  s"""
+UTF8String[] $tempArray = ($text).split($pairDelim, -1);
+
+UTF8String[] $keyArray = new UTF8String[$tempArray.length];
+UTF8String[] $valueArray = new UTF8String[$tempArray.length];
+
+for (int $i = 0; $i < $tempArray.length; $i ++) {
--- End diff --

`$i++` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14045: [SPARK-16362][SQL][WIP] Support ArrayType and StructType...

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14045
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61814/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14045: [SPARK-16362][SQL][WIP] Support ArrayType and StructType...

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14045
  
**[Test build #61814 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61814/consoleFull)**
 for PR 14045 at commit 
[`ded41b2`](https://github.com/apache/spark/commit/ded41b2726a0905fded1a74c5005732d636e7d20).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14045: [SPARK-16362][SQL][WIP] Support ArrayType and StructType...

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14045
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14063: [MINOR][PySpark][DOC] Fix code examples of SparkSession ...

2016-07-05 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14063
  
Can you verify the output after the all the follow-up commits?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13509: [SPARK-15740] [MLLIB] Word2VecSuite "big model load / sa...

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13509
  
**[Test build #3166 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3166/consoleFull)**
 for PR 13509 at commit 
[`909b6e1`](https://github.com/apache/spark/commit/909b6e16cbca29a2eaaecfd4151c5ac0af546cae).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69677978
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -393,3 +393,64 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def prettyName: String = "named_struct_unsafe"
 }
+
+/**
+ * Creates a map after splitting the input text into key/value pairs using 
delimeters
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map 
after splitting the text into
+key/value pairs using delimiters.
+Default delimiters are ',' for pairDelim and '=' for keyValueDelim.""",
+  extended = """ > SELECT _FUNC_('a:1,b:2,c:3',',',':');\n 
map("a":"1","b":"2","c":"3") """)
+case class StringToMap(text: Expression, pairDelim: Expression, 
keyValueDelim: Expression)
+  extends TernaryExpression with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(","), Literal("="))
+  }
+
+  override def children: Seq[Expression] = Seq(text, pairDelim, 
keyValueDelim)
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+
+  override def dataType: DataType = MapType(StringType, StringType, 
valueContainsNull = false)
+
+  override def nullSafeEval(str: Any, delim1: Any, delim2: Any): Any = {
+val array = str.asInstanceOf[UTF8String]
+  .split(delim1.asInstanceOf[UTF8String], -1)
+  .map{_.split(delim2.asInstanceOf[UTF8String], 2)}
--- End diff --

minor: `.map(_.split(delim2.asInstanceOf[UTF8String], 2))`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69677596
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -393,3 +393,64 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def prettyName: String = "named_struct_unsafe"
 }
+
+/**
+ * Creates a map after splitting the input text into key/value pairs using 
delimeters
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map 
after splitting the text into
+key/value pairs using delimiters.
+Default delimiters are ',' for pairDelim and '=' for keyValueDelim.""",
+  extended = """ > SELECT _FUNC_('a:1,b:2,c:3',',',':');\n 
map("a":"1","b":"2","c":"3") """)
+case class StringToMap(text: Expression, pairDelim: Expression, 
keyValueDelim: Expression)
+  extends TernaryExpression with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(","), Literal("="))
+  }
--- End diff --

Hi, @techaddict .
Could you add one more constructor, `this(child: Expression, pairDelim: 
Expression)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14045: [SPARK-16362][SQL][WIP] Support ArrayType and StructType...

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14045
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61813/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14045: [SPARK-16362][SQL][WIP] Support ArrayType and StructType...

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14045
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14045: [SPARK-16362][SQL][WIP] Support ArrayType and StructType...

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14045
  
**[Test build #61813 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61813/consoleFull)**
 for PR 14045 at commit 
[`4dca939`](https://github.com/apache/spark/commit/4dca939b131cdab042f437236b108686d12f9d94).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69677532
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlySuite.scala
 ---
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class OptimizeMetadataOnlySuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) 
"even" else "odd"))
+  .toDF("id", "data", "partId", "part")
+data.write.partitionBy("partId", 
"part").mode("append").saveAsTable("srcpart_15752")
+  }
+
+  override protected def afterAll(): Unit = {
+try {
+  sql("DROP TABLE IF EXISTS srcpart_15752")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private def checkWithMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect {
+  case l @ LocalRelation(_, _) => l
+}
+assert(localRelations.size == 1)
+  }
+
+  private def checkWithoutMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect{
+  case l @ LocalRelation(_, _) => l
+}
+assert(localRelations.size == 0)
+  }
+
+  test("spark-15752 metadata only optimizer for partition table") {
+withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
+  checkWithMetadataOnly(sql("select part from srcpart_15752 where part 
= 0 group by part"))
+  checkWithMetadataOnly(sql("select max(part) from srcpart_15752"))
+  checkWithMetadataOnly(sql("select max(part) from srcpart_15752 where 
part = 0"))
+  checkWithMetadataOnly(
+sql("select part, min(partId) from srcpart_15752 where part = 0 
group by part"))
+  checkWithMetadataOnly(
+sql("select max(x) from (select part + 1 as x from srcpart_15752 
where part = 1) t"))
+  checkWithMetadataOnly(sql("select distinct part from srcpart_15752"))
+  checkWithMetadataOnly(sql("select distinct part, partId from 
srcpart_15752"))
+  checkWithMetadataOnly(
+sql("select distinct x from (select part + 1 as x from 
srcpart_15752 where part = 0) t"))
+
+  // Now donot support metadata only optimizer
+  checkWithoutMetadataOnly(sql("select part, max(id) from 
srcpart_15752 group by part"))
+  checkWithoutMetadataOnly(sql("select distinct part, id from 
srcpart_15752"))
+  checkWithoutMetadataOnly(sql("select part, sum(partId) from 
srcpart_15752 group by part"))
+  checkWithoutMetadataOnly(
+sql("select part from srcpart_15752 where part = 1 group by 
rollup(part)"))
+  checkWithoutMetadataOnly(
+sql("select part from (select part from srcpart_15752 where part = 
0 union all " +
+  "select part from srcpart_15752 where part= 1)t group by part"))
+}
+  }
+
+  test("spark-15752 without metadata only optimizer for partition table") {
+withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "false") {
+  checkWithoutMetadataOnly(sql("select part from srcpart_15752 where 
part = 0 group by part"))
+  checkWithoutMetadataOnly(sql("select max(part) from srcpart_15752"))
--- End diff --

ah ok - this is getting confusing. we should just have 
checkWithMetadataOnly test the behavior when the flag is off.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact

[GitHub] spark issue #14053: [SPARK-16374] [SQL] Remove Alias from MetastoreRelation ...

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14053
  
**[Test build #61820 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61820/consoleFull)**
 for PR 14053 at commit 
[`d54b605`](https://github.com/apache/spark/commit/d54b605bc7ef6ba4d1ed14bfcc65fef075bb5e63).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13990: [SPARK-16287][SQL] Implement str_to_map SQL function

2016-07-05 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/13990
  
Sure, @rxin .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions

2016-07-05 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14004
  
If then, I will reposition that.
Do you mean `making a new java file containing not-really-UTF8String` 
function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions

2016-07-05 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14004
  
Oh, @cloud-fan .
Is there some misunderstanding?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13990: [SPARK-16287][SQL] Implement str_to_map SQL function

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13990
  
**[Test build #61819 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61819/consoleFull)**
 for PR 13990 at commit 
[`94c18ff`](https://github.com/apache/spark/commit/94c18ff013fbd610832dab187d1ba63408251e3d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13950: [SPARK-15487] [Web UI] Spark Master UI to reverse...

2016-07-05 Thread tnachen

Github user tnachen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13950#discussion_r69676891
  
--- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala ---
@@ -186,6 +188,67 @@ private[spark] object JettyUtils extends Logging {
 contextHandler
   }
 
+  /** Create a handler for proxying request to Workers and Application 
Drivers */
+  def createProxyHandler(
+  prefix: String,
+  target: String): ServletContextHandler = {
+val servlet = new ProxyServlet {
+  override def rewriteTarget(request: HttpServletRequest): String = {
+val path = request.getRequestURI();
+if (!path.startsWith(prefix)) return null
+
+val uri = new StringBuilder(target)
+if (target.endsWith("/")) uri.setLength(uri.length() - 1)
+val rest = path.substring(prefix.length())
+if (!rest.isEmpty())
+{
--- End diff --

Move { to previous line 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13950: [SPARK-15487] [Web UI] Spark Master UI to reverse...

2016-07-05 Thread tnachen

Github user tnachen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13950#discussion_r69676869
  
--- Diff: docs/configuration.md ---
@@ -598,6 +598,20 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.ui.reverseProxy
+  false
+  
+To enable running Spark Master, worker and application UI behined a 
reverse proxy. In this mode, Spark master will reverse proxy the worker and 
application UIs to enable access.
+  
+
+
+  spark.ui.reverseProxyUrl
+  http://localhost:8080
+  
+This is the URL where your proxy is running. Make sure this is a 
complete URL includeing scheme (http/https) and port to reach your proxy.
--- End diff --

includeing -> including


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions

2016-07-05 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14004
  
Now, the PR became more concise. Thank you for decision, @cloud-fan .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread techaddict

Github user techaddict commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69676815
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -393,3 +393,71 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def prettyName: String = "named_struct_unsafe"
 }
+
+/**
+ * Creates a map after splitting the input text into key/value pairs using 
delimeters
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map 
after splitting the text into
--- End diff --

not sure about the display ```[Usage: str_to_map(text[, pairDelim, 
keyValueDelim]) - Creates a map after splitting the text into
key/value pairs using delimiters.
Default delimiters are ',' for pairDelim and '=' for keyValueDelim.]```
added example


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13926: [SPARK-16229] [SQL] Drop Empty Table After CREATE TABLE ...

2016-07-05 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/13926
  
@cloud-fan `CreateDataSourceTableAsSelectCommand` creates a DataFrame at 
first, and then create a data source table. The order is different from 
`CreateHiveTableAsSelectCommand`.  If we hit any issue when creating a 
DataFrame, we will stop before trying to create a data source table. Thus, it 
should be fine based on my understanding.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14060: [SPARK-16340][SQL] Support column arguments for `regexp_...

2016-07-05 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14060
  
It's my pleasure. Thank you for reviewing, @MukulSoul and @rxin .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69676774
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlySuite.scala
 ---
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class OptimizeMetadataOnlySuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) 
"even" else "odd"))
+  .toDF("id", "data", "partId", "part")
+data.write.partitionBy("partId", 
"part").mode("append").saveAsTable("srcpart_15752")
+  }
+
+  override protected def afterAll(): Unit = {
+try {
+  sql("DROP TABLE IF EXISTS srcpart_15752")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private def checkWithMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect {
+  case l @ LocalRelation(_, _) => l
+}
+assert(localRelations.size == 1)
+  }
+
+  private def checkWithoutMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect{
+  case l @ LocalRelation(_, _) => l
+}
+assert(localRelations.size == 0)
+  }
+
+  test("spark-15752 metadata only optimizer for partition table") {
+withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
+  checkWithMetadataOnly(sql("select part from srcpart_15752 where part 
= 0 group by part"))
+  checkWithMetadataOnly(sql("select max(part) from srcpart_15752"))
+  checkWithMetadataOnly(sql("select max(part) from srcpart_15752 where 
part = 0"))
+  checkWithMetadataOnly(
+sql("select part, min(partId) from srcpart_15752 where part = 0 
group by part"))
+  checkWithMetadataOnly(
+sql("select max(x) from (select part + 1 as x from srcpart_15752 
where part = 1) t"))
+  checkWithMetadataOnly(sql("select distinct part from srcpart_15752"))
+  checkWithMetadataOnly(sql("select distinct part, partId from 
srcpart_15752"))
+  checkWithMetadataOnly(
+sql("select distinct x from (select part + 1 as x from 
srcpart_15752 where part = 0) t"))
+
+  // Now donot support metadata only optimizer
+  checkWithoutMetadataOnly(sql("select part, max(id) from 
srcpart_15752 group by part"))
+  checkWithoutMetadataOnly(sql("select distinct part, id from 
srcpart_15752"))
+  checkWithoutMetadataOnly(sql("select part, sum(partId) from 
srcpart_15752 group by part"))
+  checkWithoutMetadataOnly(
+sql("select part from srcpart_15752 where part = 1 group by 
rollup(part)"))
+  checkWithoutMetadataOnly(
+sql("select part from (select part from srcpart_15752 where part = 
0 union all " +
+  "select part from srcpart_15752 where part= 1)t group by part"))
+}
+  }
+
+  test("spark-15752 without metadata only optimizer for partition table") {
+withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "false") {
+  checkWithoutMetadataOnly(sql("select part from srcpart_15752 where 
part = 0 group by part"))
+  checkWithoutMetadataOnly(sql("select max(part) from srcpart_15752"))
--- End diff --

because we turn off the flag for this test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14004
  
**[Test build #61818 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61818/consoleFull)**
 for PR 14004 at commit 
[`a98c05e`](https://github.com/apache/spark/commit/a98c05e736d1638b8d9c1475bb658206bada314b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r6967
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -801,6 +804,49 @@ public static UTF8String concatWs(UTF8String 
separator, UTF8String... inputs) {
 return res;
   }
 
+  /**
+   * Return a locale of the given language and country, or a default 
locale when failures occur.
+   */
+  private Locale getLocale(UTF8String language, UTF8String country) {
--- End diff --

Sorry I mean the `sentences` method, not the `Sentences` expression...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14054: [SPARK-16226] [SQL] Weaken JDBC isolation level to avoid...

2016-07-05 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14054
  
hm it does seem to me this should be an option in the data source itself.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14059: [Mesos] expand coarse-grained mode docs

2016-07-05 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14059
  
LGTM except the minor comment.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14004
  
**[Test build #61816 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61816/consoleFull)**
 for PR 14004 at commit 
[`50629b5`](https://github.com/apache/spark/commit/50629b573a02c4833358905f7a4f9ac5833bafdd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13708: [SPARK-15591] [WEBUI] Paginate Stage Table in Stages tab

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13708
  
**[Test build #61817 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61817/consoleFull)**
 for PR 13708 at commit 
[`3a98c4d`](https://github.com/apache/spark/commit/3a98c4d9946b765bd5a36db648af8a177749143a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14059: [Mesos] expand coarse-grained mode docs

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14059#discussion_r69676444
  
--- Diff: docs/running-on-mesos.md ---
@@ -180,30 +180,47 @@ Note that jars or python files that are passed to 
spark-submit should be URIs re
 
 # Mesos Run Modes
 
-Spark can run over Mesos in two modes: "coarse-grained" (default) and 
"fine-grained".
-
-The "coarse-grained" mode will launch only *one* long-running Spark task 
on each Mesos
-machine, and dynamically schedule its own "mini-tasks" within it. The 
benefit is much lower startup
-overhead, but at the cost of reserving the Mesos resources for the 
complete duration of the
-application.
-
-Coarse-grained is the default mode. You can also set `spark.mesos.coarse` 
property to true
-to turn it on explicitly in 
[SparkConf](configuration.html#spark-properties):
-
-{% highlight scala %}
-conf.set("spark.mesos.coarse", "true")
-{% endhighlight %}
-
-In addition, for coarse-grained mode, you can control the maximum number 
of resources Spark will
-acquire. By default, it will acquire *all* cores in the cluster (that get 
offered by Mesos), which
-only makes sense if you run just one application at a time. You can cap 
the maximum number of cores
-using `conf.set("spark.cores.max", "10")` (for example).
-
-In "fine-grained" mode, each Spark task runs as a separate Mesos task. 
This allows
-multiple instances of Spark (and other frameworks) to share machines at a 
very fine granularity,
-where each application gets more or fewer machines as it ramps up and 
down, but it comes with an
-additional overhead in launching each task. This mode may be inappropriate 
for low-latency
-requirements like interactive queries or serving web requests.
+Spark can run over Mesos in two modes: "coarse-grained" (default) and
+"fine-grained".
+
+In "coarse-grained" mode, each Spark executor runs as a single Mesos
+task.  Spark executors are sized according to the following
+configuration variables:
+
+* Executor memory: `spark.executor.memory`
+* Executor cores: `spark.executor.cores`
+* Number of executors: `spark.cores.max`/`spark.executor.cores`
+
+Please see the [Spark Configuration](configuration.html) page for
+details and default values.
+
+Executors are brought up eagerly when the application starts, until
+`spark.cores.max` is reached.  If you don't set `spark.cores.max`, the
+Spark application will reserve all resources offered to it by Mesos,
+so we of course urge you to set this variable in any sort of
+multi-tenant cluster, including one which runs multiple concurrent
+Spark applications.
+
+The scheduler will start executors round-robin on the offers Mesos
+gives it, but there are no spread guarantees, as Mesos does not
+provide such guarantees on the offer stream.
+
+The benefit of coarse-grained mode is much lower startup overhead, but
+at the cost of reserving Mesos resources for the complete duration of
+the application.
--- End diff --

say a word about dynamic allocation?\


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13990: [SPARK-16287][SQL] Implement str_to_map SQL function

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13990
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61806/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13494: [SPARK-15752] [SQL] support optimization for metadata on...

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13494
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61805/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13990: [SPARK-16287][SQL] Implement str_to_map SQL function

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13990
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13494: [SPARK-15752] [SQL] support optimization for metadata on...

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13494
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13990: [SPARK-16287][SQL] Implement str_to_map SQL function

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13990
  
**[Test build #61806 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61806/consoleFull)**
 for PR 13990 at commit 
[`f7c03c5`](https://github.com/apache/spark/commit/f7c03c5a44011949e6c8b444334560c0b61c766e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13494: [SPARK-15752] [SQL] support optimization for metadata on...

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13494
  
**[Test build #61805 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61805/consoleFull)**
 for PR 13494 at commit 
[`4297f9f`](https://github.com/apache/spark/commit/4297f9f7f32fc2ea59bba43569b17da94ed84fce).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69676173
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
+agg.isDistinct || (agg.aggregateFunction match {
--- End diff --

```
select col1, max(col2) from table group by col1
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13990: [SPARK-16287][SQL] Implement str_to_map SQL function

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13990
  
**[Test build #61815 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61815/consoleFull)**
 for PR 13990 at commit 
[`d1573b6`](https://github.com/apache/spark/commit/d1573b6c3d4585f8258531571eacf12d4fad1dbc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14060: [SPARK-16340][SQL] Support column arguments for `regexp_...

2016-07-05 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14060
  
Merging in master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13990: [SPARK-16287][SQL] Implement str_to_map SQL function

2016-07-05 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13990
  
cc @dongjoon-hyun can you help review this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69676026
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -393,3 +393,71 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def prettyName: String = "named_struct_unsafe"
 }
+
+/**
+ * Creates a map after splitting the input text into key/value pairs using 
delimeters
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map 
after splitting the text into
--- End diff --

also we really need an example here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69676018
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -393,3 +393,71 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def prettyName: String = "named_struct_unsafe"
 }
+
+/**
+ * Creates a map after splitting the input text into key/value pairs using 
delimeters
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map 
after splitting the text into
--- End diff --

this will mess up the display i think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69676002
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -393,3 +393,71 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def prettyName: String = "named_struct_unsafe"
 }
+
+/**
+ * Creates a map after splitting the input text into key/value pairs using 
delimeters
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map 
after splitting the text into
+key/value pairs using delimiters.
+Default delimiters are ',' for pairDelim and '=' for keyValueDelim.""")
+case class StringToMap(text: Expression, pairDelim: Expression, 
keyValueDelim: Expression)
+  extends TernaryExpression with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(","), Literal("="))
+  }
+
+  override def children: Seq[Expression] = Seq(text, pairDelim, 
keyValueDelim)
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+
+  override def dataType: DataType = MapType(StringType, StringType, 
valueContainsNull = false)
+
+  override def nullSafeEval(str: Any, delim1: Any, delim2: Any): Any = {
+val array = str.asInstanceOf[UTF8String]
+  .split(delim1.asInstanceOf[UTF8String], -1)
+  .map{_.split(delim2.asInstanceOf[UTF8String], 2)}
+
+ArrayBasedMapData(array.map(_(0)), 
array.map(_(1))).asInstanceOf[MapData]
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+
+nullSafeCodeGen(ctx, ev, (text, delim1, delim2) => {
+  val arrayClass = classOf[GenericArrayData].getName
+  val mapClass = classOf[ArrayBasedMapData].getName
+  val keyArray = ctx.freshName("keyArray")
+  val valueArray = ctx.freshName("valueArray")
+  ctx.addMutableState("UTF8String[]", keyArray, s"this.$keyArray = 
null;")
+  ctx.addMutableState("UTF8String[]", valueArray, s"this.$valueArray = 
null;")
--- End diff --

Why don't we remove code gen from this. It is making the pull request more 
complicated, and in general I'm not sure how often these functions will be 
called, so the value of codegen isn't great. Plus, str_to_map is a pretty 
expensive function anyway.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread techaddict

Github user techaddict commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69675997
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -393,3 +393,71 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def prettyName: String = "named_struct_unsafe"
 }
+
+/**
+ * Creates a map after splitting the input text into key/value pairs using 
delimeters
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map 
after splitting the text into
+key/value pairs using delimiters.
+Default delimiters are ',' for pairDelim and '=' for keyValueDelim.""")
+case class StringToMap(text: Expression, pairDelim: Expression, 
keyValueDelim: Expression)
+  extends TernaryExpression with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(","), Literal("="))
+  }
+
+  override def children: Seq[Expression] = Seq(text, pairDelim, 
keyValueDelim)
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+
+  override def dataType: DataType = MapType(StringType, StringType, 
valueContainsNull = false)
+
+  override def nullSafeEval(str: Any, delim1: Any, delim2: Any): Any = {
+val array = str.asInstanceOf[UTF8String]
+  .split(delim1.asInstanceOf[UTF8String], -1)
+  .map{_.split(delim2.asInstanceOf[UTF8String], 2)}
+
+ArrayBasedMapData(array.map(_(0)), 
array.map(_(1))).asInstanceOf[MapData]
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+
+nullSafeCodeGen(ctx, ev, (text, delim1, delim2) => {
+  val arrayClass = classOf[GenericArrayData].getName
+  val mapClass = classOf[ArrayBasedMapData].getName
+  val keyArray = ctx.freshName("keyArray")
+  val valueArray = ctx.freshName("valueArray")
+  ctx.addMutableState("UTF8String[]", keyArray, s"this.$keyArray = 
null;")
+  ctx.addMutableState("UTF8String[]", valueArray, s"this.$valueArray = 
null;")
--- End diff --

ohh yes, makes sense. Made the change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69675862
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlySuite.scala
 ---
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class OptimizeMetadataOnlySuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) 
"even" else "odd"))
+  .toDF("id", "data", "partId", "part")
+data.write.partitionBy("partId", 
"part").mode("append").saveAsTable("srcpart_15752")
+  }
+
+  override protected def afterAll(): Unit = {
+try {
+  sql("DROP TABLE IF EXISTS srcpart_15752")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private def checkWithMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect {
+  case l @ LocalRelation(_, _) => l
+}
+assert(localRelations.size == 1)
+  }
+
+  private def checkWithoutMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect{
+  case l @ LocalRelation(_, _) => l
+}
+assert(localRelations.size == 0)
+  }
+
+  test("spark-15752 metadata only optimizer for partition table") {
+withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "true") {
+  checkWithMetadataOnly(sql("select part from srcpart_15752 where part 
= 0 group by part"))
+  checkWithMetadataOnly(sql("select max(part) from srcpart_15752"))
+  checkWithMetadataOnly(sql("select max(part) from srcpart_15752 where 
part = 0"))
+  checkWithMetadataOnly(
+sql("select part, min(partId) from srcpart_15752 where part = 0 
group by part"))
+  checkWithMetadataOnly(
+sql("select max(x) from (select part + 1 as x from srcpart_15752 
where part = 1) t"))
+  checkWithMetadataOnly(sql("select distinct part from srcpart_15752"))
+  checkWithMetadataOnly(sql("select distinct part, partId from 
srcpart_15752"))
+  checkWithMetadataOnly(
+sql("select distinct x from (select part + 1 as x from 
srcpart_15752 where part = 0) t"))
+
+  // Now donot support metadata only optimizer
+  checkWithoutMetadataOnly(sql("select part, max(id) from 
srcpart_15752 group by part"))
+  checkWithoutMetadataOnly(sql("select distinct part, id from 
srcpart_15752"))
+  checkWithoutMetadataOnly(sql("select part, sum(partId) from 
srcpart_15752 group by part"))
+  checkWithoutMetadataOnly(
+sql("select part from srcpart_15752 where part = 1 group by 
rollup(part)"))
+  checkWithoutMetadataOnly(
+sql("select part from (select part from srcpart_15752 where part = 
0 union all " +
+  "select part from srcpart_15752 where part= 1)t group by part"))
+}
+  }
+
+  test("spark-15752 without metadata only optimizer for partition table") {
+withSQLConf(SQLConf.OPTIMIZER_METADATA_ONLY.key -> "false") {
+  checkWithoutMetadataOnly(sql("select part from srcpart_15752 where 
part = 0 group by part"))
+  checkWithoutMetadataOnly(sql("select max(part) from srcpart_15752"))
--- End diff --

why isn't this one supported?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark issue #14045: [SPARK-16362][SQL][WIP] Support ArrayType and StructType...

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14045
  
**[Test build #61814 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61814/consoleFull)**
 for PR 14045 at commit 
[`ded41b2`](https://github.com/apache/spark/commit/ded41b2726a0905fded1a74c5005732d636e7d20).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69675796
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlySuite.scala
 ---
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class OptimizeMetadataOnlySuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) 
"even" else "odd"))
+  .toDF("id", "data", "partId", "part")
+data.write.partitionBy("partId", 
"part").mode("append").saveAsTable("srcpart_15752")
+  }
+
+  override protected def afterAll(): Unit = {
+try {
+  sql("DROP TABLE IF EXISTS srcpart_15752")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private def checkWithMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect {
+  case l @ LocalRelation(_, _) => l
+}
+assert(localRelations.size == 1)
+  }
+
+  private def checkWithoutMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect{
+  case l @ LocalRelation(_, _) => l
+}
+assert(localRelations.size == 0)
+  }
+
+  test("spark-15752 metadata only optimizer for partition table") {
--- End diff --

for example, you can have a test case for each of the category you 
documented in the optimizer classdoc comment, and also have separate test cases 
for filter, project, etc.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69675767
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlySuite.scala
 ---
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class OptimizeMetadataOnlySuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) 
"even" else "odd"))
+  .toDF("id", "data", "partId", "part")
+data.write.partitionBy("partId", 
"part").mode("append").saveAsTable("srcpart_15752")
+  }
+
+  override protected def afterAll(): Unit = {
+try {
+  sql("DROP TABLE IF EXISTS srcpart_15752")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private def checkWithMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect {
+  case l @ LocalRelation(_, _) => l
+}
+assert(localRelations.size == 1)
+  }
+
+  private def checkWithoutMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect{
+  case l @ LocalRelation(_, _) => l
+}
+assert(localRelations.size == 0)
+  }
+
+  test("spark-15752 metadata only optimizer for partition table") {
--- End diff --

it'd be great to break this into multiple cases, rather than simply "yes" 
vs "no".



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69675764
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -393,3 +393,71 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def prettyName: String = "named_struct_unsafe"
 }
+
+/**
+ * Creates a map after splitting the input text into key/value pairs using 
delimeters
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map 
after splitting the text into
+key/value pairs using delimiters.
+Default delimiters are ',' for pairDelim and '=' for keyValueDelim.""")
+case class StringToMap(text: Expression, pairDelim: Expression, 
keyValueDelim: Expression)
+  extends TernaryExpression with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(","), Literal("="))
+  }
+
+  override def children: Seq[Expression] = Seq(text, pairDelim, 
keyValueDelim)
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+
+  override def dataType: DataType = MapType(StringType, StringType, 
valueContainsNull = false)
+
+  override def nullSafeEval(str: Any, delim1: Any, delim2: Any): Any = {
+val array = str.asInstanceOf[UTF8String]
+  .split(delim1.asInstanceOf[UTF8String], -1)
+  .map{_.split(delim2.asInstanceOf[UTF8String], 2)}
+
+ArrayBasedMapData(array.map(_(0)), 
array.map(_(1))).asInstanceOf[MapData]
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+
+nullSafeCodeGen(ctx, ev, (text, delim1, delim2) => {
+  val arrayClass = classOf[GenericArrayData].getName
+  val mapClass = classOf[ArrayBasedMapData].getName
+  val keyArray = ctx.freshName("keyArray")
+  val valueArray = ctx.freshName("valueArray")
+  ctx.addMutableState("UTF8String[]", keyArray, s"this.$keyArray = 
null;")
+  ctx.addMutableState("UTF8String[]", valueArray, s"this.$valueArray = 
null;")
--- End diff --

It's used for split the code into some small blocks. BTW why would you do 
`this.keyArray = null;`? We can declare `keyArray` as local variable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69675675
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlySuite.scala
 ---
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class OptimizeMetadataOnlySuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) 
"even" else "odd"))
+  .toDF("id", "data", "partId", "part")
+data.write.partitionBy("partId", 
"part").mode("append").saveAsTable("srcpart_15752")
+  }
+
+  override protected def afterAll(): Unit = {
+try {
+  sql("DROP TABLE IF EXISTS srcpart_15752")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private def checkWithMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect {
+  case l @ LocalRelation(_, _) => l
+}
+assert(localRelations.size == 1)
+  }
+
+  private def checkWithoutMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect{
--- End diff --

add a space after collect


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69675663
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlySuite.scala
 ---
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class OptimizeMetadataOnlySuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) 
"even" else "odd"))
+  .toDF("id", "data", "partId", "part")
+data.write.partitionBy("partId", 
"part").mode("append").saveAsTable("srcpart_15752")
+  }
+
+  override protected def afterAll(): Unit = {
+try {
+  sql("DROP TABLE IF EXISTS srcpart_15752")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private def checkWithMetadataOnly(df: DataFrame): Unit = {
+val localRelations = df.queryExecution.optimizedPlan.collect {
+  case l @ LocalRelation(_, _) => l
+}
+assert(localRelations.size == 1)
+  }
+
+  private def checkWithoutMetadataOnly(df: DataFrame): Unit = {
--- End diff --

assertNotMetadataOnlyQuery


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69675655
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlySuite.scala
 ---
@@ -0,0 +1,95 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql._
+import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class OptimizeMetadataOnlySuite extends QueryTest with SharedSQLContext {
+  import testImplicits._
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+val data = (1 to 10).map(i => (i, s"data-$i", i % 2, if ((i % 2) == 0) 
"even" else "odd"))
+  .toDF("id", "data", "partId", "part")
+data.write.partitionBy("partId", 
"part").mode("append").saveAsTable("srcpart_15752")
+  }
+
+  override protected def afterAll(): Unit = {
+try {
+  sql("DROP TABLE IF EXISTS srcpart_15752")
+} finally {
+  super.afterAll()
+}
+  }
+
+  private def checkWithMetadataOnly(df: DataFrame): Unit = {
--- End diff --

assertMetadataOnlyQuery


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69675558
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
+agg.isDistinct || (agg.aggregateFunction match {
+  case _: Max => true
+  case _: Min => true
+  case _ => false
+})
+  }
+  if (isPartitionDataOnly) {
+a.withNewChildren(Seq(usePartitionData(child, relation)))
+  } else {
+a
+  }
+} else {
+  a
+}
+}
+  }
+
+  private def usePartitionData(child: LogicalPlan, relation: LogicalPlan): 
LogicalPlan = {
+child transform {
+  case plan if plan eq relation =>
+relation match {
+  case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _) =>
+val partColumns = 
fsRelation.partitionSchema.map(_.name.toLowerCase).toSet
+val partAttrs = l.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = fsRelation.location.listFiles(Nil)
+LocalRelation(partAttrs, partitionData.map(_.values))
+
+  case relation: CatalogRelation =>
+val partColumns = 
relation.catalogTable.partitionColumnNames.map(_.toLowerCase).toSet
+val partAttrs = relation.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = 
catalog.listPartitions(relation.catalogTable.identifier).map { p =>
+  InternalRow.fromSeq(partAttrs.map { attr =>
+Cast(Literal(p.spec(attr.name)), attr.dataType).eval()
+  })
+}
+LocalRelation(partAttrs, partitionData)
+
+  case _ => throw new IllegalStateException()
+}
+}
+  }
+
+  object PartitionedRelation {
+def unapply(plan: LogicalPlan): Option[(AttributeSet, LogicalPlan)] =

[GitHub] spark issue #13926: [SPARK-16229] [SQL] Drop Empty Table After CREATE TABLE ...

2016-07-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13926
  
does CreateDataSourceTableAsSelectCommand has this problem?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69675543
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
+agg.isDistinct || (agg.aggregateFunction match {
+  case _: Max => true
+  case _: Min => true
+  case _ => false
+})
+  }
+  if (isPartitionDataOnly) {
+a.withNewChildren(Seq(usePartitionData(child, relation)))
+  } else {
+a
+  }
+} else {
+  a
+}
+}
+  }
+
+  private def usePartitionData(child: LogicalPlan, relation: LogicalPlan): 
LogicalPlan = {
+child transform {
+  case plan if plan eq relation =>
+relation match {
+  case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _) =>
+val partColumns = 
fsRelation.partitionSchema.map(_.name.toLowerCase).toSet
+val partAttrs = l.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = fsRelation.location.listFiles(Nil)
+LocalRelation(partAttrs, partitionData.map(_.values))
+
+  case relation: CatalogRelation =>
+val partColumns = 
relation.catalogTable.partitionColumnNames.map(_.toLowerCase).toSet
+val partAttrs = relation.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = 
catalog.listPartitions(relation.catalogTable.identifier).map { p =>
+  InternalRow.fromSeq(partAttrs.map { attr =>
+Cast(Literal(p.spec(attr.name)), attr.dataType).eval()
+  })
+}
+LocalRelation(partAttrs, partitionData)
+
+  case _ => throw new IllegalStateException()
+}
+}
+  }
+
+  object PartitionedRelation {
+def unapply(plan: LogicalPlan): Option[(AttributeSet, LogicalPlan)] =

[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-05 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69675532
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +200,35 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
--- End diff --

```
hive> SELECT sentences('hi there', 'x', '');
OK
_c0
[["hi","there"]]
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14063: [MINOR][PySpark][DOC] Fix code examples of SparkSession ...

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14063
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61810/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14060: [SPARK-16340][SQL] Support column arguments for `regexp_...

2016-07-05 Thread MukulSoul

Github user MukulSoul commented on the issue:

https://github.com/apache/spark/pull/14060
  
@dongjoon-hyun : Thanks for taking the case. It looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-05 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69675519
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +200,35 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
--- End diff --

Yep. It's Hive behavior. Hive supports `sentences(str)`, 
`sentences(str,lang)`, and `sentences(str,lang,country)`.
And Hive does not raise exception for the illegal `lang` and `country`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14063: [MINOR][PySpark][DOC] Fix code examples of SparkSession ...

2016-07-05 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14063
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14035: [SPARK-16356][ML] Add testImplicits for ML unit tests an...

2016-07-05 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/14035
  
@mengxr, @yanboliang, Could you review this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14063: [MINOR][PySpark][DOC] Fix code examples of SparkSession ...

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14063
  
**[Test build #61810 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61810/consoleFull)**
 for PR 14063 at commit 
[`9edca4f`](https://github.com/apache/spark/commit/9edca4f060cc3fac304e89234b0ba6c49291145a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread techaddict

Github user techaddict commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69675457
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -393,3 +393,71 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def prettyName: String = "named_struct_unsafe"
 }
+
+/**
+ * Creates a map after splitting the input text into key/value pairs using 
delimeters
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map 
after splitting the text into
+key/value pairs using delimiters.
+Default delimiters are ',' for pairDelim and '=' for keyValueDelim.""")
+case class StringToMap(text: Expression, pairDelim: Expression, 
keyValueDelim: Expression)
+  extends TernaryExpression with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(","), Literal("="))
+  }
+
+  override def children: Seq[Expression] = Seq(text, pairDelim, 
keyValueDelim)
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+
+  override def dataType: DataType = MapType(StringType, StringType, 
valueContainsNull = false)
+
+  override def nullSafeEval(str: Any, delim1: Any, delim2: Any): Any = {
+val array = str.asInstanceOf[UTF8String]
+  .split(delim1.asInstanceOf[UTF8String], -1)
+  .map{_.split(delim2.asInstanceOf[UTF8String], 2)}
+
+ArrayBasedMapData(array.map(_(0)), 
array.map(_(1))).asInstanceOf[MapData]
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+
+nullSafeCodeGen(ctx, ev, (text, delim1, delim2) => {
+  val arrayClass = classOf[GenericArrayData].getName
+  val mapClass = classOf[ArrayBasedMapData].getName
+  val keyArray = ctx.freshName("keyArray")
+  val valueArray = ctx.freshName("valueArray")
+  ctx.addMutableState("UTF8String[]", keyArray, s"this.$keyArray = 
null;")
+  ctx.addMutableState("UTF8String[]", valueArray, s"this.$valueArray = 
null;")
--- End diff --

And we are doing similar stuff in `CreateMap` and doing same 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L129


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14045: [SPARK-16362][SQL][WIP] Support ArrayType and StructType...

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14045
  
**[Test build #61813 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61813/consoleFull)**
 for PR 14045 at commit 
[`4dca939`](https://github.com/apache/spark/commit/4dca939b131cdab042f437236b108686d12f9d94).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread techaddict

Github user techaddict commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69675325
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -393,3 +393,71 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def prettyName: String = "named_struct_unsafe"
 }
+
+/**
+ * Creates a map after splitting the input text into key/value pairs using 
delimeters
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map 
after splitting the text into
+key/value pairs using delimiters.
+Default delimiters are ',' for pairDelim and '=' for keyValueDelim.""")
+case class StringToMap(text: Expression, pairDelim: Expression, 
keyValueDelim: Expression)
+  extends TernaryExpression with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(","), Literal("="))
+  }
+
+  override def children: Seq[Expression] = Seq(text, pairDelim, 
keyValueDelim)
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+
+  override def dataType: DataType = MapType(StringType, StringType, 
valueContainsNull = false)
+
+  override def nullSafeEval(str: Any, delim1: Any, delim2: Any): Any = {
+val array = str.asInstanceOf[UTF8String]
+  .split(delim1.asInstanceOf[UTF8String], -1)
+  .map{_.split(delim2.asInstanceOf[UTF8String], 2)}
+
+ArrayBasedMapData(array.map(_(0)), 
array.map(_(1))).asInstanceOf[MapData]
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+
+nullSafeCodeGen(ctx, ev, (text, delim1, delim2) => {
+  val arrayClass = classOf[GenericArrayData].getName
+  val mapClass = classOf[ArrayBasedMapData].getName
+  val keyArray = ctx.freshName("keyArray")
+  val valueArray = ctx.freshName("valueArray")
+  ctx.addMutableState("UTF8String[]", keyArray, s"this.$keyArray = 
null;")
+  ctx.addMutableState("UTF8String[]", valueArray, s"this.$valueArray = 
null;")
--- End diff --

I get ```Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 56, Column 16: 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection"
 has no field "keyArray"```
for 
```java 
this.keyArray = null;
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69675141
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
+agg.isDistinct || (agg.aggregateFunction match {
--- End diff --

it can't pass analysis... `col2` is not grouping column and must be put 
inside agg functins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14014: [SPARK-16344][SQL] Decoding Parquet array of stru...

2016-07-05 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/14014#discussion_r69675124
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
 ---
@@ -482,13 +482,106 @@ private[parquet] class ParquetRowConverter(
  */
 // scalastyle:on
 private def isElementType(
-parquetRepeatedType: Type, catalystElementType: DataType, 
parentName: String): Boolean = {
+parquetRepeatedType: Type, catalystElementType: DataType, parent: 
GroupType): Boolean = {
+
+  def isStandardListLayout(t: GroupType): Boolean =
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13990: [SPARK-16287][SQL] Implement str_to_map SQL funct...

2016-07-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13990#discussion_r69675090
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
 ---
@@ -393,3 +393,71 @@ case class CreateNamedStructUnsafe(children: 
Seq[Expression]) extends Expression
 
   override def prettyName: String = "named_struct_unsafe"
 }
+
+/**
+ * Creates a map after splitting the input text into key/value pairs using 
delimeters
+ */
+@ExpressionDescription(
+  usage = """_FUNC_(text[, pairDelim, keyValueDelim]) - Creates a map 
after splitting the text into
+key/value pairs using delimiters.
+Default delimiters are ',' for pairDelim and '=' for keyValueDelim.""")
+case class StringToMap(text: Expression, pairDelim: Expression, 
keyValueDelim: Expression)
+  extends TernaryExpression with ExpectsInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal(","), Literal("="))
+  }
+
+  override def children: Seq[Expression] = Seq(text, pairDelim, 
keyValueDelim)
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(StringType, 
StringType, StringType)
+
+  override def dataType: DataType = MapType(StringType, StringType, 
valueContainsNull = false)
+
+  override def nullSafeEval(str: Any, delim1: Any, delim2: Any): Any = {
+val array = str.asInstanceOf[UTF8String]
+  .split(delim1.asInstanceOf[UTF8String], -1)
+  .map{_.split(delim2.asInstanceOf[UTF8String], 2)}
+
+ArrayBasedMapData(array.map(_(0)), 
array.map(_(1))).asInstanceOf[MapData]
+  }
+
+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
+
+nullSafeCodeGen(ctx, ev, (text, delim1, delim2) => {
+  val arrayClass = classOf[GenericArrayData].getName
+  val mapClass = classOf[ArrayBasedMapData].getName
+  val keyArray = ctx.freshName("keyArray")
+  val valueArray = ctx.freshName("valueArray")
+  ctx.addMutableState("UTF8String[]", keyArray, s"this.$keyArray = 
null;")
+  ctx.addMutableState("UTF8String[]", valueArray, s"this.$valueArray = 
null;")
--- End diff --

really? Don't they in the same code block? java variables are always 
mutable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69675086
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
+agg.isDistinct || (agg.aggregateFunction match {
+  case _: Max => true
+  case _: Min => true
+  case _ => false
+})
+  }
+  if (isPartitionDataOnly) {
+a.withNewChildren(Seq(usePartitionData(child, relation)))
+  } else {
+a
+  }
+} else {
+  a
+}
+}
+  }
+
+  private def usePartitionData(child: LogicalPlan, relation: LogicalPlan): 
LogicalPlan = {
+child transform {
+  case plan if plan eq relation =>
+relation match {
+  case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _) =>
+val partColumns = 
fsRelation.partitionSchema.map(_.name.toLowerCase).toSet
+val partAttrs = l.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = fsRelation.location.listFiles(Nil)
+LocalRelation(partAttrs, partitionData.map(_.values))
+
+  case relation: CatalogRelation =>
+val partColumns = 
relation.catalogTable.partitionColumnNames.map(_.toLowerCase).toSet
+val partAttrs = relation.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = 
catalog.listPartitions(relation.catalogTable.identifier).map { p =>
+  InternalRow.fromSeq(partAttrs.map { attr =>
+Cast(Literal(p.spec(attr.name)), attr.dataType).eval()
+  })
+}
+LocalRelation(partAttrs, partitionData)
+
+  case _ => throw new IllegalStateException()
+}
+}
+  }
+
+  object PartitionedRelation {
--- End diff --

we should also explain what patterns are

[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69674992
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 ---
@@ -198,6 +200,35 @@ case class StringSplit(str: Expression, pattern: 
Expression)
   override def prettyName: String = "split"
 }
 
+/**
+ * Splits a string into arrays of sentences, where each sentence is an 
array of words.
+ * The 'lang' and 'country' arguments are optional, and if omitted, the 
default locale is used.
--- End diff --

Can you check hive's behaviour? Is it legal to omit only one of them? and 
will hive throw exception if the 'lang' and 'country' are illegal?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69674974
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
+agg.isDistinct || (agg.aggregateFunction match {
+  case _: Max => true
+  case _: Min => true
+  case _ => false
+})
+  }
+  if (isPartitionDataOnly) {
+a.withNewChildren(Seq(usePartitionData(child, relation)))
+  } else {
+a
+  }
+} else {
+  a
+}
+}
+  }
+
+  private def usePartitionData(child: LogicalPlan, relation: LogicalPlan): 
LogicalPlan = {
+child transform {
+  case plan if plan eq relation =>
+relation match {
+  case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _) =>
+val partColumns = 
fsRelation.partitionSchema.map(_.name.toLowerCase).toSet
+val partAttrs = l.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = fsRelation.location.listFiles(Nil)
+LocalRelation(partAttrs, partitionData.map(_.values))
+
+  case relation: CatalogRelation =>
+val partColumns = 
relation.catalogTable.partitionColumnNames.map(_.toLowerCase).toSet
+val partAttrs = relation.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = 
catalog.listPartitions(relation.catalogTable.identifier).map { p =>
+  InternalRow.fromSeq(partAttrs.map { attr =>
+Cast(Literal(p.spec(attr.name)), attr.dataType).eval()
+  })
+}
+LocalRelation(partAttrs, partitionData)
+
+  case _ => throw new IllegalStateException()
+}
+}
+  }
+
+  object PartitionedRelation {
--- End diff --

we need comment explaining what this does


---
If

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69674988
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
+agg.isDistinct || (agg.aggregateFunction match {
+  case _: Max => true
+  case _: Min => true
+  case _ => false
+})
+  }
+  if (isPartitionDataOnly) {
+a.withNewChildren(Seq(usePartitionData(child, relation)))
+  } else {
+a
+  }
+} else {
+  a
+}
+}
+  }
+
+  private def usePartitionData(child: LogicalPlan, relation: LogicalPlan): 
LogicalPlan = {
+child transform {
+  case plan if plan eq relation =>
+relation match {
+  case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _) =>
+val partColumns = 
fsRelation.partitionSchema.map(_.name.toLowerCase).toSet
+val partAttrs = l.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = fsRelation.location.listFiles(Nil)
+LocalRelation(partAttrs, partitionData.map(_.values))
+
+  case relation: CatalogRelation =>
+val partColumns = 
relation.catalogTable.partitionColumnNames.map(_.toLowerCase).toSet
+val partAttrs = relation.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = 
catalog.listPartitions(relation.catalogTable.identifier).map { p =>
+  InternalRow.fromSeq(partAttrs.map { attr =>
+Cast(Literal(p.spec(attr.name)), attr.dataType).eval()
+  })
+}
+LocalRelation(partAttrs, partitionData)
+
+  case _ => throw new IllegalStateException()
+}
+}
+  }
+
+  object PartitionedRelation {
--- End diff --

also need to document what the returned tuple means

[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-05 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69674949
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -801,6 +804,49 @@ public static UTF8String concatWs(UTF8String 
separator, UTF8String... inputs) {
 return res;
   }
 
+  /**
+   * Return a locale of the given language and country, or a default 
locale when failures occur.
+   */
+  private Locale getLocale(UTF8String language, UTF8String country) {
--- End diff --

Okay. No problem. I'll move this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69674955
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
+agg.isDistinct || (agg.aggregateFunction match {
+  case _: Max => true
+  case _: Min => true
+  case _ => false
+})
+  }
+  if (isPartitionDataOnly) {
+a.withNewChildren(Seq(usePartitionData(child, relation)))
+  } else {
+a
+  }
+} else {
+  a
+}
+}
+  }
+
+  private def usePartitionData(child: LogicalPlan, relation: LogicalPlan): 
LogicalPlan = {
--- End diff --

actually we should probably just rename this function to make it more self 
evident, e.g. replaceTableScanWithPartitionMetadata



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69674923
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
+agg.isDistinct || (agg.aggregateFunction match {
--- End diff --

does this work for
```
select col1, col2 from table group by col1
```
?

It'd be good to handle that too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14004: [SPARK-16285][SQL] Implement sentences SQL functi...

2016-07-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14004#discussion_r69674859
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java ---
@@ -801,6 +804,49 @@ public static UTF8String concatWs(UTF8String 
separator, UTF8String... inputs) {
 return res;
   }
 
+  /**
+   * Return a locale of the given language and country, or a default 
locale when failures occur.
+   */
+  private Locale getLocale(UTF8String language, UTF8String country) {
--- End diff --

I think it's ok to inline this method into `sentences`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14036: [SPARK-16323] [SQL] Add IntegerDivide to avoid unnecessa...

2016-07-05 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/14036
  
OK I think it's fine to add it in SQL for compatibility but I wouldn't add 
functions in the DataFrame API for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14014: [SPARK-16344][SQL] Decoding Parquet array of struct with...

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14014
  
**[Test build #61811 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61811/consoleFull)**
 for PR 14014 at commit 
[`5d3fdd4`](https://github.com/apache/spark/commit/5d3fdd4d31a5ddb91d4b4fd30a8a267fbe519952).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14013: [SPARK-16344][SQL][BRANCH-1.6] Decoding Parquet array of...

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14013
  
**[Test build #61812 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61812/consoleFull)**
 for PR 14013 at commit 
[`e44451e`](https://github.com/apache/spark/commit/e44451e60856602a151dac2a2f696e12b27999cc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69674624
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
--- End diff --

i think `aggFunctions.isEmpty` is not necessary, since forall returns true 
if aggFunctions is empty?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69674487
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
+agg.isDistinct || (agg.aggregateFunction match {
+  case _: Max => true
+  case _: Min => true
+  case _ => false
+})
+  }
+  if (isPartitionDataOnly) {
+a.withNewChildren(Seq(usePartitionData(child, relation)))
+  } else {
+a
+  }
+} else {
+  a
+}
+}
+  }
+
+  private def usePartitionData(child: LogicalPlan, relation: LogicalPlan): 
LogicalPlan = {
+child transform {
+  case plan if plan eq relation =>
+relation match {
+  case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _) =>
+val partColumns = 
fsRelation.partitionSchema.map(_.name.toLowerCase).toSet
+val partAttrs = l.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = fsRelation.location.listFiles(Nil)
+LocalRelation(partAttrs, partitionData.map(_.values))
+
+  case relation: CatalogRelation =>
+val partColumns = 
relation.catalogTable.partitionColumnNames.map(_.toLowerCase).toSet
+val partAttrs = relation.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = 
catalog.listPartitions(relation.catalogTable.identifier).map { p =>
+  InternalRow.fromSeq(partAttrs.map { attr =>
+Cast(Literal(p.spec(attr.name)), attr.dataType).eval()
+  })
+}
+LocalRelation(partAttrs, partitionData)
+
+  case _ => throw new IllegalStateException()
--- End diff --

this cannot be hit unless there is a bug in Spark right? if that's the 
case, add a comment saying that.


---
If your project

[GitHub] spark issue #13890: [SPARK-16189][SQL] Add ExistingRDD logical plan for inpu...

2016-07-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13890
  
it's a pretty good optimization! should we also apply it to `LocalRelation`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69674388
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
+agg.isDistinct || (agg.aggregateFunction match {
+  case _: Max => true
+  case _: Min => true
+  case _ => false
+})
+  }
+  if (isPartitionDataOnly) {
+a.withNewChildren(Seq(usePartitionData(child, relation)))
+  } else {
+a
+  }
+} else {
+  a
+}
+}
+  }
+
+  private def usePartitionData(child: LogicalPlan, relation: LogicalPlan): 
LogicalPlan = {
+child transform {
+  case plan if plan eq relation =>
+relation match {
+  case l @ LogicalRelation(fsRelation: HadoopFsRelation, _, _) =>
+val partColumns = 
fsRelation.partitionSchema.map(_.name.toLowerCase).toSet
+val partAttrs = l.output.filter(a => 
partColumns.contains(a.name.toLowerCase))
+val partitionData = fsRelation.location.listFiles(Nil)
--- End diff --

use named argument, e.g. 
```
val partitionData = fsRelation.location.listFiles(filters = Nil)
```

for a while I was wondering what that argument does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13890: [SPARK-16189][SQL] Add ExistingRDD logical plan f...

2016-07-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13890#discussion_r69674281
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala ---
@@ -74,13 +74,71 @@ object RDDConversions {
   }
 }
 
+private[sql] object ExistingRDD {
+
+  def apply[T: Encoder](rdd: RDD[T])(session: SparkSession): LogicalPlan = 
{
+val exisitingRdd = ExistingRDD(CatalystSerde.generateObjAttr[T], 
rdd)(session)
+CatalystSerde.serialize[T](exisitingRdd)
+  }
+}
+
 /** Logical plan node for scanning data from an RDD. */
+private[sql] case class ExistingRDD[T](
+outputObjAttr: Attribute,
+rdd: RDD[T])(session: SparkSession)
+  extends LeafNode with ObjectProducer with MultiInstanceRelation {
+
+  override protected final def otherCopyArgs: Seq[AnyRef] = session :: Nil
+
+  override def newInstance(): ExistingRDD.this.type =
+ExistingRDD(outputObjAttr.newInstance(), 
rdd)(session).asInstanceOf[this.type]
+
+  override def sameResult(plan: LogicalPlan): Boolean = {
+plan.canonicalized match {
+  case ExistingRDD(_, otherRDD) => rdd.id == otherRDD.id
+  case _ => false
+}
+  }
+
+  override protected def stringArgs: Iterator[Any] = Iterator(output)
+
+  @transient override lazy val statistics: Statistics = Statistics(
+// TODO: Instead of returning a default value here, find a way to 
return a meaningful size
+// estimate for RDDs. See PR 1238 for more discussions.
+sizeInBytes = BigInt(session.sessionState.conf.defaultSizeInBytes)
+  )
+}
+
+/** Physical plan node for scanning data from an RDD. */
+private[sql] case class ExistingRDDScanExec[T](
--- End diff --

From the name it's hard to tell what's the difference between this one and 
`RDDScanExec`...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69674280
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
+ * 3. aggregate function on partition columns which have same result w or 
w/o DISTINCT keyword.
+ *  e.g. SELECT Max(col2) FROM tbl GROUP BY col1.
+ */
+case class OptimizeMetadataOnly(
+catalog: SessionCatalog,
+conf: CatalystConf) extends Rule[LogicalPlan] {
+
+  def apply(plan: LogicalPlan): LogicalPlan = {
+if (!conf.optimizerMetadataOnly) {
+  return plan
+}
+
+plan.transform {
+  case a @ Aggregate(_, aggExprs, child @ 
PartitionedRelation(partAttrs, relation)) =>
+if (a.references.subsetOf(partAttrs)) {
+  val aggFunctions = aggExprs.flatMap(_.collect {
+case agg: AggregateExpression => agg
+  })
+  val isPartitionDataOnly = aggFunctions.isEmpty || 
aggFunctions.forall { agg =>
+agg.isDistinct || (agg.aggregateFunction match {
+  case _: Max => true
+  case _: Min => true
+  case _ => false
+})
+  }
+  if (isPartitionDataOnly) {
+a.withNewChildren(Seq(usePartitionData(child, relation)))
+  } else {
+a
+  }
+} else {
+  a
+}
+}
+  }
+
+  private def usePartitionData(child: LogicalPlan, relation: LogicalPlan): 
LogicalPlan = {
--- End diff --

we need comment explaining what this function does


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13494: [SPARK-15752] [SQL] support optimization for meta...

2016-07-05 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/13494#discussion_r69674209
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnly.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution
+
+import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.{CatalystConf, InternalRow}
+import org.apache.spark.sql.catalyst.catalog.{CatalogRelation, 
SessionCatalog}
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.aggregate._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+
+/**
+ * When scanning only partition columns, get results based on partition 
data without scanning files.
+ * It's used for operators that only need distinct values. Currently only 
[[Aggregate]] operator
+ * which satisfy the following conditions are supported:
+ * 1. aggregate expression is partition columns.
+ *  e.g. SELECT col FROM tbl GROUP BY col.
+ * 2. aggregate function on partition columns with DISTINCT.
+ *  e.g. SELECT count(DISTINCT col1) FROM tbl GROUP BY col2.
--- End diff --

OK I would say something like the following to make it very explicit.

```
This rule optimizes the execution of queries that can be answered by 
looking only at partition-level metadata.
This applies when all the columns scanned are partition columns, and the 
query has an aggregate operator that satisfies the following conditions:
1.
2.
3.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13890: [SPARK-16189][SQL] Add ExistingRDD logical plan f...

2016-07-05 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13890#discussion_r69674173
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala ---
@@ -74,13 +74,71 @@ object RDDConversions {
   }
 }
 
+private[sql] object ExistingRDD {
+
+  def apply[T: Encoder](rdd: RDD[T])(session: SparkSession): LogicalPlan = 
{
+val exisitingRdd = ExistingRDD(CatalystSerde.generateObjAttr[T], 
rdd)(session)
+CatalystSerde.serialize[T](exisitingRdd)
+  }
+}
+
 /** Logical plan node for scanning data from an RDD. */
+private[sql] case class ExistingRDD[T](
+outputObjAttr: Attribute,
+rdd: RDD[T])(session: SparkSession)
--- End diff --

why curry here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13873: [SPARK-16167][SQL] RowEncoder should preserve array/map ...

2016-07-05 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13873
  
I tried to do this before, but couldn't find much value of it. Current the 
schema of encoder is passed in as parameter, not retrieved from the serializer 
expression. What do you think?

The null check for array seems reasonale.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14063: [MINOR][PySpark][DOC] Fix code examples of SparkSession ...

2016-07-05 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14063
  
**[Test build #61810 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61810/consoleFull)**
 for PR 14063 at commit 
[`9edca4f`](https://github.com/apache/spark/commit/9edca4f060cc3fac304e89234b0ba6c49291145a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14053: [SPARK-16374] [SQL] Remove Alias from MetastoreRe...

2016-07-05 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14053#discussion_r69673940
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
@@ -716,6 +717,39 @@ class ParquetSourceSuite extends 
ParquetPartitioningTest {
 }
   }
 
+  test("Alias used in converted data source tables") {
--- End diff --

Yeah, I can remove it. 

You are right. That example is just for showing the outcome of the code 
changes. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13494: [SPARK-15752] [SQL] support optimization for metadata on...

2016-07-05 Thread rxin

Github user rxin commented on the issue:

https://github.com/apache/spark/pull/13494
  
@lianhuiwang please update the pull request description to include the 
actual cases that are supported in this pull request.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 >

1 - 100 of 613 matches

Mail list logo