[GitHub] spark issue #23206: [SPARK-26249][SQL] Add ability to inject a rule in order...

2018-12-05 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/23206
  
Today in Spark, the extension points API `injectOptimizerRule` method  
allows the rules to be injected at the end in 
`extendedOperatorOptimizationRules` and this becomes 2 batches separated by 
removing the InferFiltersFromConstraints rule,   ie:  "Operator Optimization 
before Inferring Filters", "Operator Optimization after Inferring Filters".  As 
you can see, even here we have a usecase of an order, where we want the rules 
to kick in before the InferFiltersFromConstraint rule.  

What this PR proposes is a method to inject a rule in a specific place in a 
batch.   For our usecase, we have optimization rules defined that need to be 
kicked in after a certain rule.  Just a case like above.  The position the 
rules get kicked in by default now, alter the plan and our optimization rule 
doesn't get kicked in.   

The other method that is proposed is adding  a batch.  This is similar to 
what exists today already in postHocOptimizationBatches and 
preHocOptimizationBatches, but this is not exposed in SparkSessionExtensions 
API.   So the proposed method `injectOptimizerBatch` just exposes this as part 
of the extension points so we can make use of it.  

I agree this adds logic to the Optimizer to compute batches.  The new code 
is structured in such a way that if these new inject methods are not used, it 
will not use any of the computation of the new logic.  There is one check to 
see if there are any new rules or batches to be injected, if not, then the code 
is as before. 

Hope this helps some. 

The SparkSessionExtension APIs is experimental and for third party 
developers who want to add extensions to Spark without the need to get their 
code merged into Spark.   If there are other ways to achieve this without 
changing Spark code, please share your thoughts.  Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23206: [SPARK-26249][SQL] Add ability to inject a rule in order...

2018-12-05 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/23206
  
@maropu,  Thanks for your question. Yes. Thats correct. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23206: [SPARK-26249][SQL] Add ability to inject a rule i...

2018-12-03 Thread skambha
GitHub user skambha opened a pull request:

https://github.com/apache/spark/pull/23206

[SPARK-26249][SQL] Add ability to inject a rule in order and to add a batch 
via the Spark Extension Points API

## What changes were proposed in this pull request?
Add two new APIs to SparkSessionExtensions: 
- Inject optimizer rule in a specific order
- Inject optimizer batch. 

Design details is here: 

https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing

## How was this patch tested?
- New unit tests have been added to the SparkSessionExtensionSuite
- Ran the sql, hive, catalyst unit tests without any issues. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/skambha/spark ext1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23206


commit b25c31666c0136e3d98962b51a8637a21e63278a
Author: Sunitha Kambhampati 
Date:   2018-11-30T11:42:59Z

Add 2 api to the extension points  and add unit tests




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-06 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r208089990
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -169,25 +181,50 @@ package object expressions  {
 })
   }
 
-  // Find matches for the given name assuming that the 1st part is a 
qualifier (i.e. table name,
-  // alias, or subquery alias) and the 2nd part is the actual name. 
This returns a tuple of
+  // Find matches for the given name assuming that the 1st two parts 
are qualifier
+  // (i.e. database name and table name) and the 3rd part is the 
actual column name.
+  //
+  // For example, consider an example where "db1" is the database 
name, "a" is the table name
+  // and "b" is the column name and "c" is the struct field name.
+  // If the name parts is db1.a.b.c, then Attribute will match
+  // Attribute(b, qualifier("db1,"a")) and List("c") will be the 
second element
+  var matches: (Seq[Attribute], Seq[String]) = nameParts match {
+case dbPart +: tblPart +: name +: nestedFields =>
+  val key = (dbPart.toLowerCase(Locale.ROOT),
+tblPart.toLowerCase(Locale.ROOT), 
name.toLowerCase(Locale.ROOT))
+  val attributes = collectMatches(name, 
qualified3Part.get(key)).filter {
+a => (resolver(dbPart, a.qualifier.head) && resolver(tblPart, 
a.qualifier.last))
+  }
+  (attributes, nestedFields)
+case all =>
+  (Seq.empty, Seq.empty)
+  }
+
+  // If there are no matches, then find matches for the given name 
assuming that
+  // the 1st part is a qualifier (i.e. table name, alias, or subquery 
alias) and the
+  // 2nd part is the actual name. This returns a tuple of
   // matched attributes and a list of parts that are to be resolved.
   //
   // For example, consider an example where "a" is the table name, "b" 
is the column name,
   // and "c" is the struct field name, i.e. "a.b.c". In this case, 
Attribute will be "a.b",
   // and the second element will be List("c").
-  val matches = nameParts match {
-case qualifier +: name +: nestedFields =>
-  val key = (qualifier.toLowerCase(Locale.ROOT), 
name.toLowerCase(Locale.ROOT))
-  val attributes = collectMatches(name, qualified.get(key)).filter 
{ a =>
-resolver(qualifier, a.qualifier.get)
+  matches = matches match {
--- End diff --

done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-06 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
Thanks for the review.  I have addressed your comments and pushed the 
changes.  
@cloud-fan, Please take a look.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-06 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r208079529
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
 ---
@@ -201,7 +204,7 @@ case class Alias(child: Expression, name: String)(
   }
 
   override def sql: String = {
-val qualifierPrefix = qualifier.map(_ + ".").getOrElse("")
+val qualifierPrefix = if (qualifier.nonEmpty) qualifier.mkString(".") 
+ "." else ""
--- End diff --

ok, sounds good.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-06 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r208078928
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -794,19 +795,37 @@ case class LocalLimit(limitExpr: Expression, child: 
LogicalPlan) extends OrderPr
 /**
  * Aliased subquery.
  *
- * @param alias the alias name for this subquery.
+ * @param name the alias identifier for this subquery.
  * @param child the logical plan of this subquery.
  */
 case class SubqueryAlias(
-alias: String,
+name: AliasIdentifier,
 child: LogicalPlan)
   extends OrderPreservingUnaryNode {
 
-  override def doCanonicalize(): LogicalPlan = child.canonicalized
+  def alias: String = name.identifier
 
-  override def output: Seq[Attribute] = 
child.output.map(_.withQualifier(Some(alias)))
+  override def output: Seq[Attribute] = {
+val qualifierList = name.database.map(Seq(_, 
alias)).getOrElse(Seq(alias))
+child.output.map(_.withQualifier(qualifierList))
+  }
+  override def doCanonicalize(): LogicalPlan = child.canonicalized
 }
 
+object SubqueryAlias {
+  def apply(
+  identifier: String,
+  child: LogicalPlan): SubqueryAlias = {
+SubqueryAlias(AliasIdentifier(identifier), child)
+  }
+
+  def apply(
+  identifier: String,
+  database: Option[String],
--- End diff --

good point! I'll take care of this in the next push.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-06 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r208078754
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
 ---
@@ -201,7 +204,7 @@ case class Alias(child: Expression, name: String)(
   }
 
   override def sql: String = {
-val qualifierPrefix = qualifier.map(_ + ".").getOrElse("")
+val qualifierPrefix = if (qualifier.nonEmpty) qualifier.mkString(".") 
+ "." else ""
--- End diff --

This won't work for the case when we have Seq.empty.  The suffix "." gets 
returned even for a empty sequence. 
For a non empty Seq, the above call will be fine. 

Shall we leave the 'if' as is  or is there an equivalent preferred style 
that would work? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-06 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r208059884
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -169,25 +181,50 @@ package object expressions  {
 })
   }
 
-  // Find matches for the given name assuming that the 1st part is a 
qualifier (i.e. table name,
-  // alias, or subquery alias) and the 2nd part is the actual name. 
This returns a tuple of
+  // Find matches for the given name assuming that the 1st two parts 
are qualifier
+  // (i.e. database name and table name) and the 3rd part is the 
actual column name.
+  //
+  // For example, consider an example where "db1" is the database 
name, "a" is the table name
+  // and "b" is the column name and "c" is the struct field name.
+  // If the name parts is db1.a.b.c, then Attribute will match
--- End diff --

@cloud-fan , Thank you for your suggestion and question. 

Existing spark behavior follows precedence rules in the column resolution 
logic and in this patch we are following the same pattern/rule.   

I am looking into the SQL standard to see if there are any column 
resolution rules but I have not found any yet.  However when I researched 
existing databases, I observed different behaviors among them and it is listed 
in Section 2/Table A in the design doc 
[here](https://drive.google.com/file/d/1zKm3aNZ3DpsqIuoMvRsf0kkDkXsAasxH/view).

I agree, we can improve upon the checks in existing precedence to go all 
the way to ensure there is a nested field.  Although, the user can always 
qualify the field to resolve the ambiguity.  Shall we open another issue to 
discuss and improve upon the existing resolution logic.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-05 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207726267
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -169,25 +181,50 @@ package object expressions  {
 })
   }
 
-  // Find matches for the given name assuming that the 1st part is a 
qualifier (i.e. table name,
-  // alias, or subquery alias) and the 2nd part is the actual name. 
This returns a tuple of
+  // Find matches for the given name assuming that the 1st two parts 
are qualifier
+  // (i.e. database name and table name) and the 3rd part is the 
actual column name.
+  //
+  // For example, consider an example where "db1" is the database 
name, "a" is the table name
+  // and "b" is the column name and "c" is the struct field name.
+  // If the name parts is db1.a.b.c, then Attribute will match
--- End diff --

Actually it won't fail without this patch too,  the 
col.innerField1.innerField2 will resolve correctly to the struct field if there 
is a table named col and column named innerField1 which has a innerField2 
field.   

Please see an example of such a scenario in this test output on master 
spark: 

https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/results/columnresolution.sql.out#L360
--
Let me look into the sql standard and get back on that. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-04 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
I have addressed the  review comments in this commit 
[here](https://github.com/apache/spark/pull/17185/commits/065687f3b987e254f41279d45f0cced6e42e)

@cloud-fan, please take a look.  Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-04 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207718090
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 ---
@@ -169,25 +181,50 @@ package object expressions  {
 })
   }
 
-  // Find matches for the given name assuming that the 1st part is a 
qualifier (i.e. table name,
-  // alias, or subquery alias) and the 2nd part is the actual name. 
This returns a tuple of
+  // Find matches for the given name assuming that the 1st two parts 
are qualifier
+  // (i.e. database name and table name) and the 3rd part is the 
actual column name.
+  //
+  // For example, consider an example where "db1" is the database 
name, "a" is the table name
+  // and "b" is the column name and "c" is the struct field name.
+  // If the name parts is db1.a.b.c, then Attribute will match
--- End diff --

In this case, if a.b.c fails to resolve as db.table.column, then we check 
if there is a table  and column that matches a.b and then see if c is a nested 
field name and if it exists, it will resolve to the nested field. 

Tests with struct nested fields are 
[here](https://github.com/skambha/spark/blob/90cd6d33f59fdae16a3a386ed14cefb3f28d35a8/sql/core/src/test/resources/sql-tests/inputs/columnresolution.sql#L65-L73)




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-04 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207717569
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
 ---
@@ -316,8 +345,8 @@ case class UnresolvedRegex(regexPattern: String, table: 
Option[String], caseSens
   // If there is no table specified, use all input attributes that 
match expr
   case None => input.output.filter(_.name.matches(pattern))
   // If there is a table, pick out attributes that are part of this 
table that match expr
-  case Some(t) => input.output.filter(_.qualifier.exists(resolver(_, 
t)))
-.filter(_.name.matches(pattern))
+  case Some(t) => input.output.filter(a => resolver(a.qualifier.last, 
t)).
--- End diff --

Sure. Will add a defensive check. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-04 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207717536
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala
 ---
@@ -536,12 +536,13 @@ abstract class SessionCatalogSuite extends 
AnalysisTest {
   assert(metadata.viewText.isDefined)
   val view = View(desc = metadata, output = 
metadata.schema.toAttributes,
 child = CatalystSqlParser.parsePlan(metadata.viewText.get))
+  val alias = AliasIdentifier("view1", Some("db3"))
--- End diff --

Sure, let me do that if this style is preferred. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-03 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
The last push has changes that adds the map lookup for the 3 part name.  It 
implements a solution to address the lookup.. although I think there are more 
ways we can go about it and could possibly improve on it in the future. 

@cloud-fan, please take a look.  Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-03 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
I rebased and ran the catalyst unit test suite only as a sanity test.  
fwiw, I had run the sql and hive and catalyst suites earlier prior to this last 
rebase.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-03 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207640501
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
 ---
@@ -262,17 +262,47 @@ abstract class Star extends LeafExpression with 
NamedExpression {
  */
 case class UnresolvedStar(target: Option[Seq[String]]) extends Star with 
Unevaluable {
 
-  override def expand(input: LogicalPlan, resolver: Resolver): 
Seq[NamedExpression] = {
+  /**
+   * Returns true if the nameParts match the qualifier of the attribute
+   *
+   * There are two checks: i) Check if the nameParts match the qualifier 
fully.
+   * E.g. SELECT db.t1.* FROM db1.t1   In this case, the nameParts is 
Seq("db1", "t1") and
+   * qualifier of the attribute is Seq("db1","t1")
+   * ii) If (i) is not true, then check if nameParts is only a single 
element and it
+   * matches the table portion of the qualifier
+   *
+   * E.g. SELECT t1.* FROM db1.t1  In this case nameParts is Seq("t1") and
+   * qualifier is Seq("db1","t1")
+   * SELECT a.* FROM db1.t1 AS a
+   * In this case nameParts is Seq("a") and qualifier for
+   * attribute is Seq("a")
+   */
+  private def matchedQualifier(
+  attribute: Attribute,
+  nameParts: Seq[String],
+  resolver: Resolver): Boolean = {
+val qualifierList = attribute.qualifier.getOrElse(Seq.empty)
+
+// match the qualifiers and nameParts
+val matched = nameParts.corresponds(qualifierList)(resolver) match {
--- End diff --

done. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-03 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207635526
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
 ---
@@ -262,17 +262,47 @@ abstract class Star extends LeafExpression with 
NamedExpression {
  */
 case class UnresolvedStar(target: Option[Seq[String]]) extends Star with 
Unevaluable {
 
-  override def expand(input: LogicalPlan, resolver: Resolver): 
Seq[NamedExpression] = {
+  /**
+   * Returns true if the nameParts match the qualifier of the attribute
+   *
+   * There are two checks: i) Check if the nameParts match the qualifier 
fully.
+   * E.g. SELECT db.t1.* FROM db1.t1   In this case, the nameParts is 
Seq("db1", "t1") and
+   * qualifier of the attribute is Seq("db1","t1")
+   * ii) If (i) is not true, then check if nameParts is only a single 
element and it
+   * matches the table portion of the qualifier
+   *
+   * E.g. SELECT t1.* FROM db1.t1  In this case nameParts is Seq("t1") and
+   * qualifier is Seq("db1","t1")
+   * SELECT a.* FROM db1.t1 AS a
+   * In this case nameParts is Seq("a") and qualifier for
+   * attribute is Seq("a")
+   */
+  private def matchedQualifier(
+  attribute: Attribute,
--- End diff --

The database information gets carried over based on the lookupRelation 
logic.  Details are in Section 3.3.3 in the design doc 
[here](https://drive.google.com/file/d/1zKm3aNZ3DpsqIuoMvRsf0kkDkXsAasxH/view)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-03 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
Looks like there is another change that has gone in that this PR conflicts 
with so the build fails. :( 
I will rebase again and push.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-03 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207624865
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
 ---
@@ -121,14 +129,14 @@ abstract class Attribute extends LeafExpression with 
NamedExpression with NullIn
  * @param name The name to be associated with the result of computing 
[[child]].
  * @param exprId A globally unique id used to check if an 
[[AttributeReference]] refers to this
  *   alias. Auto-assigned if left blank.
- * @param qualifier An optional string that can be used to referred to 
this attribute in a fully
- *   qualified way. Consider the examples tableName.name, 
subQueryAlias.name.
- *   tableName and subQueryAlias are possible qualifiers.
+ * @param qualifier An optional Seq of string that can be used to refer to 
this attribute in a
+ *  fully qualified way. Consider the examples 
tableName.name, subQueryAlias.name.
+ *  tableName and subQueryAlias are possible qualifiers.
  * @param explicitMetadata Explicit metadata associated with this alias 
that overwrites child's.
  */
 case class Alias(child: Expression, name: String)(
 val exprId: ExprId = NamedExpression.newExprId,
-val qualifier: Option[String] = None,
+val qualifier: Option[Seq[String]] = None,
--- End diff --

I took care of this change in the recent push.  Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-03 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207624726
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
 ---
@@ -71,19 +71,27 @@ trait NamedExpression extends Expression {
* multiple qualifiers, it is possible that there are other possible way 
to refer to this
* attribute.
*/
-  def qualifiedName: String = (qualifier.toSeq :+ name).mkString(".")
+  def qualifiedName: String = {
+if (qualifier.isDefined) {
+  (qualifier.get :+ name).mkString(".")
+} else {
+  name
+}
+  }
 
   /**
* Optional qualifier for the expression.
+   * Qualifier can also contain the fully qualified information, for e.g, 
Sequence of string
+   * containing the database and the table name
*
* For now, since we do not allow using original table name to qualify a 
column name once the
* table is aliased, this can only be:
*
* 1. Empty Seq: when an attribute doesn't have a qualifier,
*e.g. top level attributes aliased in the SELECT clause, or column 
from a LocalRelation.
-   * 2. Single element: either the table name or the alias name of the 
table.
+   * 2. Seq with a Single element: either the table name or the alias name 
of the table.
--- End diff --

done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-03 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-02 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
Thanks @cloud-fan for the review. 

I am working on implementing an idea to get optimized lookup with 3part 
name.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-01 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
I rebased and found out that the resolution code in Logical plan has 
changed and it uses map lookup to do the matching. I have some ideas on how to 
incorporate the 3 part name with the map lookup logic.  

For now,  I have synced up and bypassed the new map logic and pushed the 
code so it is up to the latest so I can get the test cycle from GitHub and 
added todo’s for myself to incorporate the optimized lookup logic. 

@cloud-fan,  In the meantime, if you have any other comments please let me 
know.  Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-01 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r207027392
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
 ---
@@ -121,14 +129,14 @@ abstract class Attribute extends LeafExpression with 
NamedExpression with NullIn
  * @param name The name to be associated with the result of computing 
[[child]].
  * @param exprId A globally unique id used to check if an 
[[AttributeReference]] refers to this
  *   alias. Auto-assigned if left blank.
- * @param qualifier An optional string that can be used to referred to 
this attribute in a fully
- *   qualified way. Consider the examples tableName.name, 
subQueryAlias.name.
- *   tableName and subQueryAlias are possible qualifiers.
+ * @param qualifier An optional Seq of string that can be used to refer to 
this attribute in a
+ *  fully qualified way. Consider the examples 
tableName.name, subQueryAlias.name.
+ *  tableName and subQueryAlias are possible qualifiers.
  * @param explicitMetadata Explicit metadata associated with this alias 
that overwrites child's.
  */
 case class Alias(child: Expression, name: String)(
 val exprId: ExprId = NamedExpression.newExprId,
-val qualifier: Option[String] = None,
+val qualifier: Option[Seq[String]] = None,
--- End diff --

I'll look into this. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-01 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r206988504
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -120,22 +120,54 @@ abstract class LogicalPlan
   /**
* Resolve the given `name` string against the given attribute, 
returning either 0 or 1 match.
--- End diff --

At a high level the changes include: 
* The column resolution needs to account for the qualifier that is now a 
sequence of string. 
* Enhance the lookupRelation to add the qualifier appropriately. Right now, 
the qualifier used by the LogicalPlan nodes representing the relation leafnodes 
use only the table name as the qualifier when no alias is provided. This needs 
to be updated to add the database name and the table name for permanent views 
and tables and global temporary views. 


Full details of the design and the resolution logic is in the [doc 
here](https://drive.google.com/file/d/1zKm3aNZ3DpsqIuoMvRsf0kkDkXsAasxH/view) 
in Section 3.3 and 3.4. 

So, Section 2 in the design doc has a lot of scenarios that were tested 
against different db’s and Spark and the expected behavior with the design 
here.   

If there is a specific scenario you have in mind that would be valid but 
get flagged as ambiguous, can you please let me know.   I’d be glad to look 
into it. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-01 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r206985225
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -120,22 +120,54 @@ abstract class LogicalPlan
   /**
* Resolve the given `name` string against the given attribute, 
returning either 0 or 1 match.
--- End diff --

The resolution logic needs to account to compare the full qualifier.   If 
you are trying to resolve db1.t1.i1, the match needs to happen against the 
entire qualifier of the attribute.  


For e.g: 
```SQL
 select db1.t1.i1 from t1   // current database is db1  and lets say t1 has 
 columns i1, i2. 
```
When resolving db1.t1.i1,  against the following set of Input attributes.   
we will have the following attribute references:
- AttributeReference ( qualifier Seq(db1, t1)  and name is i1)
- AttributeReference ( qualifier Seq(db1, t1) and name is i2) 

So the new resolution logic will match the entire qualifier sequence and 
then match the column ( ie the attribute name) 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-01 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r206978220
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -654,16 +654,19 @@ class SessionCatalog(
*
* If the relation is a view, we generate a [[View]] operator from the 
view description, and
* wrap the logical plan in a [[SubqueryAlias]] which will track the 
name of the view.
+   * [[SubqueryAlias]] will also keep track of the name and 
database(optional) of the table/view
*
* @param name The name of the table/view that we look up.
*/
   def lookupRelation(name: TableIdentifier): LogicalPlan = {
 synchronized {
   val db = formatDatabaseName(name.database.getOrElse(currentDb))
   val table = formatTableName(name.table)
+  // To keep track of the name and database of the table/view
+  val alias = AliasIdentifier(table, Some(db))
   if (db == globalTempViewManager.database) {
 globalTempViewManager.get(table).map { viewDef =>
-  SubqueryAlias(table, viewDef)
+  SubqueryAlias(alias, viewDef)
--- End diff --

Here, SubqueryAlias takes in a AliasIdentifier and not a string. 

Also want to mention that: we considered somewhat similar to your 
suggestion - to  simply add a qualifier as a Seq[String] as a member of 
SubqueryAlias but that change would require touching more files.  

I had implemented that first (long time back now) and then after some 
offline discussions we favored this as this change is more localized and 
encapsulated.  @gatorsmile @cloud-fan 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-08-01 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r206975896
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
 ---
@@ -262,17 +262,47 @@ abstract class Star extends LeafExpression with 
NamedExpression {
  */
 case class UnresolvedStar(target: Option[Seq[String]]) extends Star with 
Unevaluable {
 
-  override def expand(input: LogicalPlan, resolver: Resolver): 
Seq[NamedExpression] = {
+  /**
+   * Returns true if the nameParts match the qualifier of the attribute
+   *
+   * There are two checks: i) Check if the nameParts match the qualifier 
fully.
+   * E.g. SELECT db.t1.* FROM db1.t1   In this case, the nameParts is 
Seq("db1", "t1") and
--- End diff --

This (SELECT db1.t1.*  FROM t1) would resolve correctly.   

Also a similar scenario (details in Section 2 in Table A # 5 in the design 
doc)
select db1.t1.i1 from t1 will resolve correctly when the current database 
is db1. 

I will explicitly add a test case for the db1.t1.* from t1 as well.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-08-01 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
@gatorsmile , @cloud-fan,  just a quick comment,  I have been working on 
this and will respond soon. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2018-01-31 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r165230391
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
 ---
@@ -88,12 +88,12 @@ case class UnresolvedAttribute(nameParts: Seq[String]) 
extends Attribute with Un
   override def exprId: ExprId = throw new UnresolvedException(this, 
"exprId")
   override def dataType: DataType = throw new UnresolvedException(this, 
"dataType")
   override def nullable: Boolean = throw new UnresolvedException(this, 
"nullable")
-  override def qualifier: Option[String] = throw new 
UnresolvedException(this, "qualifier")
+  override def qualifier: Option[Seq[String]] = throw new 
UnresolvedException(this, "qualifier")
--- End diff --

I followed the style that exists today. If we change here, then there are 
other places where it needs to change as well.  For e.g, in AttributeReference 
we initialize it with None today and now it will be a Seq.empty.  I am not sure 
if we want to create these objects or just leave it using  None by keeping the 
Option[]  

If you have a strong preference, I can update it.   I do have the changes 
locally but have not pushed that version.   Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-01-31 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
I have rebased and pushed the changes.  I ran the unit tests ( sql, 
catalyst and hive). 
Earlier, I was having issues running the hive test suite locally but that 
is resolved with the fix from SPARK-23275.  

Please take a look.   Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2018-01-22 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
sure.  Let me look into it.  Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19921: [SPARK-22452][SQL] Add getDouble to DataSourceV2Options

2017-12-08 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/19921
  
Thanks @cloud-fan, @gatorsmile


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19902: [SPARK-22452][SQL]Add getInt, getLong, getBoolean to Dat...

2017-12-07 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/19902
  
Opened a new PR to add the getDouble method.  
https://github.com/apache/spark/pull/19921 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19921: [SPARK-22452][SQL] Add getDouble to DataSourceV2O...

2017-12-07 Thread skambha
GitHub user skambha opened a pull request:

https://github.com/apache/spark/pull/19921

[SPARK-22452][SQL] Add getDouble to DataSourceV2Options

- Implemented getDouble method in DataSourceV2Options
- Add unit test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/skambha/spark ds2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19921.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19921


commit 2d4c2b0bdf89badf25f2d1d98903125e48e7cd5c
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-12-07T17:59:11Z

Add getDouble to DataSourceV2Options and add unit test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19902: [SPARK-22452][SQL]Add getInt, getLong, getBoolean to Dat...

2017-12-07 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/19902
  
cool!  I will add that and submit a new PR.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19902: [SPARK-22452][SQL]Add getInt, getLong, getBoolean to Dat...

2017-12-07 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/19902
  
great!  Thanks @gatorsmile , @cloud-fan.

Yes. I would be happy to open a new PR to add the rest of them. 

If I compare with the getXXX types in SparkConf, it looks like the only 
missing one is to add getDouble.  Do we want to add only that or any of the 
other datatypes or any other getXXX?  Let me know and I would be happy to open 
a new PR to address it.  Thank you!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19902: [SPARK-22452][SQL]Add getInt, getLong, getBoolean to Dat...

2017-12-06 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/19902
  
great!  Thanks @gatorsmile for your comments.  

 I have updated with code comments, please take a look.   Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19902: [SPARK-22452][SQL]Add getInt, getLong, getBoolean...

2017-12-06 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/19902#discussion_r155394195
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2Options.java 
---
@@ -49,4 +49,23 @@ public DataSourceV2Options(Map<String, String> 
originalMap) {
   public Optional get(String key) {
 return Optional.ofNullable(keyLowerCasedMap.get(toLowerCase(key)));
   }
+
--- End diff --

Thanks.  Is it ok if I modify the comment a bit to this: 

  /**
   * Returns the integer value to which the specified key is mapped, 
   * or defaultValue if this map contains no mapping for the key.
   * The key match is case-insensitive
   */


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19902: [SPARK-22452][SQL]Add getInt, getLong, getBoolean...

2017-12-06 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/19902#discussion_r155389138
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2Options.java 
---
@@ -49,4 +49,22 @@ public DataSourceV2Options(Map<String, String> 
originalMap) {
   public Optional get(String key) {
 return Optional.ofNullable(keyLowerCasedMap.get(toLowerCase(key)));
   }
+
+  public Optional getBoolean(String key) {
--- End diff --

Per @gatorsmile , I have gone and removed the version of the methods that 
doesn't take the default value.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19902: [SPARK-22452][SQL]Add getInt, getLong, getBoolean...

2017-12-06 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/19902#discussion_r155388109
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2Options.java 
---
@@ -49,4 +49,41 @@ public DataSourceV2Options(Map<String, String> 
originalMap) {
   public Optional get(String key) {
 return Optional.ofNullable(keyLowerCasedMap.get(toLowerCase(key)));
   }
+
+  public Optional getBoolean(String key) {
+String lcaseKey = toLowerCase(key);
+return Optional.ofNullable(keyLowerCasedMap.containsKey(lcaseKey) ?
+  Boolean.parseBoolean(keyLowerCasedMap.get(lcaseKey)) : null);
+  }
+
+  public Optional getInt(String key) {
+String lcaseKey = toLowerCase(key);
+return Optional.ofNullable(keyLowerCasedMap.containsKey(lcaseKey) ?
+  Integer.parseInt(keyLowerCasedMap.get(lcaseKey)) : null);
+  }
+
+  public Optional getLong(String key) {
--- End diff --

Sure. Let me post another commit.  Thanks.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19902: [SPARK-22452][SQL]Add getInt, getLong, getBoolean...

2017-12-06 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/19902#discussion_r155387973
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2Options.java 
---
@@ -49,4 +49,22 @@ public DataSourceV2Options(Map<String, String> 
originalMap) {
   public Optional get(String key) {
 return Optional.ofNullable(keyLowerCasedMap.get(toLowerCase(key)));
   }
+
+  public Optional getBoolean(String key) {
--- End diff --

good idea.   

1. So I have implemented the getBoolean, getInt, and getLong using the 
defaultValue and added unit tests for it. 
 
2. For now, I have left the other methods as well that return Optional<> 
keeping in line with the getString as it may still have some value.

Please take a look. 

If we don't want (2), I can remove the version of the getBoolean, getLong, 
getInt that doesn't take the defaultValue.  

Thanks for your comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19902: [SPARK-22452][SQL]Add getInt, getLong, getBoolean...

2017-12-05 Thread skambha
GitHub user skambha opened a pull request:

https://github.com/apache/spark/pull/19902

[SPARK-22452][SQL]Add getInt, getLong, getBoolean to DataSourceV2Options

- Implemented methods getInt, getLong, getBoolean for DataSourceV2Options
- Added new unit tests to exercise these methods

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/skambha/spark spark22452

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19902.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19902


commit 566b7a5274bef77d1a786663bfe908f3e8dc9892
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-12-05T22:29:52Z

Add getInt, getLong, getBoolean to DataSourceV2Options and add unit tests

commit bb01ff5ad214356a6bfdbebc36f1a7b2f37a46f1
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-12-05T22:40:23Z

Fix formatting




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-28 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/19747
  
great!  Thank you @gatorsmile, @hvanhovell, @wzhfy


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-28 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/19747
  
Thanks @gatorsmile. 
I have addressed your comments in the latest commit.  Please take a look. 
Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19747: [Spark-22431][SQL] Ensure that the datatype in th...

2017-11-27 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/19747#discussion_r153377691
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -895,6 +898,19 @@ private[hive] object HiveClientImpl {
 Option(hc.getComment).map(field.withComment).getOrElse(field)
   }
 
+  private def verifyColumnDataType(schema: StructType): Unit = {
+schema.foreach(field => {
+  val typeString = field.dataType.catalogString
--- End diff --

I have taken your change and incorporated it in the latest commit.  Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-27 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/19747
  
Thanks @gatorsmile for your comments. 
I have incorporated them in the latest commit: 
https://github.com/apache/spark/pull/19747/commits/a1c8a6d308b62f3439f07dbf3257b51855cb09d8

Please take a look. Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-17 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/19747
  
I have taken care of adding the check in the new 
HiveClientImpl.alterTableDataSchema as well and have added some new tests. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19747: [Spark-22431][SQL] Ensure that the datatype in th...

2017-11-17 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/19747#discussion_r151689272
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -40,6 +40,22 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 
   setupTestData()
 
+  test("SPARK-22431: table with nested type col with special char") {
--- End diff --

Thanks @gatorsmile for your comments. I have addressed them in the latest 
commit. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19747: [Spark-22431][SQL] Ensure that the datatype in th...

2017-11-15 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/19747#discussion_r151331207
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
---
@@ -507,6 +508,7 @@ private[hive] class HiveClientImpl(
 // these properties are still available to the others that share the 
same Hive metastore.
 // If users explicitly alter these Hive-specific properties through 
ALTER TABLE DDL, we respect
 // these user-specified values.
+verifyColumnDataType(table.dataSchema)
--- End diff --

Thanks @gatorsmile for the review.   I'll incorporate your other comments 
in my next commit.  

In the current codeline, another recent PR changed verifyColumnNames to 
verifyDataSchema.  

The reason I could not put the check in verifyDataSchema ( or the old 
verifyColumnNames):
- verifyDataSchema is called in the beginning of the doCreateTable method. 
But we cannot error out that early as later on in the doCreateTable method, as 
later on in that method, we create the datasource table.  If the datasource 
table cannot be stored in hive compatible format, it falls back to storing it 
in Spark sql specific format which will work fine. 
- For e.g  If I put the check there, then the create datasource table would 
throw an exception right away, which we do not want. 

```CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-15 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/19747
  
I synced up and noticed there are some recent changes that have gone in 
that changes the alter table schema codepath in the HiveExternalCatalog.  I'll 
take a look and see what changes might be needed for that. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19747: [Spark-22431][SQL] Ensure that the datatype in the schem...

2017-11-15 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/19747
  
Thanks @wzhfy for your comments.  I have addressed them in the latest 
commit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19747: [Spark-22431][SQL] Ensure that the datatype in th...

2017-11-14 Thread skambha
GitHub user skambha opened a pull request:

https://github.com/apache/spark/pull/19747

[Spark-22431][SQL]  Ensure that the datatype in the schema for the 
table/view metadata is parseable by Spark before persisting it

## What changes were proposed in this pull request?
* JIRA:  [SPARK-22431](https://issues.apache.org/jira/browse/SPARK-22431)  
: Creating Permanent view with illegal type

**Description:** 
- It is possible in Spark SQL to create a permanent view that uses an 
nested field with an illegal name.
- For example if we create the following view:
```create view x as select struct('a' as `$q`, 1 as b) q```
- A simple select fails with the following exception:

```
select * from x;

org.apache.spark.SparkException: Cannot recognize hive type string: 
struct<$q:string,b:int>
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
...
```
**Issue/Analysis**: Right now, we can create a view with a schema that 
cannot be read back by Spark from the Hive metastore.  For more details, please 
see the discussion about the analysis and proposed fix options in comment 1 and 
comment 2 in the 
[SPARK-22431](https://issues.apache.org/jira/browse/SPARK-22431) 

**Proposed changes**: 
 - Fix the hive table/view codepath to check whether the schema datatype is 
parseable by Spark before persisting it in the metastore. This change is 
localized to HiveClientImpl to do the check similar to the check in 
FromHiveColumn. This is fail-fast and we will avoid the scenario where we write 
something to the metastore that we are unable to read it back.  
- Added new unit tests
- Ran the sql related unit test suites ( hive/test, sql/test, 
catalyst/test) OK

With the fix: 
```
create view x as select struct('a' as `$q`, 1 as b) q;
17/11/14 19:16:03 ERROR SparkSQLDriver: Failed in [create view x as select 
struct('a' as `$q`, 1 as b) q]
org.apache.spark.SparkException: Cannot recognize the data type: 
struct<$q:string,b:int>
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:907)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:901)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
```
## How was this patch tested?
- New unit tests have been added. 

@hvanhovell, Please review and share your thoughts/comments.  Thank you so 
much.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/skambha/spark spark22431

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19747.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19747


commit c5824feb40af633ab480b311495ecb7737705c3a
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-11-14T12:38:17Z

Add check to ensure that the schema col datatype is parseable before 
persisting to metastore, and add unit tests

commit ce474b7b028bba45c8bd29c31308503626baafbc
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-11-14T16:02:00Z

Add : in error message

commit d5b553438d8740716e402c0210e3d121a48c2c64
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-11-14T16:07:28Z

Remove empty line

commit 626703310aa269a9351a2cf7b6ce23f8e4ab095a
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-11-14T16:20:06Z

remove empty line




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2017-03-15 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17185#discussion_r106253035
  
--- Diff: 
sql/core/src/test/resources/sql-tests/inputs/columnresolution-views.sql ---
@@ -13,10 +13,8 @@ DROP VIEW view1;
 -- Test scenario with Global Temp view
 CREATE OR REPLACE GLOBAL TEMPORARY VIEW view1 as SELECT 1 as i1;
 SELECT * FROM global_temp.view1;
--- TODO: Support this scenario
 SELECT global_temp.view1.* FROM global_temp.view1;
--- End diff --

Thanks @gatorsmile and @cloud-fan.  

Hi @cloud-fan,  
The use-case is to support fully qualified column name in queries to remove 
ambiguity in the naming.  

For e.g. `select db2.t1.i1 from db1.t1, db2.t1 where db1.t1.i2 = db2.t1.i3`

Today Spark does not support this.   The workaround is to create alias for 
the table.
 
Yes, the lookup for the table is coming from the ‘from’ clause. At a 
high level, changes include keeping track of the qualifier(including the 
database name) when the lookup happens and enhancing the column resolution 
logic in the analyzer to resolve fully qualified column name from a logical 
plan node’s children.  
There are more details in the design doc attached to the[ 
SPARK-19602](https://issues.apache.org/jira/browse/SPARK-19602)

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17185: [SPARK-19602][SQL] Support column resolution of fully qu...

2017-03-07 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17185
  
cc @gatorsmile, @cloud-fan   I'd really appreciate your review and 
comments. Thanks much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17185: [SPARK-19602][SQL] Support column resolution of f...

2017-03-06 Thread skambha
GitHub user skambha opened a pull request:

https://github.com/apache/spark/pull/17185

[SPARK-19602][SQL] Support column resolution of fully qualified column name 
( 3 part name)

## What changes were proposed in this pull request?
The design details is attached to the JIRA issue 
[here](https://issues.apache.org/jira/secure/attachment/12854681/Design_ColResolution_JIRA19602.pdf)

High level overview of the changes are: 
- Enhance the qualifier to be more than one string
- Add support to store the qualifier. Enhance the lookupRelation to keep 
the qualifier appropriately.
- Enhance the table matching column resolution algorithm to account for 
qualifier being more than a string.
- Enhance the table matching algorithm in UnresolvedStar.expand
- Ensure that we continue to support select t1.i1 from db1.t1

## How was this patch tested?
- New tests are added.
- Several test scenarios were added in a separate  [test pr 
17067](https://github.com/apache/spark/pull/17067).  The tests that were not 
supported earlier are marked with TODO markers and those are now supported with 
the code changes here. 
- Existing unit tests ( hive, catalyst and sql) were run successfully.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/skambha/spark colResolution

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17185.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17185


commit 9b27c0bb832444775f2e6f5488d66b8b4415350e
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2016-11-01T23:33:45Z

Changes to resolve fully qualified column attribute references from a 
logical plan node children

commit 66c004d46e6fa837b940c88384ff0bc11e34482c
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-03-03T23:07:16Z

Add AliasIdentifier, cleanups, merge and rebases

commit cd2e18bac541dc26b77af219e5120b29fbbcb87a
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-03-04T00:21:56Z

Tests now support qualified column name (3 part) and rebase changes

commit e23407fbb2e472d0dfc470afdb12c09cd4ca4d9f
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-03-05T01:14:10Z

merge changes

commit 0b3e6f8eb8f019404d292984a16a63a6c75ffe88
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-03-06T23:14:10Z

Cleanup

commit 8d6bd3f74a509614f1de81803ef207b1440eaa17
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-03-06T23:52:10Z

test cleanup

commit fb72118ce1b780deb0613578cf69320e16f12f5e
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-03-07T00:12:35Z

formatting




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17067: [SPARK-19602][SQL][TESTS] Add tests for qualified column...

2017-03-02 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17067
  
Thanks a lot Xiao. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17067: [SPARK-19602][SQL][TESTS] Add tests for qualified column...

2017-03-01 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17067
  
- Changes to the SQLQueryTestSuite framework to mask the exprId so I can 
add the -ve cases as well using this framework.
- Added -ve test cases to the SQLQueryTestSuite framework and so removed 
the hive specific test suite.  For the hive table testcase, I will add that 
test as part of the actual code changes PR.
- I synced up the codeline and there was one test output inner-join.sql.out 
that needed a comment to be updated, so I have updated that as well. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17067: [SPARK-19602][SQL][TESTS] Add tests for qualified...

2017-03-01 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17067#discussion_r103756083
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala ---
@@ -52,6 +52,19 @@ abstract class SQLViewSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 
+  test("column resolution scenarios with local temp view") {
+val df = Seq(2).toDF("i1")
+df.createOrReplaceTempView("table1")
+withTempView("table1") {
+  checkAnswer(spark.sql("SELECT table1.* FROM table1"), Row(2))
+  checkAnswer(spark.sql("SELECT * FROM table1"), Row(2))
+  checkAnswer(spark.sql("SELECT i1 FROM table1"), Row(2))
+  checkAnswer(spark.sql("SELECT table1.i1 FROM table1"), Row(2))
+  checkAnswer(spark.sql("SELECT a.i1 FROM table1 AS a"), Row(2))
+  checkAnswer(spark.sql("SELECT i1 FROM table1 AS a"), Row(2))
--- End diff --

Hi Xiao,  I have moved my new local temp view tests and the global temp 
view tests to the SQLQueryTestSuite framework as well.  Please take a look. 
Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17067: [SPARK-19602][SQL][TESTS] Add tests for qualified...

2017-02-27 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17067#discussion_r103379909
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/ColumnResolutionSuite.scala
 ---
@@ -0,0 +1,173 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive.execution
+
+import org.apache.spark.sql.{AnalysisException, QueryTest, Row}
+import org.apache.spark.sql.hive.test.TestHiveSingleton
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SQLTestUtils
+
+class ColumnResolutionSuite extends QueryTest with SQLTestUtils with 
TestHiveSingleton {
--- End diff --

The logic to resolve the column in the LogicalPlan is same - there is no 
change there.  I wanted to test the hive table to make sure that the qualifier 
information is correctly set. We update the qualifier info in MetastoreRelation 
so wanted to have coverage for hive table. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17067: [SPARK-19602][SQL][TESTS] Add tests for qualified...

2017-02-27 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/17067#discussion_r103378392
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala ---
@@ -52,6 +52,19 @@ abstract class SQLViewSuite extends QueryTest with 
SQLTestUtils {
 }
   }
 
+  test("column resolution scenarios with local temp view") {
+val df = Seq(2).toDF("i1")
+df.createOrReplaceTempView("table1")
+withTempView("table1") {
+  checkAnswer(spark.sql("SELECT table1.* FROM table1"), Row(2))
+  checkAnswer(spark.sql("SELECT * FROM table1"), Row(2))
+  checkAnswer(spark.sql("SELECT i1 FROM table1"), Row(2))
+  checkAnswer(spark.sql("SELECT table1.i1 FROM table1"), Row(2))
+  checkAnswer(spark.sql("SELECT a.i1 FROM table1 AS a"), Row(2))
+  checkAnswer(spark.sql("SELECT i1 FROM table1 AS a"), Row(2))
--- End diff --

Sure, let me look at converting these too. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17067: [SPARK-19602][SQL][TESTS] Add tests for qualified column...

2017-02-27 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/17067
  
Thanks much Xiao for the review and comments. 

I have made the following changes: 
- Separated out the -ve cases from the +ve cases. 
- Moved positive tests and also the cases that should be supported into the 
SQLQueryTestSuite framework.  A new test file columnresolution.sql and the 
corresponding master out file is added. 
- Clean up the ColumnResolutionSuite to remove cases that are covered in 
the SQLQueryTestSuite
- I have kept the -ve cases in the ColumnResolutionSuite because the exprId 
shows up in the exception.
- I also wanted to cover a case against a hive serde table so I have kept 
those tests in the ColumnResolutionSuite

Please advise if we should move any others.  Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17067: [SPARK-19602][SQL][TESTS] Add tests for qualified...

2017-02-25 Thread skambha
GitHub user skambha opened a pull request:

https://github.com/apache/spark/pull/17067

[SPARK-19602][SQL][TESTS] Add tests for qualified column names

## What changes were proposed in this pull request?
- Add tests covering different scenarios with qualified column names 
without any code changes. 
- Please see Section 2 in the design doc for the various test scenarios 
[here](https://issues.apache.org/jira/secure/attachment/12854681/Design_ColResolution_JIRA19602.pdf)
- As part of SPARK-19602, changes are made to support three part column 
name. In order to aid in the review and to reduce the diff, the test scenarios 
are separated out into this PR.

## How was this patch tested?
- This is a test only change. The individual test suites were run 
successfully.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/skambha/spark colResolutionTests

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17067.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17067


commit 9c2e24cc2fe7c194e3cde7c5ff623234a347774b
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-02-24T21:54:28Z

Add tests to cover scenarios for the column resolution

commit 778e3d6aa52b112c9126b9cc0492c49efac6ff4c
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-02-24T22:01:16Z

Add tests for global temp views

commit addb441c3579816d33cb996f073bb616679c6790
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-02-24T22:15:41Z

Add test with local view

commit 02cc8ed47989b84030e8f57fefd33a43b9cb4820
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-02-25T01:14:52Z

newline in end

commit 32d55e6d521047a9ba7b3fe0e7c75cd0a24cbab0
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-02-25T02:05:30Z

uppercase struct

commit 2f9937e5dc0fc02218c644083432f1bff241409f
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2017-02-25T22:23:18Z

fix style and use ctas




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16919: [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTa...

2017-02-14 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/16919
  
Thanks Xiao!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16919: [SPARK-19585][DOC][SQL] Fix the cacheTable and un...

2017-02-13 Thread skambha
GitHub user skambha opened a pull request:

https://github.com/apache/spark/pull/16919

[SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTable api call in the 
doc

## What changes were proposed in this pull request?


https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
In the doc, the call spark.cacheTable(“tableName”) and 
spark.uncacheTable(“tableName”) actually needs to be 
spark.catalog.cacheTable and spark.catalog.uncacheTable

## How was this patch tested?
Built the docs and verified the change shows up fine. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/skambha/spark docChange

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16919.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16919






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15649: [SPARK-18121][SQL] Unable to query global temp views whe...

2016-10-27 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/15649
  
Thanks @cloud-fan, @gatorsmile, @viirya


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15649: [SPARK-18121][SQL] Unable to query global temp vi...

2016-10-27 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/15649#discussion_r85377599
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -65,6 +65,20 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils 
with TestHiveSingleton {
   import hiveContext._
   import spark.implicits._
 
+  test("query global temp view") {
+val df = Seq(1).toDF("i1")
+df.createGlobalTempView("tbl1")
+checkAnswer(spark.sql("select * from global_temp.tbl1"), Row(1))
--- End diff --

Thanks @cloud-fan.  I have changed the test to get the global_temp db from 
the conf.   spark.conf.get("spark.sql.globalTempDatabase”)

It looks like there are two ways to get it: 
- One is using the conf
- Accessing the spark.sharedState.globalTempViewManager.database

In the test, I am now getting it from the conf.   Please see if this is ok. 
 Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15649: [SPARK-18121][SQL] Unable to query global temp vi...

2016-10-26 Thread skambha
GitHub user skambha opened a pull request:

https://github.com/apache/spark/pull/15649

[SPARK-18121][SQL] Unable to query global temp views when hive support is 
enabled

## What changes were proposed in this pull request?

Issue: 
Querying on a global temp view throws Table or view not found exception. 

Fix:
Update the lookupRelation in HiveSessionCatalog to check for global temp 
views similar to the SessionCatalog.lookupRelation.

Before fix: 
Querying on a global temp view ( for. e.g.:  select * from global_temp.v1)  
throws Table or view not found exception

After fix: 
Query succeeds and returns the right result. 

## How was this patch tested?
- Two unit tests are added to check for global temp view for the code path 
when hive support is enabled.
- Regression unit tests were run successfully. ( build/sbt -Phive 
hive/test, build/sbt sql/test, build/sbt catalyst/test)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/skambha/spark lookuprelationChanges

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15649.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15649


commit 95312dd16c0d5b70996fcdc3a49bcdab29fd99d2
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2016-10-26T20:11:56Z

Add check for global temp tables in HiveSessionCatalog lookupRelation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15168: [SPARK-17612][SQL] Support `DESCRIBE table PARTIT...

2016-09-20 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/15168#discussion_r79746339
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -341,6 +342,25 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
 }
   }
 
+  test("describe partition") {
+withTable("partitioned_table") {
+  sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
(c STRING, d STRING)")
+  sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
+
+  sql("DESC partitioned_table PARTITION (c='Us', d=1)")
--- End diff --

Hive supports describe based on the partition specified and will list the 
details for the particular partition when used with formatted or extended 
option. 
DESCRIBE formatted part_table partition (d='abc') 

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Describe

@dongjoon-hyun, this might be beyond the scope of this PR but this would be 
useful if there are a lot of partitions and we want to find details for a given 
partition.  What do you think? Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15168: [SPARK-17612][SQL] Support `DESCRIBE table PARTIT...

2016-09-20 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/15168#discussion_r79716333
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -341,6 +342,25 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
 }
   }
 
+  test("describe partition") {
+withTable("partitioned_table") {
+  sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
(c STRING, d STRING)")
+  sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
+
+  sql("DESC partitioned_table PARTITION (c='Us', d=1)")
--- End diff --

I was thinking - of verifying that the output result points out the 
partition columns correctly. 
Is there a testcase that covers this already? If so that is fine. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15168: [SPARK-17612][SQL] Support `DESCRIBE table PARTIT...

2016-09-20 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/15168#discussion_r79710335
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -341,6 +342,25 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
 }
   }
 
+  test("describe partition") {
+withTable("partitioned_table") {
+  sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY 
(c STRING, d STRING)")
+  sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
+
+  sql("DESC partitioned_table PARTITION (c='Us', d=1)")
--- End diff --

Can you add some validation for the output returned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13822: [SPARK-16115][SQL] Improve error message on non-e...

2016-09-20 Thread skambha
Github user skambha closed the pull request at:

https://github.com/apache/spark/pull/13822


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13822: [SPARK-16115][SQL] Improve error message on non-existent...

2016-09-20 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/13822
  
@cloud-fan, PR 15054 was merged this morning and that has resolved the 
issue the PR was trying to address.  That said, the changes that went in there 
will throw an error that the table does not exist in the database even if it is 
a temporary view.  So the error message is not as appropriate..

In any case, I think it is fine to close this PR.  But the tests are valid. 
Let me know if there is interest, and I can submit just the tests.  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13822: [SPARK-16115][SQL] Change output schema to be par...

2016-09-16 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/13822#discussion_r79265065
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala ---
@@ -660,6 +662,10 @@ case class ShowPartitionsCommand(
   override def run(sparkSession: SparkSession): Seq[Row] = {
 val catalog = sparkSession.sessionState.catalog
 
+if (!catalog.tableExists(table)) {
+  throw new AnalysisException(s" Table does not exist")
--- End diff --

Changes done. I used table.unquotedString (similar to what was used in the 
nearby code. )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13822: [SPARK-16115][SQL] Change output schema to be par...

2016-09-16 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/13822#discussion_r79265007
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -595,6 +595,19 @@ class HiveDDLSuite
 }
   }
 
+  test("test show partitions") {
+val message = intercept[AnalysisException] {
+  sql("SHOW PARTITIONS default.nonexistentTable")
+}.getMessage
+assert(message.contains("Table does not exist"))
+
+withTable("t1") {
+  sql("CREATE TABLE t1 (key STRING, value STRING) PARTITIONED BY (ds 
STRING)")
+  sql("ALTER TABLE t1 ADD PARTITION (ds = '1')")
+  assert(sql(" SHOW PARTITIONS t1").schema.getFieldIndex("partition") 
== Some(0))
--- End diff --

My original reasoning for adding this test was  to exercise the code change 
that makes the output attribute reference name for SHOW PARTITIONS command to 
be ‘partition’.  This test covers this scenario.Without the fix, it was 
‘result’.  With the fix it is ‘partition’.

But it turns out another PR already made the code change for it. But no 
tests were added in that pr.  So I have kept this as a separate test for now.  

If you prefer we do not need to have this test, we can remove it.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13822: [SPARK-16115][SQL] Change output schema to be par...

2016-09-16 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/13822#discussion_r79264648
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -595,6 +595,19 @@ class HiveDDLSuite
 }
   }
 
+  test("test show partitions") {
--- End diff --

Done. I renamed it per your suggestion. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13822: [SPARK-16115][SQL] Change output schema to be partition ...

2016-09-16 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/13822
  
Thank you so much Andrew for reviewing.  I appreciate it. 
I have taken care of your comments and also rebased.  Please take a look.  
Thanks. 

My changes were doing two things: 
1) Rename the output column for the SHOW PARTITIONS to be partition 
2) Add an error message for non-existent table in Show Partitions command.
and unit tests to cover both scenarios.

But another [PR for 
SPARK-16543](https://github.com/apache/spark/commit/56183b84fb64ea13977d89ec55a9dd3997b4dacf)
 is already merged that takes care of (1) above.   There were no tests added in 
that one, so I am keeping the test to check (1) still here.   Is that ok?

cc @yhuai, @cloud-fan, @andrewor14


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14897: [SPARK-17338][SQL] add global temp view

2016-09-07 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/14897#discussion_r77879554
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala
 ---
@@ -451,7 +464,7 @@ class SessionCatalog(
 if (isTemporaryTable(name)) {
   true
 } else {
-  externalCatalog.tableExists(db, table)
+  globalTempViews.get(table).isDefined || 
externalCatalog.tableExists(db, table)
--- End diff --

I think the tableExists should also follow the similar semantics as the 
lookupRelation.
i.e. If database in TableIdentifier is defined, then table existence should 
be checked against the defined database and table.

The code comment is correct but the code is not enforcing it in 
tableExists. 

The table/view resolution in the lookupRelation is 
- if database is defined, then fetch the table/view from it. 
- if database is not defined, then the following checks happen in the 
particular order.  
   a)  check if it a local temp view exists
   b) if a global temp view with name exists. 
   c) If not, then check if there is a table/view  in current db. 

Currently, if I do tableExists(db1.t1), it will return true  if a global 
temp view t1 exists but there is no db1.t1 in the externalCatalog. 
   
In tableExists method, we should enforce the check if database is defined, 
we should check against the externalCatalog and not the globalTempView... and 
enhance it to  follow the resolution semantics in the lookupRelation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13822: [SPARK-16115][SQL] Change output schema to be partition ...

2016-08-02 Thread skambha
Github user skambha commented on the issue:

https://github.com/apache/spark/pull/13822
  
gentle ping @andrewor14  

I have rebased it to the master.   When you get a chance, can you please 
review.  Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13822: [SPARK-16115][SQL] Change output schema to be par...

2016-06-21 Thread skambha
GitHub user skambha opened a pull request:

https://github.com/apache/spark/pull/13822

[SPARK-16115][SQL] Change output schema to be partition for SHOW PARTITIONS 
command and …

## What changes were proposed in this pull request?

Changes include: 
1. For the SHOW PARTITIONS command, the column name in the output is 
changed from 'result' to 'partition' as it is more descriptive.  This is now 
similar to Hive.

2. Corner case: Improve the error message when calling show partitions on a 
non-existent table. 
Add a check to see if table exists or not and add error message.

Without the fix: 
scala> spark.sql("show partitions t1");
org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on a 
table that is not partitioned: default.t1;

With the fix: 
scala> spark.sql("SHOW PARTITIONS T1").show;
org.apache.spark.sql.AnalysisException:  Table does not exist;

## How was this patch tested?
- Added unit tests to cover these two scenarios
- Following unit tests were run successfully:  hive/test, sql/test, 
catalyst/test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/skambha/spark showpart16115

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13822.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13822






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13630][SQL] Adds optimizer rule collaps...

2016-04-19 Thread skambha
Github user skambha closed the pull request at:

https://github.com/apache/spark/pull/11480


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13774][SQL] - Improve error message for...

2016-03-22 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/11775#discussion_r57043148
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala
 ---
@@ -693,23 +693,25 @@ class MetastoreDataSourcesSuite extends QueryTest 
with SQLTestUtils with TestHiv
   test("SPARK-6024 wide schema support") {
 withSQLConf(SQLConf.SCHEMA_STRING_LENGTH_THRESHOLD.key -> "4000") {
   withTable("wide_schema") {
-// We will need 80 splits for this schema if the threshold is 4000.
-val schema = StructType((1 to 5000).map(i => StructField(s"c_$i", 
StringType, true)))
-
-// Manually create a metastore data source table.
-sessionState.catalog.createDataSourceTable(
-  tableIdent = TableIdentifier("wide_schema"),
-  userSpecifiedSchema = Some(schema),
-  partitionColumns = Array.empty[String],
-  bucketSpec = None,
-  provider = "json",
-  options = Map("path" -> "just a dummy path"),
--- End diff --

The test had a dummy path which is a non existent path and with this fix, 
we will throw an error.   So the test is modified to give a tempdir and allow 
it to move past so it can continue to test the scenario it meant to test.   
There are 2 tests in this suite that were modified.

Please let me know if there are any concerns.  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13774][SQL] - Improve error message for...

2016-03-22 Thread skambha
Github user skambha commented on the pull request:

https://github.com/apache/spark/pull/11775#issuecomment-199904383
  
Thanks @liancheng for the review, merging the fix  and adding me to the 
spark contributor group in Apache JIRA. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13774][SQL] - Improve error message for...

2016-03-19 Thread skambha
GitHub user skambha opened a pull request:

https://github.com/apache/spark/pull/11775

[SPARK-13774][SQL] - Improve error message for non-existent paths and add 
tests

SPARK-13774: IllegalArgumentException: Can not create a Path from an empty 
string for incorrect file path

**Overview:**
-   If a non-existent path is given in this call
``
scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
``
it throws the following error:
`java.lang.IllegalArgumentException: Can not create a Path from an empty 
string` ….. 
`It gets called from inferSchema call in 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation`

-   The purpose of this JIRA is throw a better error message. 
-   With the fix, you will now get a _Path does not exist_ error message. 
```
scala> sqlContext.read.format("csv").load("file-path-is-incorrect.csv")
org.apache.spark.sql.AnalysisException: Path does not exist: 
file:/Users/ksunitha/trunk/spark/file-path-is-incorrect.csv;
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:215)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:204)
  ...
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:204)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:131)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:141)
  ... 49 elided
```

**Details**
_Changes include:_
-   Check if path exists or not in resolveRelation in DataSource, and throw 
an AnalysisException with message like “Path does not exist: $path”
-   AnalysisException is thrown similar to the exceptions thrown in 
resolveRelation.
-   The glob path and the non glob path is checked with minimal calls to 
pathExists. If the globPath is empty, then it is a nonexistent glob pattern and 
an error will be thrown. In the scenario that it is not globPath, it is 
necessary to only check if the first element in the Seq is valid or not. 
-   A new method pathExists is added to SparkHadoopUtil to check if path 
exists or not.

_Test modifications:_
-   Changes went in for 3 tests to account for this error checking.
-   SQLQuerySuite:test("run sql directly on files") – Error message 
needed to be updated.
-   2 tests failed in MetastoreDataSourcesSuite because they had a dummy 
path and so test is modified to give a tempdir and allow it to move past so it 
can continue to test the codepath it meant to test

_New Tests:_
2 new tests are added to DataFrameSuite to validate that glob and non-glob 
path will throw the new error message.  

_Testing:_
Unit tests were run with the fix.

**Notes/Questions to reviewers:**
-   There is some code duplication in DataSource.scala in resolveRelation 
method and also createSource with respect to getting the paths.  I have not 
made any changes to the createSource codepath.  Should we make the change there 
as well ? 

-   From other JIRAs, I know there is restructuring and changes going on in 
this area, not sure how that will affect this changes, but since this seemed 
like a starter issue, I looked into it.  If we prefer not to add the overhead 
of the checks, or if there is a better place to do so, let me know.  

I would appreciate your review. Thanks for your time and comments.





You can merge this pull request into a Git repository by running:

$ git pull https://github.com/skambha/spark improve_errmsg

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11775.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11775


commit e03e7761679502a98d11e3c254cc62efcc0eb36b
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2016-03-16T21:44:03Z

SPARK-13774 - Improve error message for non-existent paths and add tests




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13774][SQL] - Improve error message for...

2016-03-19 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/11775#discussion_r56549702
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
@@ -237,6 +237,8 @@ class SparkHadoopUtil extends Logging {
 }
   }
 
+  def pathExists( p: Path, config: Configuration): Boolean = 
p.getFileSystem(config).exists(p)
--- End diff --

 I have removed this helper method. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13774][SQL] - Improve error message for...

2016-03-19 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/11775#discussion_r56689434
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -205,7 +205,16 @@ case class DataSource(
   val hdfsPath = new Path(path)
   val fs = 
hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
   val qualified = hdfsPath.makeQualified(fs.getUri, 
fs.getWorkingDirectory)
-  SparkHadoopUtil.get.globPathIfNecessary(qualified)
+  val globPath = SparkHadoopUtil.get.globPathIfNecessary(qualified)
+
+  if (globPath.isEmpty) {
+throw new AnalysisException(s"Path does not exist: $qualified")
+  }
+  // Sufficient to check head of the globPath seq for non-glob 
scenario
+  if (! fs.exists(globPath.head)) {
--- End diff --

Thanks @srowen.  I have taken care of it and will pay attention to the 
style. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13774][SQL] - Improve error message for...

2016-03-19 Thread skambha
Github user skambha commented on the pull request:

https://github.com/apache/spark/pull/11775#issuecomment-198001758
  
@srowen Thanks for your comments.  I have addressed them in the last 
commit.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13774][SQL] - Improve error message for...

2016-03-19 Thread skambha
Github user skambha commented on a diff in the pull request:

https://github.com/apache/spark/pull/11775#discussion_r56551299
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -205,7 +205,17 @@ case class DataSource(
   val hdfsPath = new Path(path)
   val fs = 
hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
   val qualified = hdfsPath.makeQualified(fs.getUri, 
fs.getWorkingDirectory)
-  SparkHadoopUtil.get.globPathIfNecessary(qualified)
+  val globPath = SparkHadoopUtil.get.globPathIfNecessary(qualified)
+
+  if (globPath.isEmpty) {
+throw new AnalysisException(s"Path does not exist: $qualified")
+  }
+  // Sufficient to check head of the globPath seq for non-glob 
scenario
+  if (! (SparkHadoopUtil.
--- End diff --

I have used string interpolation for the exception message. 
Is string interpolation generally preferred than the string concat (+) in 
scala? Is it for performance reasons or more for style?Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13630][SQL] Adds optimizer rule collaps...

2016-03-02 Thread skambha
GitHub user skambha opened a pull request:

https://github.com/apache/spark/pull/11480

[SPARK-13630][SQL] Adds optimizer rule collapsesorts to collapse adja…

## What changes were proposed in this pull request?

This patch does the following: 
I) Adds a new optimizer rule collapsesorts that does the following if 
global is same for the adjacent sorts.
a)  Collapse adjacent sorts and keep the last sort
b) Collapse adjacent sorts if there is a project or a limit or a filter in 
between and keep the last sort. 

II) A new test suite CollapseSortsSuite is added with tests. 
Also note, one of the _testcase (test("collapsesorts: test collapsesorts in 
sort <- limit <- sort scenario") ) _does not compare with expected plan because 
of the unapply in Limit will actually remove the LocalLimit from the plan. 
Hence the test just checks that the collapsesorts rule was exercised by 
checking for the number of Sort in the plan.

## How was this patch tested?
A)
Following test suites were run and the lint checking was done. No new test 
failures: 
build/sbt -Phive hive/test
build/sbt sql/test
build/sbt catalyst/test
dev/lint-scala

B) A new test suite CollapseSortsSuite is added with new tests to exercise 
the collapsesorts rule. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/skambha/spark SPARK-13630

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11480.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11480


commit 7e2450efbf7b33cc33bb0b2946ba790cf27e1bac
Author: Sunitha Kambhampati <skam...@us.ibm.com>
Date:   2016-03-02T23:59:38Z

[SPARK-13630][SQL] Adds optimizer rule collapsesorts to collapse adjacent 
sorts and add new testsuite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org