[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-16 Thread chutium
Github user chutium commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-49154812
  
hi, it is really very useful for us, i tried this implementation from 
@willb , in spark-shell, i still got java.lang.UnsupportedOperationException by 
Query Plan, i made some change in SqlParser: 
https://github.com/chutium/spark/commit/1de83a7560f85cd347bca6dde256d551da63a144



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-16 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-49155240
  
Awesome! Please submit a pull request with that addition.
On Jul 16, 2014 7:53 AM, Teng Qiu notificati...@github.com wrote:

 hi, it is really very useful for us, i tried this implementation from
 @willb https://github.com/willb , in spark-shell, i still got
 java.lang.UnsupportedOperationException by Query Plan, i made some change
 in SqlParser: chutium@1de83a7
 
https://github.com/chutium/spark/commit/1de83a7560f85cd347bca6dde256d551da63a144

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1359#issuecomment-49154812.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-16 Thread chutium
Github user chutium commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-49159677
  
PR submitted #1442 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-15 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-49093280
  
Thanks!  Merged into master and 1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1359


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-14 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48969318
  
Hey Will, this looks great.  Thanks for implementing!  Mind fixing the 
conflict with master so I can merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-14 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48971559
  
Thanks, Michael.  I've fixed the conflict.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-14 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48971609
  
Thanks!  Will merge as soon as Jenkins comes back happy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-14 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14915562
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -205,3 +207,72 @@ case class EndsWith(left: Expression, right: 
Expression)
 extends BinaryExpression with StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
--- End diff --

`nullable` could be `str.nullable || pos.nullable || len.nullable` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-14 Thread marmbrus
Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48986821
  
add to whitelist


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48987029
  
QA tests have started for PR 1359. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16652/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48991417
  
QA results for PR 1359:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class Substring(str: Expression, pos: Expression, len: 
Expression) extends Expression {brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16652/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48639097
  
QA tests have started for PR 1359. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16506/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14781579
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -19,8 +19,11 @@ package org.apache.spark.sql.catalyst.expressions
 
 import java.util.regex.Pattern
 
+import scala.collection.IndexedSeqOptimized
+
 import org.apache.spark.sql.catalyst.types.DataType
 import org.apache.spark.sql.catalyst.types.StringType
+import org.apache.spark.sql.catalyst.types.BinaryType
--- End diff --

alphabetization or combine the imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14781690
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
+// negative indices for start positions. If a start index i is greater 
than 0, it 
+// refers to element i-1 in the sequence. If a start index i is less 
than 0, it refers
+// to the -ith element before the end of the sequence. If a start 
index i is 0, it
+// refers to the first element.
+
+val start = startPos match {
+  case pos if pos  0 = pos - 1
+  case neg if neg  0 = len + neg
+  case _ = 0
+}
+
+val end = sliceLen match {
+  case max if max == Integer.MAX_VALUE = max
+  case x = start + x
+}
+  
+str.slice(start, end)
+  }
+  
+  override def eval(input: Row): Any = {
+val string = str.eval(input)
+
+val po = pos.eval(input)
+val ln = len.eval(input)
+
+if ((string == null) || (po == null) || (ln == null)) {
+  null
+} else {
+  val start = po.asInstanceOf[Int]
+  val length = ln.asInstanceOf[Int] 
+  
+  string match {
+case ba: Array[Byte] = slice(ba, start, length)
+case other = slice(other.toString, start, length)
+  }
+}
+  }
+  
+  override def toString = sSUBSTR($str, $pos, $len)
--- End diff --

Minor: what do you think if we display only two args for the 2-arg case 
instead of displaying a Integer.MAX_VALUE?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14781542
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
--- End diff --

Hey Will -- I think we need to check if `resolved` is true here to not 
break Catalyst's contract. Similar to 
[here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L183).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14782197
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
+// negative indices for start positions. If a start index i is greater 
than 0, it 
+// refers to element i-1 in the sequence. If a start index i is less 
than 0, it refers
+// to the -ith element before the end of the sequence. If a start 
index i is 0, it
+// refers to the first element.
+
+val start = startPos match {
+  case pos if pos  0 = pos - 1
+  case neg if neg  0 = len + neg
+  case _ = 0
+}
+
+val end = sliceLen match {
+  case max if max == Integer.MAX_VALUE = max
+  case x = start + x
+}
+  
+str.slice(start, end)
+  }
+  
+  override def eval(input: Row): Any = {
+val string = str.eval(input)
+
+val po = pos.eval(input)
+val ln = len.eval(input)
+
+if ((string == null) || (po == null) || (ln == null)) {
+  null
+} else {
+  val start = po.asInstanceOf[Int]
+  val length = ln.asInstanceOf[Int] 
+  
+  string match {
+case ba: Array[Byte] = slice(ba, start, length)
+case other = slice(other.toString, start, length)
+  }
+}
+  }
+  
+  override def toString = sSUBSTR($str, $pos, $len)
--- End diff --

Makes sense!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14782267
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -860,6 +860,7 @@ private[hive] object HiveQl {
   val BETWEEN = (?i)BETWEEN.r
   val WHEN = (?i)WHEN.r
   val CASE = (?i)CASE.r
+  val SUBSTR = (?i)I_SUBSTR(?:ING)?.r
--- End diff --

Why I_ here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14782431
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -860,6 +860,7 @@ private[hive] object HiveQl {
   val BETWEEN = (?i)BETWEEN.r
   val WHEN = (?i)WHEN.r
   val CASE = (?i)CASE.r
+  val SUBSTR = (?i)I_SUBSTR(?:ING)?.r
--- End diff --

Good catch -- the `I_` was spuriously committed.  (I had that in my working 
tree so that I could easily compare Catalyst and Hive `substr` from the 
console.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14782583
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
--- End diff --

It looks like from [1] and [2] that Hive and SQL by default refer users to 
use 1-based indexing (and the links don't seem to mention 0-based at all). Even 
though they do support so, this subtle fact necessitates a couple branchings in 
the implementation which might cause us some performance penalty. What do you  
people think about supporting 1-based indexing only? +@marmbrus @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14782682
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
+// negative indices for start positions. If a start index i is greater 
than 0, it 
+// refers to element i-1 in the sequence. If a start index i is less 
than 0, it refers
+// to the -ith element before the end of the sequence. If a start 
index i is 0, it
+// refers to the first element.
+
+val start = startPos match {
+  case pos if pos  0 = pos - 1
+  case neg if neg  0 = len + neg
+  case _ = 0
+}
+
+val end = sliceLen match {
--- End diff --

The MySQL doc mentions that If len is less than 1, the result is the empty 
string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48642397
  
Hey @willb - thanks for working on this, which is going to be very useful 
for Spark SQL. I left a couple minor comments. Another general concern is the 
performance of `eval()`. If there are ways to reduce branchings, or reduce the 
function call (not sure if @inline will help), we should probably do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48643276
  
Thanks for the comments, @concretevitamin.  I've made the changes from your 
comments and will think about reducing branch overhead before pushing an update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14783468
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
--- End diff --

Hive supports 0-based indexing in the same way as this patch.  I agree that 
supporting both in this way is ugly (both from an interface and from an 
implementation perspective), but it seems likely that people are depending on 
this behavior in the wild, doesn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14783626
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
+// negative indices for start positions. If a start index i is greater 
than 0, it 
+// refers to element i-1 in the sequence. If a start index i is less 
than 0, it refers
+// to the -ith element before the end of the sequence. If a start 
index i is 0, it
+// refers to the first element.
+
+val start = startPos match {
+  case pos if pos  0 = pos - 1
+  case neg if neg  0 = len + neg
+  case _ = 0
+}
+
+val end = sliceLen match {
--- End diff --

This is the behavior of `IndexedSeqOptimized[A,B].slice` (and thus this 
patch) as well as of Hive, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14783724
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
+// negative indices for start positions. If a start index i is greater 
than 0, it 
+// refers to element i-1 in the sequence. If a start index i is less 
than 0, it refers
+// to the -ith element before the end of the sequence. If a start 
index i is 0, it
+// refers to the first element.
+
+val start = startPos match {
+  case pos if pos  0 = pos - 1
+  case neg if neg  0 = len + neg
+  case _ = 0
+}
+
+val end = sliceLen match {
--- End diff --

Right, missed this before, sorry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48648670
  
@concretevitamin Inlining `Substring.slice` seems to make a nontrivial 
difference (~11.5% speedup) on a very simple `Substring.eval()` microbenchmark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48648864
  
This sounds pretty cool!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48651882
  
QA results for PR 1359:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class Substring(str: Expression, pos: Expression, len: 
Expression) extends Expression {brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16506/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48651895
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48651897
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16506/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48653871
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48653881
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48654045
  
QA tests have started for PR 1359. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16513/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48665810
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48665803
  
QA results for PR 1359:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class Substring(str: Expression, pos: Expression, len: 
Expression) extends Expression {brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16513/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14801757
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +207,71 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (!resolved) {
+  throw new UnresolvedException(this, sCannot resolve since $children 
are not resolved)
+}
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  @inline
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
--- End diff --

Would it be possible to do this without view bounds?  I think those are 
going to disappear soon: https://issues.scala-lang.org/browse/SI-7629


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14802394
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +207,71 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (!resolved) {
+  throw new UnresolvedException(this, sCannot resolve since $children 
are not resolved)
+}
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  @inline
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
--- End diff --

Yes, I think I can do it with implicit parameters.  I'll push an update 
once I've run the Catalyst suite against my change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48685301
  
QA tests have started for PR 1359. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16543/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48685347
  
QA results for PR 1359:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class Substring(str: Expression, pos: Expression, len: 
Expression) extends Expression {brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16543/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48690227
  
QA results for PR 1359:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds the following public classes 
(experimental):brcase class Substring(str: Expression, pos: Expression, len: 
Expression) extends Expression {brbrFor more information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16544/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---