[GitHub] spark pull request: [SPARK-6806] [SPARKR] [DOCS] Add a new SparkR ...

2015-05-29 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/6490#discussion_r31304628
  
--- Diff: docs/sparkr.md ---
@@ -0,0 +1,198 @@
+---
+layout: global
+displayTitle: SparkR (R on Spark)
+title: SparkR (R on Spark)
+---
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+# Overview
+SparkR is an R package that provides a light-weight frontend to use Apache 
Spark from R.
+In Spark {{site.SPARK_VERSION}}, SparkR provides a distributed data frame 
implementation that
+supports operations similar to R data frames, 
[dplyr](https://github.com/hadley/dplyr) but on large
--- End diff --

add etc/e.g.?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7512] [SPARKR] Fix RDD's show method to...

2015-05-09 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/6035#issuecomment-100564609
  
LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

2015-04-27 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/5568#issuecomment-96814334
  
LGTM. /cc @shivaram 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

2015-04-27 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/5568#issuecomment-96751034
  
@hlin09 - cool, I don't have a strong opinion towards either approach for 
now. Did a pass and left some minor style comments. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

2015-04-27 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5568#discussion_r29169055
  
--- Diff: R/pkg/R/RDD.R ---
@@ -1529,3 +1529,50 @@ setMethod(zipRDD,
 
 PipelinedRDD(zippedRDD, partitionFunc)
   })
+
+#' Zip an RDD's partitions with one (or more) RDD(s).
+#'
+#' Return a new RDD by applying a function to the zipped partitions. 
+#' Assumes that all the RDDs have the same number of partitions*, but does 
*not* 
+#' require them to have the same number of elements in each partition.
+#' 
+#' @param x An RDD to be zipped.
+#' @param other Another RDD to be zipped.
+#' @return An RDD zipped from the two RDDs.
+#' @examples
+#'\dontrun{
+#' sc - sparkR.init()
+#' rdd1 - parallelize(sc, 1:2, 2L)  # 1, 2
+#' rdd2 - parallelize(sc, 1:4, 2L)  # 1:2, 3:4
+#' rdd3 - parallelize(sc, 1:6, 2L)  # 1:3, 4:6
+#' collect(zipPartitions(rdd1, rdd2, rdd3, 
+#'   func = function(x, y, z) { list(list(x, y, z))} ))
+#' # list(list(1, c(1,2), c(1,2,3)), list(2, c(3,4), c(4,5,6)))
+#'}
+#' @rdname zipRDD
+#' @aliases zipPartitions,RDD
+setMethod(zipPartitions,
+  RDD,
+  function(..., func) {
+rrdds - list(...)
+if (length(rrdds) == 1) {
+  return(rrdds[[1]])
+}
+nPart - sapply(rrdds, numPartitions)
+if (length(unique(nPart)) != 1) {
+  stop(Can only zipPartitions RDDs which have the same number 
of partitions.)
+}
+
+rrdds - lapply(rrdds, function(rdd) {
+  mapPartitionsWithIndex(rdd, function(split, part) {
--- End diff --

We've recently changed all usage of `split` to `partIndex` or something 
like `idx` is fine too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

2015-04-27 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5568#discussion_r29168961
  
--- Diff: R/pkg/R/RDD.R ---
@@ -1529,3 +1529,50 @@ setMethod(zipRDD,
 
 PipelinedRDD(zippedRDD, partitionFunc)
   })
+
+#' Zip an RDD's partitions with one (or more) RDD(s).
+#'
+#' Return a new RDD by applying a function to the zipped partitions. 
+#' Assumes that all the RDDs have the same number of partitions*, but does 
*not* 
--- End diff --

nit: unmatched *


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

2015-04-27 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/5568#discussion_r29169169
  
--- Diff: R/pkg/R/RDD.R ---
@@ -1529,3 +1529,50 @@ setMethod(zipRDD,
 
 PipelinedRDD(zippedRDD, partitionFunc)
   })
+
+#' Zip an RDD's partitions with one (or more) RDD(s).
+#'
+#' Return a new RDD by applying a function to the zipped partitions. 
+#' Assumes that all the RDDs have the same number of partitions*, but does 
*not* 
+#' require them to have the same number of elements in each partition.
+#' 
+#' @param x An RDD to be zipped.
+#' @param other Another RDD to be zipped.
--- End diff --

The doc here needs to be updated: (1) var-arg arity (2) document `func`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7033][SPARKR] Clean usage of split. Use...

2015-04-23 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/5628#issuecomment-95651218
  
Thanks, @sun-rui - doing a grep, could you also update test_rdd.R's line 
124?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7033][SPARKR] Clean usage of split. Use...

2015-04-23 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/5628#issuecomment-95775125
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...

2015-04-20 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/5568#issuecomment-94624336
  
Thanks @hlin09 - one high-level question: do you have an intended use case 
in mind for arbitrary / large arity functions? In Spark's Scala API up to 4 
RDDs are supported, which has perf. benefits (avoid an union + shuffle).  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor][SparkR] Minor refactor and removes red...

2015-04-13 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/5495#issuecomment-92477984
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor][SparkR] Minor refactor and removes red...

2015-04-13 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/5495#issuecomment-92479065
  
Hey @pwendell I can't seem to summon Jenkins now :(


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Minor][SparkR] Minor refactor and removes red...

2015-04-13 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/5495#issuecomment-92589456
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR

2015-04-09 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/5096#issuecomment-91123979
  
:+1: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5654] Integrate SparkR into Apache Spar...

2015-03-18 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/5077#issuecomment-83271348
  
@shivaram What's the timeline we are looking at? 

If possible I'd like to take a close look at this next week / after next 
week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3609][SQL] Adds sizeInBytes statistics ...

2014-09-19 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/2468#discussion_r17817992
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala 
---
@@ -122,6 +122,16 @@ object NativeType {
 IntegerType, BooleanType, LongType, DoubleType, FloatType, ShortType, 
ByteType, StringType)
 
   def unapply(dt: DataType): Boolean = all.contains(dt)
+
+  val defaultSizeOf: Map[NativeType, Int] = Map(
--- End diff --

Does JVM always use these conventional sizes on every platform?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...

2014-08-31 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/2220#issuecomment-53995569
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...

2014-08-31 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/2220#issuecomment-53995880
  
Ah, so this problem *was* fixed in PR #1514, but it seems like this 
[PR](https://github.com/apache/spark/commit/a7a9d14479ea6421513a962ff0f45cb969368bab#diff-48)
 reverted the change accidentally. I think it'd be good to re-adapt 1514's 
solution.

/cc @aarondav @liancheng


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3252][SQL] Add missing condition for te...

2014-08-27 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/2159#issuecomment-53635984
  
This is not harmful yet it doesn't do anything either, since in the test
query the two relations should be the same.

On Wednesday, August 27, 2014, Michael Armbrust notificati...@github.com
wrote:

 /cc @concretevitamin https://github.com/concretevitamin

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/2159#issuecomment-53632213.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-08-07 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1839#issuecomment-51520540
  
Jenkins, this is okay to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2678][Core][SQL] A workaround for SPARK...

2014-08-07 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1801#issuecomment-51533012
  
@liancheng @andrewor14 @pwendell  With this patch things like 
`./bin/spark-shell --master local[2]` errors out (bad options: --master).  I 
had to workaround this with an extra -- before the flag. Is this intended?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2678][Core][SQL] A workaround for SPARK...

2014-08-07 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1801#issuecomment-51533306
  
Oh, it's been reported by #1825.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2406][SQL] Initial support for using Pa...

2014-08-06 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1819#discussion_r15913270
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
@@ -78,6 +78,14 @@ class HiveContext(sc: SparkContext) extends 
SQLContext(sc) {
   // Change the default SQL dialect to HiveQL
   override private[spark] def dialect: String = getConf(SQLConf.DIALECT, 
hiveql)
 
+  /**
+   * When true, enables an experimental feature where metastore tables 
that use the parquet SerDe
+   * are automatically converted to use the Spark SQL parquet table scan, 
instead of the Hive
+   * SerDe.
+   */
+  private[spark] def convertMetastoreParquet: Boolean =
+getConf(spark.sql.hive.convertMetastoreParquet, false) == true
--- End diff --

I am going to test this PR soon. In the meantime would it make sense to 
only put this in `SQLConf` (as well as a field of the key string in the 
singleton object), making that class the central place that stores SQL configs? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2406][SQL] Initial support for using Pa...

2014-08-06 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1819#discussion_r15914067
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
@@ -78,6 +78,14 @@ class HiveContext(sc: SparkContext) extends 
SQLContext(sc) {
   // Change the default SQL dialect to HiveQL
   override private[spark] def dialect: String = getConf(SQLConf.DIALECT, 
hiveql)
 
+  /**
+   * When true, enables an experimental feature where metastore tables 
that use the parquet SerDe
+   * are automatically converted to use the Spark SQL parquet table scan, 
instead of the Hive
+   * SerDe.
+   */
+  private[spark] def convertMetastoreParquet: Boolean =
+getConf(spark.sql.hive.convertMetastoreParquet, false) == true
--- End diff --

When in doubt, make up longer names: `SQLConfigOpts`, `HiveConfigOpts`. But 
this is only possibly relevant in the future and should not block this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2860][SQL] Fix coercion of CASE WHEN.

2014-08-05 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1785#discussion_r15828435
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
 ---
@@ -336,28 +338,33 @@ trait HiveTypeCoercion {
   }
 
   /**
-   * Ensures that NullType gets casted to some other types under certain 
circumstances.
+   * Coerces the type of different branches of a CASE WHEN statement to a 
common type.
*/
-  object CastNulls extends Rule[LogicalPlan] {
+  object CaseWhenCoercion extends Rule[LogicalPlan] with TypeWidening {
 def apply(plan: LogicalPlan): LogicalPlan = plan 
transformAllExpressions {
-  case cw @ CaseWhen(branches) =
+  case cw @ CaseWhen(branches) if !cw.resolved  
!branches.exists(!_.resolved)  =
 val valueTypes = branches.sliding(2, 2).map {
-  case Seq(_, value) if value.resolved = Some(value.dataType)
-  case Seq(elseVal) if elseVal.resolved = Some(elseVal.dataType)
-  case _ = None
+  case Seq(_, value) = value.dataType
+  case Seq(elseVal) = elseVal.dataType
 }.toSeq
-if (valueTypes.distinct.size == 2  valueTypes.exists(_ == 
Some(NullType))) {
-  val otherType = valueTypes.filterNot(_ == Some(NullType))(0).get
+
+logDebug(sInput values for null casting 
${valueTypes.mkString(,)})
--- End diff --

Do we want to leave this in?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2860][SQL] Fix coercion of CASE WHEN.

2014-08-05 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1785#discussion_r15828596
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
 ---
@@ -336,28 +338,33 @@ trait HiveTypeCoercion {
   }
 
   /**
-   * Ensures that NullType gets casted to some other types under certain 
circumstances.
+   * Coerces the type of different branches of a CASE WHEN statement to a 
common type.
*/
-  object CastNulls extends Rule[LogicalPlan] {
+  object CaseWhenCoercion extends Rule[LogicalPlan] with TypeWidening {
 def apply(plan: LogicalPlan): LogicalPlan = plan 
transformAllExpressions {
-  case cw @ CaseWhen(branches) =
+  case cw @ CaseWhen(branches) if !cw.resolved  
!branches.exists(!_.resolved)  =
 val valueTypes = branches.sliding(2, 2).map {
-  case Seq(_, value) if value.resolved = Some(value.dataType)
-  case Seq(elseVal) if elseVal.resolved = Some(elseVal.dataType)
-  case _ = None
+  case Seq(_, value) = value.dataType
+  case Seq(elseVal) = elseVal.dataType
 }.toSeq
-if (valueTypes.distinct.size == 2  valueTypes.exists(_ == 
Some(NullType))) {
-  val otherType = valueTypes.filterNot(_ == Some(NullType))(0).get
+
+logDebug(sInput values for null casting 
${valueTypes.mkString(,)})
+
+if (valueTypes.distinct.size  1) {
+  val commonType = valueTypes.reduce { (v1, v2) =
+findTightestCommonType(v1, v2)
+  .getOrElse(sys.error(sInvalid types in CASE WHEN. $v1, 
$v2))
--- End diff --

Perhaps Value types in CASE WHEN must be the same, or can be coerced to a 
tightest upper bound. to be more precise about the semantics, in case users 
run into this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2860][SQL] Fix coercion of CASE WHEN.

2014-08-05 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1785#issuecomment-51232808
  
A few minor comments otherwise LGTM!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Set Spark SQL Hive compatibility test shuffle ...

2014-08-05 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1784#issuecomment-51234268
  
Can we reset the original value in afterAll()? There's a test in 
`SQLConfSuite` that depends on that option, and in the future people might 
easily add tests that uses the param as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Tighten the visibility of various SQLCon...

2014-08-05 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1794#discussion_r15856595
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -17,17 +17,17 @@
 
 package org.apache.spark.sql
 
+import scala.collection.immutable
+import scala.collection.JavaConversions._
--- End diff --

Nit: under the java import  switch order


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Tighten the visibility of various SQLCon...

2014-08-05 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1794#issuecomment-51295283
  
Hey @rxin -- I think this is good to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1774#discussion_r15790429
  
--- Diff: docs/sql-programming-guide.md ---
@@ -152,6 +152,41 @@ val teenagers = sqlContext.sql(SELECT name FROM 
people WHERE age = 13 AND age
 teenagers.map(t = Name:  + t(0)).collect().foreach(println)
 {% endhighlight %}
 
+Another way to turns an RDD to table is to use `applySchema`. Here is an 
example.
--- End diff --

to turn


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1774#discussion_r15790441
  
--- Diff: docs/sql-programming-guide.md ---
@@ -152,6 +152,41 @@ val teenagers = sqlContext.sql(SELECT name FROM 
people WHERE age = 13 AND age
 teenagers.map(t = Name:  + t(0)).collect().foreach(println)
 {% endhighlight %}
 
+Another way to turns an RDD to table is to use `applySchema`. Here is an 
example.
+{% highlight scala %}
+// sc is an existing SparkContext.
+val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+
+// Create an RDD
+val people = sc.textFile(examples/src/main/resources/people.txt)
+
+// Import Spark SQL data types and Row.
+import org.apache.spark.sql._
+
+// Define the schema that will be applied to the RDD.
+val schema =
+  StructType(
+StructField(name, StringType, true) ::
+StructField(age, IntegerType, true) :: Nil)
+
+// Convert records of the RDD (people) to rows.
--- End diff --

to Rows?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1774#discussion_r15790528
  
--- Diff: docs/sql-programming-guide.md ---
@@ -225,6 +260,54 @@ ListString teenagerNames = teenagers.map(new 
FunctionRow, String() {
 
 {% endhighlight %}
 
+Another way to turns an RDD to table is to use `applySchema`. Here is an 
example.
--- End diff --

to turn; to a table


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...

2014-08-04 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1774#discussion_r15790538
  
--- Diff: docs/sql-programming-guide.md ---
@@ -259,6 +342,40 @@ for teenName in teenNames.collect():
   print teenName
 {% endhighlight %}
 
+Another way to turns an RDD to table is to use `applySchema`. Here is an 
example.
--- End diff --

Same - maybe do a replaceAll


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

2014-08-03 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1741#discussion_r15733255
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
@@ -21,12 +21,15 @@ import java.io.{BufferedReader, File, 
InputStreamReader, PrintStream}
 import java.sql.Timestamp
 import java.util.{ArrayList = JArrayList}
 
+import org.apache.hadoop.hive.ql.stats.StatsSetupConst
--- End diff --

Alphabetize imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

2014-08-03 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1741#discussion_r15733260
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -280,7 +281,7 @@ private[hive] case class MetastoreRelation
   // of RPCs are involved.  Besides `totalSize`, there are also 
`numFiles`, `numRows`,
   // `rawDataSize` keys that we can look at in the future.
   BigInt(
-Option(hiveQlTable.getParameters.get(totalSize))
+Option(hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE))
--- End diff --

Oh wow, this is a hard-to-find class!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

2014-08-03 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1741#discussion_r15733265
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -280,7 +281,7 @@ private[hive] case class MetastoreRelation
   // of RPCs are involved.  Besides `totalSize`, there are also 
`numFiles`, `numRows`,
--- End diff --

Perhaps update the comments here to say other fields in `StatsSetupConst` 
might be useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

2014-08-03 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1741#discussion_r15733274
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
@@ -92,6 +95,64 @@ class HiveContext(sc: SparkContext) extends 
SQLContext(sc) {
 catalog.createTable(default, tableName, 
ScalaReflection.attributesFor[A], allowExisting)
   }
 
+  /**
+   * Analyzes the given table in the current database to generate 
statistics, which will be
+   * used in query optimizations.
+   *
+   * Right now, it only supports Hive tables and it only updates the size 
of a Hive table
+   * in the Hive metastore.
+   */
+  def analyze(tableName: String) {
+val relation = catalog.lookupRelation(None, tableName) match {
+  case LowerCaseSchema(r) = r
+  case o = o
+}
+
+relation match {
+  case relation: MetastoreRelation = {
+// This method is borrowed from
+// 
org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table)
+// in Hive 0.13.
+// TODO: Generalize statistics collection.
+def getFileSizeForTable(conf: HiveConf, table: Table): Long = {
+  val path = table.getPath()
+  var size: Long = 0L
+  try {
+val fs = path.getFileSystem(conf)
+size = fs.getContentSummary(path).getLength()
+  } catch {
+case e: Exception =
+  logWarning(
+sFailed to get the size of table ${table.getTableName} in 
the  +
+sdatabase ${table.getDbName} because of ${e.toString}, e)
+  size = 0L
+  }
+
+  size
+}
+
+val tableParameters = relation.hiveQlTable.getParameters
+val oldTotalSize =
+  
Option(tableParameters.get(StatsSetupConst.TOTAL_SIZE)).map(_.toLong).getOrElse(0L)
+val newTotalSize = getFileSizeForTable(hiveconf, 
relation.hiveQlTable)
+// Update the Hive metastore if the total size of the table is 
different than the size
+// recorded in the Hive metastore.
+// This logic is based on 
org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats().
+if (newTotalSize  0  newTotalSize != oldTotalSize) {
+  tableParameters.put(StatsSetupConst.TOTAL_SIZE, 
newTotalSize.toString)
--- End diff --

Do we need to be concerned about concurrent accesses to `tableParameters`? 
More generally do we need to somehow synchronize on 
`MetastoreRelation#hiveQlTable` in various places?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...

2014-08-03 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1741#discussion_r15733277
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala ---
@@ -92,6 +95,64 @@ class HiveContext(sc: SparkContext) extends 
SQLContext(sc) {
 catalog.createTable(default, tableName, 
ScalaReflection.attributesFor[A], allowExisting)
   }
 
+  /**
+   * Analyzes the given table in the current database to generate 
statistics, which will be
+   * used in query optimizations.
+   *
+   * Right now, it only supports Hive tables and it only updates the size 
of a Hive table
+   * in the Hive metastore.
+   */
+  def analyze(tableName: String) {
+val relation = catalog.lookupRelation(None, tableName) match {
+  case LowerCaseSchema(r) = r
+  case o = o
+}
+
+relation match {
+  case relation: MetastoreRelation = {
+// This method is borrowed from
+// 
org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table)
+// in Hive 0.13.
+// TODO: Generalize statistics collection.
+def getFileSizeForTable(conf: HiveConf, table: Table): Long = {
+  val path = table.getPath()
+  var size: Long = 0L
+  try {
+val fs = path.getFileSystem(conf)
+size = fs.getContentSummary(path).getLength()
+  } catch {
+case e: Exception =
+  logWarning(
+sFailed to get the size of table ${table.getTableName} in 
the  +
+sdatabase ${table.getDbName} because of ${e.toString}, e)
+  size = 0L
+  }
+
+  size
+}
+
+val tableParameters = relation.hiveQlTable.getParameters
+val oldTotalSize =
+  
Option(tableParameters.get(StatsSetupConst.TOTAL_SIZE)).map(_.toLong).getOrElse(0L)
+val newTotalSize = getFileSizeForTable(hiveconf, 
relation.hiveQlTable)
+// Update the Hive metastore if the total size of the table is 
different than the size
+// recorded in the Hive metastore.
+// This logic is based on 
org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats().
+if (newTotalSize  0  newTotalSize != oldTotalSize) {
+  tableParameters.put(StatsSetupConst.TOTAL_SIZE, 
newTotalSize.toString)
+  val hiveTTable = relation.hiveQlTable.getTTable
+  hiveTTable.setParameters(tableParameters)
+  val tableFullName =
+relation.hiveQlTable.getDbName() + . + 
relation.hiveQlTable.getTableName()
+
+  catalog.client.alterTable(tableFullName, new Table(hiveTTable))
+}
+  }
+  case otherRelation =
+throw new NotImplementedError(sAnalyzing a ${otherRelation} has 
not been implemented)
--- End diff --

We probably don't want the result of a general `.toString`. Perhaps just 
say Analyzing relations other than MetastoreRelation's has not been 
implemented instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2675] Increase EVENT_QUEUE_CAPACITY by ...

2014-08-02 Thread concretevitamin
Github user concretevitamin closed the pull request at:

https://github.com/apache/spark/pull/1579


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2316] Avoid O(blocks) operations in lis...

2014-08-01 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1679#issuecomment-50946639
  
How many listeners are used in these benchmarks?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2531 SPARK-2436] [SQL] Optimize the B...

2014-07-31 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1448#issuecomment-50807987
  
I have rebased  made changes according to the previous review comments. 
Also updated the title and description of the PR, combining two JIRA tickets.

@chenghao-intel The following tests cover different join types. I ran it 
and the tests passed for both `BuildLeft` and `BuildRight` being hardcoded into 
the planner.

```
SPARK_HIVE=true sbt/sbt 
-Dspark.hive.whitelist=mapjoin_test_outer,semijoin,smb_mapjoin_2,smb_mapjoin_3,union22
 test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2316] Avoid O(blocks) operations in lis...

2014-07-31 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1679#issuecomment-50827351
  
This patch is much appreciated -- thanks for working on this!


On Thu, Jul 31, 2014 at 3:02 PM, Apache Spark QA notificati...@github.com
wrote:

 QA results for PR 1679:
 - This patch PASSES unit tests.
 - This patch merges cleanly
 - This patch adds the following public classes (experimental):
 class StorageStatus(val blockManagerId: BlockManagerId, val maxMem: Long) 
{

 For more information see test ouptut:

 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17607/consoleFull

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1679#issuecomment-50824930.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2531] [SQL] Make BroadcastNestedLoopJoi...

2014-07-30 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1448#discussion_r15599752
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala ---
@@ -332,71 +342,88 @@ case class BroadcastNestedLoopJoin(
 }
   }
 
-  /** The Streamed Relation */
-  def left = streamed
-  /** The Broadcast relation */
-  def right = broadcast
-
   @transient lazy val boundCondition =
 InterpretedPredicate(
   condition
 .map(c = BindReferences.bindReference(c, left.output ++ 
right.output))
 .getOrElse(Literal(true)))
 
+  // TODO: low-hanging fruits for performance? (Reduce branching / pattern 
matches?)
   def execute() = {
 val broadcastedRelation =
   
sqlContext.sparkContext.broadcast(broadcast.execute().map(_.copy()).collect().toIndexedSeq)
 
-val streamedPlusMatches = streamed.execute().mapPartitions { 
streamedIter =
+/** All rows that either match both-way, or rows from streamed joined 
with nulls. */
+val matchesOrStreamedRowsWithNulls = streamed.execute().mapPartitions 
{ streamedIter =
   val matchedRows = new ArrayBuffer[Row]
   // TODO: Use Spark's BitSet.
-  val includedBroadcastTuples = new 
BitSet(broadcastedRelation.value.size)
+  val includedBroadcastTuples =
+new scala.collection.mutable.BitSet(broadcastedRelation.value.size)
   val joinedRow = new JoinedRow
 
   streamedIter.foreach { streamedRow =
 var i = 0
-var matched = false
+var streamRowMatched = false
 
 while (i  broadcastedRelation.value.size) {
   // TODO: One bitset per partition instead of per row.
   val broadcastedRow = broadcastedRelation.value(i)
-  if (boundCondition(joinedRow(streamedRow, broadcastedRow))) {
-matchedRows += buildRow(streamedRow ++ broadcastedRow)
-matched = true
+  val jr = buildSide match {
+case BuildRight = joinedRow(streamedRow, broadcastedRow)
+case BuildLeft = joinedRow(broadcastedRow, streamedRow)
+  }
+  if (boundCondition(jr)) {
+// Putting this branching inside this conditional: assume ++ 
has a
+// much higher cost than another branch  pattern matching.
+val br = buildSide match {
+  case BuildRight = streamedRow ++ broadcastedRow
+  case BuildLeft = broadcastedRow ++ streamedRow
+}
+matchedRows += buildRow(br)
+streamRowMatched = true
 includedBroadcastTuples += i
   }
   i += 1
 }
 
-if (!matched  (joinType == LeftOuter || joinType == FullOuter)) {
-  matchedRows += buildRow(streamedRow ++ 
Array.fill(right.output.size)(null))
+(streamRowMatched, joinType, buildSide) match {
+  case (false, LeftOuter | FullOuter, BuildRight) =
+matchedRows += buildRow(streamedRow ++ 
Array.fill(right.output.size)(null))
+  case (false, RightOuter | FullOuter, BuildLeft) =
+matchedRows += buildRow(Array.fill(left.output.size)(null) ++ 
streamedRow)
+  case _ =
 }
   }
   Iterator((matchedRows, includedBroadcastTuples))
 }
 
-val includedBroadcastTuples = streamedPlusMatches.map(_._2)
+val includedBroadcastTuples = matchesOrStreamedRowsWithNulls.map(_._2)
 val allIncludedBroadcastTuples =
   if (includedBroadcastTuples.count == 0) {
 new scala.collection.mutable.BitSet(broadcastedRelation.value.size)
   } else {
-streamedPlusMatches.map(_._2).reduce(_ ++ _)
+includedBroadcastTuples.reduce(_ ++ _)
   }
 
-val rightOuterMatches: Seq[Row] =
-  if (joinType == RightOuter || joinType == FullOuter) {
-broadcastedRelation.value.zipWithIndex.filter {
-  case (row, i) = !allIncludedBroadcastTuples.contains(i)
-}.map {
-  // TODO: Use projection.
-  case (row, _) = buildRow(Vector.fill(left.output.size)(null) ++ 
row)
+/** Rows from broadcasted joined with nulls. */
+val broadcastRowsWithNulls: Seq[Row] =
+  broadcastedRelation.value.zipWithIndex.filter { case (row, i) =
+// keep all rows in the broadcast relation that are not matched 
before
+!allIncludedBroadcastTuples.contains(i)
+  }.map { case (row, _) =
+// TODO: Use projection.
+val br = (joinType, buildSide) match {
+  case (RightOuter | FullOuter, BuildRight) = 
Vector.fill(left.output.size)(null) ++ row
--- End diff --

Hey

[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-29 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-50522265
  
Rebased  addressed review comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-29 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-50522286
  
Jenkins test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50423690
  
@yhuai @marmbrus I am not sure if this has been discussed before, but what 
do you guys think about adding a version of `applySchema(RDD[Array[String]], 
StructType)`? 

The use case I have in mind is TPC-DS data preparation. Currently I have a 
bunch of text files, from which I can easily create an `RDD[String]`; by 
splitting each line on some separator I get an `RDD[Array[String]]`. Now, in 
TPC-DS the tables easily have 15+ columns, and I don't want to manually create 
a `Row` for each `Array[String]`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...

2014-07-28 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-50423851
  
To add to this: for my own purpose, I can certainly hack something together 
based off this branch in a custom Spark build, but just want to throw this 
thought out there as I think it does have some generality (large number of 
columns, avoid writing `.map(p = Row(p(0), p(1), ..., p(LARGE_NUM)))`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...

2014-07-26 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1600#issuecomment-50224519
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...

2014-07-26 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1601#discussion_r15434945
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -395,6 +395,11 @@ class SchemaRDD(
   arr.asInstanceOf[Array[Any]].map {
 element = rowToMap(element.asInstanceOf[Row], struct)
   }
+case t: java.sql.Timestamp = {
--- End diff --

This block can be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2675] Increase EVENT_QUEUE_CAPACITY by ...

2014-07-25 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1579#issuecomment-50183155
  
Unfortunately I may not be able to find time to run an experiment soon. If 
anyone is interested, I think `org.apache.spark.util.SizeEstimator` is 
reasonable to try. Let's also see if 
[SPARK-2316](https://issues.apache.org/jira/browse/SPARK-2316) gets fixed in 
the near future, which may remove the need to increase the queue capacity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2675] Increase EVENT_QUEUE_CAPACITY by ...

2014-07-24 Thread concretevitamin
GitHub user concretevitamin opened a pull request:

https://github.com/apache/spark/pull/1579

[SPARK-2675] Increase EVENT_QUEUE_CAPACITY by 20x.

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2675

@pwendell @andrewor

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/concretevitamin/spark increase-event-queue

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1579.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1579


commit 8ef1d0785221c2621aca2667c8acf0241c4394ef
Author: Zongheng Yang zonghen...@gmail.com
Date:   2014-07-24T19:19:25Z

Increase EVENT_QUEUE_CAPACITY by 20x.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2675] Increase EVENT_QUEUE_CAPACITY by ...

2014-07-24 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1579#issuecomment-50065677
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2531] [SQL] Make BroadcastNestedLoopJoi...

2014-07-23 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1448#issuecomment-49911126
  
Thanks for the comments @chenghao-intel and @marmbrus. As Michael said I'll 
revisit this after the codegen PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2054][SQL] Code Generation for Exp...

2014-07-23 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/993#discussion_r15311508
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateProjection.scala
 ---
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.types._
+
+
+/**
+ * Generates bytecode that produces a new [[Row]] object based on a fixed 
set of input
+ * [[Expression Expressions]] and a given input [[Row]].  The returned 
[[Row]] object is custom
+ * generated based on the output types of the [[Expression]] to avoid 
boxing of primitive values.
+ */
+object GenerateProjection extends CodeGenerator[Seq[Expression], 
Projection] {
+  import scala.reflect.runtime.{universe = ru}
+  import scala.reflect.runtime.universe._
+
+  protected def canonicalize(in: Seq[Expression]): Seq[Expression] =
+in.map(ExpressionCanonicalizer(_))
+
+  protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): 
Seq[Expression] =
+in.map(BindReferences.bindReference(_, inputSchema))
+
+  // Make Mutablility optional...
+  protected def create(expressions: Seq[Expression]): Projection = {
+val tupleLength = ru.Literal(Constant(expressions.length))
+val lengthDef = qfinal val length = $tupleLength
+
+/* TODO: Configurable...
+val nullFunctions =
+  q
+private final val nullSet = new 
org.apache.spark.util.collection.BitSet(length)
+final def setNullAt(i: Int) = nullSet.set(i)
--- End diff --

Could inlining help these defs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-22 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r15243150
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala ---
@@ -66,8 +66,8 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] 
with Logging {
  * linking.
  */
 @DeveloperApi
-case class SparkLogicalPlan(alreadyPlanned: SparkPlan, tableName: String = 
SparkLogicalPlan)
-  extends BaseRelation with MultiInstanceRelation {
+case class SparkLogicalPlan(alreadyPlanned: SparkPlan)(@transient 
sqlContext: SQLContext)
--- End diff --

That will be pretty awesome  reduce much boilerplate code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-22 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r15243796
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -26,6 +26,26 @@ import org.apache.spark.sql.catalyst.trees
 abstract class LogicalPlan extends QueryPlan[LogicalPlan] {
   self: Product =
 
+  // TODO: handle overflow?
--- End diff --

BigInt somehow completely eluded me before. Let me try to make the changes 
now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-22 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-49785200
  
I have addressed the latest round of review comments  rebased onto latest 
master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Fix flakey HiveQuerySuite test

2014-07-21 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1514#issuecomment-49667765
  
Thanks for the fix. Looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2561][SQL] Fix apply schema

2014-07-21 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1470#issuecomment-49686947
  
LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1439#discussion_r15013611
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -241,4 +252,37 @@ private[hive] object HadoopTableReader {
 val bufferSize = System.getProperty(spark.buffer.size, 65536)
 jobConf.set(io.file.buffer.size, bufferSize)
   }
+
+  /**
+   * Transform the raw data(Writable object) into the Row object for an 
iterable input
+   * @param iter Iterable input which represented as Writable object
+   * @param deserializer Deserializer associated with the input writable 
object
+   * @param attrs Represents the row attribute names and its zero-based 
position in the MutableRow
+   * @param row reusable MutableRow object
+   * 
+   * @return Iterable Row object that transformed from the given iterable 
input.
+   */
+  def fillObject(iter: Iterator[Writable], deserializer: Deserializer, 
+  attrs: Seq[(Attribute, Int)], row: GenericMutableRow): Iterator[Row] 
= {
+val soi = 
deserializer.getObjectInspector().asInstanceOf[StructObjectInspector]
+// get the field references according to the attributes(output of the 
reader) required
+val fieldRefs = attrs.map { case (attr, idx) = 
(soi.getStructFieldRef(attr.name), idx) }
+  
+// Map each tuple to a row object
+iter.map { value =
+  val raw = deserializer.deserialize(value)
+  var idx = 0;
+  while(idx  fieldRefs.length) {
--- End diff --

nit: space after while


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1440#discussion_r15018072
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala ---
@@ -344,21 +344,52 @@ private[sql] class StringColumnStats extends 
BasicColumnStats(STRING) {
   }
 
   override def contains(row: Row, ordinal: Int) = {
-!(upperBound eq null)  {
+(upperBound ne null)  {
--- End diff --

I think I read somewhere ne has better performance in certain cases...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r15019589
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -26,6 +26,28 @@ import org.apache.spark.sql.catalyst.trees
 abstract class LogicalPlan extends QueryPlan[LogicalPlan] {
   self: Product =
 
+  // TODO: handle overflow?
+  /**
+   * Estimates of various statistics.  The default estimation logic simply 
sums up the corresponding
+   * statistic produced by the children.  To override this behavior, 
override `statistics` and
+   * assign it a overriden version of `Statistics`.
+   */
+  case class Statistics(
+/**
+ * Number of output tuples. For leaf operators this defaults to 1, 
otherwise it is set to the
+ * product of children's `numTuples`.
+ */
+numTuples: Long = childrenStats.map(_.numTuples).product,
+
+/**
+ * Physical size in bytes. For leaf operators this defaults to 1, 
otherwise it is set to the
+ * product of children's `sizeInBytes`.
+ */
+sizeInBytes: Long = childrenStats.map(_.sizeInBytes).product
+  )
+  lazy val statistics: Statistics = new Statistics
+  lazy val childrenStats = children.map(_.statistics)
--- End diff --

I think we need a lazy marker for this, otherwise during the initialization 
of `childrenStats`, `statistics` in all children will get evaluated too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-49210466
  
Jenkins, test this please.

I think I have addressed the latest round of review comments, where the 
biggest changes being:

- Remove statistics estimates from ParquetRelation for now due to 
performance considerations.
- When estimating sizes for MetastoreRelation, instead of making 
potentially expensive Hadoop FileSystem calls, we peek into the Hive Metastore 
for populated information.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-49219047
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r15024063
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -26,6 +26,28 @@ import org.apache.spark.sql.catalyst.trees
 abstract class LogicalPlan extends QueryPlan[LogicalPlan] {
   self: Product =
 
+  // TODO: handle overflow?
+  /**
+   * Estimates of various statistics.  The default estimation logic simply 
sums up the corresponding
+   * statistic produced by the children.  To override this behavior, 
override `statistics` and
+   * assign it a overriden version of `Statistics`.
+   */
+  case class Statistics(
+/**
+ * Number of output tuples. For leaf operators this defaults to 1, 
otherwise it is set to the
+ * product of children's `numTuples`.
+ */
+numTuples: Long = childrenStats.map(_.numTuples).product,
+
+/**
+ * Physical size in bytes. For leaf operators this defaults to 1, 
otherwise it is set to the
+ * product of children's `sizeInBytes`.
+ */
+sizeInBytes: Long = childrenStats.map(_.sizeInBytes).product
+  )
+  lazy val statistics: Statistics = new Statistics
+  lazy val childrenStats = children.map(_.statistics)
--- End diff --

Thanks for the note -- I wasn't well aware of the memory/time overhead of 
lazy vals. Since we have one statistic at the moment, `childrenStats` is used 
only once for a logical node, so I went ahead and removed it. 

However, memoizing `statistics` itself probably would save some planning 
time, as it will get called multiple times during planning (e.g. different 
cases in the HashJoin pattern match). If we were to use a `def` it is going to 
be evaluated in the quadratic order.

Using `val` only will fire off the computation of the stats for all logical 
nodes right away. I think this might be less desirable to only evaluating it 
for everything under a Join when needed by a Strategy. Also, we'd have very few 
operators in a typical query anyway.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-49222667
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2531] [SQL] Make BroadcastNestedLoopJoi...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1448#issuecomment-49227975
  
Jenkins, this is okay to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Add HiveDecimal HiveVarchar support in...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1436#issuecomment-49228593
  
Hey @chenghao-intel -- can you create a JIRA ticket for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2531] [SQL] Make BroadcastNestedLoopJoi...

2014-07-16 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1448#issuecomment-49239568
  
Jenkins, test this please...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Synchronize on a lock when using scala r...

2014-07-15 Thread concretevitamin
GitHub user concretevitamin opened a pull request:

https://github.com/apache/spark/pull/1423

[SQL] Synchronize on a lock when using scala reflection inside data type 
objects.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/concretevitamin/spark scala-ref-catalyst

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1423.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1423


commit 325a14944cf5ef26b66f770f92ad225c4b3610a6
Author: Zongheng Yang zonghen...@gmail.com
Date:   2014-07-15T23:11:51Z

Synchronize on a lock when initializing data type objects in Catalyst.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Synchronize on a lock when using scala r...

2014-07-15 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1423#issuecomment-49105217
  
Jenkins, add to whitelist please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SQL] Synchronize on a lock when using scala r...

2014-07-15 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1423#issuecomment-49105361
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin closed the pull request at:

https://github.com/apache/spark/pull/1390


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1390#issuecomment-48935494
  
@yhuai suggested a much simpler fix -- I benchmarked this and it gave the 
same performance boost. I am closing this and opening a new PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
GitHub user concretevitamin opened a pull request:

https://github.com/apache/spark/pull/1408

[SPARK-2443][SQL] Fix slow read from partitioned tables

This fix obtains a comparable performance boost as [PR 
#1390](https://github.com/apache/spark/pull/1390) by moving an array update and 
deserializer initialization out of a potentially very long loop. Suggested by 
@yhuai. The below results are updated for this fix.

## Benchmarks
Generated a local text file with 10M rows of simple key-value pairs. The 
data is loaded as a table through Hive. Results are obtained on my local 
machine using hive/console.

Without the fix:

Type | Non-partitioned | Partitioned (1 part)
 |  | -
First run | 9.52s end-to-end (1.64s Spark job) | 36.6s (28.3s)
Stablized runs | 1.21s (1.18s) | 27.6s (27.5s)

With this fix:

Type | Non-partitioned | Partitioned (1 part)
 |  | -
First run | 9.57s (1.46s) | 11.0s (1.69s)
Stablized runs | 1.13s (1.10s) | 1.23s (1.19s)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/concretevitamin/spark slow-read-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1408.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1408


commit d86e437218f99179934ccd9b4d5d89c02b09459d
Author: Zongheng Yang zonghen...@gmail.com
Date:   2014-07-14T18:03:07Z

Move update  initialization out of potentially long loop.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1390#issuecomment-48936743
  
New PR here: #1408 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1408#issuecomment-48936856
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1390#discussion_r14894946
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
 
   // Create local references so that the outer object isn't serialized.
   val tableDesc = _tableDesc
+  val tableSerDeClass = tableDesc.getDeserializerClass
+
   val broadcastedHiveConf = _broadcastedHiveConf
   val localDeserializer = partDeserializer
 
   val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc)
-  hivePartitionRDD.mapPartitions { iter =
+  hivePartitionRDD.mapPartitions { case iter =
--- End diff --

I initially thought in a function context, `{ case x = ... }` will be 
optimized to `{ x = ... }`. I did a `scalac -print` on a simple program to 
confirm that this is not the case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-48941276
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1408#issuecomment-48954674
  
I think we should ask the users who reported the performance issue if this 
fix solves their problems. Otherwise the comments in the previous PR seem to 
only apply to that implementation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1390#discussion_r14902569
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
 
   // Create local references so that the outer object isn't serialized.
   val tableDesc = _tableDesc
+  val tableSerDeClass = tableDesc.getDeserializerClass
+
   val broadcastedHiveConf = _broadcastedHiveConf
   val localDeserializer = partDeserializer
 
   val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc)
-  hivePartitionRDD.mapPartitions { iter =
+  hivePartitionRDD.mapPartitions { case iter =
--- End diff --

https://gist.github.com/concretevitamin/272fe413dcc06b8cbe9c

It seems the with-case version does have more instructions to do.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r14902843
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -263,6 +268,19 @@ private[hive] case class MetastoreRelation
 new Partition(hiveQlTable, p)
   }
 
+  // TODO: are there any stats in hiveQlTable.getSkewedInfo that we can 
use?
+  @transient override lazy val statistics = new Statistics {
--- End diff --

We could certainly move it out of the Statistics class to reside in 
`MetastoreRelation`, but does `new Statistics` incur extra overheads? Otherwise 
I think the copy constructor may not be the good style here, as we have a long 
comment and relatively long implementation for the field `sizeInBytes`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1390#discussion_r14902570
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala ---
@@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: 
TableDesc, @transient sc: HiveCon
 
   // Create local references so that the outer object isn't serialized.
   val tableDesc = _tableDesc
+  val tableSerDeClass = tableDesc.getDeserializerClass
+
   val broadcastedHiveConf = _broadcastedHiveConf
   val localDeserializer = partDeserializer
 
   val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc)
-  hivePartitionRDD.mapPartitions { iter =
+  hivePartitionRDD.mapPartitions { case iter =
--- End diff --

https://gist.github.com/concretevitamin/272fe413dcc06b8cbe9c

It seems the with-case version does have more instructions to do.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r14904711
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ---
@@ -47,6 +47,13 @@ private[sql] abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   /**
* Uses the ExtractEquiJoinKeys pattern to find joins where at least 
some of the predicates can be
* evaluated by matching hash keys.
+   *
+   * This strategy applies a simple optimization based on the estimates of 
the physical sizes of
+   * the two join sides.  When planning a [[execution.BroadcastHashJoin]], 
if one side has an
+   * estimated physical size smaller than the user-settable threshold
+   * `spark.sql.auto.convert.join.size`, the planner would mark it as the 
''build'' relation and
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-14 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r14906952
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -92,6 +114,8 @@ abstract class LogicalPlan extends 
QueryPlan[LogicalPlan] {
 abstract class LeafNode extends LogicalPlan with 
trees.LeafNode[LogicalPlan] {
   self: Product =
 
+  override lazy val statistics = Statistics(numTuples = 1L, sizeInBytes = 
1L)
--- End diff --

Setting large default values here (1) doesn't seem to make intuitive sense, 
and (2) can cause superfluous overflows when things are getting multiplied. 

This 
[commit](https://github.com/concretevitamin/spark/commit/f41ecf22d516bfeabc4168f2a1e6518169d85c29)
 introduced this change, which makes 1L a special default value that gets 
preserved even when a bunch of operators lack meaningful estimates -- because 
the default estimates will just multiply a bunch of 1's together, producing the 
special value 1L. With this preservation, we could test the absence of 
meaningful estimates and correspondingly do special handling, as shown in that 
commit. (One example in that commit: if RDDs don't have meaningful estimates, 
the field defaults to 
[SQLConf#statsDefaultSizeInBytes](https://github.com/apache/spark/pull/1238/files#diff-41ef65b9ef5b518f77e2a03559893f4dR61),
 which is set to be greater than the auto join conversion threshold. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-13 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1238#discussion_r14857017
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala ---
@@ -21,16 +21,27 @@ import java.util.Properties
 
 import scala.collection.JavaConverters._
 
+private object SQLConf {
+  @transient protected[spark] val confSettings = 
java.util.Collections.synchronizedMap(
--- End diff --

In one of the earlier commits, I tried the approach of making an operator 
mixing in SQLConf -- hence that necessitated a singleton object to hold the 
settings. Now I just have that operator take a SQLContext in order to get 
access to the conf, so if you want I could remove this singleton. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...

2014-07-12 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1390#issuecomment-48830080
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Prototype implementation of ...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-48570761
  
I have addressed most of the review comments  rebased. Please review again 
 take a look at the TODOs.

It might be better to leave the optimization for BroadcastNestedLoopJoin 
for a future PR (although I have some WIP locally), as it requires a couple 
changes to the class (mainly to support a BuildSide). This is the reason the 
tests were failing before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-48630836
  
@rxin sure, done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14781579
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -19,8 +19,11 @@ package org.apache.spark.sql.catalyst.expressions
 
 import java.util.regex.Pattern
 
+import scala.collection.IndexedSeqOptimized
+
 import org.apache.spark.sql.catalyst.types.DataType
 import org.apache.spark.sql.catalyst.types.StringType
+import org.apache.spark.sql.catalyst.types.BinaryType
--- End diff --

alphabetization or combine the imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14781690
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
+// negative indices for start positions. If a start index i is greater 
than 0, it 
+// refers to element i-1 in the sequence. If a start index i is less 
than 0, it refers
+// to the -ith element before the end of the sequence. If a start 
index i is 0, it
+// refers to the first element.
+
+val start = startPos match {
+  case pos if pos  0 = pos - 1
+  case neg if neg  0 = len + neg
+  case _ = 0
+}
+
+val end = sliceLen match {
+  case max if max == Integer.MAX_VALUE = max
+  case x = start + x
+}
+  
+str.slice(start, end)
+  }
+  
+  override def eval(input: Row): Any = {
+val string = str.eval(input)
+
+val po = pos.eval(input)
+val ln = len.eval(input)
+
+if ((string == null) || (po == null) || (ln == null)) {
+  null
+} else {
+  val start = po.asInstanceOf[Int]
+  val length = ln.asInstanceOf[Int] 
+  
+  string match {
+case ba: Array[Byte] = slice(ba, start, length)
+case other = slice(other.toString, start, length)
+  }
+}
+  }
+  
+  override def toString = sSUBSTR($str, $pos, $len)
--- End diff --

Minor: what do you think if we display only two args for the 2-arg case 
instead of displaying a Integer.MAX_VALUE?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14781542
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
--- End diff --

Hey Will -- I think we need to check if `resolved` is true here to not 
break Catalyst's contract. Similar to 
[here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L183).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14782267
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala ---
@@ -860,6 +860,7 @@ private[hive] object HiveQl {
   val BETWEEN = (?i)BETWEEN.r
   val WHEN = (?i)WHEN.r
   val CASE = (?i)CASE.r
+  val SUBSTR = (?i)I_SUBSTR(?:ING)?.r
--- End diff --

Why I_ here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14782583
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
--- End diff --

It looks like from [1] and [2] that Hive and SQL by default refer users to 
use 1-based indexing (and the links don't seem to mention 0-based at all). Even 
though they do support so, this subtle fact necessitates a couple branchings in 
the implementation which might cause us some performance penalty. What do you  
people think about supporting 1-based indexing only? +@marmbrus @rxin


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14782682
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
+// negative indices for start positions. If a start index i is greater 
than 0, it 
+// refers to element i-1 in the sequence. If a start index i is less 
than 0, it refers
+// to the -ith element before the end of the sequence. If a start 
index i is 0, it
+// refers to the first element.
+
+val start = startPos match {
+  case pos if pos  0 = pos - 1
+  case neg if neg  0 = len + neg
+  case _ = 0
+}
+
+val end = sliceLen match {
--- End diff --

The MySQL doc mentions that If len is less than 1, the result is the empty 
string.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48642397
  
Hey @willb - thanks for working on this, which is going to be very useful 
for Spark SQL. I left a couple minor comments. Another general concern is the 
performance of `eval()`. If there are ways to reduce branchings, or reduce the 
function call (not sure if @inline will help), we should probably do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1359#discussion_r14783724
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: 
Expression) extends StringCompari
 case class EndsWith(left: Expression, right: Expression) extends 
StringComparison {
   def compare(l: String, r: String) = l.endsWith(r)
 }
+
+/**
+ * A function that takes a substring of its first argument starting at a 
given position.
+ * Defined for String and Binary types.
+ */
+case class Substring(str: Expression, pos: Expression, len: Expression) 
extends Expression {
+  
+  type EvaluatedType = Any
+  
+  def nullable: Boolean = true
+  def dataType: DataType = {
+if (str.dataType == BinaryType) str.dataType else StringType
+  }
+  
+  def references = children.flatMap(_.references).toSet
+  
+  override def children = str :: pos :: len :: Nil
+  
+  def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, 
sliceLen: Int): Any = {
+val len = str.length
+// Hive and SQL use one-based indexing for SUBSTR arguments but also 
accept zero and
+// negative indices for start positions. If a start index i is greater 
than 0, it 
+// refers to element i-1 in the sequence. If a start index i is less 
than 0, it refers
+// to the -ith element before the end of the sequence. If a start 
index i is 0, it
+// refers to the first element.
+
+val start = startPos match {
+  case pos if pos  0 = pos - 1
+  case neg if neg  0 = len + neg
+  case _ = 0
+}
+
+val end = sliceLen match {
--- End diff --

Right, missed this before, sorry.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-48645064
  
To handle potential overflow (one last TODO), I think there are a couple 
alternatives:

- A: Throw exceptions for overflowing operations. Similar to [1].
- B: Use [1], but replace the overflow situations with a Top and a Bottom 
that absorb/saturate things correspondingly. Similar concepts here [2].
- C: Use [3] (or reimplement parts of it), which just carry out an 
overflowing operation in the lifted BigInt counterparts.

I think Approach A is bad as when dealing with big data we'd almost 
certainly run into this case in the future. Approach B is reasonable in that 
whenever we see a Top/Bottom, we could just disable/special-case the cost 
estimation. Approach C looks okay too but may be too heavy.

Let me know what do you guys think should go into this PR.

[1] 
https://github.com/twitter/util/blob/master/util-core/src/main/scala/com/twitter/util/LongOverflowArith.scala
[2] 
https://github.com/twitter/util/blob/master/util-core/src/main/scala/com/twitter/util/Duration.scala
[3] 
https://github.com/non/spire/blob/master/core/src/main/scala/spire/math/SafeLong.scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...

2014-07-10 Thread concretevitamin
Github user concretevitamin commented on the pull request:

https://github.com/apache/spark/pull/1359#issuecomment-48648864
  
This sounds pretty cool!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   3   >