[GitHub] spark pull request: [SPARK-6806] [SPARKR] [DOCS] Add a new SparkR ...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/6490#discussion_r31304628 --- Diff: docs/sparkr.md --- @@ -0,0 +1,198 @@ +--- +layout: global +displayTitle: SparkR (R on Spark) +title: SparkR (R on Spark) +--- + +* This will become a table of contents (this text will be scraped). +{:toc} + +# Overview +SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. +In Spark {{site.SPARK_VERSION}}, SparkR provides a distributed data frame implementation that +supports operations similar to R data frames, [dplyr](https://github.com/hadley/dplyr) but on large --- End diff -- add etc/e.g.? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7512] [SPARKR] Fix RDD's show method to...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/6035#issuecomment-100564609 LGTM! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/5568#issuecomment-96814334 LGTM. /cc @shivaram --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/5568#issuecomment-96751034 @hlin09 - cool, I don't have a strong opinion towards either approach for now. Did a pass and left some minor style comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/5568#discussion_r29169055 --- Diff: R/pkg/R/RDD.R --- @@ -1529,3 +1529,50 @@ setMethod(zipRDD, PipelinedRDD(zippedRDD, partitionFunc) }) + +#' Zip an RDD's partitions with one (or more) RDD(s). +#' +#' Return a new RDD by applying a function to the zipped partitions. +#' Assumes that all the RDDs have the same number of partitions*, but does *not* +#' require them to have the same number of elements in each partition. +#' +#' @param x An RDD to be zipped. +#' @param other Another RDD to be zipped. +#' @return An RDD zipped from the two RDDs. +#' @examples +#'\dontrun{ +#' sc - sparkR.init() +#' rdd1 - parallelize(sc, 1:2, 2L) # 1, 2 +#' rdd2 - parallelize(sc, 1:4, 2L) # 1:2, 3:4 +#' rdd3 - parallelize(sc, 1:6, 2L) # 1:3, 4:6 +#' collect(zipPartitions(rdd1, rdd2, rdd3, +#' func = function(x, y, z) { list(list(x, y, z))} )) +#' # list(list(1, c(1,2), c(1,2,3)), list(2, c(3,4), c(4,5,6))) +#'} +#' @rdname zipRDD +#' @aliases zipPartitions,RDD +setMethod(zipPartitions, + RDD, + function(..., func) { +rrdds - list(...) +if (length(rrdds) == 1) { + return(rrdds[[1]]) +} +nPart - sapply(rrdds, numPartitions) +if (length(unique(nPart)) != 1) { + stop(Can only zipPartitions RDDs which have the same number of partitions.) +} + +rrdds - lapply(rrdds, function(rdd) { + mapPartitionsWithIndex(rdd, function(split, part) { --- End diff -- We've recently changed all usage of `split` to `partIndex` or something like `idx` is fine too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/5568#discussion_r29168961 --- Diff: R/pkg/R/RDD.R --- @@ -1529,3 +1529,50 @@ setMethod(zipRDD, PipelinedRDD(zippedRDD, partitionFunc) }) + +#' Zip an RDD's partitions with one (or more) RDD(s). +#' +#' Return a new RDD by applying a function to the zipped partitions. +#' Assumes that all the RDDs have the same number of partitions*, but does *not* --- End diff -- nit: unmatched * --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/5568#discussion_r29169169 --- Diff: R/pkg/R/RDD.R --- @@ -1529,3 +1529,50 @@ setMethod(zipRDD, PipelinedRDD(zippedRDD, partitionFunc) }) + +#' Zip an RDD's partitions with one (or more) RDD(s). +#' +#' Return a new RDD by applying a function to the zipped partitions. +#' Assumes that all the RDDs have the same number of partitions*, but does *not* +#' require them to have the same number of elements in each partition. +#' +#' @param x An RDD to be zipped. +#' @param other Another RDD to be zipped. --- End diff -- The doc here needs to be updated: (1) var-arg arity (2) document `func`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7033][SPARKR] Clean usage of split. Use...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/5628#issuecomment-95651218 Thanks, @sun-rui - doing a grep, could you also update test_rdd.R's line 124? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7033][SPARKR] Clean usage of split. Use...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/5628#issuecomment-95775125 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6991] [SparkR] Adds support for zipPart...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/5568#issuecomment-94624336 Thanks @hlin09 - one high-level question: do you have an intended use case in mind for arbitrary / large arity functions? In Spark's Scala API up to 4 RDDs are supported, which has perf. benefits (avoid an union + shuffle). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][SparkR] Minor refactor and removes red...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/5495#issuecomment-92477984 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][SparkR] Minor refactor and removes red...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/5495#issuecomment-92479065 Hey @pwendell I can't seem to summon Jenkins now :( --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor][SparkR] Minor refactor and removes red...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/5495#issuecomment-92589456 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5654] Integrate SparkR
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/5096#issuecomment-91123979 :+1: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5654] Integrate SparkR into Apache Spar...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/5077#issuecomment-83271348 @shivaram What's the timeline we are looking at? If possible I'd like to take a close look at this next week / after next week. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3609][SQL] Adds sizeInBytes statistics ...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/2468#discussion_r17817992 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala --- @@ -122,6 +122,16 @@ object NativeType { IntegerType, BooleanType, LongType, DoubleType, FloatType, ShortType, ByteType, StringType) def unapply(dt: DataType): Boolean = all.contains(dt) + + val defaultSizeOf: Map[NativeType, Int] = Map( --- End diff -- Does JVM always use these conventional sizes on every platform? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/2220#issuecomment-53995569 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3329: [SQL] Don't depend on Hive SET pai...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/2220#issuecomment-53995880 Ah, so this problem *was* fixed in PR #1514, but it seems like this [PR](https://github.com/apache/spark/commit/a7a9d14479ea6421513a962ff0f45cb969368bab#diff-48) reverted the change accidentally. I think it'd be good to re-adapt 1514's solution. /cc @aarondav @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3252][SQL] Add missing condition for te...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/2159#issuecomment-53635984 This is not harmful yet it doesn't do anything either, since in the test query the two relations should be the same. On Wednesday, August 27, 2014, Michael Armbrust notificati...@github.com wrote: /cc @concretevitamin https://github.com/concretevitamin â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/2159#issuecomment-53632213. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1839#issuecomment-51520540 Jenkins, this is okay to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2678][Core][SQL] A workaround for SPARK...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1801#issuecomment-51533012 @liancheng @andrewor14 @pwendell With this patch things like `./bin/spark-shell --master local[2]` errors out (bad options: --master). I had to workaround this with an extra -- before the flag. Is this intended? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2678][Core][SQL] A workaround for SPARK...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1801#issuecomment-51533306 Oh, it's been reported by #1825. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2406][SQL] Initial support for using Pa...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1819#discussion_r15913270 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -78,6 +78,14 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { // Change the default SQL dialect to HiveQL override private[spark] def dialect: String = getConf(SQLConf.DIALECT, hiveql) + /** + * When true, enables an experimental feature where metastore tables that use the parquet SerDe + * are automatically converted to use the Spark SQL parquet table scan, instead of the Hive + * SerDe. + */ + private[spark] def convertMetastoreParquet: Boolean = +getConf(spark.sql.hive.convertMetastoreParquet, false) == true --- End diff -- I am going to test this PR soon. In the meantime would it make sense to only put this in `SQLConf` (as well as a field of the key string in the singleton object), making that class the central place that stores SQL configs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2406][SQL] Initial support for using Pa...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1819#discussion_r15914067 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -78,6 +78,14 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { // Change the default SQL dialect to HiveQL override private[spark] def dialect: String = getConf(SQLConf.DIALECT, hiveql) + /** + * When true, enables an experimental feature where metastore tables that use the parquet SerDe + * are automatically converted to use the Spark SQL parquet table scan, instead of the Hive + * SerDe. + */ + private[spark] def convertMetastoreParquet: Boolean = +getConf(spark.sql.hive.convertMetastoreParquet, false) == true --- End diff -- When in doubt, make up longer names: `SQLConfigOpts`, `HiveConfigOpts`. But this is only possibly relevant in the future and should not block this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2860][SQL] Fix coercion of CASE WHEN.
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1785#discussion_r15828435 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala --- @@ -336,28 +338,33 @@ trait HiveTypeCoercion { } /** - * Ensures that NullType gets casted to some other types under certain circumstances. + * Coerces the type of different branches of a CASE WHEN statement to a common type. */ - object CastNulls extends Rule[LogicalPlan] { + object CaseWhenCoercion extends Rule[LogicalPlan] with TypeWidening { def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions { - case cw @ CaseWhen(branches) = + case cw @ CaseWhen(branches) if !cw.resolved !branches.exists(!_.resolved) = val valueTypes = branches.sliding(2, 2).map { - case Seq(_, value) if value.resolved = Some(value.dataType) - case Seq(elseVal) if elseVal.resolved = Some(elseVal.dataType) - case _ = None + case Seq(_, value) = value.dataType + case Seq(elseVal) = elseVal.dataType }.toSeq -if (valueTypes.distinct.size == 2 valueTypes.exists(_ == Some(NullType))) { - val otherType = valueTypes.filterNot(_ == Some(NullType))(0).get + +logDebug(sInput values for null casting ${valueTypes.mkString(,)}) --- End diff -- Do we want to leave this in? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2860][SQL] Fix coercion of CASE WHEN.
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1785#discussion_r15828596 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala --- @@ -336,28 +338,33 @@ trait HiveTypeCoercion { } /** - * Ensures that NullType gets casted to some other types under certain circumstances. + * Coerces the type of different branches of a CASE WHEN statement to a common type. */ - object CastNulls extends Rule[LogicalPlan] { + object CaseWhenCoercion extends Rule[LogicalPlan] with TypeWidening { def apply(plan: LogicalPlan): LogicalPlan = plan transformAllExpressions { - case cw @ CaseWhen(branches) = + case cw @ CaseWhen(branches) if !cw.resolved !branches.exists(!_.resolved) = val valueTypes = branches.sliding(2, 2).map { - case Seq(_, value) if value.resolved = Some(value.dataType) - case Seq(elseVal) if elseVal.resolved = Some(elseVal.dataType) - case _ = None + case Seq(_, value) = value.dataType + case Seq(elseVal) = elseVal.dataType }.toSeq -if (valueTypes.distinct.size == 2 valueTypes.exists(_ == Some(NullType))) { - val otherType = valueTypes.filterNot(_ == Some(NullType))(0).get + +logDebug(sInput values for null casting ${valueTypes.mkString(,)}) + +if (valueTypes.distinct.size 1) { + val commonType = valueTypes.reduce { (v1, v2) = +findTightestCommonType(v1, v2) + .getOrElse(sys.error(sInvalid types in CASE WHEN. $v1, $v2)) --- End diff -- Perhaps Value types in CASE WHEN must be the same, or can be coerced to a tightest upper bound. to be more precise about the semantics, in case users run into this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2860][SQL] Fix coercion of CASE WHEN.
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1785#issuecomment-51232808 A few minor comments otherwise LGTM! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Set Spark SQL Hive compatibility test shuffle ...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1784#issuecomment-51234268 Can we reset the original value in afterAll()? There's a test in `SQLConfSuite` that depends on that option, and in the future people might easily add tests that uses the param as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Tighten the visibility of various SQLCon...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1794#discussion_r15856595 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -17,17 +17,17 @@ package org.apache.spark.sql +import scala.collection.immutable +import scala.collection.JavaConversions._ --- End diff -- Nit: under the java import switch order --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] Tighten the visibility of various SQLCon...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1794#issuecomment-51295283 Hey @rxin -- I think this is good to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1774#discussion_r15790429 --- Diff: docs/sql-programming-guide.md --- @@ -152,6 +152,41 @@ val teenagers = sqlContext.sql(SELECT name FROM people WHERE age = 13 AND age teenagers.map(t = Name: + t(0)).collect().foreach(println) {% endhighlight %} +Another way to turns an RDD to table is to use `applySchema`. Here is an example. --- End diff -- to turn --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1774#discussion_r15790441 --- Diff: docs/sql-programming-guide.md --- @@ -152,6 +152,41 @@ val teenagers = sqlContext.sql(SELECT name FROM people WHERE age = 13 AND age teenagers.map(t = Name: + t(0)).collect().foreach(println) {% endhighlight %} +Another way to turns an RDD to table is to use `applySchema`. Here is an example. +{% highlight scala %} +// sc is an existing SparkContext. +val sqlContext = new org.apache.spark.sql.SQLContext(sc) + +// Create an RDD +val people = sc.textFile(examples/src/main/resources/people.txt) + +// Import Spark SQL data types and Row. +import org.apache.spark.sql._ + +// Define the schema that will be applied to the RDD. +val schema = + StructType( +StructField(name, StringType, true) :: +StructField(age, IntegerType, true) :: Nil) + +// Convert records of the RDD (people) to rows. --- End diff -- to Rows? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1774#discussion_r15790528 --- Diff: docs/sql-programming-guide.md --- @@ -225,6 +260,54 @@ ListString teenagerNames = teenagers.map(new FunctionRow, String() { {% endhighlight %} +Another way to turns an RDD to table is to use `applySchema`. Here is an example. --- End diff -- to turn; to a table --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2179] [SQL] Public API for DataTypes an...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1774#discussion_r15790538 --- Diff: docs/sql-programming-guide.md --- @@ -259,6 +342,40 @@ for teenName in teenNames.collect(): print teenName {% endhighlight %} +Another way to turns an RDD to table is to use `applySchema`. Here is an example. --- End diff -- Same - maybe do a replaceAll --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1741#discussion_r15733255 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -21,12 +21,15 @@ import java.io.{BufferedReader, File, InputStreamReader, PrintStream} import java.sql.Timestamp import java.util.{ArrayList = JArrayList} +import org.apache.hadoop.hive.ql.stats.StatsSetupConst --- End diff -- Alphabetize imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1741#discussion_r15733260 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -280,7 +281,7 @@ private[hive] case class MetastoreRelation // of RPCs are involved. Besides `totalSize`, there are also `numFiles`, `numRows`, // `rawDataSize` keys that we can look at in the future. BigInt( -Option(hiveQlTable.getParameters.get(totalSize)) +Option(hiveQlTable.getParameters.get(StatsSetupConst.TOTAL_SIZE)) --- End diff -- Oh wow, this is a hard-to-find class! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1741#discussion_r15733265 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -280,7 +281,7 @@ private[hive] case class MetastoreRelation // of RPCs are involved. Besides `totalSize`, there are also `numFiles`, `numRows`, --- End diff -- Perhaps update the comments here to say other fields in `StatsSetupConst` might be useful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1741#discussion_r15733274 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -92,6 +95,64 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { catalog.createTable(default, tableName, ScalaReflection.attributesFor[A], allowExisting) } + /** + * Analyzes the given table in the current database to generate statistics, which will be + * used in query optimizations. + * + * Right now, it only supports Hive tables and it only updates the size of a Hive table + * in the Hive metastore. + */ + def analyze(tableName: String) { +val relation = catalog.lookupRelation(None, tableName) match { + case LowerCaseSchema(r) = r + case o = o +} + +relation match { + case relation: MetastoreRelation = { +// This method is borrowed from +// org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table) +// in Hive 0.13. +// TODO: Generalize statistics collection. +def getFileSizeForTable(conf: HiveConf, table: Table): Long = { + val path = table.getPath() + var size: Long = 0L + try { +val fs = path.getFileSystem(conf) +size = fs.getContentSummary(path).getLength() + } catch { +case e: Exception = + logWarning( +sFailed to get the size of table ${table.getTableName} in the + +sdatabase ${table.getDbName} because of ${e.toString}, e) + size = 0L + } + + size +} + +val tableParameters = relation.hiveQlTable.getParameters +val oldTotalSize = + Option(tableParameters.get(StatsSetupConst.TOTAL_SIZE)).map(_.toLong).getOrElse(0L) +val newTotalSize = getFileSizeForTable(hiveconf, relation.hiveQlTable) +// Update the Hive metastore if the total size of the table is different than the size +// recorded in the Hive metastore. +// This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats(). +if (newTotalSize 0 newTotalSize != oldTotalSize) { + tableParameters.put(StatsSetupConst.TOTAL_SIZE, newTotalSize.toString) --- End diff -- Do we need to be concerned about concurrent accesses to `tableParameters`? More generally do we need to somehow synchronize on `MetastoreRelation#hiveQlTable` in various places? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2783][SQL] Basic support for analyze in...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1741#discussion_r15733277 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala --- @@ -92,6 +95,64 @@ class HiveContext(sc: SparkContext) extends SQLContext(sc) { catalog.createTable(default, tableName, ScalaReflection.attributesFor[A], allowExisting) } + /** + * Analyzes the given table in the current database to generate statistics, which will be + * used in query optimizations. + * + * Right now, it only supports Hive tables and it only updates the size of a Hive table + * in the Hive metastore. + */ + def analyze(tableName: String) { +val relation = catalog.lookupRelation(None, tableName) match { + case LowerCaseSchema(r) = r + case o = o +} + +relation match { + case relation: MetastoreRelation = { +// This method is borrowed from +// org.apache.hadoop.hive.ql.stats.StatsUtils.getFileSizeForTable(HiveConf, Table) +// in Hive 0.13. +// TODO: Generalize statistics collection. +def getFileSizeForTable(conf: HiveConf, table: Table): Long = { + val path = table.getPath() + var size: Long = 0L + try { +val fs = path.getFileSystem(conf) +size = fs.getContentSummary(path).getLength() + } catch { +case e: Exception = + logWarning( +sFailed to get the size of table ${table.getTableName} in the + +sdatabase ${table.getDbName} because of ${e.toString}, e) + size = 0L + } + + size +} + +val tableParameters = relation.hiveQlTable.getParameters +val oldTotalSize = + Option(tableParameters.get(StatsSetupConst.TOTAL_SIZE)).map(_.toLong).getOrElse(0L) +val newTotalSize = getFileSizeForTable(hiveconf, relation.hiveQlTable) +// Update the Hive metastore if the total size of the table is different than the size +// recorded in the Hive metastore. +// This logic is based on org.apache.hadoop.hive.ql.exec.StatsTask.aggregateStats(). +if (newTotalSize 0 newTotalSize != oldTotalSize) { + tableParameters.put(StatsSetupConst.TOTAL_SIZE, newTotalSize.toString) + val hiveTTable = relation.hiveQlTable.getTTable + hiveTTable.setParameters(tableParameters) + val tableFullName = +relation.hiveQlTable.getDbName() + . + relation.hiveQlTable.getTableName() + + catalog.client.alterTable(tableFullName, new Table(hiveTTable)) +} + } + case otherRelation = +throw new NotImplementedError(sAnalyzing a ${otherRelation} has not been implemented) --- End diff -- We probably don't want the result of a general `.toString`. Perhaps just say Analyzing relations other than MetastoreRelation's has not been implemented instead. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2675] Increase EVENT_QUEUE_CAPACITY by ...
Github user concretevitamin closed the pull request at: https://github.com/apache/spark/pull/1579 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-2316] Avoid O(blocks) operations in lis...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1679#issuecomment-50946639 How many listeners are used in these benchmarks? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2531 SPARK-2436] [SQL] Optimize the B...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1448#issuecomment-50807987 I have rebased made changes according to the previous review comments. Also updated the title and description of the PR, combining two JIRA tickets. @chenghao-intel The following tests cover different join types. I ran it and the tests passed for both `BuildLeft` and `BuildRight` being hardcoded into the planner. ``` SPARK_HIVE=true sbt/sbt -Dspark.hive.whitelist=mapjoin_test_outer,semijoin,smb_mapjoin_2,smb_mapjoin_3,union22 test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2316] Avoid O(blocks) operations in lis...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1679#issuecomment-50827351 This patch is much appreciated -- thanks for working on this! On Thu, Jul 31, 2014 at 3:02 PM, Apache Spark QA notificati...@github.com wrote: QA results for PR 1679: - This patch PASSES unit tests. - This patch merges cleanly - This patch adds the following public classes (experimental): class StorageStatus(val blockManagerId: BlockManagerId, val maxMem: Long) { For more information see test ouptut: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17607/consoleFull â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1679#issuecomment-50824930. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2531] [SQL] Make BroadcastNestedLoopJoi...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1448#discussion_r15599752 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins.scala --- @@ -332,71 +342,88 @@ case class BroadcastNestedLoopJoin( } } - /** The Streamed Relation */ - def left = streamed - /** The Broadcast relation */ - def right = broadcast - @transient lazy val boundCondition = InterpretedPredicate( condition .map(c = BindReferences.bindReference(c, left.output ++ right.output)) .getOrElse(Literal(true))) + // TODO: low-hanging fruits for performance? (Reduce branching / pattern matches?) def execute() = { val broadcastedRelation = sqlContext.sparkContext.broadcast(broadcast.execute().map(_.copy()).collect().toIndexedSeq) -val streamedPlusMatches = streamed.execute().mapPartitions { streamedIter = +/** All rows that either match both-way, or rows from streamed joined with nulls. */ +val matchesOrStreamedRowsWithNulls = streamed.execute().mapPartitions { streamedIter = val matchedRows = new ArrayBuffer[Row] // TODO: Use Spark's BitSet. - val includedBroadcastTuples = new BitSet(broadcastedRelation.value.size) + val includedBroadcastTuples = +new scala.collection.mutable.BitSet(broadcastedRelation.value.size) val joinedRow = new JoinedRow streamedIter.foreach { streamedRow = var i = 0 -var matched = false +var streamRowMatched = false while (i broadcastedRelation.value.size) { // TODO: One bitset per partition instead of per row. val broadcastedRow = broadcastedRelation.value(i) - if (boundCondition(joinedRow(streamedRow, broadcastedRow))) { -matchedRows += buildRow(streamedRow ++ broadcastedRow) -matched = true + val jr = buildSide match { +case BuildRight = joinedRow(streamedRow, broadcastedRow) +case BuildLeft = joinedRow(broadcastedRow, streamedRow) + } + if (boundCondition(jr)) { +// Putting this branching inside this conditional: assume ++ has a +// much higher cost than another branch pattern matching. +val br = buildSide match { + case BuildRight = streamedRow ++ broadcastedRow + case BuildLeft = broadcastedRow ++ streamedRow +} +matchedRows += buildRow(br) +streamRowMatched = true includedBroadcastTuples += i } i += 1 } -if (!matched (joinType == LeftOuter || joinType == FullOuter)) { - matchedRows += buildRow(streamedRow ++ Array.fill(right.output.size)(null)) +(streamRowMatched, joinType, buildSide) match { + case (false, LeftOuter | FullOuter, BuildRight) = +matchedRows += buildRow(streamedRow ++ Array.fill(right.output.size)(null)) + case (false, RightOuter | FullOuter, BuildLeft) = +matchedRows += buildRow(Array.fill(left.output.size)(null) ++ streamedRow) + case _ = } } Iterator((matchedRows, includedBroadcastTuples)) } -val includedBroadcastTuples = streamedPlusMatches.map(_._2) +val includedBroadcastTuples = matchesOrStreamedRowsWithNulls.map(_._2) val allIncludedBroadcastTuples = if (includedBroadcastTuples.count == 0) { new scala.collection.mutable.BitSet(broadcastedRelation.value.size) } else { -streamedPlusMatches.map(_._2).reduce(_ ++ _) +includedBroadcastTuples.reduce(_ ++ _) } -val rightOuterMatches: Seq[Row] = - if (joinType == RightOuter || joinType == FullOuter) { -broadcastedRelation.value.zipWithIndex.filter { - case (row, i) = !allIncludedBroadcastTuples.contains(i) -}.map { - // TODO: Use projection. - case (row, _) = buildRow(Vector.fill(left.output.size)(null) ++ row) +/** Rows from broadcasted joined with nulls. */ +val broadcastRowsWithNulls: Seq[Row] = + broadcastedRelation.value.zipWithIndex.filter { case (row, i) = +// keep all rows in the broadcast relation that are not matched before +!allIncludedBroadcastTuples.contains(i) + }.map { case (row, _) = +// TODO: Use projection. +val br = (joinType, buildSide) match { + case (RightOuter | FullOuter, BuildRight) = Vector.fill(left.output.size)(null) ++ row --- End diff -- Hey
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-50522265 Rebased addressed review comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-50522286 Jenkins test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50423690 @yhuai @marmbrus I am not sure if this has been discussed before, but what do you guys think about adding a version of `applySchema(RDD[Array[String]], StructType)`? The use case I have in mind is TPC-DS data preparation. Currently I have a bunch of text files, from which I can easily create an `RDD[String]`; by splitting each line on some separator I get an `RDD[Array[String]]`. Now, in TPC-DS the tables easily have 15+ columns, and I don't want to manually create a `Row` for each `Array[String]`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2179][SQL] Public API for DataTypes and...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1346#issuecomment-50423851 To add to this: for my own purpose, I can certainly hack something together based off this branch in a custom Spark build, but just want to throw this thought out there as I think it does have some generality (large number of columns, avoid writing `.map(p = Row(p(0), p(1), ..., p(LARGE_NUM)))`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2410][SQL] Merging Hive Thrift/JDBC ser...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1600#issuecomment-50224519 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2674] [SQL] [PySpark] support datetime ...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1601#discussion_r15434945 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -395,6 +395,11 @@ class SchemaRDD( arr.asInstanceOf[Array[Any]].map { element = rowToMap(element.asInstanceOf[Row], struct) } +case t: java.sql.Timestamp = { --- End diff -- This block can be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2675] Increase EVENT_QUEUE_CAPACITY by ...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1579#issuecomment-50183155 Unfortunately I may not be able to find time to run an experiment soon. If anyone is interested, I think `org.apache.spark.util.SizeEstimator` is reasonable to try. Let's also see if [SPARK-2316](https://issues.apache.org/jira/browse/SPARK-2316) gets fixed in the near future, which may remove the need to increase the queue capacity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2675] Increase EVENT_QUEUE_CAPACITY by ...
GitHub user concretevitamin opened a pull request: https://github.com/apache/spark/pull/1579 [SPARK-2675] Increase EVENT_QUEUE_CAPACITY by 20x. JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2675 @pwendell @andrewor You can merge this pull request into a Git repository by running: $ git pull https://github.com/concretevitamin/spark increase-event-queue Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1579.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1579 commit 8ef1d0785221c2621aca2667c8acf0241c4394ef Author: Zongheng Yang zonghen...@gmail.com Date: 2014-07-24T19:19:25Z Increase EVENT_QUEUE_CAPACITY by 20x. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2675] Increase EVENT_QUEUE_CAPACITY by ...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1579#issuecomment-50065677 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2531] [SQL] Make BroadcastNestedLoopJoi...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1448#issuecomment-49911126 Thanks for the comments @chenghao-intel and @marmbrus. As Michael said I'll revisit this after the codegen PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [WIP][SPARK-2054][SQL] Code Generation for Exp...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/993#discussion_r15311508 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateProjection.scala --- @@ -0,0 +1,218 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.types._ + + +/** + * Generates bytecode that produces a new [[Row]] object based on a fixed set of input + * [[Expression Expressions]] and a given input [[Row]]. The returned [[Row]] object is custom + * generated based on the output types of the [[Expression]] to avoid boxing of primitive values. + */ +object GenerateProjection extends CodeGenerator[Seq[Expression], Projection] { + import scala.reflect.runtime.{universe = ru} + import scala.reflect.runtime.universe._ + + protected def canonicalize(in: Seq[Expression]): Seq[Expression] = +in.map(ExpressionCanonicalizer(_)) + + protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): Seq[Expression] = +in.map(BindReferences.bindReference(_, inputSchema)) + + // Make Mutablility optional... + protected def create(expressions: Seq[Expression]): Projection = { +val tupleLength = ru.Literal(Constant(expressions.length)) +val lengthDef = qfinal val length = $tupleLength + +/* TODO: Configurable... +val nullFunctions = + q +private final val nullSet = new org.apache.spark.util.collection.BitSet(length) +final def setNullAt(i: Int) = nullSet.set(i) --- End diff -- Could inlining help these defs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r15243150 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala --- @@ -66,8 +66,8 @@ abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging { * linking. */ @DeveloperApi -case class SparkLogicalPlan(alreadyPlanned: SparkPlan, tableName: String = SparkLogicalPlan) - extends BaseRelation with MultiInstanceRelation { +case class SparkLogicalPlan(alreadyPlanned: SparkPlan)(@transient sqlContext: SQLContext) --- End diff -- That will be pretty awesome reduce much boilerplate code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r15243796 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -26,6 +26,26 @@ import org.apache.spark.sql.catalyst.trees abstract class LogicalPlan extends QueryPlan[LogicalPlan] { self: Product = + // TODO: handle overflow? --- End diff -- BigInt somehow completely eluded me before. Let me try to make the changes now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-49785200 I have addressed the latest round of review comments rebased onto latest master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix flakey HiveQuerySuite test
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1514#issuecomment-49667765 Thanks for the fix. Looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2561][SQL] Fix apply schema
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1470#issuecomment-49686947 LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2523] [SQL] [WIP] Hadoop table scan bug...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1439#discussion_r15013611 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -241,4 +252,37 @@ private[hive] object HadoopTableReader { val bufferSize = System.getProperty(spark.buffer.size, 65536) jobConf.set(io.file.buffer.size, bufferSize) } + + /** + * Transform the raw data(Writable object) into the Row object for an iterable input + * @param iter Iterable input which represented as Writable object + * @param deserializer Deserializer associated with the input writable object + * @param attrs Represents the row attribute names and its zero-based position in the MutableRow + * @param row reusable MutableRow object + * + * @return Iterable Row object that transformed from the given iterable input. + */ + def fillObject(iter: Iterator[Writable], deserializer: Deserializer, + attrs: Seq[(Attribute, Int)], row: GenericMutableRow): Iterator[Row] = { +val soi = deserializer.getObjectInspector().asInstanceOf[StructObjectInspector] +// get the field references according to the attributes(output of the reader) required +val fieldRefs = attrs.map { case (attr, idx) = (soi.getStructFieldRef(attr.name), idx) } + +// Map each tuple to a row object +iter.map { value = + val raw = deserializer.deserialize(value) + var idx = 0; + while(idx fieldRefs.length) { --- End diff -- nit: space after while --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2190][SQL] Specialized ColumnType for T...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1440#discussion_r15018072 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala --- @@ -344,21 +344,52 @@ private[sql] class StringColumnStats extends BasicColumnStats(STRING) { } override def contains(row: Row, ordinal: Int) = { -!(upperBound eq null) { +(upperBound ne null) { --- End diff -- I think I read somewhere ne has better performance in certain cases... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r15019589 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -26,6 +26,28 @@ import org.apache.spark.sql.catalyst.trees abstract class LogicalPlan extends QueryPlan[LogicalPlan] { self: Product = + // TODO: handle overflow? + /** + * Estimates of various statistics. The default estimation logic simply sums up the corresponding + * statistic produced by the children. To override this behavior, override `statistics` and + * assign it a overriden version of `Statistics`. + */ + case class Statistics( +/** + * Number of output tuples. For leaf operators this defaults to 1, otherwise it is set to the + * product of children's `numTuples`. + */ +numTuples: Long = childrenStats.map(_.numTuples).product, + +/** + * Physical size in bytes. For leaf operators this defaults to 1, otherwise it is set to the + * product of children's `sizeInBytes`. + */ +sizeInBytes: Long = childrenStats.map(_.sizeInBytes).product + ) + lazy val statistics: Statistics = new Statistics + lazy val childrenStats = children.map(_.statistics) --- End diff -- I think we need a lazy marker for this, otherwise during the initialization of `childrenStats`, `statistics` in all children will get evaluated too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-49210466 Jenkins, test this please. I think I have addressed the latest round of review comments, where the biggest changes being: - Remove statistics estimates from ParquetRelation for now due to performance considerations. - When estimating sizes for MetastoreRelation, instead of making potentially expensive Hadoop FileSystem calls, we peek into the Hive Metastore for populated information. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-49219047 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r15024063 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -26,6 +26,28 @@ import org.apache.spark.sql.catalyst.trees abstract class LogicalPlan extends QueryPlan[LogicalPlan] { self: Product = + // TODO: handle overflow? + /** + * Estimates of various statistics. The default estimation logic simply sums up the corresponding + * statistic produced by the children. To override this behavior, override `statistics` and + * assign it a overriden version of `Statistics`. + */ + case class Statistics( +/** + * Number of output tuples. For leaf operators this defaults to 1, otherwise it is set to the + * product of children's `numTuples`. + */ +numTuples: Long = childrenStats.map(_.numTuples).product, + +/** + * Physical size in bytes. For leaf operators this defaults to 1, otherwise it is set to the + * product of children's `sizeInBytes`. + */ +sizeInBytes: Long = childrenStats.map(_.sizeInBytes).product + ) + lazy val statistics: Statistics = new Statistics + lazy val childrenStats = children.map(_.statistics) --- End diff -- Thanks for the note -- I wasn't well aware of the memory/time overhead of lazy vals. Since we have one statistic at the moment, `childrenStats` is used only once for a logical node, so I went ahead and removed it. However, memoizing `statistics` itself probably would save some planning time, as it will get called multiple times during planning (e.g. different cases in the HashJoin pattern match). If we were to use a `def` it is going to be evaluated in the quadratic order. Using `val` only will fire off the computation of the stats for all logical nodes right away. I think this might be less desirable to only evaluating it for everything under a Join when needed by a Strategy. Also, we'd have very few operators in a typical query anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-49222667 Jenkins, test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2531] [SQL] Make BroadcastNestedLoopJoi...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1448#issuecomment-49227975 Jenkins, this is okay to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Add HiveDecimal HiveVarchar support in...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1436#issuecomment-49228593 Hey @chenghao-intel -- can you create a JIRA ticket for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2531] [SQL] Make BroadcastNestedLoopJoi...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1448#issuecomment-49239568 Jenkins, test this please... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Synchronize on a lock when using scala r...
GitHub user concretevitamin opened a pull request: https://github.com/apache/spark/pull/1423 [SQL] Synchronize on a lock when using scala reflection inside data type objects. You can merge this pull request into a Git repository by running: $ git pull https://github.com/concretevitamin/spark scala-ref-catalyst Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1423.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1423 commit 325a14944cf5ef26b66f770f92ad225c4b3610a6 Author: Zongheng Yang zonghen...@gmail.com Date: 2014-07-15T23:11:51Z Synchronize on a lock when initializing data type objects in Catalyst. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Synchronize on a lock when using scala r...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1423#issuecomment-49105217 Jenkins, add to whitelist please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Synchronize on a lock when using scala r...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1423#issuecomment-49105361 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin closed the pull request at: https://github.com/apache/spark/pull/1390 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48935494 @yhuai suggested a much simpler fix -- I benchmarked this and it gave the same performance boost. I am closing this and opening a new PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
GitHub user concretevitamin opened a pull request: https://github.com/apache/spark/pull/1408 [SPARK-2443][SQL] Fix slow read from partitioned tables This fix obtains a comparable performance boost as [PR #1390](https://github.com/apache/spark/pull/1390) by moving an array update and deserializer initialization out of a potentially very long loop. Suggested by @yhuai. The below results are updated for this fix. ## Benchmarks Generated a local text file with 10M rows of simple key-value pairs. The data is loaded as a table through Hive. Results are obtained on my local machine using hive/console. Without the fix: Type | Non-partitioned | Partitioned (1 part) | | - First run | 9.52s end-to-end (1.64s Spark job) | 36.6s (28.3s) Stablized runs | 1.21s (1.18s) | 27.6s (27.5s) With this fix: Type | Non-partitioned | Partitioned (1 part) | | - First run | 9.57s (1.46s) | 11.0s (1.69s) Stablized runs | 1.13s (1.10s) | 1.23s (1.19s) You can merge this pull request into a Git repository by running: $ git pull https://github.com/concretevitamin/spark slow-read-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1408.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1408 commit d86e437218f99179934ccd9b4d5d89c02b09459d Author: Zongheng Yang zonghen...@gmail.com Date: 2014-07-14T18:03:07Z Move update initialization out of potentially long loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48936743 New PR here: #1408 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-48936856 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14894946 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon // Create local references so that the outer object isn't serialized. val tableDesc = _tableDesc + val tableSerDeClass = tableDesc.getDeserializerClass + val broadcastedHiveConf = _broadcastedHiveConf val localDeserializer = partDeserializer val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc) - hivePartitionRDD.mapPartitions { iter = + hivePartitionRDD.mapPartitions { case iter = --- End diff -- I initially thought in a function context, `{ case x = ... }` will be optimized to `{ x = ... }`. I did a `scalac -print` on a simple program to confirm that this is not the case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-48941276 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1408#issuecomment-48954674 I think we should ask the users who reported the performance issue if this fix solves their problems. Otherwise the comments in the previous PR seem to only apply to that implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14902569 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon // Create local references so that the outer object isn't serialized. val tableDesc = _tableDesc + val tableSerDeClass = tableDesc.getDeserializerClass + val broadcastedHiveConf = _broadcastedHiveConf val localDeserializer = partDeserializer val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc) - hivePartitionRDD.mapPartitions { iter = + hivePartitionRDD.mapPartitions { case iter = --- End diff -- https://gist.github.com/concretevitamin/272fe413dcc06b8cbe9c It seems the with-case version does have more instructions to do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r14902843 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -263,6 +268,19 @@ private[hive] case class MetastoreRelation new Partition(hiveQlTable, p) } + // TODO: are there any stats in hiveQlTable.getSkewedInfo that we can use? + @transient override lazy val statistics = new Statistics { --- End diff -- We could certainly move it out of the Statistics class to reside in `MetastoreRelation`, but does `new Statistics` incur extra overheads? Otherwise I think the copy constructor may not be the good style here, as we have a long comment and relatively long implementation for the field `sizeInBytes`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1390#discussion_r14902570 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -157,21 +161,60 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon // Create local references so that the outer object isn't serialized. val tableDesc = _tableDesc + val tableSerDeClass = tableDesc.getDeserializerClass + val broadcastedHiveConf = _broadcastedHiveConf val localDeserializer = partDeserializer val hivePartitionRDD = createHadoopRdd(tableDesc, inputPathStr, ifc) - hivePartitionRDD.mapPartitions { iter = + hivePartitionRDD.mapPartitions { case iter = --- End diff -- https://gist.github.com/concretevitamin/272fe413dcc06b8cbe9c It seems the with-case version does have more instructions to do. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r14904711 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala --- @@ -47,6 +47,13 @@ private[sql] abstract class SparkStrategies extends QueryPlanner[SparkPlan] { /** * Uses the ExtractEquiJoinKeys pattern to find joins where at least some of the predicates can be * evaluated by matching hash keys. + * + * This strategy applies a simple optimization based on the estimates of the physical sizes of + * the two join sides. When planning a [[execution.BroadcastHashJoin]], if one side has an + * estimated physical size smaller than the user-settable threshold + * `spark.sql.auto.convert.join.size`, the planner would mark it as the ''build'' relation and --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r14906952 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -92,6 +114,8 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] { abstract class LeafNode extends LogicalPlan with trees.LeafNode[LogicalPlan] { self: Product = + override lazy val statistics = Statistics(numTuples = 1L, sizeInBytes = 1L) --- End diff -- Setting large default values here (1) doesn't seem to make intuitive sense, and (2) can cause superfluous overflows when things are getting multiplied. This [commit](https://github.com/concretevitamin/spark/commit/f41ecf22d516bfeabc4168f2a1e6518169d85c29) introduced this change, which makes 1L a special default value that gets preserved even when a bunch of operators lack meaningful estimates -- because the default estimates will just multiply a bunch of 1's together, producing the special value 1L. With this preservation, we could test the absence of meaningful estimates and correspondingly do special handling, as shown in that commit. (One example in that commit: if RDDs don't have meaningful estimates, the field defaults to [SQLConf#statsDefaultSizeInBytes](https://github.com/apache/spark/pull/1238/files#diff-41ef65b9ef5b518f77e2a03559893f4dR61), which is set to be greater than the auto join conversion threshold. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1238#discussion_r14857017 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala --- @@ -21,16 +21,27 @@ import java.util.Properties import scala.collection.JavaConverters._ +private object SQLConf { + @transient protected[spark] val confSettings = java.util.Collections.synchronizedMap( --- End diff -- In one of the earlier commits, I tried the approach of making an operator mixing in SQLConf -- hence that necessitated a singleton object to hold the settings. Now I just have that operator take a SQLContext in order to get access to the conf, so if you want I could remove this singleton. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2443][SQL] Fix slow read from partition...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1390#issuecomment-48830080 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Prototype implementation of ...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-48570761 I have addressed most of the review comments rebased. Please review again take a look at the TODOs. It might be better to leave the optimization for BroadcastNestedLoopJoin for a future PR (although I have some WIP locally), as it requires a couple changes to the class (mainly to support a BuildSide). This is the reason the tests were failing before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-48630836 @rxin sure, done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14781579 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -19,8 +19,11 @@ package org.apache.spark.sql.catalyst.expressions import java.util.regex.Pattern +import scala.collection.IndexedSeqOptimized + import org.apache.spark.sql.catalyst.types.DataType import org.apache.spark.sql.catalyst.types.StringType +import org.apache.spark.sql.catalyst.types.BinaryType --- End diff -- alphabetization or combine the imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14781690 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { +val len = str.length +// Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and +// negative indices for start positions. If a start index i is greater than 0, it +// refers to element i-1 in the sequence. If a start index i is less than 0, it refers +// to the -ith element before the end of the sequence. If a start index i is 0, it +// refers to the first element. + +val start = startPos match { + case pos if pos 0 = pos - 1 + case neg if neg 0 = len + neg + case _ = 0 +} + +val end = sliceLen match { + case max if max == Integer.MAX_VALUE = max + case x = start + x +} + +str.slice(start, end) + } + + override def eval(input: Row): Any = { +val string = str.eval(input) + +val po = pos.eval(input) +val ln = len.eval(input) + +if ((string == null) || (po == null) || (ln == null)) { + null +} else { + val start = po.asInstanceOf[Int] + val length = ln.asInstanceOf[Int] + + string match { +case ba: Array[Byte] = slice(ba, start, length) +case other = slice(other.toString, start, length) + } +} + } + + override def toString = sSUBSTR($str, $pos, $len) --- End diff -- Minor: what do you think if we display only two args for the 2-arg case instead of displaying a Integer.MAX_VALUE? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14781542 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (str.dataType == BinaryType) str.dataType else StringType --- End diff -- Hey Will -- I think we need to check if `resolved` is true here to not break Catalyst's contract. Similar to [here](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L183). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14782267 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -860,6 +860,7 @@ private[hive] object HiveQl { val BETWEEN = (?i)BETWEEN.r val WHEN = (?i)WHEN.r val CASE = (?i)CASE.r + val SUBSTR = (?i)I_SUBSTR(?:ING)?.r --- End diff -- Why I_ here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14782583 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { +val len = str.length +// Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and --- End diff -- It looks like from [1] and [2] that Hive and SQL by default refer users to use 1-based indexing (and the links don't seem to mention 0-based at all). Even though they do support so, this subtle fact necessitates a couple branchings in the implementation which might cause us some performance penalty. What do you people think about supporting 1-based indexing only? +@marmbrus @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14782682 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { +val len = str.length +// Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and +// negative indices for start positions. If a start index i is greater than 0, it +// refers to element i-1 in the sequence. If a start index i is less than 0, it refers +// to the -ith element before the end of the sequence. If a start index i is 0, it +// refers to the first element. + +val start = startPos match { + case pos if pos 0 = pos - 1 + case neg if neg 0 = len + neg + case _ = 0 +} + +val end = sliceLen match { --- End diff -- The MySQL doc mentions that If len is less than 1, the result is the empty string. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1359#issuecomment-48642397 Hey @willb - thanks for working on this, which is going to be very useful for Spark SQL. I left a couple minor comments. Another general concern is the performance of `eval()`. If there are ways to reduce branchings, or reduce the function call (not sure if @inline will help), we should probably do it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user concretevitamin commented on a diff in the pull request: https://github.com/apache/spark/pull/1359#discussion_r14783724 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -207,3 +210,64 @@ case class StartsWith(left: Expression, right: Expression) extends StringCompari case class EndsWith(left: Expression, right: Expression) extends StringComparison { def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that takes a substring of its first argument starting at a given position. + * Defined for String and Binary types. + */ +case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression { + + type EvaluatedType = Any + + def nullable: Boolean = true + def dataType: DataType = { +if (str.dataType == BinaryType) str.dataType else StringType + } + + def references = children.flatMap(_.references).toSet + + override def children = str :: pos :: len :: Nil + + def slice[T, C % IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = { +val len = str.length +// Hive and SQL use one-based indexing for SUBSTR arguments but also accept zero and +// negative indices for start positions. If a start index i is greater than 0, it +// refers to element i-1 in the sequence. If a start index i is less than 0, it refers +// to the -ith element before the end of the sequence. If a start index i is 0, it +// refers to the first element. + +val start = startPos match { + case pos if pos 0 = pos - 1 + case neg if neg 0 = len + neg + case _ = 0 +} + +val end = sliceLen match { --- End diff -- Right, missed this before, sorry. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2393][SQL] Cost estimation optimization...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1238#issuecomment-48645064 To handle potential overflow (one last TODO), I think there are a couple alternatives: - A: Throw exceptions for overflowing operations. Similar to [1]. - B: Use [1], but replace the overflow situations with a Top and a Bottom that absorb/saturate things correspondingly. Similar concepts here [2]. - C: Use [3] (or reimplement parts of it), which just carry out an overflowing operation in the lifted BigInt counterparts. I think Approach A is bad as when dealing with big data we'd almost certainly run into this case in the future. Approach B is reasonable in that whenever we see a Top/Bottom, we could just disable/special-case the cost estimation. Approach C looks okay too but may be too heavy. Let me know what do you guys think should go into this PR. [1] https://github.com/twitter/util/blob/master/util-core/src/main/scala/com/twitter/util/LongOverflowArith.scala [2] https://github.com/twitter/util/blob/master/util-core/src/main/scala/com/twitter/util/Duration.scala [3] https://github.com/non/spire/blob/master/core/src/main/scala/spire/math/SafeLong.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2407: Added internal implementation of S...
Github user concretevitamin commented on the pull request: https://github.com/apache/spark/pull/1359#issuecomment-48648864 This sounds pretty cool! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---