[GitHub] spark pull request: Minor fix: made EXPLAIN output to play well ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1097#issuecomment-46251700 Thanks. I'm merging this one. The test that failed was a flume test that is sometimes flaky. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Follow up of PR #1071 for Java API
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1085#issuecomment-46252146 FYI This didn't get merged into branch-1.0. I did a manual cherry pick. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/369#issuecomment-46343296 I will test this today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/369#issuecomment-46345169 This looks good to me. I will merge it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13876519 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,667 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +/** + * Collection of converters of Parquet types (group and primitive types) that + * model arrays and maps. The conversions are partly based on the AvroParquet + * converters that are part of Parquet in order to be able to process these + * types. + * + * There are several types of converters: + * ul + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for primitive + * (numeric, boolean and String) types/li + * li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for arrays + * of native JVM element types; note: currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for arrays of + * arbitrary element types (including nested element types); note: currently + * null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for structs/li + * li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; note: + * currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] for rows + * of only primitive element types/li + * li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other nested + * records, including the top-level row record/li + * /ul + */ + +private[sql] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + // TODO: consider using Array[T] for arrays to avoid boxing of primitive types + type ArrayScalaType[T] = Seq[T] + type StructScalaType[T] = Seq[T] + type MapScalaType[K, V] = Map[K, V] + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + // Strings, Shorts and Bytes do not have a corresponding type in Parquet + // so we need to treat
[GitHub] spark pull request: SPARK-1063 Add .sortBy(f) method on RDD
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/369#issuecomment-46348842 There was a conflict that I had to merge manually. Take a look at master to make sure everything is ok. I did compile and ran a couple things. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: spark-submit: add exec at the end of the scrip...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/858#issuecomment-46353884 Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/999#issuecomment-46363656 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/999#discussion_r13891473 --- Diff: docs/sql-programming-guide.md --- @@ -91,14 +91,33 @@ of its decedents. To create a basic SQLContext, all you need is a SparkContext. {% highlight python %} from pyspark.sql import SQLContext -sqlCtx = SQLContext(sc) +sqlContext = SQLContext(sc) {% endhighlight %} /div /div -## Running SQL on RDDs +# Data Sources + +div class=codetabs +div data-lang=scala markdown=1 +Spark SQL supports operating on a variety of data sources though the SchemaRDD interface. --- End diff -- best to put code ... /code around SchemaRDD --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/999#discussion_r13891482 --- Diff: docs/sql-programming-guide.md --- @@ -91,14 +91,33 @@ of its decedents. To create a basic SQLContext, all you need is a SparkContext. {% highlight python %} from pyspark.sql import SQLContext -sqlCtx = SQLContext(sc) +sqlContext = SQLContext(sc) {% endhighlight %} /div /div -## Running SQL on RDDs +# Data Sources + +div class=codetabs +div data-lang=scala markdown=1 +Spark SQL supports operating on a variety of data sources though the SchemaRDD interface. --- End diff -- and for Python/Java too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/999#discussion_r13891733 --- Diff: docs/sql-programming-guide.md --- @@ -297,50 +328,152 @@ JavaSchemaRDD teenagers = sqlCtx.sql(SELECT name FROM parquetFile WHERE age = div data-lang=python markdown=1 {% highlight python %} +# sqlContext from the previous example is used in this example. -peopleTable # The SchemaRDD from the previous example. +schemaPeople # The SchemaRDD from the previous example. # SchemaRDDs can be saved as Parquet files, maintaining the schema information. -peopleTable.saveAsParquetFile(people.parquet) +schemaPeople.saveAsParquetFile(people.parquet) # Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved. # The result of loading a parquet file is also a SchemaRDD. -parquetFile = sqlCtx.parquetFile(people.parquet) +parquetFile = sqlContext.parquetFile(people.parquet) # Parquet files can also be registered as tables and then used in SQL statements. parquetFile.registerAsTable(parquetFile); -teenagers = sqlCtx.sql(SELECT name FROM parquetFile WHERE age = 13 AND age = 19) - +teenagers = sqlContext.sql(SELECT name FROM parquetFile WHERE age = 13 AND age = 19) +teenNames = teenagers.map(lambda p: Name: + p.name) +for teenName in teenNames.collect(): + print teenName {% endhighlight %} /div /div -## Writing Language-Integrated Relational Queries +## JSON Datasets +div class=codetabs -**Language-Integrated queries are currently only supported in Scala.** +div data-lang=scala markdown=1 +Spark SQL can automatically infer the schema of a JSON dataset and load it as a SchemaRDD. +This conversion can be done using one of two methods in a SQLContext: -Spark SQL also supports a domain specific language for writing queries. Once again, -using the data from the above examples: +* `jsonFile` - loads data from a directory of JSON files where each line of the files is a JSON object. +* `jsonRdd` - loads data from an existing RDD where each element of the RDD is a string containing a JSON object. {% highlight scala %} +// sc is an existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc) -import sqlContext._ -val people: RDD[Person] = ... // An RDD of case class objects, from the first example. -// The following is the same as 'SELECT name FROM people WHERE age = 10 AND age = 19' -val teenagers = people.where('age = 10).where('age = 19).select('name) +// A JSON dataset is pointed to by path. +// The path can be either a single text file or a directory storing text files. +val path = examples/src/main/resources/people.json +// Create a SchemaRDD from the file(s) pointed to by path +val people = sqlContext.jsonFile(path) + +// The inferred schema can be visualized using the printSchema() method. +people.printSchema() +// The schema of people is ... --- End diff -- i'd remove this line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/360#discussion_r13892570 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala --- @@ -0,0 +1,667 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.parquet + +import scala.collection.mutable.{Buffer, ArrayBuffer, HashMap} + +import parquet.io.api.{PrimitiveConverter, GroupConverter, Binary, Converter} +import parquet.schema.MessageType + +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.catalyst.expressions.{GenericRow, Row, Attribute} +import org.apache.spark.sql.parquet.CatalystConverter.FieldType + +/** + * Collection of converters of Parquet types (group and primitive types) that + * model arrays and maps. The conversions are partly based on the AvroParquet + * converters that are part of Parquet in order to be able to process these + * types. + * + * There are several types of converters: + * ul + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveConverter]] for primitive + * (numeric, boolean and String) types/li + * li[[org.apache.spark.sql.parquet.CatalystNativeArrayConverter]] for arrays + * of native JVM element types; note: currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystArrayConverter]] for arrays of + * arbitrary element types (including nested element types); note: currently + * null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystStructConverter]] for structs/li + * li[[org.apache.spark.sql.parquet.CatalystMapConverter]] for maps; note: + * currently null values are not supported!/li + * li[[org.apache.spark.sql.parquet.CatalystPrimitiveRowConverter]] for rows + * of only primitive element types/li + * li[[org.apache.spark.sql.parquet.CatalystGroupConverter]] for other nested + * records, including the top-level row record/li + * /ul + */ + +private[sql] object CatalystConverter { + // The type internally used for fields + type FieldType = StructField + + // This is mostly Parquet convention (see, e.g., `ConversionPatterns`). + // Note that array for the array elements is chosen by ParquetAvro. + // Using a different value will result in Parquet silently dropping columns. + val ARRAY_ELEMENTS_SCHEMA_NAME = array + val MAP_KEY_SCHEMA_NAME = key + val MAP_VALUE_SCHEMA_NAME = value + val MAP_SCHEMA_NAME = map + + // TODO: consider using Array[T] for arrays to avoid boxing of primitive types + type ArrayScalaType[T] = Seq[T] + type StructScalaType[T] = Seq[T] + type MapScalaType[K, V] = Map[K, V] + + protected[parquet] def createConverter( + field: FieldType, + fieldIndex: Int, + parent: CatalystConverter): Converter = { +val fieldType: DataType = field.dataType +fieldType match { + // For native JVM types we use a converter with native arrays + case ArrayType(elementType: NativeType) = { +new CatalystNativeArrayConverter(elementType, fieldIndex, parent) + } + // This is for other types of arrays, including those with nested fields + case ArrayType(elementType: DataType) = { +new CatalystArrayConverter(elementType, fieldIndex, parent) + } + case StructType(fields: Seq[StructField]) = { +new CatalystStructConverter(fields, fieldIndex, parent) + } + case MapType(keyType: DataType, valueType: DataType) = { +new CatalystMapConverter( + Seq( +new FieldType(MAP_KEY_SCHEMA_NAME, keyType, false), +new FieldType(MAP_VALUE_SCHEMA_NAME, valueType, true)), +fieldIndex, +parent) + } + // Strings, Shorts and Bytes do not have a corresponding type in Parquet + // so we need to treat
[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/999#discussion_r13892635 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala --- @@ -123,4 +125,53 @@ abstract class QueryPlan[PlanType : TreeNode[PlanType]] extends TreeNode[PlanTy case other = Nil }.toSeq } + + protected def generateSchemaTreeString(schema: Seq[Attribute]): String = { +val builder = new StringBuilder +builder.append(root\n) +val prefix = | +schema.foreach { attribute = + val name = attribute.name + val dataType = attribute.dataType + dataType match { +case fields: StructType = + builder.append(s$prefix-- $name: $StructType\n) + generateSchemaTreeString(fields, s$prefix|, builder) +case ArrayType(fields: StructType) = + builder.append(s$prefix-- $name: $ArrayType[$StructType]\n) + generateSchemaTreeString(fields, s$prefix|, builder) +case ArrayType(elementType: DataType) = + builder.append(s$prefix-- $name: $ArrayType[$elementType]\n) +case _ = builder.append(s$prefix-- $name: $dataType\n) + } +} + +builder.toString() + } + + protected def generateSchemaTreeString( + schema: StructType, + prefix: String, + builder: StringBuilder): StringBuilder = { +schema.fields.foreach { + case StructField(name, fields: StructType, _) = +builder.append(s$prefix-- $name: $StructType\n) +generateSchemaTreeString(fields, s$prefix|, builder) + case StructField(name, ArrayType(fields: StructType), _) = +builder.append(s$prefix-- $name: $ArrayType[$StructType]\n) +generateSchemaTreeString(fields, s$prefix|, builder) + case StructField(name, ArrayType(elementType: DataType), _) = +builder.append(s$prefix-- $name: $ArrayType[$elementType]\n) + case StructField(name, fieldType: DataType, _) = +builder.append(s$prefix-- $name: $fieldType\n) +} + +builder + } + + /** Returns the output schema in the tree format. */ + def schemaTreeString: String = generateSchemaTreeString(output) --- End diff -- maybe just schemaString --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/999#discussion_r13892709 --- Diff: sql/core/pom.xml --- @@ -54,6 +61,11 @@ version${parquet.version}/version /dependency dependency + groupIdcom.fasterxml.jackson.core/groupId + artifactIdjackson-core/artifactId + version2.3.2/version --- End diff -- @pwendell I think in general sub project pom files don't specify dependency versions. Can you verify? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/999#discussion_r13892874 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -99,6 +97,37 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, parquet.ParquetRelation(path)) /** + * Loads a JSON file (one object per line), returning the result as a [[SchemaRDD]]. --- End diff -- Maybe add a line explaining this goes through the data once to infer the schema ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/999#discussion_r13892881 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -99,6 +97,35 @@ class SQLContext(@transient val sparkContext: SparkContext) new SchemaRDD(this, parquet.ParquetRelation(path)) /** + * Loads a JSON file (one object per line), returning the result as a [[SchemaRDD]]. + * + * @group userf + */ + def jsonFile(path: String): SchemaRDD = jsonFile(path, 1.0) + + /** + * :: Experimental :: + */ --- End diff -- here too, although with sampling --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/999#discussion_r13893161 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -342,13 +344,34 @@ class SchemaRDD( def toJavaSchemaRDD: JavaSchemaRDD = new JavaSchemaRDD(sqlContext, logicalPlan) private[sql] def javaToPython: JavaRDD[Array[Byte]] = { --- End diff -- add some inline doc explaining this is used for the Python API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/999#discussion_r13893257 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala --- @@ -0,0 +1,399 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.json + +import scala.collection.JavaConversions._ +import scala.math.BigDecimal + +import com.fasterxml.jackson.databind.ObjectMapper + +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan} +import org.apache.spark.sql.Logging + +private[sql] object JsonRDD extends Logging { + + private[sql] def inferSchema( + json: RDD[String], + samplingRatio: Double = 1.0): LogicalPlan = { +require(samplingRatio 0) --- End diff -- add a more meaningful exception message, i.e. ``` require(samplingRatio 0, ssamplingRatio ($samplingRatio) should be greater than 0) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/999#discussion_r13893273 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala --- @@ -0,0 +1,399 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.json + +import scala.collection.JavaConversions._ +import scala.math.BigDecimal + +import com.fasterxml.jackson.databind.ObjectMapper + +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.types._ +import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan} +import org.apache.spark.sql.Logging + +private[sql] object JsonRDD extends Logging { + + private[sql] def inferSchema( + json: RDD[String], + samplingRatio: Double = 1.0): LogicalPlan = { +require(samplingRatio 0) +val schemaData = if (samplingRatio 0.99) json else json.sample(false, samplingRatio, 1) + --- End diff -- probably no need to have a blank line for each statement ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/999#issuecomment-46380653 This looks to me overall. Only few nitpicks. I think we should merge it after you addressed the couple comments I had. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2170: Fix for global name 'PIPE' is not ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1109#issuecomment-46383100 Thanks. Merging this in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2170: Fix for global name 'PIPE' is not ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1109#issuecomment-46383668 Actually the merge script failed for this pull request. @pwendell any idea? ``` ./merge_spark_pr.py Which pull request would you like to merge? (e.g. 34): 1109 === Pull Request #1109 === title SPARK-2170: Fix for global name 'PIPE' is not defined source gregakespret/spark-ec2-subprocess-script target master url https://api.github.com/repos/apache/spark/pulls/1109 Proceed with merging pull request #1109? (y/n): y From github.com:apache/spark * [new ref] refs/pull/1109/head - PR_TOOL_MERGE_PR_1109 From https://git-wip-us.apache.org/repos/asf/spark * [new branch] master - PR_TOOL_MERGE_PR_1109_MASTER Switched to branch 'PR_TOOL_MERGE_PR_1109_MASTER' Automatic merge went well; stopped before committing as requested Traceback (most recent call last): File ./merge_spark_pr.py, line 316, in module merge_hash = merge_pr(pr_num, target_ref) File ./merge_spark_pr.py, line 152, in merge_pr run_cmd(['git', 'commit', '--author=%s' % primary_author] + merge_message_flags) File ./merge_spark_pr.py, line 78, in run_cmd return subprocess.check_output(cmd) File /Users/rxin/anaconda/lib/python2.7/subprocess.py, line 573, in check_output raise CalledProcessError(retcode, cmd, output=output) subprocess.CalledProcessError: Command '['git', 'commit', '--author=Grega Kespret gr...@celtra.com', '-m', uSPARK-2170: Fix for global name 'PIPE' is not defined, '-m', u'https://issues.apache.org/jira/browse/SPARK-2170\r\n\r\nBefore this fix, when running ./spark-ec2 script:\r\n\r\n```\r\nTraceback (most recent call last):\r\n File ./spark_ec2.py, line 894, in module\r\n main()\r\n File ./spark_ec2.py, line 886, in main\r\n real_main()\r\n File ./spark_ec2.py, line 770, in real_main\r\n setup_cluster(conn, master_nodes, slave_nodes, opts, True)\r\n File ./spark_ec2.py, line 475, in setup_cluster\r\ndot_ssh_tar = ssh_read(master, opts, [\'tar\', \'c\', \'.ssh\'])\r\n File ./spark_ec2.py, line 709, in ssh_read\r\nssh_command(opts) + [\'%s@%s\' % (opts.user, host), stringify_command(command)])\r\n File ./spark_ec2.py, line 696, in _check_output\r\nprocess = subprocess.Popen(stdout=PIP E, *popenargs, **kwargs)\r\nNameError: global name \'PIPE\' is not defined\r\n```', '-m', 'Author: Grega Kespret gr...@celtra.com', '-m', u'Closes #1109 from gregakespret/spark-ec2-subprocess-script and squashes the following commits:', '-m', 4168dc6 [Grega Kespret] Fix for global name 'PIPE' is not defined]' returned non-zero exit status 1 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2060][SQL] Querying JSON Datasets with ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/999#issuecomment-46389105 Thanks. I'm merging this in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2170: Fix for global name 'PIPE' is not ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1109#issuecomment-46398768 @gregakespret since this has been fixed already in master, do you mind closing this pr? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2170: Fix for global name 'PIPE' is not ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1109#issuecomment-46399205 Yup looks like a racing condition (in a good way). Thanks a lot for catching this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Compression should be a setting for individual...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1091#issuecomment-46405754 Thanks for working on this, @ScrapCodes. I talked with Matei and while we both agree compression would be better set in per-RDD basis, adding another boolean flag to StorageLevel is not ideal. Matei suggested deferring this and we will come up with a proper design later. ``` We should come up with a proper design for this. I think one viable design is to make StorageLevel get constructed via a builder pattern. More generally in the future Iâd like to have something called StorageStrategy that can also convert the data into a format (e.g. columnar or something). Kind of like including a serializer in the storage level. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Minor fix
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1105#discussion_r13903361 --- Diff: core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala --- @@ -91,8 +91,13 @@ private[spark] object MetadataCleaner { conf.set(MetadataCleanerType.systemProperty(cleanerType), delay.toString) } + /** + * Set the default delay time( in seconds). --- End diff -- can you put the space before ( instead of after? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2162] Double check in doGetLocal to avo...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1103#discussion_r13903517 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -363,6 +363,12 @@ private[spark] class BlockManager( val info = blockInfo.get(blockId).orNull if (info != null) { info.synchronized { +// Double check to make sure the block is still there, since it +// might has been removed when we actually come here. --- End diff -- has - have also can you point out in the comment that this only works because removeBlock also synchronizes on the block info object? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2162] Double check in doGetLocal to avo...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1103#issuecomment-46406863 This LGTM actually. Makes sense to do another check within the synchronized block in case a block is being removed by another thread. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix for Spark-2151
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1095#issuecomment-46407129 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Fix for Spark-2151
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1095#issuecomment-46407125 Do you mind updating the pull request title to say something like [SPARK-2151] Recognize memory format for spark-submit? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2038: rename conf parameters in the sa...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1087#issuecomment-46408239 Just leaving a note that this pr has been reverted because changing the parameter name in Scala could make the function non-source-compatible anymore ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2176][SQL] Extra unnecessary exchange o...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1116#issuecomment-46469732 That's not a bad idea. Also we should add more documentation. While Spark SQL code in general is extremely concise, it can be hard to understand (especially the optimizer rules) to people less familiar with Scala and the tree library itself. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2176][SQL] Extra unnecessary exchange o...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1116#issuecomment-46469830 Thanks. Merging this in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2162] Double check in doGetLocal to avo...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1103#issuecomment-46470613 Thanks. I'm merging this in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Updated the comment for SPARK-2162.
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1117 Updated the comment for SPARK-2162. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-2162 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1117.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1117 commit a4231deb2a480196194fe2b0a819cff60354e3cf Author: Reynold Xin r...@apache.org Date: 2014-06-18T18:03:48Z Updated the comment for SPARK-2162. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2038: rename conf parameters in the sa...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1087#issuecomment-46477205 That's a very good idea. We should probably have a API-breaking label on JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2177][SQL] describe table result contai...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1118#discussion_r13936033 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/hiveOperators.scala --- @@ -445,7 +445,19 @@ case class NativeCommand( if (sideEffectResult.size == 0) { context.emptyResult } else { - val rows = sideEffectResult.map(r = new GenericRow(Array[Any](r))) + // TODO: Need a better way to handle the result of a native command. + // We may want to consider to use JsonMetaDataFormatter in Hive. + val isDescribe = sql.trim.startsWith(describe) + val rows = if (isDescribe) { +// TODO: If we upgrade Hive to 0.13, we need to check the results of +// context.sessionState.isHiveServerQuery() to determine how to split the result. +// This method is introduced by https://issues.apache.org/jira/browse/HIVE-4545. +// Right now, we split every string by any number of consecutive spaces. +sideEffectResult.map( + r = r.split(\\s+)).map(r = new GenericRow(r.asInstanceOf[Array[Any]])) --- End diff -- actually for describe can we only split up to 3 columns? ```scala scala a b c d e.split(\\s+, 3) res2: Array[String] = Array(a, b, c d e) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2151] Recognize memory format for spark...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1095#issuecomment-46484800 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Updated the comment for SPARK-2162.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1117#issuecomment-46484824 Merged in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2177][SQL] describe table result contai...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1118#discussion_r13937999 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/hiveOperators.scala --- @@ -445,7 +445,19 @@ case class NativeCommand( if (sideEffectResult.size == 0) { context.emptyResult } else { - val rows = sideEffectResult.map(r = new GenericRow(Array[Any](r))) + // TODO: Need a better way to handle the result of a native command. + // We may want to consider to use JsonMetaDataFormatter in Hive. --- End diff -- That sounds good. Let's merge this first and submit another PR for that. (Reason is this should make it into 1.0.1) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Remove unicode operator from RDD.scala
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1119#issuecomment-46492402 @ash211 ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: Remove unicode operator from RDD.scala
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1119#issuecomment-46500211 Thanks. I've merged this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2038: rename conf parameters in the sa...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1087#issuecomment-46500185 Yup I added api-breaking label to the ticket. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-897: preemptively serialize closures
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/143#issuecomment-46500308 That test is flaky and being fixed right now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2184][SQL] AddExchange isn't idempotent
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1122#issuecomment-46512587 I'm merging this in master branch-1.0. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2187] Explain should not run the optimi...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1123 [SPARK-2187] Explain should not run the optimizer twice. @yhuai @marmbrus You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark explain Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1123.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1123 commit a9d3ba877ce8920f0c4a348c6b32b1f4f7f39427 Author: Reynold Xin r...@apache.org Date: 2014-06-19T01:19:27Z [SPARK-2187] Explain should not run the optimizer twice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2187] Explain should not run the optimi...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1123#discussion_r13949881 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/commands.scala --- @@ -71,16 +72,24 @@ case class SetCommand( } /** + * An explain command for users to see how a command will be executed. + * + * Note that this command takes in a logical plan, runs the optimizer on the logical plan + * (but do NOT actually execute it). + * * :: DeveloperApi :: */ @DeveloperApi case class ExplainCommand( -child: SparkPlan, output: Seq[Attribute])( +logicalPlan: LogicalPlan, output: Seq[Attribute])( @transient context: SQLContext) - extends UnaryNode with Command { + extends LeafNode with Command { - // Actually EXPLAIN command doesn't cause any side effect. - override protected[sql] lazy val sideEffectResult: Seq[String] = this.toString.split(\n) + // Run through the optimizer to generate the physical plan. + // This is really side effect free but we follow the infrastructure anyway... --- End diff -- I hope so . That was an old comment I just rewrote anyway ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2187] Explain should not run the optimi...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1123#issuecomment-46525610 Ok I am merging this in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2177][SQL] describe table result contai...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1118#discussion_r13954580 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -17,7 +17,7 @@ package org.apache.spark.sql.hive -import org.apache.spark.sql.SQLContext +import org.apache.spark.sql.{SQLContext} --- End diff -- no need to change this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2177][SQL] describe table result contai...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1118#discussion_r13954638 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala --- @@ -60,3 +60,16 @@ case class ExplainCommand(plan: LogicalPlan) extends Command { * Returned for the CACHE TABLE tableName and UNCACHE TABLE tableName command. */ case class CacheCommand(tableName: String, doCache: Boolean) extends Command + +/** + * Returned for the Describe tableName command. + */ +case class DescribeCommand( --- End diff -- would be great to explain isFormatted / isExtended in @param. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2177][SQL] describe table result contai...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1118#discussion_r13954626 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala --- @@ -257,6 +250,88 @@ class HiveQuerySuite extends HiveComparisonTest { assert(Try(q0.count()).isSuccess) } + test(Describe commands) { --- End diff -- to be consistent either lowercase D, or uppercase the whole DESCRIBE --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2177][SQL] describe table result contai...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1118#discussion_r13954661 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -81,6 +81,20 @@ private[hive] trait HiveStrategies { def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match { case logical.NativeCommand(sql) = NativeCommand(sql, plan.output)(context) :: Nil + case describe: logical.DescribeCommand = { +val resolvedTable = context.executePlan(describe.table).analyzed +resolvedTable match { + case t: MetastoreRelation = +Seq(DescribeHiveTableCommand( + t, describe.output, describe.isFormatted, describe.isExtended)(context)) + case o: LogicalPlan = +if (describe.isFormatted) --- End diff -- Maybe for non metastore tables, we can just added some formatted/extended information saying they are registered as temporary tables? Then we can get rid of the extra lines here ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2177][SQL] describe table result contai...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1118#discussion_r13954704 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -362,13 +367,19 @@ private[hive] object HiveQl { } } + protected def extractDbNameTableName(tableNameParts: Node): (Option[String], String) = { +val (db, tableName) = + tableNameParts.getChildren.map{ case Token(part, Nil) = cleanIdentifier(part)} match { +case Seq(tableOnly) = (None, tableOnly) +case Seq(databaseName, table) = (Some(databaseName), table) + } + +(db, tableName) + } + protected def nodeToPlan(node: Node): LogicalPlan = node match { // Just fake explain for any of the native commands. -case Token(TOK_EXPLAIN, explainArgs) if nativeCommands contains explainArgs.head.getText = - ExplainCommand(NoRelation) -// Create tables aren't native commands due to CTAS queries, but we still don't need to -// explain them. -case Token(TOK_EXPLAIN, explainArgs) if explainArgs.head.getText == TOK_CREATETABLE = +case Token(TOK_EXPLAIN, explainArgs) if noExplainCommands contains explainArgs.head.getText = --- End diff -- avoid infix contains here, i.e. ```scala noExplainCommands.contains(explainArgs.head.getText) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2177][SQL] describe table result contai...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1118#discussion_r13954716 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala --- @@ -362,13 +367,19 @@ private[hive] object HiveQl { } } + protected def extractDbNameTableName(tableNameParts: Node): (Option[String], String) = { +val (db, tableName) = + tableNameParts.getChildren.map{ case Token(part, Nil) = cleanIdentifier(part)} match { --- End diff -- space after map, and before the closing } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2177][SQL] describe table result contai...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1118#issuecomment-46534190 hmmm a lot of tests are failing because the output doesn't match exactly Hive's ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46593227 That test has been flaky. We are fixing it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46593240 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2196] [SQL] Fix nullability of CaseWhen...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1133#issuecomment-46611219 @concretevitamin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46612027 @AndreSchumacher do u mind removing the [WIP] tag from the pull request? Unfortunately due to the avro version bump, we can't include this in 1.0.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46612533 Any idea why the having test from Hive is not runnable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: A few minor Spark SQL Scaladoc fixes.
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1139 A few minor Spark SQL Scaladoc fixes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark sparksqldoc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1139.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1139 commit 66dc72c49afd8f68e960e6e940b340ac29075fd7 Author: Reynold Xin r...@apache.org Date: 2014-06-19T21:11:37Z A few minor Spark SQL Scaladoc fixes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2191][SQL] Make sure InsertIntoHiveTabl...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1129#issuecomment-46617536 I've merged this in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: A few minor Spark SQL Scaladoc fixes.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1139#issuecomment-46619076 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2177][SQL] describe table result contai...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1118#discussion_r14001780 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/commands.scala --- @@ -60,3 +60,23 @@ case class ExplainCommand(plan: LogicalPlan) extends Command { * Returned for the CACHE TABLE tableName and UNCACHE TABLE tableName command. */ case class CacheCommand(tableName: String, doCache: Boolean) extends Command + +/** --- End diff -- remove this block --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2177][SQL] describe table result contai...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1118#discussion_r14002250 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveComparisonTest.scala --- @@ -144,6 +144,10 @@ abstract class HiveComparisonTest case _: SetCommand = Seq(0) case _: LogicalNativeCommand = answer.filterNot(nonDeterministicLine).filterNot(_ == ) case _: ExplainCommand = answer + case _: DescribeCommand = --- End diff -- add some inline comment explaining what you are filtering --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46634535 Thanks, @willb. There is at least one problem I found. - I think you'd need to add a cast to the having expression. Otherwise try run the following: ```select key, count(*) c from src group by key having c``` In Hive this returns nothing, but in Spark SQL with this patch it throws a runtime exception failing to cast integer to boolean. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46635173 To be more specific, I think you can always add a cast that cast the having expression to boolean, and then we have SimplifyCasts in the optimizer that would remove unnecessary casts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: A few minor Spark SQL Scaladoc fixes.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1139#issuecomment-46636241 Ok merging this in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: More minor scaladoc cleanup for Spark SQL.
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1142 More minor scaladoc cleanup for Spark SQL. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark sqlclean Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1142.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1142 commit 67a789e9dd8277b5ea3697af4bf07667084ad88a Author: Reynold Xin r...@apache.org Date: 2014-06-20T02:21:29Z More minor scaladoc cleanup for Spark SQL. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2209][SQL] Cast shouldn't do null check...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1143 [SPARK-2209][SQL] Cast shouldn't do null check twice. Also took the chance to clean up cast a little bit. Too many arrows on each line before! You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark cast Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1143.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1143 commit c2b88aee347edab3d36475ef75b30a1d2f15b1c1 Author: Reynold Xin r...@apache.org Date: 2014-06-20T02:43:06Z [SPARK-2209][SQL] Cast shouldn't do null check twice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46642243 That's definitely a bug - I will take a look at it later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46644761 I found the issue and fixed it. Will push out a pull request soon. If you can just add the boolean cast (always add it - no need to check if the type is already boolean since once I fix the bug, the extra cast on boolean value will be removed), that'd be great. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] [WIP] Parquet support for nes...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46644798 That sounds good. If you can just comment that test out for now, that'd be great. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: More minor scaladoc cleanup for Spark SQL.
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1142#issuecomment-46646143 Ok merging this in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46646244 Here's the patch: https://github.com/apache/spark/pull/1144 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2210] boolean cast on boolean value sho...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1144 [SPARK-2210] boolean cast on boolean value should be removed. ``` explain select cast(cast(key=0 as boolean) as boolean) aaa from src ``` should be ``` [Physical execution plan:] [Project [(key#10:0 = 0) AS aaa#7]] [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None] ``` However, it is currently ``` [Physical execution plan:] [Project [NOT((key#10=0) = 0) AS aaa#7]] [ HiveTableScan [key#10], (MetastoreRelation default, src, None), None] ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark booleancast Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1144.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1144 commit c4e543d9802641e4f7ddb2cc2ae08c05962a5b44 Author: Reynold Xin r...@apache.org Date: 2014-06-20T05:35:23Z [SPARK-2210] boolean cast on boolean value should be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2209][SQL] Cast shouldn't do null check...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1143#discussion_r14007468 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -104,85 +121,118 @@ case class Cast(child: Expression, dataType: DataType) extends UnaryExpression { } // Timestamp to long, converting milliseconds to seconds - private def timestampToLong(ts: Timestamp) = ts.getTime / 1000 + private[this] def timestampToLong(ts: Timestamp) = ts.getTime / 1000 - private def timestampToDouble(ts: Timestamp) = { + private[this] def timestampToDouble(ts: Timestamp) = { // First part is the seconds since the beginning of time, followed by nanosecs. ts.getTime / 1000 + ts.getNanos.toDouble / 10 } - def castToLong: Any = Any = child.dataType match { -case StringType = nullOrCast[String](_, s = try s.toLong catch { - case _: NumberFormatException = null -}) -case BooleanType = nullOrCast[Boolean](_, b = if(b) 1L else 0L) -case TimestampType = nullOrCast[Timestamp](_, t = timestampToLong(t)) -case DecimalType = nullOrCast[BigDecimal](_, _.toLong) -case x: NumericType = b = x.numeric.asInstanceOf[Numeric[Any]].toLong(b) - } - - def castToInt: Any = Any = child.dataType match { -case StringType = nullOrCast[String](_, s = try s.toInt catch { - case _: NumberFormatException = null -}) -case BooleanType = nullOrCast[Boolean](_, b = if(b) 1 else 0) -case TimestampType = nullOrCast[Timestamp](_, t = timestampToLong(t).toInt) -case DecimalType = nullOrCast[BigDecimal](_, _.toInt) -case x: NumericType = b = x.numeric.asInstanceOf[Numeric[Any]].toInt(b) - } - - def castToShort: Any = Any = child.dataType match { -case StringType = nullOrCast[String](_, s = try s.toShort catch { - case _: NumberFormatException = null -}) -case BooleanType = nullOrCast[Boolean](_, b = if(b) 1.toShort else 0.toShort) -case TimestampType = nullOrCast[Timestamp](_, t = timestampToLong(t).toShort) -case DecimalType = nullOrCast[BigDecimal](_, _.toShort) -case x: NumericType = b = x.numeric.asInstanceOf[Numeric[Any]].toInt(b).toShort - } - - def castToByte: Any = Any = child.dataType match { -case StringType = nullOrCast[String](_, s = try s.toByte catch { - case _: NumberFormatException = null -}) -case BooleanType = nullOrCast[Boolean](_, b = if(b) 1.toByte else 0.toByte) -case TimestampType = nullOrCast[Timestamp](_, t = timestampToLong(t).toByte) -case DecimalType = nullOrCast[BigDecimal](_, _.toByte) -case x: NumericType = b = x.numeric.asInstanceOf[Numeric[Any]].toInt(b).toByte - } - - def castToDecimal: Any = Any = child.dataType match { -case StringType = nullOrCast[String](_, s = try BigDecimal(s.toDouble) catch { - case _: NumberFormatException = null -}) -case BooleanType = nullOrCast[Boolean](_, b = if(b) BigDecimal(1) else BigDecimal(0)) + private[this] def castToLong: Any = Any = child.dataType match { +case StringType = + buildCast[String](_, s = try s.toLong catch { +case _: NumberFormatException = null + }) --- End diff -- Try is really slow though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2209][SQL] Cast shouldn't do null check...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1143#discussion_r14007471 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -24,72 +24,89 @@ import org.apache.spark.sql.catalyst.types._ /** Cast the child expression to the target data type. */ case class Cast(child: Expression, dataType: DataType) extends UnaryExpression { override def foldable = child.foldable - def nullable = (child.dataType, dataType) match { + + override def nullable = (child.dataType, dataType) match { case (StringType, _: NumericType) = true case (StringType, TimestampType) = true case _= child.nullable } + override def toString = sCAST($child, $dataType) type EvaluatedType = Any - def nullOrCast[T](a: Any, func: T = Any): Any = if(a == null) { -null - } else { -func(a.asInstanceOf[T]) - } + // [[func]] assumes the input is no longer null because eval already does the null check. + @inline private[this] def buildCast[T](a: Any, func: T = Any): Any = func(a.asInstanceOf[T]) // UDFToString - def castToString: Any = Any = child.dataType match { -case BinaryType = nullOrCast[Array[Byte]](_, new String(_, UTF-8)) -case _ = nullOrCast[Any](_, _.toString) + private[this] def castToString: Any = Any = child.dataType match { +case BinaryType = buildCast[Array[Byte]](_, new String(_, UTF-8)) +case _ = buildCast[Any](_, _.toString) } // BinaryConverter - def castToBinary: Any = Any = child.dataType match { -case StringType = nullOrCast[String](_, _.getBytes(UTF-8)) + private[this] def castToBinary: Any = Any = child.dataType match { +case StringType = buildCast[String](_, _.getBytes(UTF-8)) } // UDFToBoolean - def castToBoolean: Any = Any = child.dataType match { -case StringType = nullOrCast[String](_, _.length() != 0) -case TimestampType = nullOrCast[Timestamp](_, b = {(b.getTime() != 0 || b.getNanos() != 0)}) -case LongType = nullOrCast[Long](_, _ != 0) -case IntegerType = nullOrCast[Int](_, _ != 0) -case ShortType = nullOrCast[Short](_, _ != 0) -case ByteType = nullOrCast[Byte](_, _ != 0) -case DecimalType = nullOrCast[BigDecimal](_, _ != 0) -case DoubleType = nullOrCast[Double](_, _ != 0) -case FloatType = nullOrCast[Float](_, _ != 0) + private[this] def castToBoolean: Any = Any = child.dataType match { +case StringType = + buildCast[String](_, _.length() != 0) +case TimestampType = + buildCast[Timestamp](_, b = b.getTime() != 0 || b.getNanos() != 0) +case LongType = + buildCast[Long](_, _ != 0) +case IntegerType = + buildCast[Int](_, _ != 0) +case ShortType = + buildCast[Short](_, _ != 0) +case ByteType = + buildCast[Byte](_, _ != 0) +case DecimalType = + buildCast[BigDecimal](_, _ != 0) +case DoubleType = + buildCast[Double](_, _ != 0) +case FloatType = + buildCast[Float](_, _ != 0) } // TimestampConverter - def castToTimestamp: Any = Any = child.dataType match { -case StringType = nullOrCast[String](_, s = { - // Throw away extra if more than 9 decimal places - val periodIdx = s.indexOf(.); - var n = s - if (periodIdx != -1) { -if (n.length() - periodIdx 9) { - n = n.substring(0, periodIdx + 10) + private[this] def castToTimestamp: Any = Any = child.dataType match { +case StringType = + buildCast[String](_, s = { +// Throw away extra if more than 9 decimal places +val periodIdx = s.indexOf(.) +var n = s +if (periodIdx != -1) { + if (n.length() - periodIdx 9) { +n = n.substring(0, periodIdx + 10) + } } - } - try Timestamp.valueOf(n) catch { case _: java.lang.IllegalArgumentException = null} -}) -case BooleanType = nullOrCast[Boolean](_, b = new Timestamp((if(b) 1 else 0) * 1000)) -case LongType = nullOrCast[Long](_, l = new Timestamp(l * 1000)) -case IntegerType = nullOrCast[Int](_, i = new Timestamp(i * 1000)) -case ShortType = nullOrCast[Short](_, s = new Timestamp(s * 1000)) -case ByteType = nullOrCast[Byte](_, b = new Timestamp(b * 1000)) +try Timestamp.valueOf(n) catch { case _: java.lang.IllegalArgumentException = null } + }) +case BooleanType = + buildCast[Boolean](_, b = new Timestamp((if(b) 1 else 0) * 1000)) +case LongType = + buildCast[Long](_, l = new Timestamp
[GitHub] spark pull request: [SPARK-2218] rename Equals to EqualsTo in Spar...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1146 [SPARK-2218] rename Equals to EqualsTo in Spark SQL expressions. Due to the existence of scala.Equals, it is very error prone to name the expression Equals, especially because we use a lot of partial functions and pattern matching in the optimizer. Note that this sits on top of #1144. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark equals Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1146.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1146 commit c4e543d9802641e4f7ddb2cc2ae08c05962a5b44 Author: Reynold Xin r...@apache.org Date: 2014-06-20T05:35:23Z [SPARK-2210] boolean cast on boolean value should be removed. commit 81148d16e97d535d9a13927c1be2a9778c6e7ae5 Author: Reynold Xin r...@apache.org Date: 2014-06-20T05:52:36Z [SPARK-2218] rename Equals to EqualsTo in Spark SQL expressions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SparkSQL add SkewJoin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1134#issuecomment-46648420 Do you mind reformatting the code to match the Spark coding style? https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SQL] Improve Speed of InsertIntoHiveTable
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1130#issuecomment-46648984 Merging this in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2177][SQL] describe table result contai...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1118#issuecomment-46649113 Ok I'm merging this in master branch-1.0. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1293 [SQL] Parquet support for nested ty...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/360#issuecomment-46649473 Ok I'm going to merge this in master branch-1.0 now. Kinda scary but the change is very isolated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2185] Emit warning when task size excee...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1149#issuecomment-46649825 ``` error file=/home/jenkins/workspace/SparkPullRequestBuilder/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala message=File must end with newline character ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2210] cast to boolean on boolean value ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1144#issuecomment-46650094 Ok merging this in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1138#issuecomment-46650603 Merging this in master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46650863 BTW I really want this to go into 1.0.1, which will probably have a release candidate soon. So if you have a chance to rebase your PR and add the cast, please do. Thanks a lot, @willb! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2196] [SQL] Fix nullability of CaseWhen...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1133#issuecomment-46650936 Thanks. I'm merging this in master branch-1.0. The test failure is not related to this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2209][SQL] Cast shouldn't do null check...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1143#issuecomment-46650243 Ok merging this in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2218] rename Equals to EqualTo in Spark...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1146#issuecomment-46651438 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2218] rename Equals to EqualTo in Spark...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1146#issuecomment-46652247 Ok merging this in master branch-1.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1412][SQL] Disable partial aggregation ...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/1152 [SPARK-1412][SQL] Disable partial aggregation automatically when reduction factor is low - WIP This is just a prototype. Kinda ugly, doesn't properly connect with the config system yet, and have no test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark partialAggDisable Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1152.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1152 commit 6360e117de927b77a61b5e4f03ac52eb400c1825 Author: Reynold Xin r...@apache.org Date: 2014-06-20T08:05:22Z Prototype for disable partial aggregation when we don't see reduction. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1412][SQL] Disable partial aggregation ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1152#issuecomment-46654388 @concretevitamin I find it hard to actually use config options in a physical operator. Any suggestions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-1412][SQL] Disable partial aggregation ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1152#issuecomment-46654585 @pwendell / @mateiz should we actually build this into Spark directly (i.e. in Aggregator)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: [SPARK-2219][SQL] Fix add jar to execute with ...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1154#issuecomment-46706270 This needs to call Spark's addJar, doesn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46724012 I'm going to merge this in master branch-1.0. I will create a separate ticket to track progress on HAVING. Basically there are two things missing: 1. HAVING without GROUP BY should just become a normal WHERE 2. HAVING should be able to contain aggregate expressions that don't appear in the aggregation list. This test contains that: https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/having.q --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46725494 BTW two follow up tickets created: https://issues.apache.org/jira/browse/SPARK-2225 https://issues.apache.org/jira/browse/SPARK-2226 Let me know if you'd like to work on them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46725451 There are databases that support that, and it seems to me a very simple change (actually just removing the check code you added is probably enough). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2180: support HAVING clauses in Hive que...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1136#issuecomment-46726272 I actually did 2225 already. I will assign 2226 to you. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---