[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user mgaido91 closed the pull request at: https://github.com/apache/spark/pull/19492 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144837336 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -343,6 +367,25 @@ class JacksonParser( record: T, createParser: (JsonFactory, T) => JsonParser, recordLiteral: T => UTF8String): Seq[InternalRow] = { +parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match { + case rows: Seq[InternalRow] => rows + case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None, +new RuntimeException("Conversion of array of primitive data is not yet supported here.")) --- End diff -- yes, you are right since we have only two constructors which enforce this patters. Then I will edit the exception message according to your suggestion, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144837359 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -343,6 +368,25 @@ class JacksonParser( record: T, createParser: (JsonFactory, T) => JsonParser, recordLiteral: T => UTF8String): Seq[InternalRow] = { +parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match { + case rows: Seq[InternalRow] => rows + case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None, +new RuntimeException("Conversion of array of primitive data is not yet supported here.")) +} + } + + /** + * Parse the JSON input. This function can return a set of [[InternalRow]]s + * if a [[StructType]] is defined as schema, otherwise it returns a set of + * objects. --- End diff -- Btw, and also add comment to existing `parse` to clarify its usage. It might not easily to know which to call at the first glance. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144837080 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -343,6 +368,25 @@ class JacksonParser( record: T, createParser: (JsonFactory, T) => JsonParser, recordLiteral: T => UTF8String): Seq[InternalRow] = { +parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match { + case rows: Seq[InternalRow] => rows + case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None, +new RuntimeException("Conversion of array of primitive data is not yet supported here.")) +} + } + + /** + * Parse the JSON input. This function can return a set of [[InternalRow]]s + * if a [[StructType]] is defined as schema, otherwise it returns a set of + * objects. --- End diff -- Please add comment that this is used when passing `ArrayType` of primitive types. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144837088 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -343,6 +367,25 @@ class JacksonParser( record: T, createParser: (JsonFactory, T) => JsonParser, recordLiteral: T => UTF8String): Seq[InternalRow] = { +parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match { + case rows: Seq[InternalRow] => rows + case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None, +new RuntimeException("Conversion of array of primitive data is not yet supported here.")) --- End diff -- Anyway, we can keep it as it is. I didn't feel strongly to change it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144836485 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils /** * Constructs a parser for a given schema that translates a json string to an [[InternalRow]]. */ -class JacksonParser( -schema: StructType, +private[sql] class JacksonParser( +schema: DataType, val options: JSONOptions) extends Logging { import JacksonUtils._ import com.fasterxml.jackson.core.JsonToken._ + def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options) + def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options) --- End diff -- you are right, I am fixing it, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144835433 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -343,6 +367,25 @@ class JacksonParser( record: T, createParser: (JsonFactory, T) => JsonParser, recordLiteral: T => UTF8String): Seq[InternalRow] = { +parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match { + case rows: Seq[InternalRow] => rows + case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None, +new RuntimeException("Conversion of array of primitive data is not yet supported here.")) --- End diff -- To clarify it, I think the only way we throw this exception is passing an `ArrayType` into `JacksonParser` constructor and call `parse` instead of `parseWithArrayOfPrimitiveSupport`. Because `JacksonParser` is internally used, I assume this usage is intentional and the developer will get the exception right away. So I didn't think this exception will possibly be seen by end user, unless we ship such broken codes to users in Spark releases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144834116 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils /** * Constructs a parser for a given schema that translates a json string to an [[InternalRow]]. */ -class JacksonParser( -schema: StructType, +private[sql] class JacksonParser( +schema: DataType, val options: JSONOptions) extends Logging { import JacksonUtils._ import com.fasterxml.jackson.core.JsonToken._ + def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options) + def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options) --- End diff -- If so, then I think the default constructor should be private? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144792285 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -536,26 +536,31 @@ case class JsonToStructs( timeZoneId = None) override def checkInputDataTypes(): TypeCheckResult = schema match { -case _: StructType | ArrayType(_: StructType, _) => +case _: StructType | ArrayType(_: StructType | _: AtomicType, _) => super.checkInputDataTypes() case _ => TypeCheckResult.TypeCheckFailure( - s"Input schema ${schema.simpleString} must be a struct or an array of structs.") + s"Input schema ${schema.simpleString} must be a struct or " + +s"an array of structs or primitive types.") } @transient - lazy val rowSchema = schema match { + lazy val rowSchema: DataType = schema match { --- End diff -- I can't understand the difference. Anyway, if you think `dataSchema` is more appropriate, I can rename it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144790680 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -536,26 +536,31 @@ case class JsonToStructs( timeZoneId = None) override def checkInputDataTypes(): TypeCheckResult = schema match { -case _: StructType | ArrayType(_: StructType, _) => +case _: StructType | ArrayType(_: StructType | _: AtomicType, _) => super.checkInputDataTypes() case _ => TypeCheckResult.TypeCheckFailure( - s"Input schema ${schema.simpleString} must be a struct or an array of structs.") + s"Input schema ${schema.simpleString} must be a struct or " + +s"an array of structs or primitive types.") } @transient - lazy val rowSchema = schema match { + lazy val rowSchema: DataType = schema match { --- End diff -- I think it was row scheme because it can only be `StructType` before. This is not the input/output row's schema. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144789703 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -343,6 +367,25 @@ class JacksonParser( record: T, createParser: (JsonFactory, T) => JsonParser, recordLiteral: T => UTF8String): Seq[InternalRow] = { +parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match { + case rows: Seq[InternalRow] => rows + case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None, +new RuntimeException("Conversion of array of primitive data is not yet supported here.")) --- End diff -- Yes, it is internally used, but it it throws an `Exception` a user might see it. I think that if this is not clear for internal usage we can add comments, but the text of the exception should be meaningful to the end user IMHO. If you have any suggestion about how to improve this message keeping it meaningful to a user, I am happy to change it. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144785808 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -343,6 +367,25 @@ class JacksonParser( record: T, createParser: (JsonFactory, T) => JsonParser, recordLiteral: T => UTF8String): Seq[InternalRow] = { +parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match { + case rows: Seq[InternalRow] => rows + case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None, +new RuntimeException("Conversion of array of primitive data is not yet supported here.")) --- End diff -- What I thought is `JacksonParser` is internally used in Spark SQL. It is hard to think an end user will directly use `parse` and see this exception. Actually `parse` is supposed to return `InternalRow`s. The case we get others is only because the given schema to `JacksonParser` is wrong. So I expect this exception is only seen at SQL development internally. Btw, I've no strong option at this point. If you think it is ok. I'm fine with it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144784037 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -536,26 +536,31 @@ case class JsonToStructs( timeZoneId = None) override def checkInputDataTypes(): TypeCheckResult = schema match { -case _: StructType | ArrayType(_: StructType, _) => +case _: StructType | ArrayType(_: StructType | _: AtomicType, _) => super.checkInputDataTypes() case _ => TypeCheckResult.TypeCheckFailure( - s"Input schema ${schema.simpleString} must be a struct or an array of structs.") + s"Input schema ${schema.simpleString} must be a struct or " + +s"an array of structs or primitive types.") } @transient - lazy val rowSchema = schema match { + lazy val rowSchema: DataType = schema match { --- End diff -- Why is it not a row schema? It is, but sometimes the schema of a row is an array. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144783859 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils /** * Constructs a parser for a given schema that translates a json string to an [[InternalRow]]. */ -class JacksonParser( -schema: StructType, +private[sql] class JacksonParser( +schema: DataType, val options: JSONOptions) extends Logging { import JacksonUtils._ import com.fasterxml.jackson.core.JsonToken._ + def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options) + def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options) --- End diff -- These are to avoid that someone use the constructor specifying invalid `DataType`, ie. anything which is not a `StructType` or `ArrayType`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144783499 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala --- @@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { Row(Row(1, "haa")) :: Nil) } + test("SPARK-8: from_json should support also arrays of primitive types") { +val dfInt = Seq("[1]", "[2, 3]").toDS() +checkAnswer( + dfInt.select(from_json($"value", ArrayType(IntegerType))), + Row(Seq(1)) :: Row(Seq(2, 3)) :: Nil) + +val dfString = Seq("""["hello", "world", ""]""").toDS() +checkAnswer( + dfString.select(from_json($"value", ArrayType(StringType))), + Row(Seq("hello", "world", "")):: Nil) + +val dfTimestamp = Seq("""["26/08/2015 18:00"]""").toDS() +val schema = ArrayType(TimestampType) +val options = Map("timestampFormat" -> "dd/MM/ HH:mm") + +checkAnswer( + dfTimestamp.select(from_json($"value", schema, options)), + Row(Seq(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0" + +val dfEmpty = Seq("""[]""").toDS() +checkAnswer( + dfEmpty.select(from_json($"value", ArrayType(StringType))), + Row(Nil):: Nil) --- End diff -- I will, thanks for the suggestion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144783459 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala --- @@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { Row(Row(1, "haa")) :: Nil) } + test("SPARK-8: from_json should support also arrays of primitive types") { +val dfInt = Seq("[1]", "[2, 3]").toDS() --- End diff -- This case is tested few lines after. I preferred to treat each specific case separately so that an error points out very easily which case is not handled properly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144783193 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -343,6 +367,25 @@ class JacksonParser( record: T, createParser: (JsonFactory, T) => JsonParser, recordLiteral: T => UTF8String): Seq[InternalRow] = { +parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match { + case rows: Seq[InternalRow] => rows + case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None, +new RuntimeException("Conversion of array of primitive data is not yet supported here.")) --- End diff -- What about a user seeing this exception? With the current description (which I am very open to improve), he/she is aware that he/she is trying to do something which is not allowed (at least at the moment), ie. we might hit this exception when using `sqlContext.read.json(...)` on arrays of primitives. Your suggested description would be a bit weird to a user: he/she might feel he/she is doing something wrong to achieve something which can be done, but of course he/she knows nothing about these functions so he/she would be lost IMHO. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144782180 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -89,6 +95,24 @@ class JacksonParser( /** * Create a converter which converts the JSON documents held by the `JsonParser` + * to a value according to a desired schema. This is an overloaded method to the + * previous one which allows to handle array of primitive types. + */ + private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = { +(parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) { + case START_ARRAY => +val array = convertArray(parser, makeConverter(at.elementType)) +if (array.numElements() == 0) { + Nil +} else { + array.toArray(at.elementType).toSeq +} + case _ => Nil --- End diff -- You are right, I will remove this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144757908 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala --- @@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { Row(Row(1, "haa")) :: Nil) } + test("SPARK-8: from_json should support also arrays of primitive types") { +val dfInt = Seq("[1]", "[2, 3]").toDS() +checkAnswer( + dfInt.select(from_json($"value", ArrayType(IntegerType))), + Row(Seq(1)) :: Row(Seq(2, 3)) :: Nil) + +val dfString = Seq("""["hello", "world", ""]""").toDS() +checkAnswer( + dfString.select(from_json($"value", ArrayType(StringType))), + Row(Seq("hello", "world", "")):: Nil) + +val dfTimestamp = Seq("""["26/08/2015 18:00"]""").toDS() +val schema = ArrayType(TimestampType) +val options = Map("timestampFormat" -> "dd/MM/ HH:mm") + +checkAnswer( + dfTimestamp.select(from_json($"value", schema, options)), + Row(Seq(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0" + +val dfEmpty = Seq("""[]""").toDS() +checkAnswer( + dfEmpty.select(from_json($"value", ArrayType(StringType))), + Row(Nil):: Nil) --- End diff -- And also add a case that can fail the parsing like `Seq([1]", "[2, 3]", "[string]""").toDS()`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144757530 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala --- @@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { Row(Row(1, "haa")) :: Nil) } + test("SPARK-8: from_json should support also arrays of primitive types") { +val dfInt = Seq("[1]", "[2, 3]").toDS() --- End diff -- Add a `[]` in to the data? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144757046 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -343,6 +367,25 @@ class JacksonParser( record: T, createParser: (JsonFactory, T) => JsonParser, recordLiteral: T => UTF8String): Seq[InternalRow] = { +parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) match { + case rows: Seq[InternalRow] => rows + case _: Seq[_] => throw BadRecordException(() => recordLiteral(record), () => None, +new RuntimeException("Conversion of array of primitive data is not yet supported here.")) --- End diff -- This exception looks a bit weird. How about `` `parse` is only used to parse the JSON input to the set of `InternalRow`s. Use `parseWithArrayOfPrimitiveSupport` when paring array of primitive data is needed``? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144755348 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -89,6 +95,24 @@ class JacksonParser( /** * Create a converter which converts the JSON documents held by the `JsonParser` + * to a value according to a desired schema. This is an overloaded method to the + * previous one which allows to handle array of primitive types. + */ + private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = { +(parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) { + case START_ARRAY => +val array = convertArray(parser, makeConverter(at.elementType)) +if (array.numElements() == 0) { + Nil +} else { + array.toArray(at.elementType).toSeq +} + case _ => Nil --- End diff -- Should we return `Nil` when it is not parsed to array? The original `makeRootConverter` didn't do this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144754632 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils /** * Constructs a parser for a given schema that translates a json string to an [[InternalRow]]. */ -class JacksonParser( -schema: StructType, +private[sql] class JacksonParser( +schema: DataType, val options: JSONOptions) extends Logging { import JacksonUtils._ import com.fasterxml.jackson.core.JsonToken._ + def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options) + def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options) + // A `ValueConverter` is responsible for converting a value from `JsonParser` // to a value in a field for `InternalRow`. private type ValueConverter = JsonParser => AnyRef // `ValueConverter`s for the root schema for all fields in the schema - private val rootConverter = makeRootConverter(schema) + private val rootConverter = schema match { +case s: StructType => makeRootConverter(s) --- End diff -- It is kind of easy to confused. Please add comment to each case like: ```scala private val rootConverter = schema match { case s: StructType => makeRootConverter(s) // For struct or array of struct. case a: ArrayType => makeRootConverter(a) // For array of primitive types. } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144754775 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -89,6 +95,24 @@ class JacksonParser( /** * Create a converter which converts the JSON documents held by the `JsonParser` + * to a value according to a desired schema. This is an overloaded method to the + * previous one which allows to handle array of primitive types. + */ + private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = { +(parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) { + case START_ARRAY => +val array = convertArray(parser, makeConverter(at.elementType)) --- End diff -- Move `makeConverter` outside the inner function and so we can reuse it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144754134 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils /** * Constructs a parser for a given schema that translates a json string to an [[InternalRow]]. */ -class JacksonParser( -schema: StructType, +private[sql] class JacksonParser( +schema: DataType, val options: JSONOptions) extends Logging { import JacksonUtils._ import com.fasterxml.jackson.core.JsonToken._ + def this(schema: StructType, options: JSONOptions) = this(schema: DataType, options) + def this(schema: ArrayType, options: JSONOptions) = this(schema: DataType, options) --- End diff -- Are those necessary? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144754004 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala --- @@ -35,19 +35,25 @@ import org.apache.spark.util.Utils /** * Constructs a parser for a given schema that translates a json string to an [[InternalRow]]. --- End diff -- After this change, it didn't always return `InternalRow`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144753618 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -536,26 +536,31 @@ case class JsonToStructs( timeZoneId = None) override def checkInputDataTypes(): TypeCheckResult = schema match { -case _: StructType | ArrayType(_: StructType, _) => +case _: StructType | ArrayType(_: StructType | _: AtomicType, _) => super.checkInputDataTypes() case _ => TypeCheckResult.TypeCheckFailure( - s"Input schema ${schema.simpleString} must be a struct or an array of structs.") + s"Input schema ${schema.simpleString} must be a struct or " + +s"an array of structs or primitive types.") } @transient - lazy val rowSchema = schema match { + lazy val rowSchema: DataType = schema match { --- End diff -- Not schema for row anymore. Maybe `dataSchema`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19492#discussion_r144753534 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -536,26 +536,31 @@ case class JsonToStructs( timeZoneId = None) override def checkInputDataTypes(): TypeCheckResult = schema match { -case _: StructType | ArrayType(_: StructType, _) => +case _: StructType | ArrayType(_: StructType | _: AtomicType, _) => super.checkInputDataTypes() case _ => TypeCheckResult.TypeCheckFailure( - s"Input schema ${schema.simpleString} must be a struct or an array of structs.") + s"Input schema ${schema.simpleString} must be a struct or " + +s"an array of structs or primitive types.") } @transient - lazy val rowSchema = schema match { + lazy val rowSchema: DataType = schema match { case st: StructType => st case ArrayType(st: StructType, _) => st +case ArrayType(at: AtomicType, _) => ArrayType(at) } // This converts parsed rows to the desired output by the given schema. @transient - lazy val converter = schema match { -case _: StructType => - (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null -case ArrayType(_: StructType, _) => - (rows: Seq[InternalRow]) => new GenericArrayData(rows) - } + lazy val converter = (rows: Seq[Any]) => --- End diff -- This brings extra matching cost at runtime. Can we move matching outside? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/19492 [SPARK-8][SQL] Add support for array to from_json ## What changes were proposed in this pull request? The fix introduces support for parsing array of primitive types in the `from_json` function. Currently, it supports only array of struct, which is equivalent to array of JSON objects, but it doesn't allow to parse array of simple types, like array of strings (eg. `["this is a valid JSON", "we cannot parse before the PR"]`). ## How was this patch tested? Added UTs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-8 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19492.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19492 commit cba1f4accaf5dbe305e366eab33879577d7f9a93 Author: Marco Gaido Date: 2017-10-11T10:29:26Z [SPARK-8] Add support for array to from_json --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org