subject:"\[GitHub\] spark pull request #19492\: \[SPARK\-22228\]\[SQL\] Add support for array"

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2018-07-22 Thread mgaido91

Github user mgaido91 closed the pull request at:

https://github.com/apache/spark/pull/19492


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144837336
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -343,6 +367,25 @@ class JacksonParser(
   record: T,
   createParser: (JsonFactory, T) => JsonParser,
   recordLiteral: T => UTF8String): Seq[InternalRow] = {
+parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) 
match {
+  case rows: Seq[InternalRow] => rows
+  case _: Seq[_] => throw BadRecordException(() => 
recordLiteral(record), () => None,
+new RuntimeException("Conversion of array of primitive data is not 
yet supported here."))
--- End diff --

yes, you are right since we have only two constructors which enforce this 
patters. Then I will edit the exception message according to your suggestion, 
thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144837359
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -343,6 +368,25 @@ class JacksonParser(
   record: T,
   createParser: (JsonFactory, T) => JsonParser,
   recordLiteral: T => UTF8String): Seq[InternalRow] = {
+parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) 
match {
+  case rows: Seq[InternalRow] => rows
+  case _: Seq[_] => throw BadRecordException(() => 
recordLiteral(record), () => None,
+new RuntimeException("Conversion of array of primitive data is not 
yet supported here."))
+}
+  }
+
+  /**
+   * Parse the JSON input. This function can return a set of 
[[InternalRow]]s
+   * if a [[StructType]] is defined as schema, otherwise it returns a set 
of
+   * objects.
--- End diff --

Btw, and also add comment to existing `parse` to clarify its usage. It 
might not easily to know which to call at the first glance.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144837080
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -343,6 +368,25 @@ class JacksonParser(
   record: T,
   createParser: (JsonFactory, T) => JsonParser,
   recordLiteral: T => UTF8String): Seq[InternalRow] = {
+parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) 
match {
+  case rows: Seq[InternalRow] => rows
+  case _: Seq[_] => throw BadRecordException(() => 
recordLiteral(record), () => None,
+new RuntimeException("Conversion of array of primitive data is not 
yet supported here."))
+}
+  }
+
+  /**
+   * Parse the JSON input. This function can return a set of 
[[InternalRow]]s
+   * if a [[StructType]] is defined as schema, otherwise it returns a set 
of
+   * objects.
--- End diff --

Please add comment that this is used when passing `ArrayType` of primitive 
types.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144837088
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -343,6 +367,25 @@ class JacksonParser(
   record: T,
   createParser: (JsonFactory, T) => JsonParser,
   recordLiteral: T => UTF8String): Seq[InternalRow] = {
+parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) 
match {
+  case rows: Seq[InternalRow] => rows
+  case _: Seq[_] => throw BadRecordException(() => 
recordLiteral(record), () => None,
+new RuntimeException("Conversion of array of primitive data is not 
yet supported here."))
--- End diff --

Anyway, we can keep it as it is. I didn't feel strongly to change it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144836485
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
 /**
  * Constructs a parser for a given schema that translates a json string to 
an [[InternalRow]].
  */
-class JacksonParser(
-schema: StructType,
+private[sql] class JacksonParser(
+schema: DataType,
 val options: JSONOptions) extends Logging {
 
   import JacksonUtils._
   import com.fasterxml.jackson.core.JsonToken._
 
+  def this(schema: StructType, options: JSONOptions) = this(schema: 
DataType, options)
+  def this(schema: ArrayType, options: JSONOptions) = this(schema: 
DataType, options)
--- End diff --

you are right, I am fixing it, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144835433
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -343,6 +367,25 @@ class JacksonParser(
   record: T,
   createParser: (JsonFactory, T) => JsonParser,
   recordLiteral: T => UTF8String): Seq[InternalRow] = {
+parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) 
match {
+  case rows: Seq[InternalRow] => rows
+  case _: Seq[_] => throw BadRecordException(() => 
recordLiteral(record), () => None,
+new RuntimeException("Conversion of array of primitive data is not 
yet supported here."))
--- End diff --

To clarify it, I think the only way we throw this exception is passing an 
`ArrayType` into `JacksonParser` constructor and call `parse` instead of 
`parseWithArrayOfPrimitiveSupport`. Because `JacksonParser` is internally used, 
I assume this usage is intentional and the developer will get the exception 
right away.

So I didn't think this exception will possibly be seen by end user, unless 
we ship such broken codes to users in Spark releases.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144834116
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
 /**
  * Constructs a parser for a given schema that translates a json string to 
an [[InternalRow]].
  */
-class JacksonParser(
-schema: StructType,
+private[sql] class JacksonParser(
+schema: DataType,
 val options: JSONOptions) extends Logging {
 
   import JacksonUtils._
   import com.fasterxml.jackson.core.JsonToken._
 
+  def this(schema: StructType, options: JSONOptions) = this(schema: 
DataType, options)
+  def this(schema: ArrayType, options: JSONOptions) = this(schema: 
DataType, options)
--- End diff --

If so, then I think the default constructor should be private?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144792285
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -536,26 +536,31 @@ case class JsonToStructs(
   timeZoneId = None)
 
   override def checkInputDataTypes(): TypeCheckResult = schema match {
-case _: StructType | ArrayType(_: StructType, _) =>
+case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
   super.checkInputDataTypes()
 case _ => TypeCheckResult.TypeCheckFailure(
-  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  s"Input schema ${schema.simpleString} must be a struct or " +
+s"an array of structs or primitive types.")
   }
 
   @transient
-  lazy val rowSchema = schema match {
+  lazy val rowSchema: DataType = schema match {
--- End diff --

I can't understand the difference. Anyway, if you think `dataSchema` is 
more appropriate, I can rename it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144790680
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -536,26 +536,31 @@ case class JsonToStructs(
   timeZoneId = None)
 
   override def checkInputDataTypes(): TypeCheckResult = schema match {
-case _: StructType | ArrayType(_: StructType, _) =>
+case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
   super.checkInputDataTypes()
 case _ => TypeCheckResult.TypeCheckFailure(
-  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  s"Input schema ${schema.simpleString} must be a struct or " +
+s"an array of structs or primitive types.")
   }
 
   @transient
-  lazy val rowSchema = schema match {
+  lazy val rowSchema: DataType = schema match {
--- End diff --

I think it was row scheme because it can only be `StructType` before. This 
is not the input/output row's schema.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144789703
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -343,6 +367,25 @@ class JacksonParser(
   record: T,
   createParser: (JsonFactory, T) => JsonParser,
   recordLiteral: T => UTF8String): Seq[InternalRow] = {
+parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) 
match {
+  case rows: Seq[InternalRow] => rows
+  case _: Seq[_] => throw BadRecordException(() => 
recordLiteral(record), () => None,
+new RuntimeException("Conversion of array of primitive data is not 
yet supported here."))
--- End diff --

Yes, it is internally used, but it it throws an `Exception` a user might 
see it. I think that if this is not clear for internal usage we can add 
comments, but the text of the exception should be meaningful to the end user 
IMHO. If you have any suggestion about how to improve this message keeping it 
meaningful to a user, I am happy to change it. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144785808
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -343,6 +367,25 @@ class JacksonParser(
   record: T,
   createParser: (JsonFactory, T) => JsonParser,
   recordLiteral: T => UTF8String): Seq[InternalRow] = {
+parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) 
match {
+  case rows: Seq[InternalRow] => rows
+  case _: Seq[_] => throw BadRecordException(() => 
recordLiteral(record), () => None,
+new RuntimeException("Conversion of array of primitive data is not 
yet supported here."))
--- End diff --

What I thought is `JacksonParser` is internally used in Spark SQL.  It is 
hard to think an end user will directly use `parse` and see this exception.

Actually `parse` is supposed to return `InternalRow`s. The case we get 
others is only because the given schema to `JacksonParser` is wrong. So I 
expect this exception is only seen at SQL development internally.

Btw, I've no strong option at this point. If you think it is ok. I'm fine 
with it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144784037
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -536,26 +536,31 @@ case class JsonToStructs(
   timeZoneId = None)
 
   override def checkInputDataTypes(): TypeCheckResult = schema match {
-case _: StructType | ArrayType(_: StructType, _) =>
+case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
   super.checkInputDataTypes()
 case _ => TypeCheckResult.TypeCheckFailure(
-  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  s"Input schema ${schema.simpleString} must be a struct or " +
+s"an array of structs or primitive types.")
   }
 
   @transient
-  lazy val rowSchema = schema match {
+  lazy val rowSchema: DataType = schema match {
--- End diff --

Why is it not a row schema? It is, but sometimes the schema of a row is an 
array.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144783859
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
 /**
  * Constructs a parser for a given schema that translates a json string to 
an [[InternalRow]].
  */
-class JacksonParser(
-schema: StructType,
+private[sql] class JacksonParser(
+schema: DataType,
 val options: JSONOptions) extends Logging {
 
   import JacksonUtils._
   import com.fasterxml.jackson.core.JsonToken._
 
+  def this(schema: StructType, options: JSONOptions) = this(schema: 
DataType, options)
+  def this(schema: ArrayType, options: JSONOptions) = this(schema: 
DataType, options)
--- End diff --

These are to avoid that someone use the constructor specifying invalid 
`DataType`, ie. anything which is not a `StructType` or `ArrayType`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144783499
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(Row(1, "haa")) :: Nil)
   }
 
+  test("SPARK-8: from_json should support also arrays of primitive 
types") {
+val dfInt = Seq("[1]", "[2, 3]").toDS()
+checkAnswer(
+  dfInt.select(from_json($"value", ArrayType(IntegerType))),
+  Row(Seq(1)) :: Row(Seq(2, 3)) :: Nil)
+
+val dfString = Seq("""["hello", "world", ""]""").toDS()
+checkAnswer(
+  dfString.select(from_json($"value", ArrayType(StringType))),
+  Row(Seq("hello", "world", "")):: Nil)
+
+val dfTimestamp = Seq("""["26/08/2015 18:00"]""").toDS()
+val schema = ArrayType(TimestampType)
+val options = Map("timestampFormat" -> "dd/MM/ HH:mm")
+
+checkAnswer(
+  dfTimestamp.select(from_json($"value", schema, options)),
+  Row(Seq(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0"
+
+val dfEmpty = Seq("""[]""").toDS()
+checkAnswer(
+  dfEmpty.select(from_json($"value", ArrayType(StringType))),
+  Row(Nil):: Nil)
--- End diff --

I will, thanks for the suggestion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144783459
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(Row(1, "haa")) :: Nil)
   }
 
+  test("SPARK-8: from_json should support also arrays of primitive 
types") {
+val dfInt = Seq("[1]", "[2, 3]").toDS()
--- End diff --

This case is tested few lines after. I preferred to treat each specific 
case separately so that an error points out very easily which case is not 
handled properly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144783193
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -343,6 +367,25 @@ class JacksonParser(
   record: T,
   createParser: (JsonFactory, T) => JsonParser,
   recordLiteral: T => UTF8String): Seq[InternalRow] = {
+parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) 
match {
+  case rows: Seq[InternalRow] => rows
+  case _: Seq[_] => throw BadRecordException(() => 
recordLiteral(record), () => None,
+new RuntimeException("Conversion of array of primitive data is not 
yet supported here."))
--- End diff --

What about a user seeing this exception? With the current description 
(which I am very open to improve), he/she is aware that he/she is trying to do 
something which is not allowed (at least at the moment), ie. we might hit this 
exception when using `sqlContext.read.json(...)` on arrays of primitives. Your 
suggested description would be a bit weird to a user: he/she might feel he/she 
is doing something wrong to achieve something which can be done, but of course 
he/she knows nothing about these functions so he/she would be lost IMHO.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-16 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144782180
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -89,6 +95,24 @@ class JacksonParser(
 
   /**
* Create a converter which converts the JSON documents held by the 
`JsonParser`
+   * to a value according to a desired schema. This is an overloaded 
method to the
+   * previous one which allows to handle array of primitive types.
+   */
+  private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = {
+(parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) {
+  case START_ARRAY =>
+val array = convertArray(parser, makeConverter(at.elementType))
+if (array.numElements() == 0) {
+  Nil
+} else {
+  array.toArray(at.elementType).toSeq
+}
+  case _ => Nil
--- End diff --

You are right, I will remove this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144757908
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(Row(1, "haa")) :: Nil)
   }
 
+  test("SPARK-8: from_json should support also arrays of primitive 
types") {
+val dfInt = Seq("[1]", "[2, 3]").toDS()
+checkAnswer(
+  dfInt.select(from_json($"value", ArrayType(IntegerType))),
+  Row(Seq(1)) :: Row(Seq(2, 3)) :: Nil)
+
+val dfString = Seq("""["hello", "world", ""]""").toDS()
+checkAnswer(
+  dfString.select(from_json($"value", ArrayType(StringType))),
+  Row(Seq("hello", "world", "")):: Nil)
+
+val dfTimestamp = Seq("""["26/08/2015 18:00"]""").toDS()
+val schema = ArrayType(TimestampType)
+val options = Map("timestampFormat" -> "dd/MM/ HH:mm")
+
+checkAnswer(
+  dfTimestamp.select(from_json($"value", schema, options)),
+  Row(Seq(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0"
+
+val dfEmpty = Seq("""[]""").toDS()
+checkAnswer(
+  dfEmpty.select(from_json($"value", ArrayType(StringType))),
+  Row(Nil):: Nil)
--- End diff --

And also add a case that can fail the parsing like `Seq([1]", "[2, 3]", 
"[string]""").toDS()`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144757530
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -170,6 +160,31 @@ class JsonFunctionsSuite extends QueryTest with 
SharedSQLContext {
   Row(Row(1, "haa")) :: Nil)
   }
 
+  test("SPARK-8: from_json should support also arrays of primitive 
types") {
+val dfInt = Seq("[1]", "[2, 3]").toDS()
--- End diff --

Add a `[]` in to the data?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144757046
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -343,6 +367,25 @@ class JacksonParser(
   record: T,
   createParser: (JsonFactory, T) => JsonParser,
   recordLiteral: T => UTF8String): Seq[InternalRow] = {
+parseWithArrayOfPrimitiveSupport(record, createParser, recordLiteral) 
match {
+  case rows: Seq[InternalRow] => rows
+  case _: Seq[_] => throw BadRecordException(() => 
recordLiteral(record), () => None,
+new RuntimeException("Conversion of array of primitive data is not 
yet supported here."))
--- End diff --

This exception looks a bit weird. How about `` `parse` is only used to 
parse the JSON input to the set of `InternalRow`s. Use 
`parseWithArrayOfPrimitiveSupport` when paring array of primitive data is 
needed``?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144755348
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -89,6 +95,24 @@ class JacksonParser(
 
   /**
* Create a converter which converts the JSON documents held by the 
`JsonParser`
+   * to a value according to a desired schema. This is an overloaded 
method to the
+   * previous one which allows to handle array of primitive types.
+   */
+  private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = {
+(parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) {
+  case START_ARRAY =>
+val array = convertArray(parser, makeConverter(at.elementType))
+if (array.numElements() == 0) {
+  Nil
+} else {
+  array.toArray(at.elementType).toSeq
+}
+  case _ => Nil
--- End diff --

Should we return `Nil` when it is not parsed to array? The original 
`makeRootConverter` didn't do this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144754632
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
 /**
  * Constructs a parser for a given schema that translates a json string to 
an [[InternalRow]].
  */
-class JacksonParser(
-schema: StructType,
+private[sql] class JacksonParser(
+schema: DataType,
 val options: JSONOptions) extends Logging {
 
   import JacksonUtils._
   import com.fasterxml.jackson.core.JsonToken._
 
+  def this(schema: StructType, options: JSONOptions) = this(schema: 
DataType, options)
+  def this(schema: ArrayType, options: JSONOptions) = this(schema: 
DataType, options)
+
   // A `ValueConverter` is responsible for converting a value from 
`JsonParser`
   // to a value in a field for `InternalRow`.
   private type ValueConverter = JsonParser => AnyRef
 
   // `ValueConverter`s for the root schema for all fields in the schema
-  private val rootConverter = makeRootConverter(schema)
+  private val rootConverter = schema match {
+case s: StructType => makeRootConverter(s)
--- End diff --

It is kind of easy to confused. Please add comment to each case like:
```scala
private val rootConverter = schema match {
case s: StructType => makeRootConverter(s) // For struct or array of 
struct.
case a: ArrayType => makeRootConverter(a)  // For array of primitive 
types.
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144754775
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -89,6 +95,24 @@ class JacksonParser(
 
   /**
* Create a converter which converts the JSON documents held by the 
`JsonParser`
+   * to a value according to a desired schema. This is an overloaded 
method to the
+   * previous one which allows to handle array of primitive types.
+   */
+  private def makeRootConverter(at: ArrayType): JsonParser => Seq[Any] = {
+(parser: JsonParser) => parseJsonToken[Seq[Any]](parser, at) {
+  case START_ARRAY =>
+val array = convertArray(parser, makeConverter(at.elementType))
--- End diff --

Move `makeConverter` outside the inner function and so we can reuse it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144754134
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
 /**
  * Constructs a parser for a given schema that translates a json string to 
an [[InternalRow]].
  */
-class JacksonParser(
-schema: StructType,
+private[sql] class JacksonParser(
+schema: DataType,
 val options: JSONOptions) extends Logging {
 
   import JacksonUtils._
   import com.fasterxml.jackson.core.JsonToken._
 
+  def this(schema: StructType, options: JSONOptions) = this(schema: 
DataType, options)
+  def this(schema: ArrayType, options: JSONOptions) = this(schema: 
DataType, options)
--- End diff --

Are those necessary?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144754004
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -35,19 +35,25 @@ import org.apache.spark.util.Utils
 /**
  * Constructs a parser for a given schema that translates a json string to 
an [[InternalRow]].
--- End diff --

After this change, it didn't always return `InternalRow`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144753618
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -536,26 +536,31 @@ case class JsonToStructs(
   timeZoneId = None)
 
   override def checkInputDataTypes(): TypeCheckResult = schema match {
-case _: StructType | ArrayType(_: StructType, _) =>
+case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
   super.checkInputDataTypes()
 case _ => TypeCheckResult.TypeCheckFailure(
-  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  s"Input schema ${schema.simpleString} must be a struct or " +
+s"an array of structs or primitive types.")
   }
 
   @transient
-  lazy val rowSchema = schema match {
+  lazy val rowSchema: DataType = schema match {
--- End diff --

Not schema for row anymore. Maybe `dataSchema`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-15 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19492#discussion_r144753534
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -536,26 +536,31 @@ case class JsonToStructs(
   timeZoneId = None)
 
   override def checkInputDataTypes(): TypeCheckResult = schema match {
-case _: StructType | ArrayType(_: StructType, _) =>
+case _: StructType | ArrayType(_: StructType | _: AtomicType, _) =>
   super.checkInputDataTypes()
 case _ => TypeCheckResult.TypeCheckFailure(
-  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  s"Input schema ${schema.simpleString} must be a struct or " +
+s"an array of structs or primitive types.")
   }
 
   @transient
-  lazy val rowSchema = schema match {
+  lazy val rowSchema: DataType = schema match {
 case st: StructType => st
 case ArrayType(st: StructType, _) => st
+case ArrayType(at: AtomicType, _) => ArrayType(at)
   }
 
   // This converts parsed rows to the desired output by the given schema.
   @transient
-  lazy val converter = schema match {
-case _: StructType =>
-  (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
-case ArrayType(_: StructType, _) =>
-  (rows: Seq[InternalRow]) => new GenericArrayData(rows)
-  }
+  lazy val converter = (rows: Seq[Any]) =>
--- End diff --

This brings extra matching cost at runtime. Can we move matching outside?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

2017-10-13 Thread mgaido91

GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/19492

[SPARK-8][SQL] Add support for array to from_json

## What changes were proposed in this pull request?

The fix introduces support for parsing array of primitive types in the 
`from_json` function. Currently, it supports only array of struct, which is 
equivalent to array of JSON objects, but it doesn't allow to parse array of 
simple types, like array of strings (eg. `["this is a valid JSON", "we cannot 
parse before the PR"]`).

## How was this patch tested?

Added UTs.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-8

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19492.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19492


commit cba1f4accaf5dbe305e366eab33879577d7f9a93
Author: Marco Gaido 
Date:   2017-10-11T10:29:26Z

[SPARK-8] Add support for array to from_json




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

[GitHub] spark pull request #19492: [SPARK-22228][SQL] Add support for array

29 matches

Site Navigation

Mail list logo

Footer information