[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-03-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16929


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-03-05 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r104326605
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 ---
@@ -372,6 +372,62 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 )
   }
 
+  test("from_json - input=array, schema=array, output=array") {
--- End diff --

these are great! thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-03-03 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r104278223
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 ---
@@ -372,6 +372,58 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 )
   }
 
+  test("from_json - array") {
+val schema = ArrayType(StructType(StructField("a", IntegerType) :: 
Nil))
+
+// json array: `Array(Row(...), ...)`
+val jsonData1 = """[{"a": 1}, {"a": 2}]"""
+val expected =
+  InternalRow.fromSeq(1 :: Nil) ::
+  InternalRow.fromSeq(2 :: Nil) :: Nil
+checkEvaluation(JsonToStruct(
+  schema, Map.empty, Literal(jsonData1), gmtId), expected)
+
+// json object: `Array(Row(...))`
+val jsonData2 = """{"a": 1}"""
--- End diff --

I would make each example a separate test. This way it's easier to figure 
out what breaks later.

e.g.
  1. `from_json - input=array, schema=array, output=array`
  2. `from_json - input=object, schema=array, output=array of single object`
  3. `from_json - input=empty json array, schema=array, output=empty array`
...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-03-03 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r104278176
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 ---
@@ -372,6 +372,58 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 )
   }
 
+  test("from_json - array") {
+val schema = ArrayType(StructType(StructField("a", IntegerType) :: 
Nil))
+
+// json array: `Array(Row(...), ...)`
+val jsonData1 = """[{"a": 1}, {"a": 2}]"""
+val expected =
+  InternalRow.fromSeq(1 :: Nil) ::
+  InternalRow.fromSeq(2 :: Nil) :: Nil
+checkEvaluation(JsonToStruct(
+  schema, Map.empty, Literal(jsonData1), gmtId), expected)
--- End diff --

could you put input and expected output in different rows for readability 
please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-03-03 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r104253528
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,23 +480,45 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
+case _: StructType | ArrayType(_: StructType, _) =>
+  super.checkInputDataTypes()
+case _ => TypeCheckResult.TypeCheckFailure(
+  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  }
+
+  @transient
+  lazy val rowSchema = schema match {
+case st: StructType => st
+case ArrayType(st: StructType, _) => st
+  }
+
+  // This converts parsed rows to the desired output by the given schema.
+  @transient
+  lazy val converter = schema match {
+case _: StructType =>
+  (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
--- End diff --

We should list this in the release notes though (i.e. go tag the JIRA).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-03-03 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r104253484
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,23 +480,45 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
+case _: StructType | ArrayType(_: StructType, _) =>
+  super.checkInputDataTypes()
+case _ => TypeCheckResult.TypeCheckFailure(
+  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  }
+
+  @transient
+  lazy val rowSchema = schema match {
+case st: StructType => st
+case ArrayType(st: StructType, _) => st
+  }
+
+  // This converts parsed rows to the desired output by the given schema.
+  @transient
+  lazy val converter = schema match {
+case _: StructType =>
+  (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
--- End diff --

I'm okay breaking previous behavior because I'd call truncating an array a 
bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-03-03 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r104253014
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,23 +480,45 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
+case _: StructType | ArrayType(_: StructType, _) =>
+  super.checkInputDataTypes()
+case _ => TypeCheckResult.TypeCheckFailure(
+  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  }
+
+  @transient
+  lazy val rowSchema = schema match {
+case st: StructType => st
+case ArrayType(st: StructType, _) => st
+  }
+
+  // This converts parsed rows to the desired output by the given schema.
+  @transient
+  lazy val converter = schema match {
+case _: StructType =>
+  (rows: Seq[InternalRow]) => if (rows.length == 1) rows.head else null
--- End diff --

this breaks previous behavior. I would still return the first element. Feel 
free to push back. Also wonder what @marmbrus  thinks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103386481
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
--- End diff --

I tried several combinations with `TypeCollection` but it seems not working.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103338371
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
--- End diff --

Uh.. I thought `schema` is not the child of the expression. Let me check 
again!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103337914
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
+case _: StructType | ArrayType(_: StructType, _) =>
+  super.checkInputDataTypes()
+case _ => TypeCheckResult.TypeCheckFailure(
+  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  }
+
+  @transient
+  lazy val rowSchema = schema match {
+case st: StructType => st
+case ArrayType(st: StructType, _) => st
+  }
+
+  // This converts parsed rows to the desired output by the given schema.
+  @transient
+  lazy val converter = schema match {
+case _: StructType =>
+  // These are always produced from json objects by `objectSupport` in 
`JacksonParser`.
+  (rows: Seq[InternalRow]) => rows.head
+
+case ArrayType(_: StructType, _) =>
+  // These are always produced from json arrays by `arraySupport` in 
`JacksonParser`.
+  (rows: Seq[InternalRow]) => new GenericArrayData(rows)
+  }
+
   @transient
   lazy val parser =
 new JacksonParser(
-  schema,
-  new JSONOptions(options + ("mode" -> ParseModes.FAIL_FAST_MODE), 
timeZoneId.get))
+  rowSchema,
+  new JSONOptions(options + ("mode" -> ParseModes.FAIL_FAST_MODE), 
timeZoneId.get),
+  objectSupport = schema.isInstanceOf[StructType],
--- End diff --

> What does the input look like, and what are they specifying?

`JscksonParser.parse` produces `Seq[InternalRow]` and it takes 
`StructType`. What I meant by both `objectSupport` and `arraySupport` is, JSON 
object and JSON array because we support both as a root JSON object and 
intended to produce `null` if one of them is disabled. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103337028
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2969,11 +2969,27 @@ object functions {
   }
 
   /**
-   * (Java-specific) Parses a column containing a JSON string into a 
`StructType` with the
+   * (Scala-specific) Parses a column containing a JSON array string into 
a `ArrayType` with the
* specified schema. Returns `null`, in the case of an unparseable 
string.
*
-   * @param e a string column containing JSON data.
-   * @param schema the schema to use when parsing the json string
+   * @param e a string column containing JSON array data.
+   * @param schema the schema to use when parsing the json array string
+   * @param options options to control how the json is parsed. accepts the 
same options and the
+   *json data source.
+   *
+   * @group collection_funcs
+   * @since 2.2.0
+   */
+  def from_json(e: Column, schema: ArrayType, options: Map[String, 
String]): Column = withExpr {
--- End diff --

I think you can leave a bridge method in this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103334238
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
--- End diff --

Uh.. I though `schema` is not a child but just a parameter. Let me check!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r10990
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2969,11 +2969,27 @@ object functions {
   }
 
   /**
-   * (Java-specific) Parses a column containing a JSON string into a 
`StructType` with the
+   * (Scala-specific) Parses a column containing a JSON array string into 
a `ArrayType` with the
* specified schema. Returns `null`, in the case of an unparseable 
string.
*
-   * @param e a string column containing JSON data.
-   * @param schema the schema to use when parsing the json string
+   * @param e a string column containing JSON array data.
+   * @param schema the schema to use when parsing the json array string
+   * @param options options to control how the json is parsed. accepts the 
same options and the
+   *json data source.
+   *
+   * @group collection_funcs
+   * @since 2.2.0
+   */
+  def from_json(e: Column, schema: ArrayType, options: Map[String, 
String]): Column = withExpr {
--- End diff --

I thought changing `StructType` to `DataType` breaks binary compatibility 
as method signature is changed and the app complied in 2.1.0 does not run in 
2.2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103302035
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
+case _: StructType | ArrayType(_: StructType, _) =>
+  super.checkInputDataTypes()
+case _ => TypeCheckResult.TypeCheckFailure(
+  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  }
+
+  @transient
+  lazy val rowSchema = schema match {
+case st: StructType => st
+case ArrayType(st: StructType, _) => st
+  }
+
+  // This converts parsed rows to the desired output by the given schema.
+  @transient
+  lazy val converter = schema match {
+case _: StructType =>
+  // These are always produced from json objects by `objectSupport` in 
`JacksonParser`.
+  (rows: Seq[InternalRow]) => rows.head
+
+case ArrayType(_: StructType, _) =>
+  // These are always produced from json arrays by `arraySupport` in 
`JacksonParser`.
+  (rows: Seq[InternalRow]) => new GenericArrayData(rows)
+  }
+
   @transient
   lazy val parser =
 new JacksonParser(
-  schema,
-  new JSONOptions(options + ("mode" -> ParseModes.FAIL_FAST_MODE), 
timeZoneId.get))
+  rowSchema,
+  new JSONOptions(options + ("mode" -> ParseModes.FAIL_FAST_MODE), 
timeZoneId.get),
+  objectSupport = schema.isInstanceOf[StructType],
--- End diff --

I'm not sure I follow.  What does the input look like, and what are they 
specifying?  I would avoid magic unless we really think users need it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103300622
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2969,11 +2969,27 @@ object functions {
   }
 
   /**
-   * (Java-specific) Parses a column containing a JSON string into a 
`StructType` with the
+   * (Scala-specific) Parses a column containing a JSON array string into 
a `ArrayType` with the
* specified schema. Returns `null`, in the case of an unparseable 
string.
*
-   * @param e a string column containing JSON data.
-   * @param schema the schema to use when parsing the json string
+   * @param e a string column containing JSON array data.
+   * @param schema the schema to use when parsing the json array string
+   * @param options options to control how the json is parsed. accepts the 
same options and the
+   *json data source.
+   *
+   * @group collection_funcs
+   * @since 2.2.0
+   */
+  def from_json(e: Column, schema: ArrayType, options: Map[String, 
String]): Column = withExpr {
--- End diff --

Really, why not just support any `DataType` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103268734
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
 ---
@@ -39,7 +39,12 @@ private[sql] class SparkSQLJsonProcessingException(msg: 
String) extends RuntimeE
  */
 class JacksonParser(
 schema: StructType,
-options: JSONOptions) extends Logging {
+options: JSONOptions,
+arraySupport: Boolean = true,
--- End diff --

as I commented above, I don't think we need this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103268655
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
+case _: StructType | ArrayType(_: StructType, _) =>
+  super.checkInputDataTypes()
+case _ => TypeCheckResult.TypeCheckFailure(
+  s"Input schema ${schema.simpleString} must be a struct or an array 
of structs.")
+  }
+
+  @transient
+  lazy val rowSchema = schema match {
+case st: StructType => st
+case ArrayType(st: StructType, _) => st
+  }
+
+  // This converts parsed rows to the desired output by the given schema.
+  @transient
+  lazy val converter = schema match {
+case _: StructType =>
+  // These are always produced from json objects by `objectSupport` in 
`JacksonParser`.
+  (rows: Seq[InternalRow]) => rows.head
+
+case ArrayType(_: StructType, _) =>
+  // These are always produced from json arrays by `arraySupport` in 
`JacksonParser`.
+  (rows: Seq[InternalRow]) => new GenericArrayData(rows)
+  }
+
   @transient
   lazy val parser =
 new JacksonParser(
-  schema,
-  new JSONOptions(options + ("mode" -> ParseModes.FAIL_FAST_MODE), 
timeZoneId.get))
+  rowSchema,
+  new JSONOptions(options + ("mode" -> ParseModes.FAIL_FAST_MODE), 
timeZoneId.get),
+  objectSupport = schema.isInstanceOf[StructType],
--- End diff --

Do you think we need the `objectSupport` and `arraySupport`?
I would rather not add it. If someone specifies an `ArrayType` but the row 
contains just an object, let's still just return it as an `ArrayType`. I think 
users would appreciate this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103268156
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -480,36 +480,79 @@ case class JsonTuple(children: Seq[Expression])
 }
 
 /**
- * Converts an json input string to a [[StructType]] with the specified 
schema.
+ * Converts an json input string to a [[StructType]] or [[ArrayType]] with 
the specified schema.
  */
 case class JsonToStruct(
-schema: StructType,
+schema: DataType,
 options: Map[String, String],
 child: Expression,
 timeZoneId: Option[String] = None)
   extends UnaryExpression with TimeZoneAwareExpression with 
CodegenFallback with ExpectsInputTypes {
   override def nullable: Boolean = true
 
-  def this(schema: StructType, options: Map[String, String], child: 
Expression) =
+  def this(schema: DataType, options: Map[String, String], child: 
Expression) =
 this(schema, options, child, None)
 
+  override def checkInputDataTypes(): TypeCheckResult = schema match {
--- End diff --

why not just override:

`override def inputTypes = new TypeCollection(ArrayType, StructType) :: Nil`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16929: [SPARK-19595][SQL] Support json array in from_jso...

2017-02-27 Thread brkyvz
Github user brkyvz commented on a diff in the pull request:

https://github.com/apache/spark/pull/16929#discussion_r103262990
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -2969,11 +2969,27 @@ object functions {
   }
 
   /**
-   * (Java-specific) Parses a column containing a JSON string into a 
`StructType` with the
+   * (Scala-specific) Parses a column containing a JSON array string into 
a `ArrayType` with the
* specified schema. Returns `null`, in the case of an unparseable 
string.
*
-   * @param e a string column containing JSON data.
-   * @param schema the schema to use when parsing the json string
+   * @param e a string column containing JSON array data.
+   * @param schema the schema to use when parsing the json array string
+   * @param options options to control how the json is parsed. accepts the 
same options and the
+   *json data source.
+   *
+   * @group collection_funcs
+   * @since 2.2.0
+   */
+  def from_json(e: Column, schema: ArrayType, options: Map[String, 
String]): Column = withExpr {
--- End diff --

why do we need the `ArrayType` specific methods? Can't we just change the 
`StructType` -> `DataType` and do a `require` check?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org