[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380250#comment-16380250 ] Apache Spark commented on SPARK-23173: -- User 'mswit-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/20694 > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-23173 > URL: https://issues.apache.org/jira/browse/SPARK-23173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Herman van Hovell >Priority: Major > Labels: release-notes > > The {{from_json}} function uses a schema to convert a string into a Spark SQL > struct. This schema can contain non-nullable fields. The underlying > {{JsonToStructs}} expression does not check if a resulting struct respects > the nullability of the schema. This leads to very weird problems in consuming > expressions. In our case parquet writing would produce an illegal parquet > file. > There are roughly solutions here: > # Assume that each field in schema passed to {{from_json}} is nullable, and > ignore the nullability information set in the passed schema. > # Validate the object during runtime, and fail execution if the data is null > where we are not expecting this. > I currently am slightly in favor of option 1, since this is the more > performant option and a lot easier to do. > WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339794#comment-16339794 ] Reynold Xin commented on SPARK-23173: - Yea I agree with you Herman. On Sun, Jan 21, 2018 at 5:44 PM Herman van Hovell (JIRA)> from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-23173 > URL: https://issues.apache.org/jira/browse/SPARK-23173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Herman van Hovell >Priority: Major > > The {{from_json}} function uses a schema to convert a string into a Spark SQL > struct. This schema can contain non-nullable fields. The underlying > {{JsonToStructs}} expression does not check if a resulting struct respects > the nullability of the schema. This leads to very weird problems in consuming > expressions. In our case parquet writing would produce an illegal parquet > file. > There are roughly solutions here: > # Assume that each field in schema passed to {{from_json}} is nullable, and > ignore the nullability information set in the passed schema. > # Validate the object during runtime, and fail execution if the data is null > where we are not expecting this. > I currently am slightly in favor of option 1, since this is the more > performant option and a lot easier to do. > WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16339144#comment-16339144 ] Michał Świtakowski commented on SPARK-23173: I'm going to work on this. > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-23173 > URL: https://issues.apache.org/jira/browse/SPARK-23173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Herman van Hovell >Priority: Major > > The {{from_json}} function uses a schema to convert a string into a Spark SQL > struct. This schema can contain non-nullable fields. The underlying > {{JsonToStructs}} expression does not check if a resulting struct respects > the nullability of the schema. This leads to very weird problems in consuming > expressions. In our case parquet writing would produce an illegal parquet > file. > There are roughly solutions here: > # Assume that each field in schema passed to {{from_json}} is nullable, and > ignore the nullability information set in the passed schema. > # Validate the object during runtime, and fail execution if the data is null > where we are not expecting this. > I currently am slightly in favor of option 1, since this is the more > performant option and a lot easier to do. > WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334388#comment-16334388 ] Michał Świtakowski commented on SPARK-23173: I think starting with option 1 is a good idea for a remedy of the corruption issue. Verifying the data would certainly be good and it can be done in at least two approaches: (1) detect incorrect data and fail (2) write rejected data to a separate file/column as Burak suggests (1) can even be orthogonal if the verification is done at the level of parquet encoding. It would help avoid the corruption with all sources, not just JSON. > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-23173 > URL: https://issues.apache.org/jira/browse/SPARK-23173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Herman van Hovell >Priority: Major > > The {{from_json}} function uses a schema to convert a string into a Spark SQL > struct. This schema can contain non-nullable fields. The underlying > {{JsonToStructs}} expression does not check if a resulting struct respects > the nullability of the schema. This leads to very weird problems in consuming > expressions. In our case parquet writing would produce an illegal parquet > file. > There are roughly solutions here: > # Assume that each field in schema passed to {{from_json}} is nullable, and > ignore the nullability information set in the passed schema. > # Validate the object during runtime, and fail execution if the data is null > where we are not expecting this. > I currently am slightly in favor of option 1, since this is the more > performant option and a lot easier to do. > WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334226#comment-16334226 ] Burak Yavuz commented on SPARK-23173: - In terms of usability, I prefer 1. In terms of the viewpoint of a data engineer, I would like 2 as well if that's not too hard. Basically, if I expect that my data doesn't have nulls, but is suddenly outputting them, I would rather have it fail initially (or get written out to the \_corrupt\_record column). In an ideal world, I should be able to either permit nullable fields (Option 1), or have the record be written out as corrupt. > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-23173 > URL: https://issues.apache.org/jira/browse/SPARK-23173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Herman van Hovell >Priority: Major > > The {{from_json}} function uses a schema to convert a string into a Spark SQL > struct. This schema can contain non-nullable fields. The underlying > {{JsonToStructs}} expression does not check if a resulting struct respects > the nullability of the schema. This leads to very weird problems in consuming > expressions. In our case parquet writing would produce an illegal parquet > file. > There are roughly solutions here: > # Assume that each field in schema passed to {{from_json}} is nullable, and > ignore the nullability information set in the passed schema. > # Validate the object during runtime, and fail execution if the data is null > where we are not expecting this. > I currently am slightly in favor of option 1, since this is the more > performant option and a lot easier to do. > WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334005#comment-16334005 ] Liang-Chi Hsieh commented on SPARK-23173: - +1 for 1 too. > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-23173 > URL: https://issues.apache.org/jira/browse/SPARK-23173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Herman van Hovell >Priority: Major > > The {{from_json}} function uses a schema to convert a string into a Spark SQL > struct. This schema can contain non-nullable fields. The underlying > {{JsonToStructs}} expression does not check if a resulting struct respects > the nullability of the schema. This leads to very weird problems in consuming > expressions. In our case parquet writing would produce an illegal parquet > file. > There are roughly solutions here: > # Assume that each field in schema passed to {{from_json}} is nullable, and > ignore the nullability information set in the passed schema. > # Validate the object during runtime, and fail execution if the data is null > where we are not expecting this. > I currently am slightly in favor of option 1, since this is the more > performant option and a lot easier to do. > WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333940#comment-16333940 ] Wenchen Fan commented on SPARK-23173: - +1 on proposal 1. > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-23173 > URL: https://issues.apache.org/jira/browse/SPARK-23173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Herman van Hovell >Priority: Major > > The {{from_json}} function uses a schema to convert a string into a Spark SQL > struct. This schema can contain non-nullable fields. The underlying > {{JsonToStructs}} expression does not check if a resulting struct respects > the nullability of the schema. This leads to very weird problems in consuming > expressions. In our case parquet writing would produce an illegal parquet > file. > There are roughly solutions here: > # Assume that each field in schema passed to {{from_json}} is nullable, and > ignore the nullability information set in the passed schema. > # Validate the object during runtime, and fail execution if the data is null > where we are not expecting this. > I currently am slightly in favor of option 1, since this is the more > performant option and a lot easier to do. > WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23173) from_json can produce nulls for fields which are marked as non-nullable
[ https://issues.apache.org/jira/browse/SPARK-23173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333855#comment-16333855 ] Hyukjin Kwon commented on SPARK-23173: -- I believe this one is related with SPARK-17763. My first try was 2 (roughly 1.5 years ago). The root cause seems the same - I just double checked the Jackson parsers produce {{null}} regardless of the nullability. Just FYI, there was a related discussion in SPARK-16472 too. If I understood correctly, seems the guys roughly rather prefer to leave it as nullable rather than failure during runtime. So, +1 for 1. to me given the past discussion and It seems simplest and less invasion. cc [~cloud_fan] too who was in the discussion of SPARK-16472. > from_json can produce nulls for fields which are marked as non-nullable > --- > > Key: SPARK-23173 > URL: https://issues.apache.org/jira/browse/SPARK-23173 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Herman van Hovell >Priority: Major > > The {{from_json}} function uses a schema to convert a string into a Spark SQL > struct. This schema can contain non-nullable fields. The underlying > {{JsonToStructs}} expression does not check if a resulting struct respects > the nullability of the schema. This leads to very weird problems in consuming > expressions. In our case parquet writing would produce an illegal parquet > file. > There are roughly solutions here: > # Assume that each field in schema passed to {{from_json}} is nullable, and > ignore the nullability information set in the passed schema. > # Validate the object during runtime, and fail execution if the data is null > where we are not expecting this. > I currently am slightly in favor of option 1, since this is the more > performant option and a lot easier to do. > WDYT? cc [~rxin] [~marmbrus] [~hyukjin.kwon] [~brkyvz] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org