Zachary Jablons created SPARK-27609: ---------------------------------------
Summary: [Documentation Issue?] from_json expects values of options dictionary to be Key: SPARK-27609 URL: https://issues.apache.org/jira/browse/SPARK-27609 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.1 Environment: I've found this issue on an AWS Glue development endpoint which is running Spark 2.2.1 and being given jobs through a SparkMagic Python 2 kernel, running through livy and all that. I don't know how much of that is important for reproduction, and can get more details if needed. Reporter: Zachary Jablons When reading a column of a DataFrame that consists of serialized JSON, one of the options for inferring the schema and then parsing the JSON is to do a two step process consisting of: {{#this results in a new dataframe where the top-level keys of the JSON # are columns}} {{df_parsed_direct = spark.read.json(df.rdd.map(lambda row: row.json_col))}} {{# this does that while preserving the rest of df}} {{schema = df_parsed_direct.schema}} {{df_parsed = df.withColumn('parsed', from_json(df.json_col, schema)}} When I do this, I sometimes find myself passing in options. My understanding is, from the documentation [here|http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json], that the nature of these options passed should be the same whether I do {{spark.read.option('option',value)}} or {{from_json(df.json_col, schema, options=\{'option':value})}} However, I've found that the latter expects value to be a string representation of the value that can be decoded by JSON. So, for example options=\{'multiLine':True} fails with {{java.lang.ClassCastException: java.lang.Boolean cannot be cast to java.lang.String}} whereas options=\{'multiLine':'true'} works just fine. Notably, providing spark.read.option('multiLine',True) works fine! The code for reproducing this issue as well as the stacktrace from hitting it are provided in [this gist|https://gist.github.com/zmjjmz/0af5cf9b059b4969951e825565e266aa]. I also noticed that from_json doesn't complain if you give it a garbage option key – but that seems separate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org