What you suggested works in Spark 2.3 , but in the version that I am using 
(2.1) it produces the following exception : 

 

found   : org.apache.spark.sql.types.ArrayType

required: org.apache.spark.sql.types.StructType

       ds.select(from_json($"news", schema) as "news_parsed").show(false)

 

Is it viable/possible to export a function from 2.3 to 2.1?  What other options 
do I have? 

 

Thank you.

 

 

From: Magnus Nilsson <ma...@kth.se> 
Sent: Saturday, February 23, 2019 3:43 PM
Cc: user@spark.apache.org
Subject: Re: How can I parse an "unnamed" json array present in a column?

 

Use spark.sql.types.ArrayType instead of a Scala Array as the root type when 
you define the schema and it will work.

 

Regards,

 

Magnus

 

On Fri, Feb 22, 2019 at 11:15 PM Yeikel <em...@yeikel.com 
<mailto:em...@yeikel.com> > wrote:

I have an "unnamed" json array stored in a *column*.  

The format is the following : 

column name : news

Data : 

[
  {
    "source": "source1",
    "name": "News site1"
  },
   {
    "source": "source2",
    "name": "News site2"
  }
]


Ideally , I'd like to parse it as : 

news ARRAY<struct&lt;source:string, name:string>>

I've tried the following : 

import org.apache.spark.sql.Encoders
import org.apache.spark.sql.types._;

val entry = scala.io.Source.fromFile("1.txt").mkString

val ds = Seq(entry).toDF("news")

val schema = Array(new StructType().add("name", StringType).add("source",
StringType))

ds.select(from_json($"news", schema) as "news_parsed").show(false)

But this is not allowed : 

found   : Array[org.apache.spark.sql.types.StructType]
required: org.apache.spark.sql.types.StructType


I also tried passing the following schema : 

val schema = StructType(new StructType().add("name",
StringType).add("source", StringType))

But this only parsed the first record : 

+--------------------+
|news_parsed         |
+--------------------+
|[News site1,source1]|
+--------------------+


I am aware that if I fix the JSON like this : 

{
  "news": [
    {
      "source": "source1",
      "name": "News site1"
    },
    {
      "source": "source2",
      "name": "News site2"
    }
  ]
}

The parsing works as expected , but I would like to avoid doing that if
possible. 

Another approach that I can think of is to map on it and parse it using
third party libraries like Gson , but  I am not sure if this is any better
than fixing the json beforehand. 

I am running Spark 2.1



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
<mailto:user-unsubscr...@spark.apache.org> 

Reply via email to