RE: How can I parse an "unnamed" json array present in a column?

email Sun, 24 Feb 2019 13:17:54 -0800

Unfortunately , I can’t change the source system , so changing the JSON at 
runtime is the best I can do right now.


 

Is there any preferred way to modify the String other than an UDF or map on the 
string? 

 

At the moment I am modifying it returning a generic type “t” so I can use the 
same UDF  for many different JSONs that have the same issue. 

 

Also , is there any advantage(if possible) to extract the function from the 
original source code and run it on an older version of Spark? 

 

 

From: Magnus Nilsson <ma...@kth.se> 
Sent: Sunday, February 24, 2019 5:34 AM
To: Yeikel <em...@yeikel.com>
Cc: user@spark.apache.org
Subject: Re: How can I parse an "unnamed" json array present in a column?

 

That's a bummer, if you're unable to upgrade to Spark 2.3+ your best bet is 
probably to prepend/append the jsonarray-string and locate the json array as 
the value of a root attribute in a json-document (as in your first work 
around). I mean, it's such an easy and safe fix, you can still do it even if 
you stream the file.

 

Even better, make the source system create a JSON-lines file instead of an json 
array if possible.

 

When I use Datasets (Tungsten) I basically try to stay there and use the 
available column functions unless I have no choice but to serialize and run 
custom advanced calculations/parsings. In your case just modifying the string 
and use the tested from_json function beats the available alternatives if you 
ask me.

 

 

On Sun, Feb 24, 2019 at 1:13 AM <em...@yeikel.com <mailto:em...@yeikel.com> > 
wrote:

What you suggested works in Spark 2.3 , but in the version that I am using 
(2.1) it produces the following exception : 

 

found   : org.apache.spark.sql.types.ArrayType

required: org.apache.spark.sql.types.StructType

       ds.select(from_json($"news", schema) as "news_parsed").show(false)

 

Is it viable/possible to export a function from 2.3 to 2.1?  What other options 
do I have? 

 

Thank you.

 

 

From: Magnus Nilsson <ma...@kth.se <mailto:ma...@kth.se> > 
Sent: Saturday, February 23, 2019 3:43 PM
Cc: user@spark.apache.org <mailto:user@spark.apache.org> 
Subject: Re: How can I parse an "unnamed" json array present in a column?

 

Use spark.sql.types.ArrayType instead of a Scala Array as the root type when 
you define the schema and it will work.

 

Regards,

 

Magnus

 

On Fri, Feb 22, 2019 at 11:15 PM Yeikel <em...@yeikel.com 
<mailto:em...@yeikel.com> > wrote:

I have an "unnamed" json array stored in a *column*.  

The format is the following : 

column name : news

Data : 

[
  {
    "source": "source1",
    "name": "News site1"
  },
   {
    "source": "source2",
    "name": "News site2"
  }
]


Ideally , I'd like to parse it as : 

news ARRAY<struct&lt;source:string, name:string>>

I've tried the following : 

import org.apache.spark.sql.Encoders
import org.apache.spark.sql.types._;

val entry = scala.io.Source.fromFile("1.txt").mkString

val ds = Seq(entry).toDF("news")

val schema = Array(new StructType().add("name", StringType).add("source",
StringType))

ds.select(from_json($"news", schema) as "news_parsed").show(false)

But this is not allowed : 

found   : Array[org.apache.spark.sql.types.StructType]
required: org.apache.spark.sql.types.StructType


I also tried passing the following schema : 

val schema = StructType(new StructType().add("name",
StringType).add("source", StringType))

But this only parsed the first record : 

+--------------------+
|news_parsed         |
+--------------------+
|[News site1,source1]|
+--------------------+


I am aware that if I fix the JSON like this : 

{
  "news": [
    {
      "source": "source1",
      "name": "News site1"
    },
    {
      "source": "source2",
      "name": "News site2"
    }
  ]
}

The parsing works as expected , but I would like to avoid doing that if
possible. 

Another approach that I can think of is to map on it and parse it using
third party libraries like Gson , but  I am not sure if this is any better
than fixing the json beforehand. 

I am running Spark 2.1



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
<mailto:user-unsubscr...@spark.apache.org>

RE: How can I parse an "unnamed" json array present in a column?

Reply via email to