hen you have to preprocess it. Or write your own
> implementation to handle the record delimiter, for your json data case. But
> good luck with that. There is no perfect generic solution for any kind of
> JSON data you want to handle.
>
> Yong
>
> From: ljia...@gmail.com
-
From: ljia...@gmail.com
Date: Thu, 7 Jul 2016 11:57:26 -0500
Subject: Re: Processing json document
To: gurwls...@gmail.com
CC: jornfra...@gmail.com; user@spark.apache.org
Hi, there,
Thank you all for your input. @Hyukjin, as a matter of fact, I have read
the blog link you posted before as
.
Yong
From: ljia...@gmail.com
Date: Thu, 7 Jul 2016 11:57:26 -0500
Subject: Re: Processing json document
To: gurwls...@gmail.com
CC: jornfra...@gmail.com; user@spark.apache.org
Hi, there,
Thank you all for your input. @Hyukjin, as a matter of fact, I have read the
blog link you posted before
Hi, there,
Thank you all for your input. @Hyukjin, as a matter of fact, I have read
the blog link you posted before asking the question on the forum. As you
pointed out, the link uses wholeTextFiles(0, which is bad in my case,
because my json file can be as large as 20G+ and OOM might occur. I am
The link uses wholeTextFiles() API which treats each file as each record.
2016-07-07 15:42 GMT+09:00 Jörn Franke :
> This does not need necessarily the case if you look at the Hadoop
> FileInputFormat architecture then you can even split large multi line Jsons
> without issues. I would need to h
This does not need necessarily the case if you look at the Hadoop
FileInputFormat architecture then you can even split large multi line Jsons
without issues. I would need to have a look at it, but one large file does not
mean one Executor independent of the underlying format.
> On 07 Jul 2016,
There is a good link for this here,
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
If there are a lot of small files, then it would work pretty okay in a
distributed manner, but I am worried if it is single large file.
In this case, this would only work in single
do you want id1, id2, id3 to be processed similarly?
The Java code I use is:
df = df.withColumn(K.NAME, df.col("fields.premise_name"));
the original structure is something like {"fields":{"premise_name":"ccc"}}
hope it helps
> On Jul 7, 2016, at 1:48 AM, Lan Jiang wrote:
>
> H
Hi, there
Spark has provided json document processing feature for a long time. In
most examples I see, each line is a json object in the sample file. That is
the easiest case. But how can we process a json document, which does not
conform to this standard format (one line per json object)? Here is