On 18 Oct 2016, at 10:58, Chetan Khatri 
<ckhatriman...@gmail.com<mailto:ckhatriman...@gmail.com>> wrote:

Dear Xi shen,

Thank you for getting back to question.

The approach i am following are as below:
I have MSSQL server as Enterprise data lack.

1. run Java jobs and generated JSON files, every file is almost 6 GB.
Correct spark need every JSON on separate line, so i did
sed -e 's/}/}\n/g' -s old-file.json > new-file.json
to get every json element on separate lines.
2. uploaded to s3 bucket and reading from their using sqlContext.read.json() 
function, where i am getting above error.

Note: If i am running for small size files then i am not getting this error 
where JSON elements are almost same structured.

Current approach:


  *    splitting large JSON(6 GB) to 1-1 GB then will process.

Note: Machine size is , 1 master and 2 slave, each 4 vcore, 26 GB RAM

I see what you are trying to do here: one JSON file per line, then splitting by 
line so that you can parallelise JSON processing, as well as holding many JSON 
objects in a single s3 file. This is a devious little trick. It just doesn't 
work once the json files goes > 2^31 bytes long, as the code to split by line 
breaks.

You could write your own input splitter which actually does basic Json parsing, 
splitting up by looking for the final } in a JSON clause (harder than you 
think, as you need to remember how many {} clauses you have entered and not 
include escaped "{" in strings.

a quick google shows some that may be a good starting point

https://github.com/Pivotal-Field-Engineering/pmr-common/blob/master/PivotalMRCommon/src/main/java/com/gopivotal/mapreduce/lib/input/JsonInputFormat.java
https://github.com/alexholmes/json-mapreduce

Reply via email to