On 18 Oct 2016, at 10:58, Chetan Khatri <ckhatriman...@gmail.com<mailto:ckhatriman...@gmail.com>> wrote:
Dear Xi shen, Thank you for getting back to question. The approach i am following are as below: I have MSSQL server as Enterprise data lack. 1. run Java jobs and generated JSON files, every file is almost 6 GB. Correct spark need every JSON on separate line, so i did sed -e 's/}/}\n/g' -s old-file.json > new-file.json to get every json element on separate lines. 2. uploaded to s3 bucket and reading from their using sqlContext.read.json() function, where i am getting above error. Note: If i am running for small size files then i am not getting this error where JSON elements are almost same structured. Current approach: * splitting large JSON(6 GB) to 1-1 GB then will process. Note: Machine size is , 1 master and 2 slave, each 4 vcore, 26 GB RAM I see what you are trying to do here: one JSON file per line, then splitting by line so that you can parallelise JSON processing, as well as holding many JSON objects in a single s3 file. This is a devious little trick. It just doesn't work once the json files goes > 2^31 bytes long, as the code to split by line breaks. You could write your own input splitter which actually does basic Json parsing, splitting up by looking for the final } in a JSON clause (harder than you think, as you need to remember how many {} clauses you have entered and not include escaped "{" in strings. a quick google shows some that may be a good starting point https://github.com/Pivotal-Field-Engineering/pmr-common/blob/master/PivotalMRCommon/src/main/java/com/gopivotal/mapreduce/lib/input/JsonInputFormat.java https://github.com/alexholmes/json-mapreduce