Dear Xi shen, Thank you for getting back to question.
The approach i am following are as below: I have MSSQL server as Enterprise data lack. 1. run Java jobs and generated JSON files, every file is almost 6 GB. *Correct spark need every JSON on **separate line, so i did * sed -e 's/}/}\n/g' -s old-file.json > new-file.json to get every json element on separate lines. 2. uploaded to s3 bucket and reading from their using sqlContext.read.json() function, where i am getting above error. Note: If i am running for small size files then i am not getting this error where JSON elements are almost same structured. *Current approach:* - splitting large JSON(6 GB) to 1-1 GB then will process. Note: Machine size is , 1 master and 2 slave, each 4 vcore, 26 GB RAM Thanks. On Tue, Oct 18, 2016 at 2:50 PM, Xi Shen <davidshe...@gmail.com> wrote: > It is a plain Java IO error. Your line is too long. You should alter your > JSON schema, so each line is a small JSON object. > > Please do not concatenate all the object into an array, then write the > array in one line. You will have difficulty handling your super large JSON > array in Spark anyway. > > Because one array is one object, it cannot be split into multiple > partition. > > > On Tue, Oct 18, 2016 at 3:44 PM Chetan Khatri <ckhatriman...@gmail.com> > wrote: > >> Hello Community members, >> >> I am getting error while reading large JSON file in spark, >> >> *Code:* >> >> val landingVisitor = sqlContext.read.json("s3n:// >> hist-ngdp/lvisitor/lvisitor-01-aug.json") >> >> *Error:* >> >> 16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID >> 8) >> java.io.IOException: Too many bytes before newline: 2147483648 >> at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249) >> at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) >> at org.apache.hadoop.mapred.LineRecordReader.<init>( >> LineRecordReader.java:135) >> at org.apache.hadoop.mapred.TextInputFormat.getRecordReader( >> TextInputFormat.java:67) >> at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237) >> >> What would be resolution for the same ? >> >> Thanks in Advance ! >> >> >> -- >> Yours Aye, >> Chetan Khatri. >> >> -- > > > Thanks, > David S. > -- Yours Aye, Chetan Khatri. M.+91 76666 80574 Data Science Researcher INDIA Statement of Confidentiality ———————————————————————————— The contents of this e-mail message and any attachments are confidential and are intended solely for addressee. The information may also be legally privileged. This transmission is sent in trust, for the sole purpose of delivery to the intended recipient. If you have received this transmission in error, any use, reproduction or dissemination of this transmission is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail or phone and delete this message and its attachments, if any.