On 18 Oct 2016, at 08:43, Chetan Khatri 
<ckhatriman...@gmail.com<mailto:ckhatriman...@gmail.com>> wrote:

Hello Community members,

I am getting error while reading large JSON file in spark,


the underlying read code can't handle more than 2^31 bytes in a single line:

    if (bytesConsumed > Integer.MAX_VALUE) {
      throw new IOException("Too many bytes before newline: " + bytesConsumed);
    }

That's because it's trying to split work by line, and of course, there aren't 
lines

you need to move over to reading the JSON by other means, i'm afraid. At a 
guess, something involving SparkContext.binaryFiles() streaming the data 
straight into a JSON parser,



Code:

val landingVisitor = 
sqlContext.read.json("s3n://hist-ngdp/lvisitor/lvisitor-01-aug.json")

unrelated, but use s3a if you can. It's better, you know.


Error:

16/10/18 07:30:30 ERROR Executor: Exception in task 8.0 in stage 0.0 (TID 8)
java.io.IOException: Too many bytes before newline: 2147483648
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:249)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:135)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:237)

What would be resolution for the same ?

Thanks in Advance !


--
Yours Aye,
Chetan Khatri.


Reply via email to