Re: Spark 1.1.1 SQLContext.jsonFile dumps trace if JSON has newlines ...
Yep, because sc.textFile will only guarantee that lines will be preserved across splits, this is the semantic. It would be possible to write a custom input format, but that hasn't been done yet. From the documentation: http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > >- jsonFile - loads data from a directory of JSON files *where each >line of the files is a JSON object*. > > On Wed, Dec 10, 2014 at 11:48 AM, Manoj Samel wrote: > I am using SQLContext.jsonFile. If a valid JSON contains newlines, > spark1.1.1 dumps trace below. If the JSON is read as one line, it works > fine. Is this known? > > > 14/12/10 11:44:02 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID > 28) > > com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input > within/between OBJECT entries > > at [Source: java.io.StringReader@4c8dd4d9; line: 1, column: 19] > > at com.fasterxml.jackson.core.JsonParser._constructError( > JsonParser.java:1524) > > at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWS( > ReaderBasedJsonParser.java:1682) > > at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken( > ReaderBasedJsonParser.java:619) > > at > com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap( > MapDeserializer.java:412) > > at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize( > MapDeserializer.java:312) > > at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize( > MapDeserializer.java:26) > > at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose( > ObjectMapper.java:2986) > > at com.fasterxml.jackson.databind.ObjectMapper.readValue( > ObjectMapper.java:2091) > > at > org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply( > JsonRDD.scala:275) > > at > org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply( > JsonRDD.scala:274) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at scala.collection.TraversableOnce$class.reduceLeft( > TraversableOnce.scala:172) > > at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157) > > at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:847) > > at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:845) > > at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:1179 > ) > > at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:1179 > ) > > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > > at org.apache.spark.scheduler.Task.run(Task.scala:54) > > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) > > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1145) > > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:744) > > 14/12/10 11:44:02 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID > 28, localhost): com.fasterxml.jackson.core.JsonParseException: Unexpected > end-of-input within/between OBJECT entries > > at [Source: java.io.StringReader@4c8dd4d9; line: 1, column: 19] > > com.fasterxml.jackson.core.JsonParser._constructError( > JsonParser.java:1524) > > com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWS( > ReaderBasedJsonParser.java:1682) > > com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken( > ReaderBasedJsonParser.java:619) > > > com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap( > MapDeserializer.java:412) > > > com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize( > MapDeserializer.java:312) > > > com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize( > MapDeserializer.java:26) > > com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose( > ObjectMapper.java:2986) > > com.fasterxml.jackson.databind.ObjectMapper.readValue( > ObjectMapper.java:2091) > > > org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply( > JsonRDD.scala:275) > > > org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply( > JsonRDD.scala:274) > > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > scala.collection.Iterator$class.foreach(Iterator.scala:727) > > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > scala.collection.TraversableOnce$class.reduceLeft( > TraversableOnce.scala:172) > > scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157) > > org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:847) > > org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:845) > >
Spark 1.1.1 SQLContext.jsonFile dumps trace if JSON has newlines ...
I am using SQLContext.jsonFile. If a valid JSON contains newlines, spark1.1.1 dumps trace below. If the JSON is read as one line, it works fine. Is this known? 14/12/10 11:44:02 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 28) com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input within/between OBJECT entries at [Source: java.io.StringReader@4c8dd4d9; line: 1, column: 19] at com.fasterxml.jackson.core.JsonParser._constructError( JsonParser.java:1524) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWS( ReaderBasedJsonParser.java:1682) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken( ReaderBasedJsonParser.java:619) at com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap( MapDeserializer.java:412) at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize( MapDeserializer.java:312) at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize( MapDeserializer.java:26) at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose( ObjectMapper.java:2986) at com.fasterxml.jackson.databind.ObjectMapper.readValue( ObjectMapper.java:2091) at org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply( JsonRDD.scala:275) at org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply( JsonRDD.scala:274) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.TraversableOnce$class.reduceLeft( TraversableOnce.scala:172) at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:847) at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:845) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:1179) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:1179) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/12/10 11:44:02 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 28, localhost): com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input within/between OBJECT entries at [Source: java.io.StringReader@4c8dd4d9; line: 1, column: 19] com.fasterxml.jackson.core.JsonParser._constructError( JsonParser.java:1524) com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWS( ReaderBasedJsonParser.java:1682) com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken( ReaderBasedJsonParser.java:619) com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap( MapDeserializer.java:412) com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize( MapDeserializer.java:312) com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize( MapDeserializer.java:26) com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose( ObjectMapper.java:2986) com.fasterxml.jackson.databind.ObjectMapper.readValue( ObjectMapper.java:2091) org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply( JsonRDD.scala:275) org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply( JsonRDD.scala:274) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.TraversableOnce$class.reduceLeft( TraversableOnce.scala:172) scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157) org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:847) org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:845) org.apache.spark.SparkContext$$anonfun$28.apply( SparkContext.scala:1179) org.apache.spark.SparkContext$$anonfun$28.apply( SparkContext.scala:1179) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178 ) java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) 14/12/10 11:44:02 ERROR TaskSetManager: Task 0 in stage 14.0 failed 1 times; aborting job 14/12/