Re: Spark 1.1.1 SQLContext.jsonFile dumps trace if JSON has newlines ...

2014-12-10 Thread Michael Armbrust
Yep, because sc.textFile will only guarantee that lines will be preserved
across splits, this is the semantic.  It would be possible to write a
custom input format, but that hasn't been done yet.  From the documentation:

http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets


>
>- jsonFile - loads data from a directory of JSON files *where each
>line of the files is a JSON object*.
>
>
On Wed, Dec 10, 2014 at 11:48 AM, Manoj Samel 
wrote:

> I am using SQLContext.jsonFile. If a valid JSON contains newlines,
> spark1.1.1 dumps trace below. If the JSON is read as one line, it works
> fine. Is this known?
>
>
> 14/12/10 11:44:02 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
> 28)
>
> com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input
> within/between OBJECT entries
>
>  at [Source: java.io.StringReader@4c8dd4d9; line: 1, column: 19]
>
> at com.fasterxml.jackson.core.JsonParser._constructError(
> JsonParser.java:1524)
>
> at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWS(
> ReaderBasedJsonParser.java:1682)
>
> at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(
> ReaderBasedJsonParser.java:619)
>
> at
> com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap(
> MapDeserializer.java:412)
>
> at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
> MapDeserializer.java:312)
>
> at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
> MapDeserializer.java:26)
>
> at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(
> ObjectMapper.java:2986)
>
> at com.fasterxml.jackson.databind.ObjectMapper.readValue(
> ObjectMapper.java:2091)
>
> at
> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
> JsonRDD.scala:275)
>
> at
> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
> JsonRDD.scala:274)
>
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> at scala.collection.TraversableOnce$class.reduceLeft(
> TraversableOnce.scala:172)
>
> at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
>
> at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:847)
>
> at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:845)
>
> at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:1179
> )
>
> at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:1179
> )
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:744)
>
> 14/12/10 11:44:02 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID
> 28, localhost): com.fasterxml.jackson.core.JsonParseException: Unexpected
> end-of-input within/between OBJECT entries
>
>  at [Source: java.io.StringReader@4c8dd4d9; line: 1, column: 19]
>
> com.fasterxml.jackson.core.JsonParser._constructError(
> JsonParser.java:1524)
>
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWS(
> ReaderBasedJsonParser.java:1682)
>
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(
> ReaderBasedJsonParser.java:619)
>
>
> com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap(
> MapDeserializer.java:412)
>
>
> com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
> MapDeserializer.java:312)
>
>
> com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
> MapDeserializer.java:26)
>
> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(
> ObjectMapper.java:2986)
>
> com.fasterxml.jackson.databind.ObjectMapper.readValue(
> ObjectMapper.java:2091)
>
>
> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
> JsonRDD.scala:275)
>
>
> org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
> JsonRDD.scala:274)
>
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> scala.collection.TraversableOnce$class.reduceLeft(
> TraversableOnce.scala:172)
>
> scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)
>
> org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:847)
>
> org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:845)
>
> 

Spark 1.1.1 SQLContext.jsonFile dumps trace if JSON has newlines ...

2014-12-10 Thread Manoj Samel
I am using SQLContext.jsonFile. If a valid JSON contains newlines,
spark1.1.1 dumps trace below. If the JSON is read as one line, it works
fine. Is this known?


14/12/10 11:44:02 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID
28)

com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input
within/between OBJECT entries

 at [Source: java.io.StringReader@4c8dd4d9; line: 1, column: 19]

at com.fasterxml.jackson.core.JsonParser._constructError(
JsonParser.java:1524)

at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWS(
ReaderBasedJsonParser.java:1682)

at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(
ReaderBasedJsonParser.java:619)

at
com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap(
MapDeserializer.java:412)

at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
MapDeserializer.java:312)

at com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
MapDeserializer.java:26)

at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(
ObjectMapper.java:2986)

at com.fasterxml.jackson.databind.ObjectMapper.readValue(
ObjectMapper.java:2091)

at
org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
JsonRDD.scala:275)

at
org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
JsonRDD.scala:274)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

at scala.collection.TraversableOnce$class.reduceLeft(
TraversableOnce.scala:172)

at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)

at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:847)

at org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:845)

at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:1179)

at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:1179)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

at org.apache.spark.scheduler.Task.run(Task.scala:54)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

14/12/10 11:44:02 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 28,
localhost): com.fasterxml.jackson.core.JsonParseException: Unexpected
end-of-input within/between OBJECT entries

 at [Source: java.io.StringReader@4c8dd4d9; line: 1, column: 19]

com.fasterxml.jackson.core.JsonParser._constructError(
JsonParser.java:1524)

com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWS(
ReaderBasedJsonParser.java:1682)

com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(
ReaderBasedJsonParser.java:619)


com.fasterxml.jackson.databind.deser.std.MapDeserializer._readAndBindStringMap(
MapDeserializer.java:412)


com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
MapDeserializer.java:312)


com.fasterxml.jackson.databind.deser.std.MapDeserializer.deserialize(
MapDeserializer.java:26)

com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(
ObjectMapper.java:2986)

com.fasterxml.jackson.databind.ObjectMapper.readValue(
ObjectMapper.java:2091)


org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
JsonRDD.scala:275)


org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$5.apply(
JsonRDD.scala:274)

scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

scala.collection.Iterator$class.foreach(Iterator.scala:727)

scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.TraversableOnce$class.reduceLeft(
TraversableOnce.scala:172)

scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1157)

org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:847)

org.apache.spark.rdd.RDD$$anonfun$18.apply(RDD.scala:845)

org.apache.spark.SparkContext$$anonfun$28.apply(
SparkContext.scala:1179)

org.apache.spark.SparkContext$$anonfun$28.apply(
SparkContext.scala:1179)

org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178
)

java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:615)

java.lang.Thread.run(Thread.java:744)

14/12/10 11:44:02 ERROR TaskSetManager: Task 0 in stage 14.0 failed 1
times; aborting job

14/12/