subject:"\"Exception when reading multiline JSON file\""

Re: Exception when reading multiline JSON file

2019-09-12 Thread Kevin Mellott

Hi Kumaresh,

This is most likely an issue with the size of your Spark cluster not being
large enough to accomplish the desired task. Hints for this type of
situation are when the stack trace mentions things like a size limitation
was exceeded and you lost a node.

However, this is also a great opportunity for you to learn how to monitor a
Spark job for performance considerations. There is guidance on the Spark
website that will get you started, and you will want to dig into the Spark
cluster UI and driver log files to see information in action. Databricks
also makes situations like this easier to troubleshoot because you can
execute the Spark code one step at a time using their visual notebook
experience.

Hope that helps point you in the right direction.

https://spark.apache.org/docs/latest/monitoring.html
https://m.youtube.com/watch?v=KscZf1y97m8

Kevin

On Thu, Sep 12, 2019 at 12:04 PM Kumaresh AK 
wrote:

> Hello Spark Community!
> I am new to Spark. I tried to read a multiline json file (has around 2M
> records and gzip size is about 2GB) and encountered an exception. It works
> if I convert the same file into jsonl before reading it via spark.
> Unfortunately the file is private and I cannot share it. Is there any
> information that can help me narrow it down?
> I tried writing to parquet and json. Both face the same exception. This is
> the short form of the code:
>
> df = spark.read.option('multiline', 'true').json("file.json.gz") \
> .select(explode("objects")).select("col.*")
>
> df.write.mode('overwrite').json(p)
> # this crashes too: df.write.mode('overwrite').parquet(p)
>
>
> Json file is of the form:
> {
>
> "version":"xyz",
>
> "objects":[
>
> {
>
> "id":"abc",
>
> 
>
> },
>
> {
>
> "id": "def",
>
> 
>
> },
>
> 
>
> ]
>
> }
>
>
> This is the exception:
>
>> 2019-09-12 14:02:01,515 WARN scheduler.TaskSetManager: Lost task 0.0 in
>> stage 1.0 (TID 1, 172.21.0.3, executor 0): org.apache.spark.SparkException:
>> Task failed while writing rows.
>> at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
>> at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>> at
>> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>> at org.apache.spark.scheduler.Task.run(Task.scala:121)
>> at
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
>> at
>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder
>> by size 16 because the size after growing exceeds size limitation 2147483632
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.writeUnalignedBytes(UnsafeWriter.java:126)
>> at
>> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
>> at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown
>> Source)
>> at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_1_6$(Unknown
>> Source)
>> at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>> Source)
>> at
>> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:149)
>> at
>> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:148)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>> at
>> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
>> at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>> Source)
>> at
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>> at
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scal

Exception when reading multiline JSON file

2019-09-12 Thread Kumaresh AK

Hello Spark Community!
I am new to Spark. I tried to read a multiline json file (has around 2M
records and gzip size is about 2GB) and encountered an exception. It works
if I convert the same file into jsonl before reading it via spark.
Unfortunately the file is private and I cannot share it. Is there any
information that can help me narrow it down?
I tried writing to parquet and json. Both face the same exception. This is
the short form of the code:

df = spark.read.option('multiline', 'true').json("file.json.gz") \
.select(explode("objects")).select("col.*")

df.write.mode('overwrite').json(p)
# this crashes too: df.write.mode('overwrite').parquet(p)


Json file is of the form:
{

"version":"xyz",

"objects":[

{

"id":"abc",



},

{

"id": "def",



},



]

}


This is the exception:

> 2019-09-12 14:02:01,515 WARN scheduler.TaskSetManager: Lost task 0.0 in
> stage 1.0 (TID 1, 172.21.0.3, executor 0): org.apache.spark.SparkException:
> Task failed while writing rows.
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
> at
> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder by
> size 16 because the size after growing exceeds size limitation 2147483632
> at
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.writeUnalignedBytes(UnsafeWriter.java:126)
> at
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_1_6$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
> Source)
> at
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:149)
> at
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:148)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at
> scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:244)
> at
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
> at
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
> at
> org.apache.spark.sql.exe

Re: Exception when reading multiline JSON file

Exception when reading multiline JSON file

2 matches

Site Navigation

Mail list logo

Footer information