Re: Exception when reading multiline JSON file
Hi Kumaresh, This is most likely an issue with the size of your Spark cluster not being large enough to accomplish the desired task. Hints for this type of situation are when the stack trace mentions things like a size limitation was exceeded and you lost a node. However, this is also a great opportunity for you to learn how to monitor a Spark job for performance considerations. There is guidance on the Spark website that will get you started, and you will want to dig into the Spark cluster UI and driver log files to see information in action. Databricks also makes situations like this easier to troubleshoot because you can execute the Spark code one step at a time using their visual notebook experience. Hope that helps point you in the right direction. https://spark.apache.org/docs/latest/monitoring.html https://m.youtube.com/watch?v=KscZf1y97m8 Kevin On Thu, Sep 12, 2019 at 12:04 PM Kumaresh AK wrote: > Hello Spark Community! > I am new to Spark. I tried to read a multiline json file (has around 2M > records and gzip size is about 2GB) and encountered an exception. It works > if I convert the same file into jsonl before reading it via spark. > Unfortunately the file is private and I cannot share it. Is there any > information that can help me narrow it down? > I tried writing to parquet and json. Both face the same exception. This is > the short form of the code: > > df = spark.read.option('multiline', 'true').json("file.json.gz") \ > .select(explode("objects")).select("col.*") > > df.write.mode('overwrite').json(p) > # this crashes too: df.write.mode('overwrite').parquet(p) > > > Json file is of the form: > { > > "version":"xyz", > > "objects":[ > > { > > "id":"abc", > > > > }, > > { > > "id": "def", > > > > }, > > > > ] > > } > > > This is the exception: > >> 2019-09-12 14:02:01,515 WARN scheduler.TaskSetManager: Lost task 0.0 in >> stage 1.0 (TID 1, 172.21.0.3, executor 0): org.apache.spark.SparkException: >> Task failed while writing rows. >> at >> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257) >> at >> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) >> at >> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) >> at >> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) >> at org.apache.spark.scheduler.Task.run(Task.scala:121) >> at >> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403) >> at >> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) >> at >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder >> by size 16 because the size after growing exceeds size limitation 2147483632 >> at >> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71) >> at >> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62) >> at >> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.writeUnalignedBytes(UnsafeWriter.java:126) >> at >> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109) >> at >> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown >> Source) >> at >> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_1_6$(Unknown >> Source) >> at >> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown >> Source) >> at >> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:149) >> at >> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:148) >> at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) >> at >> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) >> at >> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown >> Source) >> at >> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) >> at >> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) >> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scal
Exception when reading multiline JSON file
Hello Spark Community! I am new to Spark. I tried to read a multiline json file (has around 2M records and gzip size is about 2GB) and encountered an exception. It works if I convert the same file into jsonl before reading it via spark. Unfortunately the file is private and I cannot share it. Is there any information that can help me narrow it down? I tried writing to parquet and json. Both face the same exception. This is the short form of the code: df = spark.read.option('multiline', 'true').json("file.json.gz") \ .select(explode("objects")).select("col.*") df.write.mode('overwrite').json(p) # this crashes too: df.write.mode('overwrite').parquet(p) Json file is of the form: { "version":"xyz", "objects":[ { "id":"abc", }, { "id": "def", }, ] } This is the exception: > 2019-09-12 14:02:01,515 WARN scheduler.TaskSetManager: Lost task 0.0 in > stage 1.0 (TID 1, 172.21.0.3, executor 0): org.apache.spark.SparkException: > Task failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:121) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403) > at > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IllegalArgumentException: Cannot grow BufferHolder by > size 16 because the size after growing exceeds size limitation 2147483632 > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:71) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.grow(UnsafeWriter.java:62) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.writeUnalignedBytes(UnsafeWriter.java:126) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_1_6$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:149) > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1$$anonfun$apply$1.apply(FileFormat.scala:148) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at > scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:244) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > at > org.apache.spark.sql.exe