This might be a hint. Maybe invalid data? Caused by: java.lang.IllegalArgumentException: Missing required char ':' at 'struct<LeftData:int,collect_set^(RightData):array<int>>'
On Wed., 24 Jul. 2019, 2:15 pm Balakumar iyer S, <bala93ku...@gmail.com> wrote: > Hi Bobby Evans, > > I apologise for the delayed response , yes you are right I missed out to > paste the complete stack trace exception. Here with I have attached the > complete yarn log for the same. > > Thank you , It would be helpful if you guys could assist me on this error. > > > ----------------------------------------------------------------------------------------------------------------------------------------- > Regards > Balakumar Seetharaman > > > On Mon, Jul 22, 2019 at 7:05 PM Bobby Evans <bo...@apache.org> wrote: > >> You are missing a lot of the stack trace that could explain the >> exception. All it shows is that an exception happened while writing out >> the orc file, not what that underlying exception is, there should be at >> least one more caused by under the one you included. >> >> Thanks, >> >> Bobby >> >> On Mon, Jul 22, 2019 at 5:58 AM Balakumar iyer S <bala93ku...@gmail.com> >> wrote: >> >>> Hi , >>> >>> I am trying to perform a group by followed by aggregate collect set >>> operation on a two column data-set schema (LeftData int , RightData >>> int). >>> >>> code snippet >>> >>> val wind_2 = >>> dframe.groupBy("LeftData").agg(collect_set(array("RightData"))) >>> >>> wind_2.write.mode(SaveMode.Append).format("orc").save(args(1)) >>> >>> the above code works fine on a smaller dataset but throws the following >>> error on large dataset (where each keys in LeftData column needs to be >>> grouped with 64k values approximately ). >>> >>> Could some one assist me on this , should i set any configuration to >>> accommodate such a large values? >>> >>> ERROR >>> --------------------------------- >>> Driver stacktrace: >>> at org.apache.spark.scheduler.DAGScheduler.org >>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599) >>> at >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587) >>> at >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586) >>> at >>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) >>> at >>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) >>> at >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) >>> at >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) >>> at scala.Option.foreach(Option.scala:257) >>> at >>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) >>> at >>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) >>> at >>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) >>> at >>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) >>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) >>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) >>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) >>> at >>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194) >>> >>> >>> Caused by: org.apache.spark.SparkException: Task failed while writing >>> rows. >>> at >>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) >>> >>> -- >>> REGARDS >>> BALAKUMAR SEETHARAMAN >>> >>> > > -- > REGARDS > BALAKUMAR SEETHARAMAN > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org