This might be a hint. Maybe invalid data?

Caused by: java.lang.IllegalArgumentException: Missing required char
':' at 'struct<LeftData:int,collect_set^(RightData):array<int>>'


On Wed., 24 Jul. 2019, 2:15 pm Balakumar iyer S, <bala93ku...@gmail.com>
wrote:

> Hi Bobby Evans,
>
> I apologise for the delayed response , yes you are right I missed out to
> paste the complete stack trace exception. Here with I have attached the
> complete yarn log for the same.
>
> Thank you , It would be helpful if you guys could assist me on this error.
>
>
> -----------------------------------------------------------------------------------------------------------------------------------------
> Regards
> Balakumar Seetharaman
>
>
> On Mon, Jul 22, 2019 at 7:05 PM Bobby Evans <bo...@apache.org> wrote:
>
>> You are missing a lot of the stack trace that could explain the
>> exception.  All it shows is that an exception happened while writing out
>> the orc file, not what that underlying exception is, there should be at
>> least one more caused by under the one you included.
>>
>> Thanks,
>>
>> Bobby
>>
>> On Mon, Jul 22, 2019 at 5:58 AM Balakumar iyer S <bala93ku...@gmail.com>
>> wrote:
>>
>>> Hi ,
>>>
>>> I am trying to perform a group by  followed by aggregate collect set
>>> operation on a two column data-set    schema (LeftData int , RightData
>>> int).
>>>
>>> code snippet
>>>
>>>   val wind_2  =
>>> dframe.groupBy("LeftData").agg(collect_set(array("RightData")))
>>>
>>>      wind_2.write.mode(SaveMode.Append).format("orc").save(args(1))
>>>
>>> the above code works fine on a smaller dataset but throws the following
>>> error on large dataset (where each keys in LeftData column  needs to be
>>> grouped with 64k values approximately ).
>>>
>>> Could some one assist me on this , should i  set any configuration to
>>> accommodate such a large  values?
>>>
>>> ERROR
>>> ---------------------------------
>>> Driver stacktrace:
>>> at org.apache.spark.scheduler.DAGScheduler.org
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>>> at scala.Option.foreach(Option.scala:257)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
>>> at
>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
>>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>>> at
>>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
>>>
>>>
>>> Caused by: org.apache.spark.SparkException: Task failed while writing
>>> rows.
>>> at
>>> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>>>
>>> --
>>> REGARDS
>>> BALAKUMAR SEETHARAMAN
>>>
>>>
>
> --
> REGARDS
> BALAKUMAR SEETHARAMAN
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to