Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity
Thanks Georg. But I'm not sure how mapPartitions is relevant here. Can you elaborate? On Thu, Jun 15, 2017 at 4:18 AM, Georg Heiler wrote: > What about using map partitions instead? > > RD schrieb am Do. 15. Juni 2017 um 06:52: > >> Hi Spark folks, >> >> Is there any plan to support the richer UDF API that Hive supports >> for Spark UDFs ? Hive supports the GenericUDF API which has, among others >> methods like initialize(), configure() (called once on the cluster) etc, >> which a lot of our users use. We have now a lot of UDFs in Hive which make >> use of these methods. We plan to move to UDFs to Spark UDFs but are being >> limited by not having similar lifecycle methods. >>Are there plans to address these? Or do people usually adopt some sort >> of workaround? >> >>If we directly use the Hive UDFs in Spark we pay a performance >> penalty. I think Spark anyways does a conversion from InternalRow to Row >> back to InternalRow for native spark udfs and for Hive it does InternalRow >> to Hive Object back to InternalRow but somehow the conversion in native >> udfs is more performant. >> >> -Best, >> R. >> >
[Spark Sql/ UDFs] Spark and Hive UDFs parity
Hi Spark folks, Is there any plan to support the richer UDF API that Hive supports for Spark UDFs ? Hive supports the GenericUDF API which has, among others methods like initialize(), configure() (called once on the cluster) etc, which a lot of our users use. We have now a lot of UDFs in Hive which make use of these methods. We plan to move to UDFs to Spark UDFs but are being limited by not having similar lifecycle methods. Are there plans to address these? Or do people usually adopt some sort of workaround? If we directly use the Hive UDFs in Spark we pay a performance penalty. I think Spark anyways does a conversion from InternalRow to Row back to InternalRow for native spark udfs and for Hive it does InternalRow to Hive Object back to InternalRow but somehow the conversion in native udfs is more performant. -Best, R.
many 'activity' job are pending
Hi there, I meet with a “many Active jobs” issue when using direct kafka streaming on YARN. (spark 1.5, hadoop 2.6, CDH5.5.1) The problem happens when kafka has almost NO traffic. From application UI, I see many ‘active’ jobs are pending for hours. And finally the driver “Requesting 4 new executors because tasks are backlogged”. But, when looking at the driver log of a ‘activity’ job, the log says the job is finished. So, why the application UI shows this job is activity like forever? Thanks! Here are related log info about one of the ‘activity’ jobs. There are two stages: a reduceByKey follows a flatmap. The log says both stages are finished in ~20ms and the job also finishes in 64 ms. Got job 6567 Final stage: ResultStage 9851(foreachRDD at Parents of final stage: List(ShuffleMapStage 9850) Missing parents: List(ShuffleMapStage 9850) … Finished task 0.0 in stage 9850.0 (TID 29551) in 20 ms Removed TaskSet 9850.0, whose tasks have all completed, from pool ShuffleMapStage 9850 (flatMap at OpaTransLogAnalyzeWithShuffle.scala:83) finished in 0.022 s … Submitting ResultStage 9851 (ShuffledRDD[16419] at reduceByKey at OpaTransLogAnalyzeWithShuffle.scala:83), which is now runnable … ResultStage 9851 (foreachRDD at OpaTransLogAnalyzeWithShuffle.scala:84) finished in 0.023 s Job 6567 finished: foreachRDD at OpaTransLogAnalyzeWithShuffle.scala:84, took 0.064372 s Finished job streaming job 1468592373000 ms.1 from job set of time 1468592373000 ms Wei Lu