Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

2017-06-16 Thread RD
Thanks Georg. But I'm not sure how mapPartitions is relevant here.  Can you
elaborate?



On Thu, Jun 15, 2017 at 4:18 AM, Georg Heiler 
wrote:

> What about using map partitions instead?
>
> RD  schrieb am Do. 15. Juni 2017 um 06:52:
>
>> Hi Spark folks,
>>
>> Is there any plan to support the richer UDF API that Hive supports
>> for Spark UDFs ? Hive supports the GenericUDF API which has, among others
>> methods like initialize(), configure() (called once on the cluster) etc,
>> which a lot of our users use. We have now a lot of UDFs in Hive which make
>> use of these methods. We plan to move to UDFs to Spark UDFs but are being
>> limited by not having similar lifecycle methods.
>>Are there plans to address these? Or do people usually adopt some sort
>> of workaround?
>>
>>If we  directly use  the Hive UDFs  in Spark we pay a performance
>> penalty. I think Spark anyways does a conversion from InternalRow to Row
>> back to InternalRow for native spark udfs and for Hive it does InternalRow
>> to Hive Object back to InternalRow but somehow the conversion in native
>> udfs is more performant.
>>
>> -Best,
>> R.
>>
>


[Spark Sql/ UDFs] Spark and Hive UDFs parity

2017-06-14 Thread RD
Hi Spark folks,

Is there any plan to support the richer UDF API that Hive supports for
Spark UDFs ? Hive supports the GenericUDF API which has, among others
methods like initialize(), configure() (called once on the cluster) etc,
which a lot of our users use. We have now a lot of UDFs in Hive which make
use of these methods. We plan to move to UDFs to Spark UDFs but are being
limited by not having similar lifecycle methods.
   Are there plans to address these? Or do people usually adopt some sort
of workaround?

   If we  directly use  the Hive UDFs  in Spark we pay a performance
penalty. I think Spark anyways does a conversion from InternalRow to Row
back to InternalRow for native spark udfs and for Hive it does InternalRow
to Hive Object back to InternalRow but somehow the conversion in native
udfs is more performant.

-Best,
R.


many 'activity' job are pending

2016-07-15 Thread 陆巍|Wei Lu(RD
Hi there,

I meet with a “many Active jobs” issue when using direct kafka streaming on 
YARN. (spark 1.5, hadoop 2.6, CDH5.5.1)

The problem happens when kafka has almost NO traffic.

From application UI, I see many ‘active’ jobs are pending for hours. And 
finally the driver “Requesting 4 new executors because tasks are backlogged”.

But, when looking at the driver log of a ‘activity’ job, the log says the job 
is finished. So, why the application UI shows this job is activity like forever?

Thanks!


Here are related log info about one of the ‘activity’ jobs.
There are two stages: a reduceByKey follows a flatmap. The log says both stages 
are finished in ~20ms and the job also finishes in 64 ms.

Got job 6567
Final stage: ResultStage 9851(foreachRDD at
Parents of final stage: List(ShuffleMapStage 9850)
Missing parents: List(ShuffleMapStage 9850)
…
Finished task 0.0 in stage 9850.0 (TID 29551) in 20 ms
Removed TaskSet 9850.0, whose tasks have all completed, from pool
ShuffleMapStage 9850 (flatMap at OpaTransLogAnalyzeWithShuffle.scala:83) 
finished in 0.022 s
…
Submitting ResultStage 9851 (ShuffledRDD[16419] at reduceByKey at 
OpaTransLogAnalyzeWithShuffle.scala:83), which is now runnable
…
ResultStage 9851 (foreachRDD at OpaTransLogAnalyzeWithShuffle.scala:84) 
finished in 0.023 s
Job 6567 finished: foreachRDD at OpaTransLogAnalyzeWithShuffle.scala:84, took 
0.064372 s
Finished job streaming job 1468592373000 ms.1 from job set of time 
1468592373000 ms

Wei Lu