Re: Need help on Spark UDF (Join) Performance tuning .

2014-07-18 Thread S Malligarjunan
Hello Experts,

Appreciate your input highly, please suggest/ give me hint, what would be the 
issue here?

 
Thanks and Regards,
Malligarjunan S.  



On Thursday, 17 July 2014, 22:47, S Malligarjunan smalligarju...@yahoo.com 
wrote:
 


Hello Experts,

I am facing performance problem when I use the UDF function call. Please help 
me to tune the query.
Please find the details below

shark select count(*) from table1;
OK
151096
Time taken: 7.242 seconds
shark select count(*) from table2; 
OK
938
Time taken: 1.273 seconds

Without UDF:
shark SELECT
 
 count(pvc1.time)
    FROM table2 pvc2 JOIN table1 pvc1
    WHERE pvc1.col1 = pvc2.col2
    AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS')  
unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
OK
328
Time taken: 200.487 seconds


shark 
  SELECT
  count(pvc1.time)
        FROM table2 pvc2 JOIN table1 pvc1
    WHERE (pvc1.col1 = pvc2.col1 OR pvc1.col1 = pvc2.col2)
    AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS')  
unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
OK
331
Time taken: 292.86 seconds

With UDF:
shark  
   SELECT
  count(pvc1.time)
    FROM table2 pvc2 JOIN table1 pvc1
    WHERE testCompare(pvc1.col1, pvc1.col2, pvc2.col1,pvc2.col2)
   
 AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS')  
unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');

OK
331
Time taken: 3718.23 seconds

The above UDF query takes more time to run. 


Where testCompare is an udf function, The function just does the pvc1.col1 = 
pvc2.col1 OR pvc1.col1 = pvc2.col2

Please let me know what is the issue here?

 
Thanks and Regards,
Sankar S.  

Re: Need help on Spark UDF (Join) Performance tuning .

2014-07-18 Thread Michael Armbrust
It's likely that since your UDF is a black box to hive's query optimizer
that it must choose a less efficient join algorithm that passes all
possible  matches to your function for comparison.  This will happen any
time your UDF touches attributes from both sides of the join.

In general you can learn more about the chosen execution strategy by
running explain.
On Jul 18, 2014 12:04 PM, S Malligarjunan smalligarju...@yahoo.com
wrote:

 Hello Experts,

 Appreciate your input highly, please suggest/ give me hint, what would be
 the issue here?


 Thanks and Regards,
 Malligarjunan S.



   On Thursday, 17 July 2014, 22:47, S Malligarjunan 
 smalligarju...@yahoo.com wrote:


 Hello Experts,

 I am facing performance problem when I use the UDF function call. Please
 help me to tune the query.
 Please find the details below

 shark select count(*) from table1;
 OK
 151096
 Time taken: 7.242 seconds
 shark select count(*) from table2;
 OK
 938
 Time taken: 1.273 seconds


 *Without UDF:*shark SELECT
   count(pvc1.time)
 FROM table2 pvc2 JOIN table1 pvc1
 WHERE pvc1.col1 = pvc2.col2
 AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') 
 unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
 OK
 328
 Time taken: 200.487 seconds


 shark
   SELECT
   count(pvc1.time)
 FROM table2 pvc2 JOIN table1 pvc1
 WHERE (pvc1.col1 = pvc2.col1 OR pvc1.col1 = pvc2.col2)
 AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') 
 unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
 OK
 331
 Time taken: 292.86 seconds

 *With UDF:*
 shark
SELECT
   count(pvc1.time)
 FROM table2 pvc2 JOIN table1 pvc1
 WHERE testCompare(pvc1.col1, pvc1.col2, pvc2.col1,pvc2.col2)
 AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS') 
 unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');

 OK
 331
 Time taken: 3718.23 seconds
 The above UDF query takes more time to run.

 Where testCompare is an udf function, The function just does the pvc1.col1
 = pvc2.col1 OR pvc1.col1 = pvc2.col2

 Please let me know what is the issue here?


 Thanks and Regards,
 Sankar S.






Need help on Spark UDF (Join) Performance tuning .

2014-07-17 Thread S Malligarjunan
Hello Experts,

I am facing performance problem when I use the UDF function call. Please help 
me to tune the query.
Please find the details below

shark select count(*) from table1;
OK
151096
Time taken: 7.242 seconds
shark select count(*) from table2; 
OK
938
Time taken: 1.273 seconds

Without UDF:
shark SELECT
  count(pvc1.time)
    FROM table2 pvc2 JOIN table1 pvc1
    WHERE pvc1.col1 = pvc2.col2
    AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS')  
unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
OK
328
Time taken: 200.487 seconds


shark 
  SELECT
  count(pvc1.time)
        FROM table2 pvc2 JOIN table1 pvc1
    WHERE (pvc1.col1 = pvc2.col1 OR pvc1.col1 = pvc2.col2)
    AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS')  
unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');
OK
331
Time taken: 292.86 seconds

With UDF:
shark  
   SELECT
  count(pvc1.time)
    FROM table2 pvc2 JOIN table1 pvc1
    WHERE testCompare(pvc1.col1,pvc1.col2, pvc2.col1,pvc2.col2)
    AND  unix_timestamp(pvc2.time, '-MM-dd HH:mm:ss,SSS')  
unix_timestamp(pvc1.time, '-MM-dd HH:mm:ss,SSS');

OK
331
Time taken: 3718.23 seconds

The above UDF query takes more time to run. 


Where testCompare is an udf function, The function just does the pvc1.col1 = 
pvc2.col1 OR pvc1.col1 = pvc2.col2

Please let me know what is the issue here?

 
Thanks and Regards,
Sankar S.