Hi gurus,
STDDEV function can be used for both aggregates and analytics. Fortunately in Hive it has been implemented which is great. I have a simple question on this if I may I would expect the in-built function STDDEV to run pretty efficiently in most databases as they are system defined functions. At least that is my understanding. I did a test on Hive with both STDDEV in-built function and also the equivalent of it that I wrote myself In this example the base table sales 918,843 rows imported from Oracle sh.sales table. and I am trying to work STDDEV for as column called amount_sold SELECT CASE WHEN COUNT (amount_sold) <= 1 THEN 0 ELSE SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT (1)-1)) END FROM sales; I ran this first with straight forward calculation for standard deviation 1) It returned 259.7804899502628 in 19.656 seconds Then I ran it immediately after using STDDV function itself in hive hive> SELECT STDDEV(amount_sold) from sales; 2) It retuned 259.7803485874695 in 20.346 seconds Now the value itself is pretty close but seems to be taking slightly more time in 2) When I did the same tests on base table in Oracle I got first for my calculation 1 SELECT CASE 2 WHEN COUNT (amount_sold) <= 1 3 THEN 0 4 ELSE SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT (1)-1)) 5 END 6* FROM sh.sales hddtes...@mydb.mich.LOCAL> / CASEWHENCOUNT(AMOUNT_SOLD)<=1THEN0ELSESQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT (1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1))END ---------------------------------------------------------------------------- ------------------------------------------------ 259.78049 Elapsed: 00:00:00.23 It took 230 ms. Then I ran again using in built function in Oracle hddtes...@mydb.mich.LOCAL> SELECT STDDEV(amount_sold) FROM sh.sales; STDDEV(AMOUNT_SOLD) ------------------- 259.78049 Elapsed: 00:00:00.16 Right I would expect the in-built function to be more efficient and quicker. In Oracle it took 160 ms compared to 230 ms from my own calculation. My query is why in Hive it takes longer to do the same calculation with in-built STDDEV function? Is this behaviour expected. Thanks Mich http://talebzadehmich.wordpress.com Publications due shortly: Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and Coherence Cache NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility.