Hi,
 
I'm trying to optimize a UDF that runs very slowly on Hive. The UDF takes in a 
5GB table and builds a large data structure out of it to facilitate lookups. 
The 5GB input is loaded into the distributed cache with an 'add file <path>' 
command, and the UDF builds the data structure a single time per instance (or 
so it should). 
 
My problem is that the Hive UDF takes several hours to complete, while running 
the exact same code on my local machine takes 5 minutes! What could be causing 
Hive to be so impractically slow? According to the Hive logs, the data transfer 
takes 5-10 minutes, which is reasonable. What else is taking so long?
 
Thanks,
B
                                          

Reply via email to