Hi,
I'm trying to optimize a UDF that runs very slowly on Hive. The UDF takes in a
5GB table and builds a large data structure out of it to facilitate lookups.
The 5GB input is loaded into the distributed cache with an 'add file <path>'
command, and the UDF builds the data structure a single time per instance (or
so it should).
My problem is that the Hive UDF takes several hours to complete, while running
the exact same code on my local machine takes 5 minutes! What could be causing
Hive to be so impractically slow? According to the Hive logs, the data transfer
takes 5-10 minutes, which is reasonable. What else is taking so long?
Thanks,
B