Tez Map Joins

Nitin Vijayvargiya Thu, 15 Aug 2019 13:34:24 -0700

Hi all,
We use Hive (2.0.1) on Tez (0.9.1) at my company as our data processing layer. I
recently upped the setting:hive.auto.convert.join.noconditionaltask.size, which
determines the upper cap on the memory used to store the broadcast edge in map
joins, to 512 MB from 10MB. Our tez containers are allocated 3 Gigs, so the map
join memory accounted for around ~ 17% of the total memory.


Once this setting was made available, our daily pipelines suffered increasing
delays, and over the course of a week, we were missing all our SLAs. Reverting
this caused things to smoothen out again.

Funnily enough, individual query performance seems to be better, since map joins
kick out the various phases associated with a merge join, but when running
thousands of jobs an hour, this setting doesn't seem to scale.

Does anyone have any insight as to why this is the case? Do you normally up the
map-mem setting for individual jobs, or do you set high defaults?

Thanks,
Nitin

Tez Map Joins

Reply via email to