Hi all, We use Hive (2.0.1) on Tez (0.9.1) at my company as our data processing layer. I recently upped the setting:hive.auto.convert.join.noconditionaltask.size, which determines the upper cap on the memory used to store the broadcast edge in map joins, to 512 MB from 10MB. Our tez containers are allocated 3 Gigs, so the map join memory accounted for around ~ 17% of the total memory.
Once this setting was made available, our daily pipelines suffered increasing delays, and over the course of a week, we were missing all our SLAs. Reverting this caused things to smoothen out again. Funnily enough, individual query performance seems to be better, since map joins kick out the various phases associated with a merge join, but when running thousands of jobs an hour, this setting doesn't seem to scale. Does anyone have any insight as to why this is the case? Do you normally up the map-mem setting for individual jobs, or do you set high defaults? Thanks, Nitin