> The table also has a large Regex serde. There are no stats fast paths for Regex SerDe.
The statistics computation is lifting each row into memory, parsing it and throwing it away. Most of your time would be spent in GC (check the GC time millis), due to the huge expense of the Regex Serde. For a direct comparison you could compute stats while turning it into another format set hive.stats.autogather=true; create table tmp1 stored as orc as select * from oldtable; Due to the nature of the columnar SerDes, that ETL would happen in parallel to the compute stats off the same stream (i.e autogather). That said, I have noticed performance issues with the RegexSerde, but haven¹t bothered to fix it yet - maybe you¹d want to take a shot at fixing it? Cheers, Gopal
