> The table also has a large Regex serde.

There are no stats fast paths for Regex SerDe.

The statistics computation is lifting each row into memory, parsing it and
throwing it away.

Most of your time would be spent in GC (check the GC time millis), due to
the huge expense of the Regex Serde.

For a direct comparison you could compute stats while turning it into
another format

set hive.stats.autogather=true;
create table tmp1 stored as orc as select * from oldtable;

Due to the nature of the columnar SerDes, that ETL would happen in
parallel to the compute stats off the same stream (i.e autogather).

That said, I have noticed performance issues with the RegexSerde, but
haven¹t bothered to fix it yet - maybe you¹d want to take a shot at fixing
it?


Cheers,
Gopal


Reply via email to