Re: Analyze table compute statistics on wide table taking too long

Gopal Vijayaraghavan Tue, 07 Apr 2015 18:45:07 -0700

> The table also has a large Regex serde.

There are no stats fast paths for Regex SerDe.


The statistics computation is lifting each row into memory, parsing it and
throwing it away.

Most of your time would be spent in GC (check the GC time millis), due to
the huge expense of the Regex Serde.

For a direct comparison you could compute stats while turning it into
another format

set hive.stats.autogather=true;
create table tmp1 stored as orc as select * from oldtable;

Due to the nature of the columnar SerDes, that ETL would happen in
parallel to the compute stats off the same stream (i.e autogather).

That said, I have noticed performance issues with the RegexSerde, but
haven¹t bothered to fix it yet - maybe you¹d want to take a shot at fixing
it?


Cheers,
Gopal

Re: Analyze table compute statistics on wide table taking too long

Reply via email to