Hi Gopal, Thanks for that.
I'm happy to look into improving the Regex serde performance, any tips on where I should start looking?. Regards, Roger On 08/04/2015 11:44 AM, "Gopal Vijayaraghavan" <[email protected]> wrote: > > > The table also has a large Regex serde. > > There are no stats fast paths for Regex SerDe. > > The statistics computation is lifting each row into memory, parsing it and > throwing it away. > > Most of your time would be spent in GC (check the GC time millis), due to > the huge expense of the Regex Serde. > > For a direct comparison you could compute stats while turning it into > another format > > set hive.stats.autogather=true; > create table tmp1 stored as orc as select * from oldtable; > > Due to the nature of the columnar SerDes, that ETL would happen in > parallel to the compute stats off the same stream (i.e autogather). > > That said, I have noticed performance issues with the RegexSerde, but > haven¹t bothered to fix it yet - maybe you¹d want to take a shot at fixing > it? > > > Cheers, > Gopal > > >
