Parquet

Gopal Vijayaraghavan Thu, 15 Dec 2016 15:00:52 -0800

 
> Actually, we don't have that many partitions - there are lot of gaps both in 
> days and time events as well.


Your partition description sounded a lot like one of the FAQs from Mithun's 
talks, which is why I asked

http://www.slideshare.net/Hadoop_Summit/hive-at-yahoo-letters-from-the-trenches/24

> But, I would like to understand when you say " time spent might partly be 
> query planning with million partitions"? I presume, this is in producing the 
> physical plan? -- does it spend time in allocating group of partition 
> directories to each map task

Yes, the physical planner is significant overhead, since all map-tasks get a 
list of all partitions and match each read against that list (partition schema 
can evolve, this is closer to O(n^2)) & the split-generation is bottlenecked by 
the total # of files involved in the operation (a slow O(n) operation is still 
slow).

The overhead in maintaining partitions is fairly high & the entire query 
planning will try to do something like a du -sh on each partition, if all the 
basic file statistics are missing etc.

Also, if you have >100k partitions, disabling the metadata optimizer 
(hive.optimize.metadataonly=false) would be a good thing, since that codepath 
is single threaded, while a compute-heavy full-table scan is much faster due to 
parallelism - it might burn more CPU, but it would come back in less than an 
hour.

You might want to opt for daily partitions & also run the stats gathering ops 
with "analyze table <table> compute statistics partialscan;" & "analyze table 
<table> compute statistics for columns" to speed up further queries.

At least in my experience with ORC + valid stats, a query like "select count(*) 
from table" should takes <500 milliseconds.

Cheers,
Gopal

Re: Hive/TEZ/Parquet

Reply via email to