Unless your table is partitioned or bucketed by myid, Hive generally
requires to read through all the records to find the records that match
your predicate.

In other words, Hive table are generally not indexed for single record
retrieval like you would expect RDBMs tables or Vertica tables to be
indexed to allow single record.
Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add
bloom filters on specific columns of a table
which could work as a kind of index.
Also, depending on the query engine you are using (Hive, Spark-SQL, Impala,
Presto...) and its version, they may or may not be able to leverage certain
storage optimization.
For example, Spark still does not support Hive Bucketed Table optimization.
But it might come in the upcoming Spark 2.3.

I'm much less familiar with Parquet, so if anyone has links to a good
documentation for Parquet fine tuning (or even better a comparison with ORC
features) that would be really helpful.
By googling, I found these slides where someone at Netflix
seems to have tried the same kind of optimization as you in Parquet.

On 23 February 2018 at 12:02, Sun, Keith <ai...@ebay.com> wrote:

> Hi,
> Why Hive still read so much "records" even with a filter pushdown enabled
> and the returned dataset would be a very small amount ( 4k out of
> 30billion records).
> The "RECORDS_IN" counter of Hive which still showed the 30billion count
> and also the output in the map reduce log like this :
> org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 100000
> BTW, I am using parquet as stoarg format and the filter pushdown did work
> as i see this in log :
> AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: 
> eq(myid, 223)
> Thanks,
> Keith

Reply via email to