from:"Sun, Keith"

Re: Why the filter push down does not reduce the read data record count

2018-02-24 Thread Sun, Keith

Hi ,


My env : Hive 1.2.1 and Parquet 1.8.1


Per my search in hive and parquet source code of version 1.8.1,  I did not see 
the paramters in that slides. but found that here :

https://github.com/apache/parquet-mr/blob/parquet-1.8.x/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java


For hive, (also detected in  my test),  row group filter is auto-applied , 
check this  ParquetRecordReaderWrapper

if (filter != null) {
  filtedBlocks = RowGroupFilter.filterRowGroups(filter, splitGroup, 
fileMetaData.getSchema());
  if (filtedBlocks.isEmpty()) {
LOG.debug("All row groups are dropped due to filter predicates");
return null;
  }


I will dig into more detail of parquet and do some test later.


Thanks,

Keith


From: Furcy Pin <pin.fu...@gmail.com>
Sent: Friday, February 23, 2018 8:03:34 AM
To: user@hive.apache.org
Subject: Re: Why the filter push down does not reduce the read data record count

And if you come across a comprehensive documentation of parquet configuration, 
please share it!!!

The Parquet documentation says that it can be configured but doesn't explain 
how: 
http://parquet.apache.org/documentation/latest/<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fparquet.apache.org%2Fdocumentation%2Flatest%2F=02%7C01%7Caisun%40ebay.com%7C41a5aa52483a46e8458208d57ad7294b%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549986889004305=QoTRmoY7yiVICa%2FXgJJyNiC1zFTjvzYHv2u8MdvDtM0%3D=0>
and apparently, both TAJO 
(http://tajo.apache.org/docs/0.8.0/table_management/parquet.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftajo.apache.org%2Fdocs%2F0.8.0%2Ftable_management%2Fparquet.html=02%7C01%7Caisun%40ebay.com%7C41a5aa52483a46e8458208d57ad7294b%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549986889004305=FPOuDplxxU68HxCjdZMRKuHrF9kvuARddiNWARyPaTg%3D=0>)
 and Drill 
(https://drill.apache.org/docs/parquet-format/<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrill.apache.org%2Fdocs%2Fparquet-format%2F=02%7C01%7Caisun%40ebay.com%7C41a5aa52483a46e8458208d57ad7294b%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549986889004305=gqpYD9UYPQewMPlpvsc%2BQjaz8Qq6FcII00zALHCWbPU%3D=0>)
 seem to have some configuration parameters for Parquet.
If Hive has configuration parameters for Parquet too, I couldn't find it 
documented anywhere.



On 23 February 2018 at 16:48, Sun, Keith 
<ai...@ebay.com<mailto:ai...@ebay.com>> wrote:

I got your point and thanks for the nice slides info.


So the parquet filter is not an easy thing and I will try that according to the 
deck.


Thanks !


From: Furcy Pin <pin.fu...@gmail.com<mailto:pin.fu...@gmail.com>>
Sent: Friday, February 23, 2018 3:37:52 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: Re: Why the filter push down does not reduce the read data record count

Hi,

Unless your table is partitioned or bucketed by myid, Hive generally requires 
to read through all the records to find the records that match your predicate.

In other words, Hive table are generally not indexed for single record 
retrieval like you would expect RDBMs tables or Vertica tables to be indexed to 
allow single record.
Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add bloom 
filters on specific columns of a 
table<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsnippetessay.wordpress.com%2F2015%2F07%2F25%2Fhive-optimizations-with-indexes-bloom-filters-and-statistics%2F=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=rqlaV994fEVnDts8xeKJ3gysOkG738Q6iAi5aWnLTrM%3D=0>,
 which could work as a kind of index.
Also, depending on the query engine you are using (Hive, Spark-SQL, Impala, 
Presto...) and its version, they may or may not be able to leverage certain 
storage optimization.
For example, Spark still does not support Hive Bucketed Table optimization. But 
it might come in the upcoming Spark 2.3.


I'm much less familiar with Parquet, so if anyone has links to a good 
documentation for Parquet fine tuning (or even better a comparison with ORC 
features) that would be really helpful.
By googling, I found these slides where someone at 
Netflix<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRyanBlue3%2Fparquet-performance-tuning-the-missing-guide=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=Ek%2BezplTbMr5m8xmHFICmwkWIBKhO39zWARXNKCrR18%3D=0>
 seems to have tried the same kind of optimization as you in Parquet.





On 23 February 2018 at 12:02, Sun, Keith 
<ai...@ebay.com<mailto:ai...@ebay.com>> wrote:

Hi,


Why Hive still read so much "records" even with a filter pushdown enabled and 
the

Re: Why the filter push down does not reduce the read data record count

2018-02-23 Thread Sun, Keith

I got your point and thanks for the nice slides info.


So the parquet filter is not an easy thing and I will try that according to the 
deck.


Thanks !


From: Furcy Pin <pin.fu...@gmail.com>
Sent: Friday, February 23, 2018 3:37:52 AM
To: user@hive.apache.org
Subject: Re: Why the filter push down does not reduce the read data record count

Hi,

Unless your table is partitioned or bucketed by myid, Hive generally requires 
to read through all the records to find the records that match your predicate.

In other words, Hive table are generally not indexed for single record 
retrieval like you would expect RDBMs tables or Vertica tables to be indexed to 
allow single record.
Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add bloom 
filters on specific columns of a 
table<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsnippetessay.wordpress.com%2F2015%2F07%2F25%2Fhive-optimizations-with-indexes-bloom-filters-and-statistics%2F=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=rqlaV994fEVnDts8xeKJ3gysOkG738Q6iAi5aWnLTrM%3D=0>,
 which could work as a kind of index.
Also, depending on the query engine you are using (Hive, Spark-SQL, Impala, 
Presto...) and its version, they may or may not be able to leverage certain 
storage optimization.
For example, Spark still does not support Hive Bucketed Table optimization. But 
it might come in the upcoming Spark 2.3.


I'm much less familiar with Parquet, so if anyone has links to a good 
documentation for Parquet fine tuning (or even better a comparison with ORC 
features) that would be really helpful.
By googling, I found these slides where someone at 
Netflix<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRyanBlue3%2Fparquet-performance-tuning-the-missing-guide=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178=Ek%2BezplTbMr5m8xmHFICmwkWIBKhO39zWARXNKCrR18%3D=0>
 seems to have tried the same kind of optimization as you in Parquet.





On 23 February 2018 at 12:02, Sun, Keith 
<ai...@ebay.com<mailto:ai...@ebay.com>> wrote:

Hi,


Why Hive still read so much "records" even with a filter pushdown enabled and 
the returned dataset would be a very small amount ( 4k out of  30billion 
records).


The "RECORDS_IN" counter of Hive which still showed the 30billion count and 
also the output in the map reduce log like this :

org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 10


BTW, I am using parquet as stoarg format and the filter pushdown did work as i 
see this in log :


AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: 
eq(myid, 223)


Thanks,

Keith

Why the filter push down does not reduce the read data record count

2018-02-23 Thread Sun, Keith

Hi,


Why Hive still read so much "records" even with a filter pushdown enabled and 
the returned dataset would be a very small amount ( 4k out of  30billion 
records).


The "RECORDS_IN" counter of Hive which still showed the 30billion count and 
also the output in the map reduce log like this :

org.apache.hadoop.hive.ql.exec.MapOperator: MAP[4]: records read - 10


BTW, I am using parquet as stoarg format and the filter pushdown did work as i 
see this in log :


AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: 
eq(myid, 223)


Thanks,

Keith

Re: Why the filter push down does not reduce the read data record count

Re: Why the filter push down does not reduce the read data record count

Why the filter push down does not reduce the read data record count

3 matches

Site Navigation

Mail list logo

Footer information