And if you come across a comprehensive documentation of parquet configuration, please share it!!!
The Parquet documentation says that it can be configured but doesn't explain how: http://parquet.apache.org/documentation/latest/ and apparently, both TAJO ( http://tajo.apache.org/docs/0.8.0/table_management/parquet.html) and Drill ( https://drill.apache.org/docs/parquet-format/) seem to have some configuration parameters for Parquet. If Hive has configuration parameters for Parquet too, I couldn't find it documented anywhere. On 23 February 2018 at 16:48, Sun, Keith <ai...@ebay.com> wrote: > I got your point and thanks for the nice slides info. > > > So the parquet filter is not an easy thing and I will try that according > to the deck. > > > Thanks ! > ------------------------------ > *From:* Furcy Pin <pin.fu...@gmail.com> > *Sent:* Friday, February 23, 2018 3:37:52 AM > *To:* firstname.lastname@example.org > *Subject:* Re: Why the filter push down does not reduce the read data > record count > > Hi, > > Unless your table is partitioned or bucketed by myid, Hive generally > requires to read through all the records to find the records that match > your predicate. > > In other words, Hive table are generally not indexed for single record > retrieval like you would expect RDBMs tables or Vertica tables to be > indexed to allow single record. > Some file formats like ORC (and maybe Parquet, I'm not sure) allow to add > bloom filters on specific columns of a table > <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsnippetessay.wordpress.com%2F2015%2F07%2F25%2Fhive-optimizations-with-indexes-bloom-filters-and-statistics%2F&data=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178&sdata=rqlaV994fEVnDts8xeKJ3gysOkG738Q6iAi5aWnLTrM%3D&reserved=0>, > which could work as a kind of index. > Also, depending on the query engine you are using (Hive, Spark-SQL, > Impala, Presto...) and its version, they may or may not be able to leverage > certain storage optimization. > For example, Spark still does not support Hive Bucketed Table > optimization. But it might come in the upcoming Spark 2.3. > > > I'm much less familiar with Parquet, so if anyone has links to a good > documentation for Parquet fine tuning (or even better a comparison with ORC > features) that would be really helpful. > By googling, I found these slides where someone at Netflix > <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2FRyanBlue3%2Fparquet-performance-tuning-the-missing-guide&data=02%7C01%7Caisun%40ebay.com%7C65fc6c45d6394d53c25508d57ab204ff%7C46326bff992841a0baca17c16c94ea99%7C0%7C0%7C636549827365379178&sdata=Ek%2BezplTbMr5m8xmHFICmwkWIBKhO39zWARXNKCrR18%3D&reserved=0> > seems to have tried the same kind of optimization as you in Parquet. > > > > > > On 23 February 2018 at 12:02, Sun, Keith <ai...@ebay.com> wrote: > > Hi, > > > Why Hive still read so much "records" even with a filter pushdown enabled > and the returned dataset would be a very small amount ( 4k out of > 30billion records). > > > The "RECORDS_IN" counter of Hive which still showed the 30billion count > and also the output in the map reduce log like this : > > org.apache.hadoop.hive.ql.exec.MapOperator: MAP: records read - 100000 > > > BTW, I am using parquet as stoarg format and the filter pushdown did work > as i see this in log : > > > AM INFO: parquet.filter2.compat.FilterCompat: Filtering using predicate: > eq(myid, 223) > > > Thanks, > > Keith > > > >