Hi Jessica, Interesting. The ORC files are laid out in stripes that are specified by *orc.stripe.size* (default 64MB). Within each stripe you have row groups of 10K rows that keep statistics for both data and index. Your query should perform a SARG pushdown that limits which rows are required for the query and can avoid reading an entire file, or at least sections of the file which is by and large what a conventional RDBMS B-tree index does.
So with this in mind, An ORC file will have the following components: 1. ORC File itself 2. Multiple stripes within the ORC file 3. Multiple row groups (row batches) within each stripe Please check two things 1) Have you updated statistics for the table 2) What is the outcome of ORC file dump? Example hive --orcfiledump /user/hive/warehouse/oraclehadoop.db/orctest/000000_0 HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 28 February 2016 at 07:39, Jie Zhang <jiezh2...@gmail.com> wrote: > Hi, Mich, > > Thanks for the reply. We don't set any tblproperties when creating table. > Here is the TBLPROPERTIES part from show create table: > > STORED AS ORC > TBLPROPERTIES ('transient_lastDdlTime'='1455765074') > > Jessica > > > On Sat, Feb 27, 2016 at 11:15 AM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> Hi, >> >> Can you do show create table <TABLE> on your external table and send the >> sections from >> >> STORED AS ORC >> TBLPROPERTIES ( >> >> onwards please? >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 27 February 2016 at 18:59, Jie Zhang <jiezh2...@gmail.com> wrote: >> >>> Hi, >>> >>> We have an external ORC table which includes ~200 relatively small orc >>> files (less than 256MB). When querying the table with selective SARG >>> predicate (explain shows the predicate is qualified pushdown), we expects a >>> few splits generated with pruning based on predicate condition and only a >>> few files will be scanned. However, somehow predicate pushdown is not in >>> effect at all, all the files are scanned in MR job and SARG did not even >>> show up in the MR job config. >>> >>> After digging more in hive code (version 0.14), looks like the split >>> pruning only happens for the stripes within each file. If the file size is >>> smaller than default split size, SARG is not considered. Here is the code >>> we are referring: >>> >>> https://github.com/apache/hive/blob/release-0.14.0/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L656 >>> >>> >>> Any idea why SARG is ignored for this scenario? also can split pruning >>> filter out the files with all stripes not satisfied with SARG condition? >>> Thanks for any help, really appreciated. >>> >>> Jessica >>> >> >> >