Hi

Please find answers inline.

On Feb 28, 2016, at 2:50 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Hi Jessica,

Interesting. The ORC files are laid out in stripes that are specified by 
orc.stripe.size (default 64MB).  Within each stripe you have row groups of 10K 
rows that keep statistics for both data and index. Your query should perform a 
SARG pushdown that limits which rows are required for the query and can avoid 
reading an entire file, or at least sections of the file which is by and large 
what a conventional RDBMS B-tree index does.

So with this in mind, An ORC file will have the following components:



  1.
ORC File itself
  2.
Multiple stripes within the ORC file
  3.
Multiple row groups (row batches) within each stripe



Please check two things

1) Have you updated statistics for the table
2) What is the outcome of  ORC file dump? Example

hive --orcfiledump  /user/hive/warehouse/oraclehadoop.db/orctest/000000_0

HTH

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 28 February 2016 at 07:39, Jie Zhang 
<jiezh2...@gmail.com<mailto:jiezh2...@gmail.com>> wrote:
Hi, Mich,

Thanks for the reply. We don't set any tblproperties when creating table. Here 
is the TBLPROPERTIES part from show create table:

STORED AS ORC
TBLPROPERTIES ('transient_lastDdlTime'='1455765074')

Jessica


On Sat, Feb 27, 2016 at 11:15 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:
Hi,

Can you do show create table <TABLE> on your external table and send the 
sections from

STORED AS ORC
TBLPROPERTIES (

onwards please?

HTH

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>



On 27 February 2016 at 18:59, Jie Zhang 
<jiezh2...@gmail.com<mailto:jiezh2...@gmail.com>> wrote:
Hi,

We have an external ORC table which includes ~200 relatively small orc files 
(less than 256MB). When querying the table with selective SARG predicate 
(explain shows the predicate is qualified pushdown), we expects a few splits 
generated with pruning based on predicate condition and only a few files will 
be scanned. However, somehow predicate pushdown is not in effect at all, all 
the files are scanned in MR job and SARG did not even show up in the MR job 
config.

After digging more in hive code (version 0.14), looks like the split pruning 
only happens for the stripes within each file. If the file size is smaller than 
default split size, SARG is not considered. Here is the code we are referring:
https://github.com/apache/hive/blob/release-0.14.0/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L656

Any idea why SARG is ignored for this scenario?

Before hive 0.14 the default stripe size of ORC was 256MB and hdfs block size 
is calculated based Math.min(2*stripe_size,1.5GB). So typically block size is 
512MB. When the entire file is less than a block it is not beneficial  to read 
the footer to eliminate stripes. Its usually a wasted effort. So this 
optimization was added to not read footers when entire file is smaller than 
hdfs block size. You can change this behavior by setting 
mapreduce.input.fileinputformat.split.maxsize to a value less than you minimum 
file size so that all file footers will be forcefully read for split 
elimination. Note this can increase split creation time if your files are not 
laid out properly/when there are no elimination.

>From hive 0.14 onwards the relationship between stripe size and block size is 
>broken. The default stripe size is 64MB and default block size is 256MB. We 
>decreased the default stripe size for better split elimination and increased 
>task parallelism (minimum splittable unit is stripe boundary).

also can split pruning filter out the files with all stripes not satisfied with 
SARG condition?

Yes. If none of the files satisfies the SARG condition all files can be pruned 
(0 splits).

Thanks for any help, really appreciated.

Jessica




Reply via email to