Re: Benefit of ORC format storing Sum, Min, Max...

Gopal Vijayaraghavan Fri, 29 May 2015 20:21:07 -0700

> I am new to Hive, please help me understand the benefit of ORC file
>format storing Sum, Min, Max values.
> Whenever we try to find a sum of values in a particular column, it still
>runs the MapReduce job.


ORC uses row-indexes to constraint filtering.

What you¹re looking at is the ORC file footer, where ORC only uses that to
speed up ³ANALYZE TABLE <table> COMPUTE STATISTICS PARTIALSCAN" instead of
reading all rows out, like RCFile does.

We do not push down sum() into the RowSchema at all, which is what I think
you want.

> select sum(col1) from orctable;
> select sum(col1) from txttable;

If you¹re interested, it would be a very interesting patch to push-down
the min/max/count/avg for Number types into the Hive RowSchema.

That would help both ORC and Parquet, since they both carry metadata in
easily accessible locations.

> For a sample file with around 100 records, i dint see any difference in
>performance running the above queries .. Please let me know what am i
>missing...

At hundreds of rows, you¹re probably better off dropping MapReduce from
your runs, if you¹re measuring in seconds.

hive> set hive.tez.exec.print.summary=true;
hive> create temporary table date_dim_txt stored as textfile as select *
from tpcds5_bin_partitioned_orc_200.date_Dim;
...

hive> select sum(d_date_sk) from date_dim_txt;
Status: DAG finished successfully in 1.01 seconds


METHOD                         DURATION(ms)
parse                                    1
semanticAnalyze                        200
TezBuildDag                            125
TezSubmitToRunningDag                    5
TotalPrepTime                          690

VERTICES         TOTAL_TASKS  FAILED_ATTEMPTS KILLED_TASKS
DURATION_SECONDS    CPU_TIME_MILLIS     GC_TIME_MILLIS  INPUT_RECORDS
OUTPUT_RECORDS 
Map 1                      1                0            0
0.60                630                  0         73,049                1
Reducer 2                  1                0            0
0.40                 50                  0              1                0
OK
179082983754
Time taken: 1.79 seconds, Fetched: 1 row(s)



Otherwise, the only difference between those runs would be the HDFS
BYTES_READ counter.


Cheers,
Gopal

Re: Benefit of ORC format storing Sum, Min, Max...

Reply via email to