> I am new to Hive, please help me understand the benefit of ORC file >format storing Sum, Min, Max values. > Whenever we try to find a sum of values in a particular column, it still >runs the MapReduce job.
ORC uses row-indexes to constraint filtering. What you¹re looking at is the ORC file footer, where ORC only uses that to speed up ³ANALYZE TABLE <table> COMPUTE STATISTICS PARTIALSCAN" instead of reading all rows out, like RCFile does. We do not push down sum() into the RowSchema at all, which is what I think you want. > select sum(col1) from orctable; > select sum(col1) from txttable; If you¹re interested, it would be a very interesting patch to push-down the min/max/count/avg for Number types into the Hive RowSchema. That would help both ORC and Parquet, since they both carry metadata in easily accessible locations. > For a sample file with around 100 records, i dint see any difference in >performance running the above queries .. Please let me know what am i >missing... At hundreds of rows, you¹re probably better off dropping MapReduce from your runs, if you¹re measuring in seconds. hive> set hive.tez.exec.print.summary=true; hive> create temporary table date_dim_txt stored as textfile as select * from tpcds5_bin_partitioned_orc_200.date_Dim; ... hive> select sum(d_date_sk) from date_dim_txt; Status: DAG finished successfully in 1.01 seconds METHOD DURATION(ms) parse 1 semanticAnalyze 200 TezBuildDag 125 TezSubmitToRunningDag 5 TotalPrepTime 690 VERTICES TOTAL_TASKS FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS CPU_TIME_MILLIS GC_TIME_MILLIS INPUT_RECORDS OUTPUT_RECORDS Map 1 1 0 0 0.60 630 0 73,049 1 Reducer 2 1 0 0 0.40 50 0 1 0 OK 179082983754 Time taken: 1.79 seconds, Fetched: 1 row(s) Otherwise, the only difference between those runs would be the HDFS BYTES_READ counter. Cheers, Gopal
