On Mon, Jan 24, 2011 at 7:51 PM, yongqiang he <heyongqiang...@gmail.com> wrote: > Yes. It only support block compression. (No record level compression support.) > You can use the config 'hive.io.rcfile.record.buffer.size' to specify > the block size (before compression). The default is 4MB. > > Thanks > Yongqiang > On Mon, Jan 24, 2011 at 4:44 PM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: >> On Mon, Jan 24, 2011 at 4:42 PM, Edward Capriolo <edlinuxg...@gmail.com> >> wrote: >>> On Mon, Jan 24, 2011 at 4:14 PM, yongqiang he <heyongqiang...@gmail.com> >>> wrote: >>>> How did you upload the data to the new table? >>>> You can get the data compressed by doing a insert overwrite to the >>>> destination table with setting "hive.exec.compress.output" to true. >>>> >>>> Thanks >>>> Yongqiang >>>> On Mon, Jan 24, 2011 at 12:30 PM, Edward Capriolo <edlinuxg...@gmail.com> >>>> wrote: >>>>> I am trying to explore some use case that I believe are perfect for >>>>> the columnarSerDe, tables with 100+ columns where only one or two are >>>>> selected in a particular query. >>>>> >>>>> CREATE TABLE (....) >>>>> ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" >>>>> STORED AS RCFile ; >>>>> >>>>> My issue is my data from our source table, with gzip sequence files, >>>>> is much smaller then the ColumnarSerDe table and as a result any >>>>> performance gains are lost. >>>>> >>>>> Any ideas? >>>>> >>>>> Thank you, >>>>> Edward >>>>> >>>> >>> >>> Thank you! That was a RTFM question. >>> >>> set hive.exec.dynamic.partition.mode=nonstrict; >>> set hive.exec.compress.output=true; >>> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; >>> >>> I was unclear about 'STORED AS RCFile' since normally you would need >>> to use ' STORED AS SEQUENCEFILE' >>> >>> However >>> http://hive.apache.org/docs/r0.6.0/api/org/apache/hadoop/hive/ql/io/RCFile.html >>> explains this well. RCFILE is a special type of sequence file. >>> >>> I did get it working. Looks good compression for my table was smaller >>> then using GZIP BLOCK Sequence file. Query time was slightly better in >>> limited testing. Cool stuff. >>> >>> Edward >>> >> >> Do rcfiles support a blocksize for compression like other compressed >> sequence files? >> >
Great. Do you have any suggestions or hints on how to tune this. Any information on what the ceiling or the floor might be? Edward