On Tue, Jan 18, 2011 at 10:25 AM, Ajo Fod <ajo....@gmail.com> wrote: > I tried with the gzip compression codec. BTW, what do you think of > bz2, I've read that it is possible to split as input to different > mappers ... is there a catch? > > Here are my flags now ... of these the last 2 were added per your suggestion. > SET hive.enforce.bucketing=TRUE; > set hive.exec.compress.output=true; > set hive.merge.mapfiles = false; > set io.seqfile.compression.type = BLOCK; > set io.seqfile.compress.blocksize=1000000; > set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; > > > The reuslts: > text files result in about 18MB total (2 files) with compression ... > as earlier ... BTW, takes 32sec to complete. > sequence files are now stored in (2 files) totaling 244MB ... takes > about 84 seconds. > .. mind you the original was one file with 132MB. > > Cheers, > -Ajo > > > On Tue, Jan 18, 2011 at 6:36 AM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: >> On Tue, Jan 18, 2011 at 9:07 AM, Ajo Fod <ajo....@gmail.com> wrote: >>> Hello, >>> >>> My questions in short are: >>> - why are sequencefiles bigger than textfiles (considering that they >>> are binary)? >>> - It looks like compression does not make for a smaller sequence file >>> than the original text file. >>> >>> -- here is a sample data that is transfered into the tables below with >>> an INSERT OVERWRITE >>> A 09:33:30 N 38.75 109100 0 522486 40 >>> A 09:33:31 M 38.75 200 0 0 0 >>> A 09:33:31 M 38.75 100 0 0 0 >>> A 09:33:31 M 38.75 100 0 0 0 >>> A 09:33:31 M 38.75 100 0 0 0 >>> A 09:33:31 M 38.75 100 0 0 0 >>> A 09:33:31 M 38.75 500 0 0 0 >>> >>> -- so focusing on the column 4 and 5: >>> -- text representation: columns 4 and 5 are 5 + 3 = 8 bytes long >>> respectively. >>> -- binary representation: columns 4 and 5 are 4 + 4=8 bytes long >>> respectively. >>> -- NOTE: I drop the last 3 columns in the table representation. >>> >>> -- The original size of one sample partition was 132MB ... extract from >>> <ls> : >>> 132M 2011-01-16 18:20 data/2001-05-22 >>> >>> -- ... so I set the following hive variables: >>> >>> set hive.exec.compress.output=true; >>> set hive.merge.mapfiles = false; >>> set io.seqfile.compression.type = BLOCK; >>> >>> -- ... and create the following table. >>> CREATE TABLE alltrades >>> (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT) >>> PARTITIONED BY (dt STRING) >>> CLUSTERED BY (symbol) >>> SORTED BY (time ASC) >>> INTO 4 BUCKETS >>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' >>> STORED AS TEXTFILE ; >>> >>> -- ... now the table is split into 2 files. (!! shouldn't this be 4 >>> ... but that is discussed in the previous mail to this group) >>> -- The bucket files total 17.5MB. >>> 9,009,080 2011-01-18 05:32 >>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000000_0.deflate >>> 8,534,264 2011-01-18 05:32 >>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000001_0.deflate >>> >>> -- ... so, I wondered, what would happen if I used SEQUENCEFILE instead >>> CREATE TABLE alltrades >>> (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT) >>> PARTITIONED BY (dt STRING) >>> CLUSTERED BY (symbol) >>> SORTED BY (time ASC) >>> INTO 4 BUCKETS >>> STORED AS SEQUENCEFILE; >>> >>> ... this created files that were a total of 193MB (larger even than >>> the original)!! >>> 99,751,137 2011-01-18 05:24 >>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000000_0 >>> 93,859,644 2011-01-18 05:24 >>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000001_0 >>> >>> So, in summary: >>> Why are sequence files bigger than the original? >>> >>> >>> -Ajo >>> >> >> It looks like you have not explicitly set the compression codec or the >> block size. This likely means you will end up with the Default Codec >> and a block size that probably adds more overhead then compression. >> Dont you just love this stuff? >> >> Experiment with these settings: >> io.seqfile.compress.blocksize=1000000 >> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec >> >
I may have been unclear. Try different io.seqfile.compress.blocksize 's (1,000,000 is not really that big)