I didn't do the test you suggested, but With the sequence file case: - the size of what should have been compressed was bigger than the uncompressed - it didn't have .defate suffix - in contrast to the text file case, where I got 10x compression or so,
Cheers, -Ajo On Wed, Jan 19, 2011 at 11:30 AM, Steven Wong <sw...@netflix.com> wrote: > Here's a simple check -- look inside one of your sequence files: > > hadoop fs -cat /your/seq/file | head > > If it is compressed, the header will contain the compression codec's name and > the data will look gibberish. Otherwise, it is not compressed. > > > -----Original Message----- > From: Ajo Fod [mailto:ajo....@gmail.com] > Sent: Tuesday, January 18, 2011 8:46 AM > To: user@hive.apache.org > Subject: Re: On compressed storage : why are sequence files bigger than text > files? > > I tried 10M for blocksize ... the files are not any smaller. > > Also, I tried BZ2 compression codec ... it takes for ever ... the > mapper ran for 10 mins and completed only 4% of the job on one > partition. For comparison, with gzip, it took about 85secs. So, I > terminated the job prematurely. > > In summary, what I started with ... gzip with textfiles seems to > provide about 10% compression. I also tried out : > set io.seqfile.compression.type = RECORD; > > I have feeling that compression is not turned on for sequence files > for some reason. > > -Ajo. > > On Tue, Jan 18, 2011 at 7:28 AM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: >> On Tue, Jan 18, 2011 at 10:25 AM, Ajo Fod <ajo....@gmail.com> wrote: >>> I tried with the gzip compression codec. BTW, what do you think of >>> bz2, I've read that it is possible to split as input to different >>> mappers ... is there a catch? >>> >>> Here are my flags now ... of these the last 2 were added per your >>> suggestion. >>> SET hive.enforce.bucketing=TRUE; >>> set hive.exec.compress.output=true; >>> set hive.merge.mapfiles = false; >>> set io.seqfile.compression.type = BLOCK; >>> set io.seqfile.compress.blocksize=1000000; >>> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; >>> >>> >>> The reuslts: >>> text files result in about 18MB total (2 files) with compression ... >>> as earlier ... BTW, takes 32sec to complete. >>> sequence files are now stored in (2 files) totaling 244MB ... takes >>> about 84 seconds. >>> .. mind you the original was one file with 132MB. >>> >>> Cheers, >>> -Ajo >>> >>> >>> On Tue, Jan 18, 2011 at 6:36 AM, Edward Capriolo <edlinuxg...@gmail.com> >>> wrote: >>>> On Tue, Jan 18, 2011 at 9:07 AM, Ajo Fod <ajo....@gmail.com> wrote: >>>>> Hello, >>>>> >>>>> My questions in short are: >>>>> - why are sequencefiles bigger than textfiles (considering that they >>>>> are binary)? >>>>> - It looks like compression does not make for a smaller sequence file >>>>> than the original text file. >>>>> >>>>> -- here is a sample data that is transfered into the tables below with >>>>> an INSERT OVERWRITE >>>>> A 09:33:30 N 38.75 109100 0 522486 40 >>>>> A 09:33:31 M 38.75 200 0 0 0 >>>>> A 09:33:31 M 38.75 100 0 0 0 >>>>> A 09:33:31 M 38.75 100 0 0 0 >>>>> A 09:33:31 M 38.75 100 0 0 0 >>>>> A 09:33:31 M 38.75 100 0 0 0 >>>>> A 09:33:31 M 38.75 500 0 0 0 >>>>> >>>>> -- so focusing on the column 4 and 5: >>>>> -- text representation: columns 4 and 5 are 5 + 3 = 8 bytes long >>>>> respectively. >>>>> -- binary representation: columns 4 and 5 are 4 + 4=8 bytes long >>>>> respectively. >>>>> -- NOTE: I drop the last 3 columns in the table representation. >>>>> >>>>> -- The original size of one sample partition was 132MB ... extract from >>>>> <ls> : >>>>> 132M 2011-01-16 18:20 data/2001-05-22 >>>>> >>>>> -- ... so I set the following hive variables: >>>>> >>>>> set hive.exec.compress.output=true; >>>>> set hive.merge.mapfiles = false; >>>>> set io.seqfile.compression.type = BLOCK; >>>>> >>>>> -- ... and create the following table. >>>>> CREATE TABLE alltrades >>>>> (symbol STRING, time STRING, exchange STRING, price FLOAT, volume >>>>> INT) >>>>> PARTITIONED BY (dt STRING) >>>>> CLUSTERED BY (symbol) >>>>> SORTED BY (time ASC) >>>>> INTO 4 BUCKETS >>>>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' >>>>> STORED AS TEXTFILE ; >>>>> >>>>> -- ... now the table is split into 2 files. (!! shouldn't this be 4 >>>>> ... but that is discussed in the previous mail to this group) >>>>> -- The bucket files total 17.5MB. >>>>> 9,009,080 2011-01-18 05:32 >>>>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000000_0.deflate >>>>> 8,534,264 2011-01-18 05:32 >>>>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000001_0.deflate >>>>> >>>>> -- ... so, I wondered, what would happen if I used SEQUENCEFILE instead >>>>> CREATE TABLE alltrades >>>>> (symbol STRING, time STRING, exchange STRING, price FLOAT, volume >>>>> INT) >>>>> PARTITIONED BY (dt STRING) >>>>> CLUSTERED BY (symbol) >>>>> SORTED BY (time ASC) >>>>> INTO 4 BUCKETS >>>>> STORED AS SEQUENCEFILE; >>>>> >>>>> ... this created files that were a total of 193MB (larger even than >>>>> the original)!! >>>>> 99,751,137 2011-01-18 05:24 >>>>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000000_0 >>>>> 93,859,644 2011-01-18 05:24 >>>>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000001_0 >>>>> >>>>> So, in summary: >>>>> Why are sequence files bigger than the original? >>>>> >>>>> >>>>> -Ajo >>>>> >>>> >>>> It looks like you have not explicitly set the compression codec or the >>>> block size. This likely means you will end up with the Default Codec >>>> and a block size that probably adds more overhead then compression. >>>> Dont you just love this stuff? >>>> >>>> Experiment with these settings: >>>> io.seqfile.compress.blocksize=1000000 >>>> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec >>>> >>> >> >> I may have been unclear. Try different io.seqfile.compress.blocksize >> 's (1,000,000 is not really that big) >> > >