On Tue, Jan 18, 2011 at 9:07 AM, Ajo Fod <ajo....@gmail.com> wrote: > Hello, > > My questions in short are: > - why are sequencefiles bigger than textfiles (considering that they > are binary)? > - It looks like compression does not make for a smaller sequence file > than the original text file. > > -- here is a sample data that is transfered into the tables below with > an INSERT OVERWRITE > A 09:33:30 N 38.75 109100 0 522486 40 > A 09:33:31 M 38.75 200 0 0 0 > A 09:33:31 M 38.75 100 0 0 0 > A 09:33:31 M 38.75 100 0 0 0 > A 09:33:31 M 38.75 100 0 0 0 > A 09:33:31 M 38.75 100 0 0 0 > A 09:33:31 M 38.75 500 0 0 0 > > -- so focusing on the column 4 and 5: > -- text representation: columns 4 and 5 are 5 + 3 = 8 bytes long respectively. > -- binary representation: columns 4 and 5 are 4 + 4=8 bytes long respectively. > -- NOTE: I drop the last 3 columns in the table representation. > > -- The original size of one sample partition was 132MB ... extract from <ls> > : > 132M 2011-01-16 18:20 data/2001-05-22 > > -- ... so I set the following hive variables: > > set hive.exec.compress.output=true; > set hive.merge.mapfiles = false; > set io.seqfile.compression.type = BLOCK; > > -- ... and create the following table. > CREATE TABLE alltrades > (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT) > PARTITIONED BY (dt STRING) > CLUSTERED BY (symbol) > SORTED BY (time ASC) > INTO 4 BUCKETS > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > STORED AS TEXTFILE ; > > -- ... now the table is split into 2 files. (!! shouldn't this be 4 > ... but that is discussed in the previous mail to this group) > -- The bucket files total 17.5MB. > 9,009,080 2011-01-18 05:32 > /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000000_0.deflate > 8,534,264 2011-01-18 05:32 > /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000001_0.deflate > > -- ... so, I wondered, what would happen if I used SEQUENCEFILE instead > CREATE TABLE alltrades > (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT) > PARTITIONED BY (dt STRING) > CLUSTERED BY (symbol) > SORTED BY (time ASC) > INTO 4 BUCKETS > STORED AS SEQUENCEFILE; > > ... this created files that were a total of 193MB (larger even than > the original)!! > 99,751,137 2011-01-18 05:24 > /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000000_0 > 93,859,644 2011-01-18 05:24 > /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000001_0 > > So, in summary: > Why are sequence files bigger than the original? > > > -Ajo >
It looks like you have not explicitly set the compression codec or the block size. This likely means you will end up with the Default Codec and a block size that probably adds more overhead then compression. Dont you just love this stuff? Experiment with these settings: io.seqfile.compress.blocksize=1000000 mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec