On Tue, Jan 18, 2011 at 9:07 AM, Ajo Fod <ajo....@gmail.com> wrote:
> Hello,
>
> My questions in short are:
> - why are sequencefiles bigger than textfiles (considering that they
> are binary)?
> - It looks like compression does not make for a smaller sequence file
> than the original text file.
>
> -- here is a sample data that is transfered into the tables below with
> an INSERT OVERWRITE
> A       09:33:30        N       38.75   109100  0       522486  40
> A       09:33:31        M       38.75   200     0       0       0
> A       09:33:31        M       38.75   100     0       0       0
> A       09:33:31        M       38.75   100     0       0       0
> A       09:33:31        M       38.75   100     0       0       0
> A       09:33:31        M       38.75   100     0       0       0
> A       09:33:31        M       38.75   500     0       0       0
>
> -- so focusing on the column 4 and 5:
> -- text representation: columns 4 and 5 are 5 + 3 = 8 bytes long respectively.
> -- binary representation: columns 4 and 5 are 4 + 4=8 bytes long respectively.
> -- NOTE: I drop the last 3 columns in the table representation.
>
> -- The original size of one sample partition was 132MB  ... extract from <ls> 
> :
> 132M 2011-01-16 18:20 data/2001-05-22
>
> -- ... so  I set the following hive variables:
>
> set hive.exec.compress.output=true;
> set hive.merge.mapfiles = false;
> set io.seqfile.compression.type = BLOCK;
>
> -- ... and create the following table.
> CREATE TABLE alltrades
>       (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT)
> PARTITIONED BY (dt STRING)
> CLUSTERED BY (symbol)
> SORTED BY (time ASC)
> INTO 4 BUCKETS
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE ;
>
> -- ... now the table is split into 2 files. (!! shouldn't this be 4
> ... but that is discussed in the previous mail to this group)
> -- The bucket files total 17.5MB.
> 9,009,080 2011-01-18 05:32
> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000000_0.deflate
> 8,534,264 2011-01-18 05:32
> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000001_0.deflate
>
> -- ... so, I wondered, what would happen if I used SEQUENCEFILE instead
> CREATE TABLE alltrades
>       (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT)
> PARTITIONED BY (dt STRING)
> CLUSTERED BY (symbol)
> SORTED BY (time ASC)
> INTO 4 BUCKETS
> STORED AS SEQUENCEFILE;
>
> ... this created files that were a total of 193MB (larger even than
> the original)!!
> 99,751,137 2011-01-18 05:24
> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000000_0
> 93,859,644 2011-01-18 05:24
> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000001_0
>
> So, in summary:
> Why are sequence files bigger than the original?
>
>
> -Ajo
>

It looks like you have not explicitly set the compression codec or the
block size. This likely means you will end up with the Default Codec
and a block size that probably adds more overhead then compression.
Dont you just love this stuff?

Experiment with these settings:
io.seqfile.compress.blocksize=1000000
mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

Reply via email to