Re: On compressed storage : why are sequence files bigger than text files?

Edward Capriolo Tue, 18 Jan 2011 07:29:06 -0800

On Tue, Jan 18, 2011 at 10:25 AM, Ajo Fod <ajo....@gmail.com> wrote:
> I tried with the gzip compression codec. BTW, what do you think of
> bz2, I've read that it is possible to split as input to different
> mappers ... is there a catch?
>
> Here are my flags now ... of these the last 2 were added per your suggestion.
> SET hive.enforce.bucketing=TRUE;
> set hive.exec.compress.output=true;
> set hive.merge.mapfiles = false;
> set io.seqfile.compression.type = BLOCK;
> set io.seqfile.compress.blocksize=1000000;
> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>
>
> The reuslts:
> text files result in about 18MB total (2 files) with compression ...
> as earlier ... BTW, takes 32sec to complete.
> sequence files are now stored in (2 files) totaling 244MB ... takes
> about 84 seconds.
> .. mind you the original was one file with 132MB.
>
> Cheers,
> -Ajo
>
>
> On Tue, Jan 18, 2011 at 6:36 AM, Edward Capriolo <edlinuxg...@gmail.com> 
> wrote:
>> On Tue, Jan 18, 2011 at 9:07 AM, Ajo Fod <ajo....@gmail.com> wrote:
>>> Hello,
>>>
>>> My questions in short are:
>>> - why are sequencefiles bigger than textfiles (considering that they
>>> are binary)?
>>> - It looks like compression does not make for a smaller sequence file
>>> than the original text file.
>>>
>>> -- here is a sample data that is transfered into the tables below with
>>> an INSERT OVERWRITE
>>> A       09:33:30        N       38.75   109100  0       522486  40
>>> A       09:33:31        M       38.75   200     0       0       0
>>> A       09:33:31        M       38.75   100     0       0       0
>>> A       09:33:31        M       38.75   100     0       0       0
>>> A       09:33:31        M       38.75   100     0       0       0
>>> A       09:33:31        M       38.75   100     0       0       0
>>> A       09:33:31        M       38.75   500     0       0       0
>>>
>>> -- so focusing on the column 4 and 5:
>>> -- text representation: columns 4 and 5 are 5 + 3 = 8 bytes long 
>>> respectively.
>>> -- binary representation: columns 4 and 5 are 4 + 4=8 bytes long 
>>> respectively.
>>> -- NOTE: I drop the last 3 columns in the table representation.
>>>
>>> -- The original size of one sample partition was 132MB  ... extract from 
>>> <ls> :
>>> 132M 2011-01-16 18:20 data/2001-05-22
>>>
>>> -- ... so  I set the following hive variables:
>>>
>>> set hive.exec.compress.output=true;
>>> set hive.merge.mapfiles = false;
>>> set io.seqfile.compression.type = BLOCK;
>>>
>>> -- ... and create the following table.
>>> CREATE TABLE alltrades
>>>       (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT)
>>> PARTITIONED BY (dt STRING)
>>> CLUSTERED BY (symbol)
>>> SORTED BY (time ASC)
>>> INTO 4 BUCKETS
>>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>>> STORED AS TEXTFILE ;
>>>
>>> -- ... now the table is split into 2 files. (!! shouldn't this be 4
>>> ... but that is discussed in the previous mail to this group)
>>> -- The bucket files total 17.5MB.
>>> 9,009,080 2011-01-18 05:32
>>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000000_0.deflate
>>> 8,534,264 2011-01-18 05:32
>>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000001_0.deflate
>>>
>>> -- ... so, I wondered, what would happen if I used SEQUENCEFILE instead
>>> CREATE TABLE alltrades
>>>       (symbol STRING, time STRING, exchange STRING, price FLOAT, volume INT)
>>> PARTITIONED BY (dt STRING)
>>> CLUSTERED BY (symbol)
>>> SORTED BY (time ASC)
>>> INTO 4 BUCKETS
>>> STORED AS SEQUENCEFILE;
>>>
>>> ... this created files that were a total of 193MB (larger even than
>>> the original)!!
>>> 99,751,137 2011-01-18 05:24
>>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000000_0
>>> 93,859,644 2011-01-18 05:24
>>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000001_0
>>>
>>> So, in summary:
>>> Why are sequence files bigger than the original?
>>>
>>>
>>> -Ajo
>>>
>>
>> It looks like you have not explicitly set the compression codec or the
>> block size. This likely means you will end up with the Default Codec
>> and a block size that probably adds more overhead then compression.
>> Dont you just love this stuff?
>>
>> Experiment with these settings:
>> io.seqfile.compress.blocksize=1000000
>> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
>>
>


I may have been unclear. Try different  io.seqfile.compress.blocksize
's (1,000,000 is not really that big)

Re: On compressed storage : why are sequence files bigger than text files?

Reply via email to