Re: On compressed storage : why are sequence files bigger than text files?

Ajo Fod Wed, 19 Jan 2011 13:50:07 -0800

I didn't do the test you suggested, but

With the sequence file case:
- the size of what should have been compressed was bigger than the uncompressed
- it didn't have .defate suffix
- in contrast to the text file case, where I got 10x compression or so,


Cheers,
-Ajo

On Wed, Jan 19, 2011 at 11:30 AM, Steven Wong <sw...@netflix.com> wrote:
> Here's a simple check -- look inside one of your sequence files:
>
> hadoop fs -cat /your/seq/file | head
>
> If it is compressed, the header will contain the compression codec's name and 
> the data will look gibberish. Otherwise, it is not compressed.
>
>
> -----Original Message-----
> From: Ajo Fod [mailto:ajo....@gmail.com]
> Sent: Tuesday, January 18, 2011 8:46 AM
> To: user@hive.apache.org
> Subject: Re: On compressed storage : why are sequence files bigger than text 
> files?
>
> I tried 10M for blocksize ... the files are not any smaller.
>
> Also, I tried BZ2 compression codec ... it takes for ever ... the
> mapper ran for 10 mins and completed only 4% of the job on one
> partition. For comparison, with gzip, it took about 85secs. So, I
> terminated the job prematurely.
>
> In summary, what I started with ... gzip with textfiles seems to
> provide about 10% compression. I also tried out :
> set io.seqfile.compression.type = RECORD;
>
> I have feeling that compression is not turned on for sequence files
> for some reason.
>
> -Ajo.
>
> On Tue, Jan 18, 2011 at 7:28 AM, Edward Capriolo <edlinuxg...@gmail.com> 
> wrote:
>> On Tue, Jan 18, 2011 at 10:25 AM, Ajo Fod <ajo....@gmail.com> wrote:
>>> I tried with the gzip compression codec. BTW, what do you think of
>>> bz2, I've read that it is possible to split as input to different
>>> mappers ... is there a catch?
>>>
>>> Here are my flags now ... of these the last 2 were added per your 
>>> suggestion.
>>> SET hive.enforce.bucketing=TRUE;
>>> set hive.exec.compress.output=true;
>>> set hive.merge.mapfiles = false;
>>> set io.seqfile.compression.type = BLOCK;
>>> set io.seqfile.compress.blocksize=1000000;
>>> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>>>
>>>
>>> The reuslts:
>>> text files result in about 18MB total (2 files) with compression ...
>>> as earlier ... BTW, takes 32sec to complete.
>>> sequence files are now stored in (2 files) totaling 244MB ... takes
>>> about 84 seconds.
>>> .. mind you the original was one file with 132MB.
>>>
>>> Cheers,
>>> -Ajo
>>>
>>>
>>> On Tue, Jan 18, 2011 at 6:36 AM, Edward Capriolo <edlinuxg...@gmail.com> 
>>> wrote:
>>>> On Tue, Jan 18, 2011 at 9:07 AM, Ajo Fod <ajo....@gmail.com> wrote:
>>>>> Hello,
>>>>>
>>>>> My questions in short are:
>>>>> - why are sequencefiles bigger than textfiles (considering that they
>>>>> are binary)?
>>>>> - It looks like compression does not make for a smaller sequence file
>>>>> than the original text file.
>>>>>
>>>>> -- here is a sample data that is transfered into the tables below with
>>>>> an INSERT OVERWRITE
>>>>> A       09:33:30        N       38.75   109100  0       522486  40
>>>>> A       09:33:31        M       38.75   200     0       0       0
>>>>> A       09:33:31        M       38.75   100     0       0       0
>>>>> A       09:33:31        M       38.75   100     0       0       0
>>>>> A       09:33:31        M       38.75   100     0       0       0
>>>>> A       09:33:31        M       38.75   100     0       0       0
>>>>> A       09:33:31        M       38.75   500     0       0       0
>>>>>
>>>>> -- so focusing on the column 4 and 5:
>>>>> -- text representation: columns 4 and 5 are 5 + 3 = 8 bytes long 
>>>>> respectively.
>>>>> -- binary representation: columns 4 and 5 are 4 + 4=8 bytes long 
>>>>> respectively.
>>>>> -- NOTE: I drop the last 3 columns in the table representation.
>>>>>
>>>>> -- The original size of one sample partition was 132MB  ... extract from 
>>>>> <ls> :
>>>>> 132M 2011-01-16 18:20 data/2001-05-22
>>>>>
>>>>> -- ... so  I set the following hive variables:
>>>>>
>>>>> set hive.exec.compress.output=true;
>>>>> set hive.merge.mapfiles = false;
>>>>> set io.seqfile.compression.type = BLOCK;
>>>>>
>>>>> -- ... and create the following table.
>>>>> CREATE TABLE alltrades
>>>>>       (symbol STRING, time STRING, exchange STRING, price FLOAT, volume 
>>>>> INT)
>>>>> PARTITIONED BY (dt STRING)
>>>>> CLUSTERED BY (symbol)
>>>>> SORTED BY (time ASC)
>>>>> INTO 4 BUCKETS
>>>>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>>>>> STORED AS TEXTFILE ;
>>>>>
>>>>> -- ... now the table is split into 2 files. (!! shouldn't this be 4
>>>>> ... but that is discussed in the previous mail to this group)
>>>>> -- The bucket files total 17.5MB.
>>>>> 9,009,080 2011-01-18 05:32
>>>>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000000_0.deflate
>>>>> 8,534,264 2011-01-18 05:32
>>>>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0013_m_000001_0.deflate
>>>>>
>>>>> -- ... so, I wondered, what would happen if I used SEQUENCEFILE instead
>>>>> CREATE TABLE alltrades
>>>>>       (symbol STRING, time STRING, exchange STRING, price FLOAT, volume 
>>>>> INT)
>>>>> PARTITIONED BY (dt STRING)
>>>>> CLUSTERED BY (symbol)
>>>>> SORTED BY (time ASC)
>>>>> INTO 4 BUCKETS
>>>>> STORED AS SEQUENCEFILE;
>>>>>
>>>>> ... this created files that were a total of 193MB (larger even than
>>>>> the original)!!
>>>>> 99,751,137 2011-01-18 05:24
>>>>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000000_0
>>>>> 93,859,644 2011-01-18 05:24
>>>>> /user/hive/warehouse/alltrades/dt=2001-05-22/attempt_201101180504_0007_m_000001_0
>>>>>
>>>>> So, in summary:
>>>>> Why are sequence files bigger than the original?
>>>>>
>>>>>
>>>>> -Ajo
>>>>>
>>>>
>>>> It looks like you have not explicitly set the compression codec or the
>>>> block size. This likely means you will end up with the Default Codec
>>>> and a block size that probably adds more overhead then compression.
>>>> Dont you just love this stuff?
>>>>
>>>> Experiment with these settings:
>>>> io.seqfile.compress.blocksize=1000000
>>>> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
>>>>
>>>
>>
>> I may have been unclear. Try different  io.seqfile.compress.blocksize
>> 's (1,000,000 is not really that big)
>>
>
>

Re: On compressed storage : why are sequence files bigger than text files?

Reply via email to