Re: Compressing output using block compression

帝归 Thu, 05 Apr 2012 08:05:33 -0700

I think Lzo is a good format to compress files, because it costs constant
time for compressing and decompressing. As Lzo is not included in Hadoop's
compression formats (because of it's GPL licence?), I need to write a Java
script to compress files on HDFS.



2012/4/4 Raghu Angadi <[email protected]>

> SequenceFileStorage in elephant-bird lets you load and store to sequence
> files.
> If your input is text lines, you can store each line as 'value'.
> You can experiment with different codecs.
>
> depending on your use case, simple bzip2 files may not be a bad choice.
>
> On Tue, Apr 3, 2012 at 1:57 PM, Mohit Anchlia <[email protected]
> >wrote:
>
> > Thanks for the examples. It appears that snappy is not splittable and
> > suggested approach is to write to sequence files.
> >
> > I know how to load from sequencefiles, but in pig I can't find a way to
> > write to the sequence files using snappy compression.
> >
> > On Tue, Apr 3, 2012 at 1:30 PM, Prashant Kommireddi <[email protected]
> > >wrote:
> >
> > > Does it mean Snappy is splittable?
> > > http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
> > >
> > > If so then how can I use it in pig?
> > >
> http://hadoopified.wordpress.com/2012/01/24/snappy-compression-with-pig/
> > >
> > >
> > > On Tue, Apr 3, 2012 at 1:02 PM, Mohit Anchlia <[email protected]
> > > >wrote:
> > >
> > > > I am currently using Snappy in sequence files. I wasn't aware snappy
> > uses
> > > > block compression. Does it mean Snappy is splittable? If so then how
> > can
> > > I
> > > > use it in pig?
> > > >
> > > > Thanks again
> > > >
> > > > On Tue, Apr 3, 2012 at 12:42 PM, Prashant Kommireddi <
> > > [email protected]
> > > > >wrote:
> > > >
> > > > > Most companies handling BigData use LZO, a few have started
> > > > exploring/using
> > > > > Snappy as well (which is not any easier to configure). These are
> the
> > 2
> > > > > splittable fast-compression algorithms. Note Snappy is not
> efficient
> > > > > space-wise compared to gzip or other compression algos, but a lot
> > > faster
> > > > > (ideal for compression between Map and Reduce)
> > > > >
> > > > > Is there any repeated/heavy computation involved on the outputs
> other
> > > > than
> > > > > pushing this data to a database? If not, may be its fine to use
> gzip
> > > but
> > > > > you have to make sure the individual files are close to the block
> > size,
> > > > or
> > > > > you will have a lot of unnecessary IO transfers taking place.  If
> you
> > > > read
> > > > > the outputs to perform further Map Reduce computation, gzip is not
> > the
> > > > > best.
> > > > >
> > > > > -Prashant
> > > > >
> > > > > On Tue, Apr 3, 2012 at 12:18 PM, Mohit Anchlia <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > Thanks for your input.
> > > > > >
> > > > > > It looks like it's some work to configure LZO. What are the other
> > > > > > alternatives? We read new sequence files and generate output
> > > > > continuously.
> > > > > > What are my options? Should I split the output in small pieces
> and
> > > gzip
> > > > > > them? How do people solve similar problems where there is
> > continuous
> > > > flow
> > > > > > of data that generates tons of output continuosly?
> > > > > >
> > > > > > After output is generated we again read them and load it in OLAP
> db
> > > or
> > > > do
> > > > > > some other analysis.
> > > > > >
> > > > > > On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi <
> > > > > [email protected]
> > > > > > >wrote:
> > > > > >
> > > > > > > Yes, it is splittable.
> > > > > > >
> > > > > > > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs
> > > > > generally
> > > > > > > being IO bound, Bzip2 sometimes can become the bottleneck with
> > > > respect
> > > > > to
> > > > > > > performance due to this slow decompression rate (algorithm
> unable
> > > to
> > > > > > > decompress at disk read rate).
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia <
> > > > [email protected]
> > > > > > > >wrote:
> > > > > > >
> > > > > > > > Is bzip2 not advisable? I think it can split too and is
> > supported
> > > > out
> > > > > > of
> > > > > > > > the box.
> > > > > > > >
> > > > > > > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[email protected]>
> > wrote:
> > > > > > > >
> > > > > > > > > When I use LzoPigStorage, it will load all files under a
> > > > directory.
> > > > > > > But I
> > > > > > > > > want compress every file under a directory and keep the
> file
> > > name
> > > > > > > > > unchanged, just with a .lzo extension name. How can I do
> > this?
> > > > > Maybe
> > > > > > I
> > > > > > > > must
> > > > > > > > > write a mapreduce job?
> > > > > > > > >
> > > > > > > > > 2012/3/30 Jonathan Coveney <[email protected]>
> > > > > > > > >
> > > > > > > > > > check out:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store
> > > > > > > > > >
> > > > > > > > > > 2012/3/29 Mohit Anchlia <[email protected]>
> > > > > > > > > >
> > > > > > > > > > > Thanks! When I store output how can I tell pig to
> > compress
> > > it
> > > > > in
> > > > > > > LZO
> > > > > > > > > > > format?
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy <
> > > > > > > [email protected]>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > You might find the elephant-bird project helpful for
> > > > reading
> > > > > > and
> > > > > > > > > > > > creating LZO files, in raw hadoop or using Pig.
> > > > > > > > > > > > (disclaimer: I'm a committer on elephant-bird)
> > > > > > > > > > > >
> > > > > > > > > > > > D
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi
> > > > > > > > > > > > <[email protected]> wrote:
> > > > > > > > > > > > > Pig support LZO for splittable compression.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Prashant
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia <
> > > > > > > > [email protected]
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> We currently have 100s of GB of uncompressed data
> > > which
> > > > we
> > > > > > > would
> > > > > > > > > > like
> > > > > > > > > > > to
> > > > > > > > > > > > >> zip using some compression that is block
> compression
> > > so
> > > > > that
> > > > > > > we
> > > > > > > > > can
> > > > > > > > > > > use
> > > > > > > > > > > > >> multiple input splits. Does pig support any such
> > > > > > compression?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > ‘(hello world)
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
‘(hello world)

Re: Compressing output using block compression

Reply via email to