I think Lzo is a good format to compress files, because it costs constant time for compressing and decompressing. As Lzo is not included in Hadoop's compression formats (because of it's GPL licence?), I need to write a Java script to compress files on HDFS.
2012/4/4 Raghu Angadi <[email protected]> > SequenceFileStorage in elephant-bird lets you load and store to sequence > files. > If your input is text lines, you can store each line as 'value'. > You can experiment with different codecs. > > depending on your use case, simple bzip2 files may not be a bad choice. > > On Tue, Apr 3, 2012 at 1:57 PM, Mohit Anchlia <[email protected] > >wrote: > > > Thanks for the examples. It appears that snappy is not splittable and > > suggested approach is to write to sequence files. > > > > I know how to load from sequencefiles, but in pig I can't find a way to > > write to the sequence files using snappy compression. > > > > On Tue, Apr 3, 2012 at 1:30 PM, Prashant Kommireddi <[email protected] > > >wrote: > > > > > Does it mean Snappy is splittable? > > > http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/ > > > > > > If so then how can I use it in pig? > > > > http://hadoopified.wordpress.com/2012/01/24/snappy-compression-with-pig/ > > > > > > > > > On Tue, Apr 3, 2012 at 1:02 PM, Mohit Anchlia <[email protected] > > > >wrote: > > > > > > > I am currently using Snappy in sequence files. I wasn't aware snappy > > uses > > > > block compression. Does it mean Snappy is splittable? If so then how > > can > > > I > > > > use it in pig? > > > > > > > > Thanks again > > > > > > > > On Tue, Apr 3, 2012 at 12:42 PM, Prashant Kommireddi < > > > [email protected] > > > > >wrote: > > > > > > > > > Most companies handling BigData use LZO, a few have started > > > > exploring/using > > > > > Snappy as well (which is not any easier to configure). These are > the > > 2 > > > > > splittable fast-compression algorithms. Note Snappy is not > efficient > > > > > space-wise compared to gzip or other compression algos, but a lot > > > faster > > > > > (ideal for compression between Map and Reduce) > > > > > > > > > > Is there any repeated/heavy computation involved on the outputs > other > > > > than > > > > > pushing this data to a database? If not, may be its fine to use > gzip > > > but > > > > > you have to make sure the individual files are close to the block > > size, > > > > or > > > > > you will have a lot of unnecessary IO transfers taking place. If > you > > > > read > > > > > the outputs to perform further Map Reduce computation, gzip is not > > the > > > > > best. > > > > > > > > > > -Prashant > > > > > > > > > > On Tue, Apr 3, 2012 at 12:18 PM, Mohit Anchlia < > > [email protected] > > > > > >wrote: > > > > > > > > > > > Thanks for your input. > > > > > > > > > > > > It looks like it's some work to configure LZO. What are the other > > > > > > alternatives? We read new sequence files and generate output > > > > > continuously. > > > > > > What are my options? Should I split the output in small pieces > and > > > gzip > > > > > > them? How do people solve similar problems where there is > > continuous > > > > flow > > > > > > of data that generates tons of output continuosly? > > > > > > > > > > > > After output is generated we again read them and load it in OLAP > db > > > or > > > > do > > > > > > some other analysis. > > > > > > > > > > > > On Tue, Apr 3, 2012 at 11:48 AM, Prashant Kommireddi < > > > > > [email protected] > > > > > > >wrote: > > > > > > > > > > > > > Yes, it is splittable. > > > > > > > > > > > > > > Bzip2 consumes a lot of CPU in decompression. With Hadoop jobs > > > > > generally > > > > > > > being IO bound, Bzip2 sometimes can become the bottleneck with > > > > respect > > > > > to > > > > > > > performance due to this slow decompression rate (algorithm > unable > > > to > > > > > > > decompress at disk read rate). > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 3, 2012 at 11:39 AM, Mohit Anchlia < > > > > [email protected] > > > > > > > >wrote: > > > > > > > > > > > > > > > Is bzip2 not advisable? I think it can split too and is > > supported > > > > out > > > > > > of > > > > > > > > the box. > > > > > > > > > > > > > > > > On Thu, Mar 29, 2012 at 8:08 PM, 帝归 <[email protected]> > > wrote: > > > > > > > > > > > > > > > > > When I use LzoPigStorage, it will load all files under a > > > > directory. > > > > > > > But I > > > > > > > > > want compress every file under a directory and keep the > file > > > name > > > > > > > > > unchanged, just with a .lzo extension name. How can I do > > this? > > > > > Maybe > > > > > > I > > > > > > > > must > > > > > > > > > write a mapreduce job? > > > > > > > > > > > > > > > > > > 2012/3/30 Jonathan Coveney <[email protected]> > > > > > > > > > > > > > > > > > > > check out: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/kevinweil/elephant-bird/tree/master/src/java/com/twitter/elephantbird/pig/store > > > > > > > > > > > > > > > > > > > > 2012/3/29 Mohit Anchlia <[email protected]> > > > > > > > > > > > > > > > > > > > > > Thanks! When I store output how can I tell pig to > > compress > > > it > > > > > in > > > > > > > LZO > > > > > > > > > > > format? > > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 29, 2012 at 4:02 PM, Dmitriy Ryaboy < > > > > > > > [email protected]> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > You might find the elephant-bird project helpful for > > > > reading > > > > > > and > > > > > > > > > > > > creating LZO files, in raw hadoop or using Pig. > > > > > > > > > > > > (disclaimer: I'm a committer on elephant-bird) > > > > > > > > > > > > > > > > > > > > > > > > D > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 28, 2012 at 9:49 AM, Prashant Kommireddi > > > > > > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > Pig support LZO for splittable compression. > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > Prashant > > > > > > > > > > > > > > > > > > > > > > > > > > On Mar 28, 2012, at 9:45 AM, Mohit Anchlia < > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > >> We currently have 100s of GB of uncompressed data > > > which > > > > we > > > > > > > would > > > > > > > > > > like > > > > > > > > > > > to > > > > > > > > > > > > >> zip using some compression that is block > compression > > > so > > > > > that > > > > > > > we > > > > > > > > > can > > > > > > > > > > > use > > > > > > > > > > > > >> multiple input splits. Does pig support any such > > > > > > compression? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > ‘(hello world) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- ‘(hello world)
