Hey Som, Something seems amiss-- I use this trick in Cloudera ML to handle output compression, viz.:
https://github.com/cloudera/ml/blob/master/client/src/main/java/com/cloudera/science/ml/client/params/PipelineParameters.java Can you send me a gist of what you're trying if that doesn't work? J On Fri, Aug 2, 2013 at 5:33 PM, Som Satpathy <[email protected]> wrote: > Thanks Josh. I tried setting compression parameters via the Configuration > object and also via command line, but the output sequence file never seems > to get compressed. I'm trying to Snappy compress it. > > If I trying creating a sequence file outside of crunch using > SequenceFile.createWriter, I see the file getting compressed with my > compression type (i.e Snappy) > > I was wondering if this is a know issue with crunch.. > > Thanks, > Som > > > On Fri, Aug 2, 2013 at 4:56 PM, Josh Wills <[email protected]> wrote: > >> Hey Som, >> >> The Pipeline object that coordinates the flow has a getConfiguration() >> method where you can set any options you might like and they will propagate >> to all of your jars. >> >> I usually implement Hadoop's Tool interface and then specify these >> configuration options on the command line so I can play with them >> independent of the logic of my runtime, and I end up w/something like: >> >> hadoop jar <crunch-job.jar> -D mapred.compress.output=true -D >> mapred.output.compression.type=block etc. >> >> I think that having some syntactic sugar for compressing Target objects >> (like To.sequenceFile or To.avroFile) would be a nice JIRA. >> >> J >> >> >> On Fri, Aug 2, 2013 at 3:58 PM, Som Satpathy <[email protected]>wrote: >> >>> Hi all, >>> >>> I am trying to write compressed sequence files at the end of my crunch >>> pipeline. I'm doing a pipeline.write(mycollection, To.sequenceFile(path)) >>> for that. >>> However, Crunch is writing an uncompressed sequence file by default. How >>> do I pass the codec that I want to use to Crunch? >>> >>> Looking forward for your inputs. >>> >>> Thanks, >>> Som >>> >>> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
