On Sun, Sep 13, 2015 at 10:36 AM, Everett Anderson <[email protected]> wrote:
> Hi! > > On Sat, Sep 12, 2015 at 11:15 PM, Josh Wills <[email protected]> wrote: > >> >> >> On Sat, Sep 12, 2015 at 2:35 PM, Everett Anderson <[email protected]> >> wrote: >> >>> Hi, >>> >>> I've got two basic questions about org.apache.crunch.io.Compress >>> <https://crunch.apache.org/apidocs/0.12.0/index.html?overview-summary.html> >>> . >>> >>> 1) It seems like it should only be used to wrap Targets that are >>> themselves binary file output formats, but org.apache.crunch.io.To only >>> has text, avro, and sequence, none of which seem appropriate. How do people >>> tend to use this? Is there a Hadoop FileOutputFormat that they give to >>> To.formattedFile? >>> >> >> I don't understand the question-- the Compress methods can be used for >> any sort of output format that extends FileOutputFormat, it doesn't matter >> whether it's text/sequence/avro or a custom thing. >> > > I think I may just not understand how it's to be used. > > For example, if you do something like this: > > PCollection<String> data = ... > > Target baseTarget = To.textFile("out1"); > Target compressedTarget = Compress.gzip(baseTarget); > > data.write(compressedTarget); > > What is the output file supposed to be? Is it a UTF-8 encoded text file of > Strings, each of which has been passed through gzip? > > I'm actually looking for a way to compress each of the part-* output files > itself, such that they'd be gzip (or lzo) files that contain text. Does > that make sense? Is there an easy wrapper to do that? > I think that what it does now is what you want-- each part-* file is gzipped (or snappied, or whatever). Is that not what seems to be happening when you run it? > > > >> >>> 2) The implementation of Compress.gzip is >>> >>> public static <T extends Target> T gzip(T target) { >>> return (T) compress(target, GzipCodec.class) >>> .outputConf(*AvroJob.OUTPUT_CODEC*, >>> DataFileConstants.DEFLATE_CODEC); >>> } >>> >>> Does this mean it can only work with Avro? >>> >> >> No, it's just that Avro has its own built-in support for gzip/snappy >> serialization and it requires some extra conf to enable it. Any other >> output format will just ignore that configuration parameter. >> > > Cool! > > >> >> >>> Thanks! >>> >>> *DISCLAIMER:* The contents of this email, including any attachments, >>> may contain information that is confidential, proprietary in nature, >>> protected health information (PHI), or otherwise protected by law from >>> disclosure, and is solely for the use of the intended recipient(s). If you >>> are not the intended recipient, you are hereby notified that any use, >>> disclosure or copying of this email, including any attachments, is >>> unauthorized and strictly prohibited. If you have received this email in >>> error, please notify the sender of this email. Please delete this and all >>> copies of this email from your system. Any opinions either expressed or >>> implied in this email and all attachments, are those of its author only, >>> and do not necessarily reflect those of Nuna Health, Inc. >> >> >> > > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc. >
