Yeah, more like laziness on the part of whoever wrote the MemPipeline impl. ;) On Sun, Sep 13, 2015 at 7:17 PM Everett Anderson <[email protected]> wrote:
> On Sun, Sep 13, 2015 at 6:03 PM, Josh Wills <[email protected]> wrote: > >> >> >> On Sun, Sep 13, 2015 at 10:36 AM, Everett Anderson <[email protected]> >> wrote: >> >>> Hi! >>> >>> On Sat, Sep 12, 2015 at 11:15 PM, Josh Wills <[email protected]> >>> wrote: >>> >>>> >>>> >>>> On Sat, Sep 12, 2015 at 2:35 PM, Everett Anderson <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I've got two basic questions about org.apache.crunch.io.Compress >>>>> <https://crunch.apache.org/apidocs/0.12.0/index.html?overview-summary.html> >>>>> . >>>>> >>>>> 1) It seems like it should only be used to wrap Targets that are >>>>> themselves binary file output formats, but org.apache.crunch.io.To >>>>> only has text, avro, and sequence, none of which seem appropriate. How do >>>>> people tend to use this? Is there a Hadoop FileOutputFormat that they give >>>>> to To.formattedFile? >>>>> >>>> >>>> I don't understand the question-- the Compress methods can be used for >>>> any sort of output format that extends FileOutputFormat, it doesn't matter >>>> whether it's text/sequence/avro or a custom thing. >>>> >>> >>> I think I may just not understand how it's to be used. >>> >>> For example, if you do something like this: >>> >>> PCollection<String> data = ... >>> >>> Target baseTarget = To.textFile("out1"); >>> Target compressedTarget = Compress.gzip(baseTarget); >>> >>> data.write(compressedTarget); >>> >>> What is the output file supposed to be? Is it a UTF-8 encoded text file >>> of Strings, each of which has been passed through gzip? >>> >>> I'm actually looking for a way to compress each of the part-* output >>> files itself, such that they'd be gzip (or lzo) files that contain text. >>> Does that make sense? Is there an easy wrapper to do that? >>> >> >> I think that what it does now is what you want-- each part-* file is >> gzipped (or snappied, or whatever). Is that not what seems to be happening >> when you run it? >> > > Oh! It looks like it does create .gz part files with the MRPipeline, but > with the MemPipeline, which was what I was using to play around with, it > just creates a text file. > > Example: > > Pipeline pipeline = MemPipeline.getInstance(); > List<String> dataElements = new ArrayList<>(100); > for (int i = 0; i < 100; i++) { > dataElements.add("Test data element"); > } > > PCollection<String> data = pipeline.create(dataElements, > Writables.strings()); > > Target baseTarget = To.textFile("out1"); > Target compressedTarget = Compress.gzip(baseTarget); > data.write(compressedTarget, Target.WriteMode.OVERWRITE); > > pipeline.done(); > > Results in a out1/out1.txt file which is just plain text. > > Switching to the MRPipeline results in a out1/part-m-00000.gz file which > is, indeed, a gzip file. > > I'm not sure if this is a bug given the MemPipeline is likely only meant > to be used for unit tests? > > > > >> >> >>> >>> >>> >>>> >>>>> 2) The implementation of Compress.gzip is >>>>> >>>>> public static <T extends Target> T gzip(T target) { >>>>> return (T) compress(target, GzipCodec.class) >>>>> .outputConf(*AvroJob.OUTPUT_CODEC*, >>>>> DataFileConstants.DEFLATE_CODEC); >>>>> } >>>>> >>>>> Does this mean it can only work with Avro? >>>>> >>>> >>>> No, it's just that Avro has its own built-in support for gzip/snappy >>>> serialization and it requires some extra conf to enable it. Any other >>>> output format will just ignore that configuration parameter. >>>> >>> >>> Cool! >>> >>> >>>> >>>> >>>>> Thanks! >>>>> >>>>> *DISCLAIMER:* The contents of this email, including any attachments, >>>>> may contain information that is confidential, proprietary in nature, >>>>> protected health information (PHI), or otherwise protected by law from >>>>> disclosure, and is solely for the use of the intended recipient(s). If you >>>>> are not the intended recipient, you are hereby notified that any use, >>>>> disclosure or copying of this email, including any attachments, is >>>>> unauthorized and strictly prohibited. If you have received this email in >>>>> error, please notify the sender of this email. Please delete this and all >>>>> copies of this email from your system. Any opinions either expressed or >>>>> implied in this email and all attachments, are those of its author only, >>>>> and do not necessarily reflect those of Nuna Health, Inc. >>>> >>>> >>>> >>> >>> *DISCLAIMER:* The contents of this email, including any attachments, >>> may contain information that is confidential, proprietary in nature, >>> protected health information (PHI), or otherwise protected by law from >>> disclosure, and is solely for the use of the intended recipient(s). If you >>> are not the intended recipient, you are hereby notified that any use, >>> disclosure or copying of this email, including any attachments, is >>> unauthorized and strictly prohibited. If you have received this email in >>> error, please notify the sender of this email. Please delete this and all >>> copies of this email from your system. Any opinions either expressed or >>> implied in this email and all attachments, are those of its author only, >>> and do not necessarily reflect those of Nuna Health, Inc. >>> >> >> > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc.
