It sounds like this could be down to block-level vs record-level compression -- could you check that mapred.output.compression.type was set to the same thing (should probably be BLOCK) in both cases?
On Thu, Oct 31, 2013 at 7:57 PM, Josh Wills <[email protected]> wrote: > That's surprising-- I know that the block size can matter for sequence/avro > files w/Snappy, but I don't know of any similar issues or settings that need > to be in place for text. > > > On Thu, Oct 31, 2013 at 11:38 AM, Durfey,Stephen <[email protected]> > wrote: >> >> Coincidentally enough, yesterday I was also looking into a way to merge >> csv output files into one larger csv output files to prevent cluttering up >> the namenode with many smaller csv files. >> >> Background: >> In our crunch pipeline we are capturing context information about errors >> we encountered, and then writing them out to csv files. The csv files >> themselves are just a side effect of our processing and not the main output, >> and they are written out from our map tasks, before the data we did process >> is bulk loaded into hbase. The output of these csv files is compressed as >> snappy. >> >> Problem: >> I ran the pipeline against one of our data sources and it produced 14 >> different snappy compressed csv files, totaling 4.6GB. After the job has >> finished I created a new TextFileSource that would point to the directory in >> hdfs that contained the 14 files, and using Shard, set the number of >> partitions to 1 to write everything out to one file. The new file size after >> the combination is 11.6GB, compressed as snappy. It's not clear to me why >> the file size would almost triple. Any ideas? >> >> Thanks, >> Stephen >> >> From: Som Satpathy <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Wednesday, October 30, 2013 5:36 PM >> To: "[email protected]" <[email protected]> >> Subject: Re: Making crunch job output single file >> >> Thanks for the help Josh! >> >> >> On Wed, Oct 30, 2013 at 2:37 PM, Josh Wills <[email protected]> wrote: >>> >>> Best guess is that the input data is compressed, but the output data is >>> not- Crunch does not turn it on by default. >>> >>> On Oct 30, 2013 4:56 PM, "Som Satpathy" <[email protected]> wrote: >>>> >>>> May be we can expect the csv to size up by that much compared to the >>>> input sequence file, just wanted to confirm if I'm using the shard() >>>> correctly. >>>> >>>> Thanks, >>>> Som >>>> >>>> >>>> On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <[email protected]> >>>> wrote: >>>>> >>>>> Hi Josh, >>>>> >>>>> Thank you for the input. I incorporated Shard in the mrpipeline, this >>>>> time I get a one output csv part-r file, but interestingly the file size >>>>> is >>>>> much bigger than the input sequence file size. >>>>> >>>>> The input sequence file size is around 11GB and the final csv turns out >>>>> to be 65GB in size. >>>>> >>>>> Let me explain what I'm trying to do. This is my mrpipeline: >>>>> >>>>> Pcollection<T> collection1 = >>>>> pipeline.read(fromSequenceFile).parallelDo(doFn1()) >>>>> PCollection<T> collection2 = collection1.filter(filterFn1()) >>>>> PCollection<T> collection3 = collection2.filter(filterFn2()) >>>>> PCollection<T> collection4 = collection3.parallelDo(doFn3()) >>>>> >>>>> PCollection<T> finalShardedCollection = Shard.shard(collection4,1) >>>>> >>>>> pipeline.writeTextFile(finalShardedCollection, csvFilePath) >>>>> >>>>> pipeline.done() >>>>> >>>>> Am I using the shard correctly? It is weird that the output file size >>>>> is much bigger than the input file. >>>>> >>>>> Look forward to hear from you. >>>>> >>>>> Thanks, >>>>> Som >>>>> >>>>> >>>>> >>>>> On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <[email protected]> >>>>> wrote: >>>>>> >>>>>> Hey Som, >>>>>> >>>>>> Check out org.apache.crunch.lib.Shard, it does what you want. >>>>>> >>>>>> J >>>>>> >>>>>> >>>>>> On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I have a crunch job that should process a big sequence file and >>>>>>> produce a single csv file. I am using the >>>>>>> "pipeline.writeTextFile(transformedRecords, csvFilePath)" to write to a >>>>>>> csv. >>>>>>> (csvFilePath is like "/data/csv_directory"). The larger the input >>>>>>> sequence >>>>>>> file is, more number of mappers are being created and thus equivalent >>>>>>> number >>>>>>> of csv output files are being created. >>>>>>> >>>>>>> In classic mapreduce one could output a single file by setting the >>>>>>> #reducers to 1 while configuring the job. How could I achieve this with >>>>>>> crunch? >>>>>>> >>>>>>> I would really appreciate any help here. >>>>>>> >>>>>>> Thanks, >>>>>>> Som >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Director of Data Science >>>>>> Cloudera >>>>>> Twitter: @josh_wills >>>>> >>>>> >>>> >> >> CONFIDENTIALITY NOTICE This message and any included attachments are from >> Cerner Corporation and are intended only for the addressee. The information >> contained in this message is confidential and may constitute inside or >> non-public information under international, federal, or state securities >> laws. Unauthorized forwarding, printing, copying, distribution, or use of >> such information is strictly prohibited and may be unlawful. If you are not >> the addressee, please promptly delete this message and notify the sender of >> the delivery error by e-mail or you may call Cerner's corporate offices in >> Kansas City, Missouri, U.S.A at (+1) (816)221-1024. > > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills
