That's surprising-- I know that the block size can matter for sequence/avro files w/Snappy, but I don't know of any similar issues or settings that need to be in place for text.
On Thu, Oct 31, 2013 at 11:38 AM, Durfey,Stephen <[email protected]>wrote: > Coincidentally enough, yesterday I was also looking into a way to merge > csv output files into one larger csv output files to prevent cluttering up > the namenode with many smaller csv files. > > Background: > In our crunch pipeline we are capturing context information about errors > we encountered, and then writing them out to csv files. The csv files > themselves are just a side effect of our processing and not the main > output, and they are written out from our map tasks, before the data we did > process is bulk loaded into hbase. The output of these csv files is > compressed as snappy. > > Problem: > I ran the pipeline against one of our data sources and it produced 14 > different snappy compressed csv files, totaling 4.6GB. After the job has > finished I created a new TextFileSource that would point to the directory > in hdfs that contained the 14 files, and using Shard, set the number of > partitions to 1 to write everything out to one file. The new file size > after the combination is 11.6GB, compressed as snappy. It's not clear to > me why the file size would almost triple. Any ideas? > > Thanks, > Stephen > > From: Som Satpathy <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Wednesday, October 30, 2013 5:36 PM > To: "[email protected]" <[email protected]> > Subject: Re: Making crunch job output single file > > Thanks for the help Josh! > > > On Wed, Oct 30, 2013 at 2:37 PM, Josh Wills <[email protected]> wrote: > >> Best guess is that the input data is compressed, but the output data is >> not- Crunch does not turn it on by default. >> On Oct 30, 2013 4:56 PM, "Som Satpathy" <[email protected]> wrote: >> >>> May be we can expect the csv to size up by that much compared to the >>> input sequence file, just wanted to confirm if I'm using the shard() >>> correctly. >>> >>> Thanks, >>> Som >>> >>> >>> On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <[email protected]>wrote: >>> >>>> Hi Josh, >>>> >>>> Thank you for the input. I incorporated Shard in the mrpipeline, this >>>> time I get a one output csv part-r file, but interestingly the file size is >>>> much bigger than the input sequence file size. >>>> >>>> The input sequence file size is around 11GB and the final csv turns >>>> out to be 65GB in size. >>>> >>>> Let me explain what I'm trying to do. This is my mrpipeline: >>>> >>>> Pcollection<T> collection1 = >>>> pipeline.read(fromSequenceFile).parallelDo(doFn1()) >>>> PCollection<T> collection2 = collection1.filter(filterFn1()) >>>> PCollection<T> collection3 = collection2.filter(filterFn2()) >>>> PCollection<T> collection4 = collection3.parallelDo(doFn3()) >>>> >>>> PCollection<T> finalShardedCollection = Shard.shard(collection4,1) >>>> >>>> pipeline.writeTextFile(finalShardedCollection, csvFilePath) >>>> >>>> pipeline.done() >>>> >>>> Am I using the shard correctly? It is weird that the output file size >>>> is much bigger than the input file. >>>> >>>> Look forward to hear from you. >>>> >>>> Thanks, >>>> Som >>>> >>>> >>>> >>>> On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <[email protected]>wrote: >>>> >>>>> Hey Som, >>>>> >>>>> Check out org.apache.crunch.lib.Shard, it does what you want. >>>>> >>>>> J >>>>> >>>>> >>>>> On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy >>>>> <[email protected]>wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I have a crunch job that should process a big sequence file and >>>>>> produce a single csv file. I am using the >>>>>> "pipeline.writeTextFile(transformedRecords, csvFilePath)" to write to a >>>>>> csv. (csvFilePath is like "/data/csv_directory"). The larger the input >>>>>> sequence file is, more number of mappers are being created and thus >>>>>> equivalent number of csv output files are being created. >>>>>> >>>>>> In classic mapreduce one could output a single file by setting the >>>>>> #reducers to 1 while configuring the job. How could I achieve this with >>>>>> crunch? >>>>>> >>>>>> I would really appreciate any help here. >>>>>> >>>>>> Thanks, >>>>>> Som >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Director of Data Science >>>>> Cloudera<https://urldefense.proofpoint.com/v1/url?u=http://www.cloudera.com&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=ec%2BVWdsSP94LNbXEtHsotxoYoTqZETkLScTIx1iu%2FyQ%3D%0A&m=DLzzaHhr94eIyCR7CuxMUx%2BfQXEgFWghuyzM8b8pdms%3D%0A&s=7b30d2a20ef62a1becc155a89c69d1a64410b39bc1cba5ab30de67baaafb841b> >>>>> Twitter: >>>>> @josh_wills<https://urldefense.proofpoint.com/v1/url?u=http://twitter.com/josh_wills&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=ec%2BVWdsSP94LNbXEtHsotxoYoTqZETkLScTIx1iu%2FyQ%3D%0A&m=DLzzaHhr94eIyCR7CuxMUx%2BfQXEgFWghuyzM8b8pdms%3D%0A&s=792fea091bb745732e9f585db1ad54ac061941f55a89b0445cd443210a1be6fc> >>>>> >>>> >>>> >>> > CONFIDENTIALITY NOTICE This message and any included attachments are > from Cerner Corporation and are intended only for the addressee. The > information contained in this message is confidential and may constitute > inside or non-public information under international, federal, or state > securities laws. Unauthorized forwarding, printing, copying, distribution, > or use of such information is strictly prohibited and may be unlawful. If > you are not the addressee, please promptly delete this message and notify > the sender of the delivery error by e-mail or you may call Cerner's > corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024. > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
