Coincidentally enough, yesterday I was also looking into a way to merge csv output files into one larger csv output files to prevent cluttering up the namenode with many smaller csv files.
Background: In our crunch pipeline we are capturing context information about errors we encountered, and then writing them out to csv files. The csv files themselves are just a side effect of our processing and not the main output, and they are written out from our map tasks, before the data we did process is bulk loaded into hbase. The output of these csv files is compressed as snappy. Problem: I ran the pipeline against one of our data sources and it produced 14 different snappy compressed csv files, totaling 4.6GB. After the job has finished I created a new TextFileSource that would point to the directory in hdfs that contained the 14 files, and using Shard, set the number of partitions to 1 to write everything out to one file. The new file size after the combination is 11.6GB, compressed as snappy. It's not clear to me why the file size would almost triple. Any ideas? Thanks, Stephen From: Som Satpathy <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Wednesday, October 30, 2013 5:36 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Making crunch job output single file Thanks for the help Josh! On Wed, Oct 30, 2013 at 2:37 PM, Josh Wills <[email protected]<mailto:[email protected]>> wrote: Best guess is that the input data is compressed, but the output data is not- Crunch does not turn it on by default. On Oct 30, 2013 4:56 PM, "Som Satpathy" <[email protected]<mailto:[email protected]>> wrote: May be we can expect the csv to size up by that much compared to the input sequence file, just wanted to confirm if I'm using the shard() correctly. Thanks, Som On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <[email protected]<mailto:[email protected]>> wrote: Hi Josh, Thank you for the input. I incorporated Shard in the mrpipeline, this time I get a one output csv part-r file, but interestingly the file size is much bigger than the input sequence file size. The input sequence file size is around 11GB and the final csv turns out to be 65GB in size. Let me explain what I'm trying to do. This is my mrpipeline: Pcollection<T> collection1 = pipeline.read(fromSequenceFile).parallelDo(doFn1()) PCollection<T> collection2 = collection1.filter(filterFn1()) PCollection<T> collection3 = collection2.filter(filterFn2()) PCollection<T> collection4 = collection3.parallelDo(doFn3()) PCollection<T> finalShardedCollection = Shard.shard(collection4,1) pipeline.writeTextFile(finalShardedCollection, csvFilePath) pipeline.done() Am I using the shard correctly? It is weird that the output file size is much bigger than the input file. Look forward to hear from you. Thanks, Som On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <[email protected]<mailto:[email protected]>> wrote: Hey Som, Check out org.apache.crunch.lib.Shard, it does what you want. J On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy <[email protected]<mailto:[email protected]>> wrote: Hi all, I have a crunch job that should process a big sequence file and produce a single csv file. I am using the "pipeline.writeTextFile(transformedRecords, csvFilePath)" to write to a csv. (csvFilePath is like "/data/csv_directory"). The larger the input sequence file is, more number of mappers are being created and thus equivalent number of csv output files are being created. In classic mapreduce one could output a single file by setting the #reducers to 1 while configuring the job. How could I achieve this with crunch? I would really appreciate any help here. Thanks, Som -- Director of Data Science Cloudera<https://urldefense.proofpoint.com/v1/url?u=http://www.cloudera.com&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=ec%2BVWdsSP94LNbXEtHsotxoYoTqZETkLScTIx1iu%2FyQ%3D%0A&m=DLzzaHhr94eIyCR7CuxMUx%2BfQXEgFWghuyzM8b8pdms%3D%0A&s=7b30d2a20ef62a1becc155a89c69d1a64410b39bc1cba5ab30de67baaafb841b> Twitter: @josh_wills<https://urldefense.proofpoint.com/v1/url?u=http://twitter.com/josh_wills&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=ec%2BVWdsSP94LNbXEtHsotxoYoTqZETkLScTIx1iu%2FyQ%3D%0A&m=DLzzaHhr94eIyCR7CuxMUx%2BfQXEgFWghuyzM8b8pdms%3D%0A&s=792fea091bb745732e9f585db1ad54ac061941f55a89b0445cd443210a1be6fc> CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
