Coincidentally enough, yesterday I was also looking into a way to merge csv 
output files into one larger csv output files to prevent cluttering up the 
namenode with many smaller csv files.

Background:
In our crunch pipeline we are capturing context information about errors we 
encountered, and then writing them out to csv files. The csv files themselves 
are just a side effect of our processing and not the main output, and they are 
written out from our map tasks, before the data we did process is bulk loaded 
into hbase. The output of these csv files is compressed as snappy.

Problem:
I ran the pipeline against one of our data sources and it produced 14 different 
snappy compressed csv files, totaling 4.6GB. After the job has finished I 
created a new TextFileSource that would point to the directory in hdfs that 
contained the 14 files, and using Shard, set the number of partitions to 1 to 
write everything out to one file. The new file size after the combination is 
11.6GB, compressed as snappy.  It's not clear to me why the file size would 
almost triple.  Any ideas?

Thanks,
Stephen

From: Som Satpathy <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, October 30, 2013 5:36 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Making crunch job output single file

Thanks for the help Josh!


On Wed, Oct 30, 2013 at 2:37 PM, Josh Wills 
<[email protected]<mailto:[email protected]>> wrote:

Best guess is that the input data is compressed, but the output data is not- 
Crunch does not turn it on by default.

On Oct 30, 2013 4:56 PM, "Som Satpathy" 
<[email protected]<mailto:[email protected]>> wrote:
May be we can expect the csv to size up by that much compared to the input 
sequence file, just wanted to confirm if I'm using the shard() correctly.

Thanks,
Som


On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy 
<[email protected]<mailto:[email protected]>> wrote:
Hi Josh,

Thank you for the input. I incorporated Shard in the mrpipeline, this time I 
get a one output csv part-r file, but interestingly the file size is much 
bigger than the input sequence file size.

The input sequence file size is around 11GB and the final csv turns out to be 
65GB in size.

Let me explain what I'm trying to do. This is my mrpipeline:

Pcollection<T> collection1 = pipeline.read(fromSequenceFile).parallelDo(doFn1())
PCollection<T> collection2 = collection1.filter(filterFn1())
PCollection<T> collection3 = collection2.filter(filterFn2())
PCollection<T> collection4 = collection3.parallelDo(doFn3())

PCollection<T> finalShardedCollection = Shard.shard(collection4,1)

pipeline.writeTextFile(finalShardedCollection, csvFilePath)

pipeline.done()

Am I using the shard correctly? It is weird that the output file size is much 
bigger than the input file.

Look forward to hear from you.

Thanks,
Som



On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills 
<[email protected]<mailto:[email protected]>> wrote:
Hey Som,

Check out org.apache.crunch.lib.Shard, it does what you want.

J


On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

I have a crunch job that should process a big sequence file and produce a 
single csv file. I am using the "pipeline.writeTextFile(transformedRecords, 
csvFilePath)" to write to a csv. (csvFilePath is like "/data/csv_directory"). 
The larger the input sequence file is, more number of mappers are being created 
and thus equivalent number of csv output files are being created.

In classic mapreduce one could output a single file by setting the #reducers to 
1 while configuring the job. How could I achieve this with crunch?

I would really appreciate any help here.

Thanks,
Som



--
Director of Data Science
Cloudera<https://urldefense.proofpoint.com/v1/url?u=http://www.cloudera.com&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=ec%2BVWdsSP94LNbXEtHsotxoYoTqZETkLScTIx1iu%2FyQ%3D%0A&m=DLzzaHhr94eIyCR7CuxMUx%2BfQXEgFWghuyzM8b8pdms%3D%0A&s=7b30d2a20ef62a1becc155a89c69d1a64410b39bc1cba5ab30de67baaafb841b>
Twitter: 
@josh_wills<https://urldefense.proofpoint.com/v1/url?u=http://twitter.com/josh_wills&k=PmKqfXspAHNo6iYJ48Q45A%3D%3D%0A&r=ec%2BVWdsSP94LNbXEtHsotxoYoTqZETkLScTIx1iu%2FyQ%3D%0A&m=DLzzaHhr94eIyCR7CuxMUx%2BfQXEgFWghuyzM8b8pdms%3D%0A&s=792fea091bb745732e9f585db1ad54ac061941f55a89b0445cd443210a1be6fc>



CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

Reply via email to