May be we can expect the csv to size up by that much compared to the input sequence file, just wanted to confirm if I'm using the shard() correctly.
Thanks, Som On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <[email protected]> wrote: > Hi Josh, > > Thank you for the input. I incorporated Shard in the mrpipeline, this time > I get a one output csv part-r file, but interestingly the file size is much > bigger than the input sequence file size. > > The input sequence file size is around 11GB and the final csv turns out to > be 65GB in size. > > Let me explain what I'm trying to do. This is my mrpipeline: > > Pcollection<T> collection1 = > pipeline.read(fromSequenceFile).parallelDo(doFn1()) > PCollection<T> collection2 = collection1.filter(filterFn1()) > PCollection<T> collection3 = collection2.filter(filterFn2()) > PCollection<T> collection4 = collection3.parallelDo(doFn3()) > > PCollection<T> finalShardedCollection = Shard.shard(collection4,1) > > pipeline.writeTextFile(finalShardedCollection, csvFilePath) > > pipeline.done() > > Am I using the shard correctly? It is weird that the output file size is > much bigger than the input file. > > Look forward to hear from you. > > Thanks, > Som > > > > On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <[email protected]> wrote: > >> Hey Som, >> >> Check out org.apache.crunch.lib.Shard, it does what you want. >> >> J >> >> >> On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy <[email protected]>wrote: >> >>> Hi all, >>> >>> I have a crunch job that should process a big sequence file and produce >>> a single csv file. I am using the >>> "pipeline.writeTextFile(transformedRecords, csvFilePath)" to write to a >>> csv. (csvFilePath is like "/data/csv_directory"). The larger the input >>> sequence file is, more number of mappers are being created and thus >>> equivalent number of csv output files are being created. >>> >>> In classic mapreduce one could output a single file by setting the >>> #reducers to 1 while configuring the job. How could I achieve this with >>> crunch? >>> >>> I would really appreciate any help here. >>> >>> Thanks, >>> Som >>> >> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > >
