Thanks for the help Josh!
On Wed, Oct 30, 2013 at 2:37 PM, Josh Wills <[email protected]> wrote: > Best guess is that the input data is compressed, but the output data is > not- Crunch does not turn it on by default. > On Oct 30, 2013 4:56 PM, "Som Satpathy" <[email protected]> wrote: > >> May be we can expect the csv to size up by that much compared to the >> input sequence file, just wanted to confirm if I'm using the shard() >> correctly. >> >> Thanks, >> Som >> >> >> On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <[email protected]>wrote: >> >>> Hi Josh, >>> >>> Thank you for the input. I incorporated Shard in the mrpipeline, this >>> time I get a one output csv part-r file, but interestingly the file size is >>> much bigger than the input sequence file size. >>> >>> The input sequence file size is around 11GB and the final csv turns out >>> to be 65GB in size. >>> >>> Let me explain what I'm trying to do. This is my mrpipeline: >>> >>> Pcollection<T> collection1 = >>> pipeline.read(fromSequenceFile).parallelDo(doFn1()) >>> PCollection<T> collection2 = collection1.filter(filterFn1()) >>> PCollection<T> collection3 = collection2.filter(filterFn2()) >>> PCollection<T> collection4 = collection3.parallelDo(doFn3()) >>> >>> PCollection<T> finalShardedCollection = Shard.shard(collection4,1) >>> >>> pipeline.writeTextFile(finalShardedCollection, csvFilePath) >>> >>> pipeline.done() >>> >>> Am I using the shard correctly? It is weird that the output file size is >>> much bigger than the input file. >>> >>> Look forward to hear from you. >>> >>> Thanks, >>> Som >>> >>> >>> >>> On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <[email protected]> wrote: >>> >>>> Hey Som, >>>> >>>> Check out org.apache.crunch.lib.Shard, it does what you want. >>>> >>>> J >>>> >>>> >>>> On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy <[email protected]>wrote: >>>> >>>>> Hi all, >>>>> >>>>> I have a crunch job that should process a big sequence file and >>>>> produce a single csv file. I am using the >>>>> "pipeline.writeTextFile(transformedRecords, csvFilePath)" to write to a >>>>> csv. (csvFilePath is like "/data/csv_directory"). The larger the input >>>>> sequence file is, more number of mappers are being created and thus >>>>> equivalent number of csv output files are being created. >>>>> >>>>> In classic mapreduce one could output a single file by setting the >>>>> #reducers to 1 while configuring the job. How could I achieve this with >>>>> crunch? >>>>> >>>>> I would really appreciate any help here. >>>>> >>>>> Thanks, >>>>> Som >>>>> >>>> >>>> >>>> >>>> -- >>>> Director of Data Science >>>> Cloudera <http://www.cloudera.com> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>> >>> >>> >>
