Re: Making crunch job output single file

Josh Wills Wed, 30 Oct 2013 14:38:30 -0700

Best guess is that the input data is compressed, but the output data is
not- Crunch does not turn it on by default.
On Oct 30, 2013 4:56 PM, "Som Satpathy" <[email protected]> wrote:


> May be we can expect the csv to size up by that much compared to the input
> sequence file, just wanted to confirm if I'm using the shard() correctly.
>
> Thanks,
> Som
>
>
> On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <[email protected]>wrote:
>
>> Hi Josh,
>>
>> Thank you for the input. I incorporated Shard in the mrpipeline, this
>> time I get a one output csv part-r file, but interestingly the file size is
>> much bigger than the input sequence file size.
>>
>> The input sequence file size is around 11GB and the final csv turns out
>> to be 65GB in size.
>>
>> Let me explain what I'm trying to do. This is my mrpipeline:
>>
>> Pcollection<T> collection1 =
>> pipeline.read(fromSequenceFile).parallelDo(doFn1())
>> PCollection<T> collection2 = collection1.filter(filterFn1())
>> PCollection<T> collection3 = collection2.filter(filterFn2())
>> PCollection<T> collection4 = collection3.parallelDo(doFn3())
>>
>> PCollection<T> finalShardedCollection = Shard.shard(collection4,1)
>>
>> pipeline.writeTextFile(finalShardedCollection, csvFilePath)
>>
>> pipeline.done()
>>
>> Am I using the shard correctly? It is weird that the output file size is
>> much bigger than the input file.
>>
>> Look forward to hear from you.
>>
>> Thanks,
>> Som
>>
>>
>>
>> On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <[email protected]> wrote:
>>
>>> Hey Som,
>>>
>>> Check out org.apache.crunch.lib.Shard, it does what you want.
>>>
>>> J
>>>
>>>
>>> On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy <[email protected]>wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have a crunch job that should process a big sequence file and produce
>>>> a single csv file. I am using the
>>>> "pipeline.writeTextFile(transformedRecords, csvFilePath)" to write to a
>>>> csv. (csvFilePath is like "/data/csv_directory"). The larger the input
>>>> sequence file is, more number of mappers are being created and thus
>>>> equivalent number of csv output files are being created.
>>>>
>>>> In classic mapreduce one could output a single file by setting the
>>>> #reducers to 1 while configuring the job. How could I achieve this with
>>>> crunch?
>>>>
>>>> I would really appreciate any help here.
>>>>
>>>> Thanks,
>>>> Som
>>>>
>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>
>

Re: Making crunch job output single file

Reply via email to