Re: Making crunch job output single file

Gabriel Reid Thu, 31 Oct 2013 22:32:43 -0700

It sounds like this could be down to block-level vs record-level
compression -- could you check that mapred.output.compression.type was
set to the same thing (should probably be BLOCK) in both cases?



On Thu, Oct 31, 2013 at 7:57 PM, Josh Wills <[email protected]> wrote:
> That's surprising-- I know that the block size can matter for sequence/avro
> files w/Snappy, but I don't know of any similar issues or settings that need
> to be in place for text.
>
>
> On Thu, Oct 31, 2013 at 11:38 AM, Durfey,Stephen <[email protected]>
> wrote:
>>
>> Coincidentally enough, yesterday I was also looking into a way to merge
>> csv output files into one larger csv output files to prevent cluttering up
>> the namenode with many smaller csv files.
>>
>> Background:
>> In our crunch pipeline we are capturing context information about errors
>> we encountered, and then writing them out to csv files. The csv files
>> themselves are just a side effect of our processing and not the main output,
>> and they are written out from our map tasks, before the data we did process
>> is bulk loaded into hbase. The output of these csv files is compressed as
>> snappy.
>>
>> Problem:
>> I ran the pipeline against one of our data sources and it produced 14
>> different snappy compressed csv files, totaling 4.6GB. After the job has
>> finished I created a new TextFileSource that would point to the directory in
>> hdfs that contained the 14 files, and using Shard, set the number of
>> partitions to 1 to write everything out to one file. The new file size after
>> the combination is 11.6GB, compressed as snappy.  It's not clear to me why
>> the file size would almost triple.  Any ideas?
>>
>> Thanks,
>> Stephen
>>
>> From: Som Satpathy <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Wednesday, October 30, 2013 5:36 PM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Making crunch job output single file
>>
>> Thanks for the help Josh!
>>
>>
>> On Wed, Oct 30, 2013 at 2:37 PM, Josh Wills <[email protected]> wrote:
>>>
>>> Best guess is that the input data is compressed, but the output data is
>>> not- Crunch does not turn it on by default.
>>>
>>> On Oct 30, 2013 4:56 PM, "Som Satpathy" <[email protected]> wrote:
>>>>
>>>> May be we can expect the csv to size up by that much compared to the
>>>> input sequence file, just wanted to confirm if I'm using the shard()
>>>> correctly.
>>>>
>>>> Thanks,
>>>> Som
>>>>
>>>>
>>>> On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <[email protected]>
>>>> wrote:
>>>>>
>>>>> Hi Josh,
>>>>>
>>>>> Thank you for the input. I incorporated Shard in the mrpipeline, this
>>>>> time I get a one output csv part-r file, but interestingly the file size 
>>>>> is
>>>>> much bigger than the input sequence file size.
>>>>>
>>>>> The input sequence file size is around 11GB and the final csv turns out
>>>>> to be 65GB in size.
>>>>>
>>>>> Let me explain what I'm trying to do. This is my mrpipeline:
>>>>>
>>>>> Pcollection<T> collection1 =
>>>>> pipeline.read(fromSequenceFile).parallelDo(doFn1())
>>>>> PCollection<T> collection2 = collection1.filter(filterFn1())
>>>>> PCollection<T> collection3 = collection2.filter(filterFn2())
>>>>> PCollection<T> collection4 = collection3.parallelDo(doFn3())
>>>>>
>>>>> PCollection<T> finalShardedCollection = Shard.shard(collection4,1)
>>>>>
>>>>> pipeline.writeTextFile(finalShardedCollection, csvFilePath)
>>>>>
>>>>> pipeline.done()
>>>>>
>>>>> Am I using the shard correctly? It is weird that the output file size
>>>>> is much bigger than the input file.
>>>>>
>>>>> Look forward to hear from you.
>>>>>
>>>>> Thanks,
>>>>> Som
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Hey Som,
>>>>>>
>>>>>> Check out org.apache.crunch.lib.Shard, it does what you want.
>>>>>>
>>>>>> J
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy <[email protected]>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have a crunch job that should process a big sequence file and
>>>>>>> produce a single csv file. I am using the
>>>>>>> "pipeline.writeTextFile(transformedRecords, csvFilePath)" to write to a 
>>>>>>> csv.
>>>>>>> (csvFilePath is like "/data/csv_directory"). The larger the input 
>>>>>>> sequence
>>>>>>> file is, more number of mappers are being created and thus equivalent 
>>>>>>> number
>>>>>>> of csv output files are being created.
>>>>>>>
>>>>>>> In classic mapreduce one could output a single file by setting the
>>>>>>> #reducers to 1 while configuring the job. How could I achieve this with
>>>>>>> crunch?
>>>>>>>
>>>>>>> I would really appreciate any help here.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Som
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Director of Data Science
>>>>>> Cloudera
>>>>>> Twitter: @josh_wills
>>>>>
>>>>>
>>>>
>>
>> CONFIDENTIALITY NOTICE This message and any included attachments are from
>> Cerner Corporation and are intended only for the addressee. The information
>> contained in this message is confidential and may constitute inside or
>> non-public information under international, federal, or state securities
>> laws. Unauthorized forwarding, printing, copying, distribution, or use of
>> such information is strictly prohibited and may be unlawful. If you are not
>> the addressee, please promptly delete this message and notify the sender of
>> the delivery error by e-mail or you may call Cerner's corporate offices in
>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>
>
>
>
> --
> Director of Data Science
> Cloudera
> Twitter: @josh_wills

Re: Making crunch job output single file

Reply via email to