And the other settings look fine-- mapred.output.compress and mapred.output.compression.codec?
On Fri, Nov 1, 2013 at 7:30 AM, Durfey,Stephen <[email protected]>wrote: > Checking the job.xml on job tracker, the mapred.output.compression.type > for both the original output and the combined output (a separate job) are > both set at BLOCK level compression. > > Stephen Durfey > Software Engineer|The Record > 816-201-2689 | [email protected] > > > > > On 11/1/13 12:31 AM, "Gabriel Reid" <[email protected]> wrote: > > >It sounds like this could be down to block-level vs record-level > >compression -- could you check that mapred.output.compression.type was > >set to the same thing (should probably be BLOCK) in both cases? > > > > > >On Thu, Oct 31, 2013 at 7:57 PM, Josh Wills <[email protected]> wrote: > >> That's surprising-- I know that the block size can matter for > >>sequence/avro > >> files w/Snappy, but I don't know of any similar issues or settings that > >>need > >> to be in place for text. > >> > >> > >> On Thu, Oct 31, 2013 at 11:38 AM, Durfey,Stephen > >><[email protected]> > >> wrote: > >>> > >>> Coincidentally enough, yesterday I was also looking into a way to merge > >>> csv output files into one larger csv output files to prevent > >>>cluttering up > >>> the namenode with many smaller csv files. > >>> > >>> Background: > >>> In our crunch pipeline we are capturing context information about > >>>errors > >>> we encountered, and then writing them out to csv files. The csv files > >>> themselves are just a side effect of our processing and not the main > >>>output, > >>> and they are written out from our map tasks, before the data we did > >>>process > >>> is bulk loaded into hbase. The output of these csv files is compressed > >>>as > >>> snappy. > >>> > >>> Problem: > >>> I ran the pipeline against one of our data sources and it produced 14 > >>> different snappy compressed csv files, totaling 4.6GB. After the job > >>>has > >>> finished I created a new TextFileSource that would point to the > >>>directory in > >>> hdfs that contained the 14 files, and using Shard, set the number of > >>> partitions to 1 to write everything out to one file. The new file size > >>>after > >>> the combination is 11.6GB, compressed as snappy. It's not clear to me > >>>why > >>> the file size would almost triple. Any ideas? > >>> > >>> Thanks, > >>> Stephen > >>> > >>> From: Som Satpathy <[email protected]> > >>> Reply-To: "[email protected]" <[email protected]> > >>> Date: Wednesday, October 30, 2013 5:36 PM > >>> To: "[email protected]" <[email protected]> > >>> Subject: Re: Making crunch job output single file > >>> > >>> Thanks for the help Josh! > >>> > >>> > >>> On Wed, Oct 30, 2013 at 2:37 PM, Josh Wills <[email protected]> > >>>wrote: > >>>> > >>>> Best guess is that the input data is compressed, but the output data > >>>>is > >>>> not- Crunch does not turn it on by default. > >>>> > >>>> On Oct 30, 2013 4:56 PM, "Som Satpathy" <[email protected]> > wrote: > >>>>> > >>>>> May be we can expect the csv to size up by that much compared to the > >>>>> input sequence file, just wanted to confirm if I'm using the shard() > >>>>> correctly. > >>>>> > >>>>> Thanks, > >>>>> Som > >>>>> > >>>>> > >>>>> On Wed, Oct 30, 2013 at 1:46 PM, Som Satpathy <[email protected] > > > >>>>> wrote: > >>>>>> > >>>>>> Hi Josh, > >>>>>> > >>>>>> Thank you for the input. I incorporated Shard in the mrpipeline, > >>>>>>this > >>>>>> time I get a one output csv part-r file, but interestingly the file > >>>>>>size is > >>>>>> much bigger than the input sequence file size. > >>>>>> > >>>>>> The input sequence file size is around 11GB and the final csv turns > >>>>>>out > >>>>>> to be 65GB in size. > >>>>>> > >>>>>> Let me explain what I'm trying to do. This is my mrpipeline: > >>>>>> > >>>>>> Pcollection<T> collection1 = > >>>>>> pipeline.read(fromSequenceFile).parallelDo(doFn1()) > >>>>>> PCollection<T> collection2 = collection1.filter(filterFn1()) > >>>>>> PCollection<T> collection3 = collection2.filter(filterFn2()) > >>>>>> PCollection<T> collection4 = collection3.parallelDo(doFn3()) > >>>>>> > >>>>>> PCollection<T> finalShardedCollection = Shard.shard(collection4,1) > >>>>>> > >>>>>> pipeline.writeTextFile(finalShardedCollection, csvFilePath) > >>>>>> > >>>>>> pipeline.done() > >>>>>> > >>>>>> Am I using the shard correctly? It is weird that the output file > >>>>>>size > >>>>>> is much bigger than the input file. > >>>>>> > >>>>>> Look forward to hear from you. > >>>>>> > >>>>>> Thanks, > >>>>>> Som > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Wed, Oct 30, 2013 at 8:14 AM, Josh Wills <[email protected]> > >>>>>> wrote: > >>>>>>> > >>>>>>> Hey Som, > >>>>>>> > >>>>>>> Check out org.apache.crunch.lib.Shard, it does what you want. > >>>>>>> > >>>>>>> J > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Oct 30, 2013 at 8:05 AM, Som Satpathy > >>>>>>><[email protected]> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> I have a crunch job that should process a big sequence file and > >>>>>>>> produce a single csv file. I am using the > >>>>>>>> "pipeline.writeTextFile(transformedRecords, csvFilePath)" to > >>>>>>>>write to a csv. > >>>>>>>> (csvFilePath is like "/data/csv_directory"). The larger the input > >>>>>>>>sequence > >>>>>>>> file is, more number of mappers are being created and thus > >>>>>>>>equivalent number > >>>>>>>> of csv output files are being created. > >>>>>>>> > >>>>>>>> In classic mapreduce one could output a single file by setting the > >>>>>>>> #reducers to 1 while configuring the job. How could I achieve > >>>>>>>>this with > >>>>>>>> crunch? > >>>>>>>> > >>>>>>>> I would really appreciate any help here. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Som > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Director of Data Science > >>>>>>> Cloudera > >>>>>>> Twitter: @josh_wills > >>>>>> > >>>>>> > >>>>> > >>> > >>> CONFIDENTIALITY NOTICE This message and any included attachments are > >>>from > >>> Cerner Corporation and are intended only for the addressee. The > >>>information > >>> contained in this message is confidential and may constitute inside or > >>> non-public information under international, federal, or state > >>>securities > >>> laws. Unauthorized forwarding, printing, copying, distribution, or use > >>>of > >>> such information is strictly prohibited and may be unlawful. If you > >>>are not > >>> the addressee, please promptly delete this message and notify the > >>>sender of > >>> the delivery error by e-mail or you may call Cerner's corporate > >>>offices in > >>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024. > >> > >> > >> > >> > >> -- > >> Director of Data Science > >> Cloudera > >> Twitter: @josh_wills > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
