Re: Merging files

j.barrett Strausser Wed, 31 Jul 2013 14:02:32 -0700

That is what I was suggesting yes.




On Wed, Jul 31, 2013 at 4:39 PM, Something Something <
[email protected]> wrote:

> So you are saying, we will first do a 'hadoop count' to get the total # of
> bytes for all files.  Let's say that comes to:  1538684305
>
> Default Block Size is:  128M
>
> So, total # of blocks needed:  1538684305 / 131072 = 11740
>
> Max file blocks = 11740 / 50 (# of output files) = 234
>
> Does this calculation look right?
>
> On Wed, Jul 31, 2013 at 10:28 AM, John Meagher <[email protected]
> >wrote:
>
> > It is file size based, not file count based.  For fewer files up the
> > max-file-blocks setting.
> >
> > On Wed, Jul 31, 2013 at 12:21 PM, Something Something
> > <[email protected]> wrote:
> > > Thanks, John.  But I don't see an option to specify the # of output
> > files.
> > >  How does Crush decide how many files to create?  Is it only based on
> > file
> > > sizes?
> > >
> > > On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <[email protected]
> > >wrote:
> > >
> > >> Here's a great tool for handling exactly that case:
> > >> https://github.com/edwardcapriolo/filecrush
> > >>
> > >> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> > >> <[email protected]> wrote:
> > >> > Each bz2 file after merging is about 50Megs.  The reducers take
> about
> > 9
> > >> > minutes.
> > >> >
> > >> > Note:  'getmerge' is not an option.  There isn't enough disk space
> to
> > do
> > >> a
> > >> > getmerge on the local production box.  Plus we need a scalable
> > solution
> > >> as
> > >> > these files will get a lot bigger soon.
> > >> >
> > >> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <[email protected]>
> > wrote:
> > >> >
> > >> >> How big are your 50 files?  How long are the reducers taking?
> > >> >>
> > >> >> On Jul 30, 2013, at 10:26 PM, Something Something <
> > >> >> [email protected]> wrote:
> > >> >>
> > >> >> > Hello,
> > >> >> >
> > >> >> > One of our pig scripts creates over 500 small part files.  To
> save
> > on
> > >> >> > namespace, we need to cut down the # of files, so instead of
> saving
> > >> 500
> > >> >> > small files we need to merge them into 50.  We tried the
> following:
> > >> >> >
> > >> >> > 1)  When we set parallel number to 50, the Pig script takes a
> long
> > >> time -
> > >> >> > for obvious reasons.
> > >> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into
> > the
> > >> key
> > >> >> > field.
> > >> >> > 3)  We wrote our own Map Reducer program that reads these 500
> small
> > >> part
> > >> >> > files & uses 50 reducers.  Basically, the Mappers simply write
> the
> > >> line &
> > >> >> > reducers loop thru values & write them out.  We set
> > >> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> > >> written
> > >> >> to
> > >> >> > the output file.  This is performing better than Pig.  Actually
> > >> Mappers
> > >> >> run
> > >> >> > very fast, but Reducers take some time to complete, but this
> > approach
> > >> >> seems
> > >> >> > to be working well.
> > >> >> >
> > >> >> > Is there a better way to do this?  What strategy can you think of
> > to
> > >> >> > increase speed of reducers.
> > >> >> >
> > >> >> > Any help in this regard will be greatly appreciated.  Thanks.
> > >> >>
> > >> >>
> > >>
> >
>



-- 


https://github.com/bearrito
@deepbearrito

Re: Merging files

Reply via email to