Don't have time to read the thread, but incase it has not been mentioned....
Unleash filecrusher! https://github.com/edwardcapriolo/filecrush On Sun, Jul 20, 2014 at 4:47 AM, Kilaru, Sambaiah < [email protected]> wrote: > This is not place to discuss merits or demerits of MapR, Small files > screw up very badly with Mapr. > Small files go into one container (to fill up 256MB or what ever container > size) and with locality most > Of the mappers go to three datanodes. > > You should be looking into sequence file format. > > Thanks, > Sam > > From: "M. C. Srivas" <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Sunday, July 20, 2014 at 8:01 AM > To: "[email protected]" <[email protected]> > Subject: Re: Merging small files > > You should look at MapR .... a few 100's of billions of small files is > absolutely no problem. (disc: I work for MapR) > > > On Sat, Jul 19, 2014 at 10:29 AM, Shashidhar Rao < > [email protected]> wrote: > >> Hi , >> >> Has anybody worked in retail use case. If my production Hadoop cluster >> block size is 256 MB but generally if we have to process retail invoice >> data , each invoice data is merely let's say 4 KB . Do we merge the invoice >> data to make one large file say 1 GB . What is the best practice in this >> scenario >> >> >> Regards >> Shashi >> > >
