That is what I was suggesting yes.
On Wed, Jul 31, 2013 at 4:39 PM, Something Something < [email protected]> wrote: > So you are saying, we will first do a 'hadoop count' to get the total # of > bytes for all files. Let's say that comes to: 1538684305 > > Default Block Size is: 128M > > So, total # of blocks needed: 1538684305 / 131072 = 11740 > > Max file blocks = 11740 / 50 (# of output files) = 234 > > Does this calculation look right? > > On Wed, Jul 31, 2013 at 10:28 AM, John Meagher <[email protected] > >wrote: > > > It is file size based, not file count based. For fewer files up the > > max-file-blocks setting. > > > > On Wed, Jul 31, 2013 at 12:21 PM, Something Something > > <[email protected]> wrote: > > > Thanks, John. But I don't see an option to specify the # of output > > files. > > > How does Crush decide how many files to create? Is it only based on > > file > > > sizes? > > > > > > On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <[email protected] > > >wrote: > > > > > >> Here's a great tool for handling exactly that case: > > >> https://github.com/edwardcapriolo/filecrush > > >> > > >> On Wed, Jul 31, 2013 at 2:40 AM, Something Something > > >> <[email protected]> wrote: > > >> > Each bz2 file after merging is about 50Megs. The reducers take > about > > 9 > > >> > minutes. > > >> > > > >> > Note: 'getmerge' is not an option. There isn't enough disk space > to > > do > > >> a > > >> > getmerge on the local production box. Plus we need a scalable > > solution > > >> as > > >> > these files will get a lot bigger soon. > > >> > > > >> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <[email protected]> > > wrote: > > >> > > > >> >> How big are your 50 files? How long are the reducers taking? > > >> >> > > >> >> On Jul 30, 2013, at 10:26 PM, Something Something < > > >> >> [email protected]> wrote: > > >> >> > > >> >> > Hello, > > >> >> > > > >> >> > One of our pig scripts creates over 500 small part files. To > save > > on > > >> >> > namespace, we need to cut down the # of files, so instead of > saving > > >> 500 > > >> >> > small files we need to merge them into 50. We tried the > following: > > >> >> > > > >> >> > 1) When we set parallel number to 50, the Pig script takes a > long > > >> time - > > >> >> > for obvious reasons. > > >> >> > 2) If we use Hadoop Streaming, it puts some garbage values into > > the > > >> key > > >> >> > field. > > >> >> > 3) We wrote our own Map Reducer program that reads these 500 > small > > >> part > > >> >> > files & uses 50 reducers. Basically, the Mappers simply write > the > > >> line & > > >> >> > reducers loop thru values & write them out. We set > > >> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not > > >> written > > >> >> to > > >> >> > the output file. This is performing better than Pig. Actually > > >> Mappers > > >> >> run > > >> >> > very fast, but Reducers take some time to complete, but this > > approach > > >> >> seems > > >> >> > to be working well. > > >> >> > > > >> >> > Is there a better way to do this? What strategy can you think of > > to > > >> >> > increase speed of reducers. > > >> >> > > > >> >> > Any help in this regard will be greatly appreciated. Thanks. > > >> >> > > >> >> > > >> > > > -- https://github.com/bearrito @deepbearrito
