Re: Merging files

2013-07-31 Thread Something Something
So you are saying, we will first do a 'hadoop count' to get the total # of bytes for all files. Let's say that comes to: 1538684305 Default Block Size is: 128M So, total # of blocks needed: 1538684305 / 131072 = 11740 Max file blocks = 11740 / 50 (# of output files) = 234 Does this calculat

Re: Merging files

2013-07-31 Thread Something Something
Thanks, John. But I don't see an option to specify the # of output files. How does Crush decide how many files to create? Is it only based on file sizes? On Wed, Jul 31, 2013 at 6:28 AM, John Meagher wrote: > Here's a great tool for handling exactly that case: > https://github.com/edwardcaprio

Re: Merging files

2013-07-30 Thread Something Something
Each bz2 file after merging is about 50Megs. The reducers take about 9 minutes. Note: 'getmerge' is not an option. There isn't enough disk space to do a getmerge on the local production box. Plus we need a scalable solution as these files will get a lot bigger soon. On Tue, Jul 30, 2013 at 10

Merging files

2013-07-30 Thread Something Something
Hello, One of our pig scripts creates over 500 small part files. To save on namespace, we need to cut down the # of files, so instead of saving 500 small files we need to merge them into 50. We tried the following: 1) When we set parallel number to 50, the Pig script takes a long time - for ob