So you are saying, we will first do a 'hadoop count' to get the total # of
bytes for all files. Let's say that comes to: 1538684305
Default Block Size is: 128M
So, total # of blocks needed: 1538684305 / 131072 = 11740
Max file blocks = 11740 / 50 (# of output files) = 234
Does this calculat
Thanks, John. But I don't see an option to specify the # of output files.
How does Crush decide how many files to create? Is it only based on file
sizes?
On Wed, Jul 31, 2013 at 6:28 AM, John Meagher wrote:
> Here's a great tool for handling exactly that case:
> https://github.com/edwardcaprio
Each bz2 file after merging is about 50Megs. The reducers take about 9
minutes.
Note: 'getmerge' is not an option. There isn't enough disk space to do a
getmerge on the local production box. Plus we need a scalable solution as
these files will get a lot bigger soon.
On Tue, Jul 30, 2013 at 10
Hello,
One of our pig scripts creates over 500 small part files. To save on
namespace, we need to cut down the # of files, so instead of saving 500
small files we need to merge them into 50. We tried the following:
1) When we set parallel number to 50, the Pig script takes a long time -
for ob