Re: Hadoop Production Issue

Jeremy Hanna Sat, 16 Jul 2011 05:49:04 -0700

One thing that we use is filecrush to merge small files below a threshold.  It 
works pretty well.
http://www.jointhegrid.com/hadoop_filecrush/index.jsp


On Jul 16, 2011, at 1:17 AM, jagaran das wrote:

> 
> 
> Hi,
> 
> Due to requirements in our current production CDH3 cluster we need to copy 
> around 11520 small size files (Total Size 12 GB) to the cluster for one 
> application.
> Like this we have 20 applications that would run in parallel
> 
> So one set would have 11520 files of total size 12 GB
> Like this we would have 15 sets in parallel, 
> 
> We have a total SLA for the pipeline from copy to pig aggregation to copy to 
> local and sql load is 15 mins. 
> 
> What we do:
> 
> 1. Merge Files so that we get rid of small files. - Huge time hit process??? 
> Do we have any other option???
> 2. Copy to cluster
> 3. Execute PIG job
> 4. copy to local
> 5 Sql loader
> 
> Can we perform merge and copy to cluster from a different host other than the 
> Namenode?
> We want an out of cluster machine running a java process that would
> 1. Run periodically
> 2. Merge Files
> 3. Copy to Cluster 
> 
> Secondly,
> If we can append to an existing file in cluster?
> 
> Please provide your thoughts as maintaing the SLA is becoming tough. 
> 
> Regards,
> Jagaran

Re: Hadoop Production Issue

Reply via email to