One thing that we use is filecrush to merge small files below a threshold. It works pretty well. http://www.jointhegrid.com/hadoop_filecrush/index.jsp
On Jul 16, 2011, at 1:17 AM, jagaran das wrote: > > > Hi, > > Due to requirements in our current production CDH3 cluster we need to copy > around 11520 small size files (Total Size 12 GB) to the cluster for one > application. > Like this we have 20 applications that would run in parallel > > So one set would have 11520 files of total size 12 GB > Like this we would have 15 sets in parallel, > > We have a total SLA for the pipeline from copy to pig aggregation to copy to > local and sql load is 15 mins. > > What we do: > > 1. Merge Files so that we get rid of small files. - Huge time hit process??? > Do we have any other option??? > 2. Copy to cluster > 3. Execute PIG job > 4. copy to local > 5 Sql loader > > Can we perform merge and copy to cluster from a different host other than the > Namenode? > We want an out of cluster machine running a java process that would > 1. Run periodically > 2. Merge Files > 3. Copy to Cluster > > Secondly, > If we can append to an existing file in cluster? > > Please provide your thoughts as maintaing the SLA is becoming tough. > > Regards, > Jagaran
