1) Correct. 2) Copy to the cluster from any machine, just have the config on the classpath or specify the full path in your copy command (hdfs://my-nn/path/to/destination).
On Sat, Jul 16, 2011 at 1:00 PM, jagaran das <[email protected]> wrote: > ok then > > 1. We have to write a pig job for merging or PIG itself merges so that less > number of mappers are invoked. > > 2. Can we copy to a cluster from a non cluster machine, using the namespace > URI of the NN ? - We can dedicate some good config boxes to do our merging > and copying and then copy it to NN over network. > > 3. How is the performance of FileCrusher tool ? > > We found that to copy 12 GB of data for all the 15 apps in parallel took 35 > mins. > > we ran 15 copy from local each having 12 GB data. > > Thanks > JD > > > ________________________________ > From: Dmitriy Ryaboy <[email protected]> > To: [email protected]; jagaran das <[email protected]> > Sent: Saturday, 16 July 2011 7:58 AM > Subject: Re: Hadoop Production Issue > > Merging: doesn't actually speed things up all that much; reduces load > on the Namenode, and speeds up job initialization some. You don't have > to do it on the namenode itself. Neither do you have to do copying on > the NN. In fact, don't do anything but run the NameNode on the > namenode. > > Pig jobs can transparently merge small jobs into larger chunks, so you > won't be stuck with 11K mappers. > > Don't copy to local an then run SQL loader. Use Sqoop export, and load > directly from Hadoop. > > You cannot append to a file that already exists in the cluster. This > will be available in one of the coming Hadoop releases. You can > certainly create a new file in a directory, and load whole > directories. > > -D > > On Sat, Jul 16, 2011 at 1:17 AM, jagaran das <[email protected]> wrote: >> >> >> Hi, >> >> Due to requirements in our current production CDH3 cluster we need to copy >> around 11520 small size files (Total Size 12 GB) to the cluster for one >> application. >> Like this we have 20 applications that would run in parallel >> >> So one set would have 11520 files of total size 12 GB >> Like this we would have 15 sets in parallel, >> >> We have a total SLA for the pipeline from copy to pig aggregation to copy to >> local and sql load is 15 mins. >> >> What we do: >> >> 1. Merge Files so that we get rid of small files. - Huge time hit process??? >> Do we have any other option??? >> 2. Copy to cluster >> 3. Execute PIG job >> 4. copy to local >> 5 Sql loader >> >> Can we perform merge and copy to cluster from a different host other than >> the Namenode? >> We want an out of cluster machine running a java process that would >> 1. Run periodically >> 2. Merge Files >> 3. Copy to Cluster >> >> Secondly, >> If we can append to an existing file in cluster? >> >> Please provide your thoughts as maintaing the SLA is becoming tough. >> >> Regards, >> Jagaran >>
