1) Correct.

2) Copy to the cluster from any machine, just have the config on the
classpath or specify the full path in your copy command
(hdfs://my-nn/path/to/destination).



On Sat, Jul 16, 2011 at 1:00 PM, jagaran das <[email protected]> wrote:
> ok then
>
> 1. We have to write a pig job for merging or PIG itself merges so that less 
> number of mappers are invoked.
>
> 2. Can we copy to a cluster from a non cluster machine, using the namespace 
> URI of the NN ? - We can dedicate some good config boxes to do our merging 
> and copying and then copy it to NN over network.
>
> 3. How is the performance of FileCrusher tool ?
>
> We found that to copy 12 GB of data for all the 15 apps in parallel took 35 
> mins.
>
> we ran 15 copy from local each having 12 GB data.
>
> Thanks
> JD
>
>
> ________________________________
> From: Dmitriy Ryaboy <[email protected]>
> To: [email protected]; jagaran das <[email protected]>
> Sent: Saturday, 16 July 2011 7:58 AM
> Subject: Re: Hadoop Production Issue
>
> Merging: doesn't actually speed things up all that much; reduces load
> on the Namenode, and speeds up job initialization some. You don't have
> to do it on the namenode itself. Neither do you have to do copying on
> the NN. In fact, don't do anything but run the NameNode on the
> namenode.
>
> Pig jobs can transparently merge small jobs into larger chunks, so you
> won't be stuck with 11K mappers.
>
> Don't copy to local an then run SQL loader. Use Sqoop export, and load
> directly from Hadoop.
>
> You cannot append to a file that already exists in the cluster. This
> will be available in one of the coming Hadoop releases. You can
> certainly create a new file in a directory, and load whole
> directories.
>
> -D
>
> On Sat, Jul 16, 2011 at 1:17 AM, jagaran das <[email protected]> wrote:
>>
>>
>> Hi,
>>
>> Due to requirements in our current production CDH3 cluster we need to copy 
>> around 11520 small size files (Total Size 12 GB) to the cluster for one 
>> application.
>> Like this we have 20 applications that would run in parallel
>>
>> So one set would have 11520 files of total size 12 GB
>> Like this we would have 15 sets in parallel,
>>
>> We have a total SLA for the pipeline from copy to pig aggregation to copy to 
>> local and sql load is 15 mins.
>>
>> What we do:
>>
>> 1. Merge Files so that we get rid of small files. - Huge time hit process??? 
>> Do we have any other option???
>> 2. Copy to cluster
>> 3. Execute PIG job
>> 4. copy to local
>> 5 Sql loader
>>
>> Can we perform merge and copy to cluster from a different host other than 
>> the Namenode?
>> We want an out of cluster machine running a java process that would
>> 1. Run periodically
>> 2. Merge Files
>> 3. Copy to Cluster
>>
>> Secondly,
>> If we can append to an existing file in cluster?
>>
>> Please provide your thoughts as maintaing the SLA is becoming tough.
>>
>> Regards,
>> Jagaran
>>

Reply via email to