Is it possible to configure hdfs in a federation mode and in an HA mode in the same time?

2016-08-15 Thread Alexandr Porunov
Hello all,

I don't understand if it possible to configure HDFS in both modes in the
same time. Does it make sense? Can somebody show a simple configuration of
HDFS in both modes? (nameNode1, nameNode2, nameNodeStandby1,
nameNodeStandby2)

Sincerely,
Alexandr


Re: How to distcp data between two clusters which are not in the same local network?

2016-08-15 Thread Shady Xu
Thanks Wei-Chiu and Sunil, I have read the docs you mentioned before
starting. The specific problem now is that the DataNodes of the source
cluster report their local ip instead of the public one, which cannot be
accessed from the NodeManagers of the destination cluster. Seems the
solution is to set the `dfs.datanode.dns.interface` property but
unfortunately it doesn't work.

2016-08-15 22:06 GMT+08:00 Sunil Govind :

> Hi
>
> I think you can also refer below link too.
> http://aajisaka.github.io/hadoop-project/hadoop-distcp/DistCp.html
>
> Thanks
> Sunil
>
> On Mon, Aug 15, 2016 at 7:26 PM Wei-Chiu Chuang 
> wrote:
>
>> Hello,
>> if I understand your question correctly, you are actually building a
>> multi-home Hadoop, correct?
>> Multi-homed Hadoop cluster can be tricky to set up, to the extend that
>> Cloudera does not recommend it. I've not set up a multihome Hadoop cluster
>> before, but I think you have to make sure the reverse resolution works for
>> the IP addresses.
>>
>> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/
>> HdfsMultihoming.html
>>
>>
>> On Mon, Aug 15, 2016 at 1:06 AM, Shady Xu  wrote:
>>
>>> Hi all,
>>>
>>> Recently I tried to use distcp to copy data across two clusters which
>>> are not in the same local network. Fortunately, the nodes of the source
>>> cluster each has an extra interface and ip which can be accessed from the
>>> destination cluster. But during the process of distcp, the map tasks always
>>> used the local ip of the source cluster nodes which they cannot reach.
>>>
>>> I tried changing the property 'dfs.datanode.dns.interface' to the one I
>>> want, and I tried changing the property 'dfs.datanode.use.datanode.
>>> hostname' to true too. Nothing works.
>>>
>>> Does hadoop now support this or do I miss something?
>>>
>>
>>


Re: Hadoop archives (.har) are really really slow

2016-08-15 Thread Aaron Turner
Oh I should mention that creating the archive took only a few hours, but 
copying the files out of the archive back to HDFS was 80MB/min. Would take 
years to copy back which seems really surprising. 

-Aaron


> On Aug 15, 2016, at 12:33 PM, Tsz Wo Sze  wrote:
> 
> ls over files in har:// maybe 10 times slow than ls over regular files.  It 
> does not sound normal unless it would take ~1 day to list out all the 250TB 
> files when they are stored as regular files.
> Tsz-Wo
> 
> 
> On Monday, August 15, 2016 10:01 AM, Aaron Turner  
> wrote:
> 
> 
> Basically I want to list all the files in a .har file and compare the
> file list/sizes to an existing directory in HDFS.  The problem is that
> running commands like: hdfs dfs -ls -R  is orders of
> magnitude slower then running the same command against a live HDFS
> file system.
> 
> How much slower?  I've calculated it will take ~19 days to list all
> the files in 250TB worth of content spread between 2 .har files.
> 
> Is this normal?  Can I do this faster (write a map/reduce job/etc?)
> 
> --
> Aaron Turner
> https://synfin.net/ Twitter: @synfinatic
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
> -- Benjamin Franklin
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
> 
> 
> 


Re: Hadoop archives (.har) are really really slow

2016-08-15 Thread Aaron Turner
I can list all the files out of HDFS in a few hours, not a day. Listing the 
files in a single directory in the har takes ~50 min.  Honestly I'd be happy 
with only a 10x performance hit. I'm seeing closer to 100-150x. 

-Aaron


> On Aug 15, 2016, at 12:33 PM, Tsz Wo Sze  wrote:
> 
> ls over files in har:// maybe 10 times slow than ls over regular files.  It 
> does not sound normal unless it would take ~1 day to list out all the 250TB 
> files when they are stored as regular files.
> Tsz-Wo
> 
> 
> On Monday, August 15, 2016 10:01 AM, Aaron Turner  
> wrote:
> 
> 
> Basically I want to list all the files in a .har file and compare the
> file list/sizes to an existing directory in HDFS.  The problem is that
> running commands like: hdfs dfs -ls -R  is orders of
> magnitude slower then running the same command against a live HDFS
> file system.
> 
> How much slower?  I've calculated it will take ~19 days to list all
> the files in 250TB worth of content spread between 2 .har files.
> 
> Is this normal?  Can I do this faster (write a map/reduce job/etc?)
> 
> --
> Aaron Turner
> https://synfin.net/ Twitter: @synfinatic
> Those who would give up essential Liberty, to purchase a little temporary
> Safety, deserve neither Liberty nor Safety.
> -- Benjamin Franklin
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
> 
> 
> 


Hadoop archives (.har) are really really slow

2016-08-15 Thread Aaron Turner
Basically I want to list all the files in a .har file and compare the
file list/sizes to an existing directory in HDFS.  The problem is that
running commands like: hdfs dfs -ls -R  is orders of
magnitude slower then running the same command against a live HDFS
file system.

How much slower?  I've calculated it will take ~19 days to list all
the files in 250TB worth of content spread between 2 .har files.

Is this normal?  Can I do this faster (write a map/reduce job/etc?)

--
Aaron Turner
https://synfin.net/ Twitter: @synfinatic
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: Yarn web UI shows more memory used than actual

2016-08-15 Thread Ravi Prakash
Hi Suresh!

YARN's accounting for memory on each node is completely different from the
Linux kernel's accounting of memory used. e.g. I could launch a MapReduce
task which in reality allocates just 100 Mb, and tell YARN to give it 8 Gb.
The kernel would show the memory requested by the task, the resident memory
(which would be ~ 100Mb) and the NodeManager page will show 8Gb used.
Please see
https://yahooeng.tumblr.com/post/147408435396/moving-the-utilization-needle-with-hadoop

HTH
Ravi

On Mon, Aug 15, 2016 at 5:58 AM, Sunil Govind 
wrote:

> Hi Suresh
>
> "This 'memory used' would be the memory used by all containers running on
> that node"
> >> "Memory Used" in Nodes page indicates how memory is used in all the
> node managers with respect to the corresponding demand made to RM. For eg,
> if application has asked for 4GB resource and if its really using only 2GB,
> then this kind of difference can be shown (one possibility). Which means
> 4GB will be displayed in Node page.
>
> As Ray has mentioned if the demand for resource is more from AM itself OR
> with highly configured JVM size for containers (through java opts), there
> can be chances that containers may take more that you intented and UI will
> display higher value.
>
> Thanks
> Sunil
>
> On Sun, Aug 14, 2016 at 6:35 AM Suresh V  wrote:
>
>> Hello Ray,
>>
>> I'm referring to the nodes of the cluster page, which shows the
>> individual nodes and the total memory available in each node and the memory
>> used in each node.
>>
>> This 'memory used' would be the memory used by all containers running on
>> that node; however, if I check free command in the node, there is
>> significant difference. I'm unable to understand this...
>>
>> Appreciate any light into this. I agree the main RM page shows the total
>> containers memory utilization across nodes., which is matching the sum of
>> memory used in each nodes as displayed in the 'nodes of the cluster' page...
>>
>> Thank you
>> Suresh.
>>
>>
>> Suresh V
>> http://www.justbirds.in
>>
>>
>> On Sat, Aug 13, 2016 at 12:44 PM, Ray Chiang  wrote:
>>
>>> The RM page will show the combined container memory usage.  If you have
>>> a significant difference between any or all of
>>>
>>> 1) actual process memory usage
>>> 2) JVM heap size
>>> 3) container maximum
>>>
>>> then you will have significant memory underutilization.
>>>
>>> -Ray
>>>
>>>
>>> On 20160813 6:31 AM, Suresh V wrote:
>>>
>>> Hello,
>>>
>>> In our cluster when a MR job is running, in the 'Nodes of the cluster'
>>> page, it shows the memory used as 84GB out of 87GB allocated to yarn
>>> nodemanagers.
>>> However when I actually do a top or free command while logged in to the
>>> node, it shows as only 23GB used and about 95GB or more free.
>>>
>>> I would imagine the memory used displayed in the Yarn web UI should
>>> match the memory used shown by top or free command on the node.
>>>
>>> Please advise if this is right thinking or am I missing something?
>>>
>>> Thank you
>>> Suresh.
>>>
>>>
>>>
>>>
>>


Re: How to distcp data between two clusters which are not in the same local network?

2016-08-15 Thread Sunil Govind
Hi

I think you can also refer below link too.
http://aajisaka.github.io/hadoop-project/hadoop-distcp/DistCp.html

Thanks
Sunil

On Mon, Aug 15, 2016 at 7:26 PM Wei-Chiu Chuang  wrote:

> Hello,
> if I understand your question correctly, you are actually building a
> multi-home Hadoop, correct?
> Multi-homed Hadoop cluster can be tricky to set up, to the extend that
> Cloudera does not recommend it. I've not set up a multihome Hadoop cluster
> before, but I think you have to make sure the reverse resolution works for
> the IP addresses.
>
>
> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html
>
>
> On Mon, Aug 15, 2016 at 1:06 AM, Shady Xu  wrote:
>
>> Hi all,
>>
>> Recently I tried to use distcp to copy data across two clusters which are
>> not in the same local network. Fortunately, the nodes of the source cluster
>> each has an extra interface and ip which can be accessed from the
>> destination cluster. But during the process of distcp, the map tasks always
>> used the local ip of the source cluster nodes which they cannot reach.
>>
>> I tried changing the property 'dfs.datanode.dns.interface' to the one I
>> want, and I tried changing the property '
>> dfs.datanode.use.datanode.hostname' to true too. Nothing works.
>>
>> Does hadoop now support this or do I miss something?
>>
>
>


Re: How to distcp data between two clusters which are not in the same local network?

2016-08-15 Thread Wei-Chiu Chuang
Hello,
if I understand your question correctly, you are actually building a
multi-home Hadoop, correct?
Multi-homed Hadoop cluster can be tricky to set up, to the extend that
Cloudera does not recommend it. I've not set up a multihome Hadoop cluster
before, but I think you have to make sure the reverse resolution works for
the IP addresses.

https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html


On Mon, Aug 15, 2016 at 1:06 AM, Shady Xu  wrote:

> Hi all,
>
> Recently I tried to use distcp to copy data across two clusters which are
> not in the same local network. Fortunately, the nodes of the source cluster
> each has an extra interface and ip which can be accessed from the
> destination cluster. But during the process of distcp, the map tasks always
> used the local ip of the source cluster nodes which they cannot reach.
>
> I tried changing the property 'dfs.datanode.dns.interface' to the one I
> want, and I tried changing the property 'dfs.datanode.use.datanode.
> hostname' to true too. Nothing works.
>
> Does hadoop now support this or do I miss something?
>


Re: Yarn web UI shows more memory used than actual

2016-08-15 Thread Sunil Govind
Hi Suresh

"This 'memory used' would be the memory used by all containers running on
that node"
>> "Memory Used" in Nodes page indicates how memory is used in all the node
managers with respect to the corresponding demand made to RM. For eg, if
application has asked for 4GB resource and if its really using only 2GB,
then this kind of difference can be shown (one possibility). Which means
4GB will be displayed in Node page.

As Ray has mentioned if the demand for resource is more from AM itself OR
with highly configured JVM size for containers (through java opts), there
can be chances that containers may take more that you intented and UI will
display higher value.

Thanks
Sunil

On Sun, Aug 14, 2016 at 6:35 AM Suresh V  wrote:

> Hello Ray,
>
> I'm referring to the nodes of the cluster page, which shows the individual
> nodes and the total memory available in each node and the memory used in
> each node.
>
> This 'memory used' would be the memory used by all containers running on
> that node; however, if I check free command in the node, there is
> significant difference. I'm unable to understand this...
>
> Appreciate any light into this. I agree the main RM page shows the total
> containers memory utilization across nodes., which is matching the sum of
> memory used in each nodes as displayed in the 'nodes of the cluster' page...
>
> Thank you
> Suresh.
>
>
> Suresh V
> http://www.justbirds.in
>
>
> On Sat, Aug 13, 2016 at 12:44 PM, Ray Chiang  wrote:
>
>> The RM page will show the combined container memory usage.  If you have a
>> significant difference between any or all of
>>
>> 1) actual process memory usage
>> 2) JVM heap size
>> 3) container maximum
>>
>> then you will have significant memory underutilization.
>>
>> -Ray
>>
>>
>> On 20160813 6:31 AM, Suresh V wrote:
>>
>> Hello,
>>
>> In our cluster when a MR job is running, in the 'Nodes of the cluster'
>> page, it shows the memory used as 84GB out of 87GB allocated to yarn
>> nodemanagers.
>> However when I actually do a top or free command while logged in to the
>> node, it shows as only 23GB used and about 95GB or more free.
>>
>> I would imagine the memory used displayed in the Yarn web UI should match
>> the memory used shown by top or free command on the node.
>>
>> Please advise if this is right thinking or am I missing something?
>>
>> Thank you
>> Suresh.
>>
>>
>>
>>
>


How to distcp data between two clusters which are not in the same local network?

2016-08-15 Thread Shady Xu
Hi all,

Recently I tried to use distcp to copy data across two clusters which are
not in the same local network. Fortunately, the nodes of the source cluster
each has an extra interface and ip which can be accessed from the
destination cluster. But during the process of distcp, the map tasks always
used the local ip of the source cluster nodes which they cannot reach.

I tried changing the property 'dfs.datanode.dns.interface' to the one I
want, and I tried changing the property 'dfs.datanode.use.datanode.hostname'
to true too. Nothing works.

Does hadoop now support this or do I miss something?