RE: System auto reboot When MR runs

2016-04-25 Thread sunww
It maybe a kernel bug. 

url is https://bugs.centos.org/print_bug_page.php?bug_id=7770


From: spe...@outlook.com
To: user@hadoop.apache.org
Subject: System auto  reboot When MR runs
Date: Sun, 24 Apr 2016 13:07:31 +




Hi
I'm using Hadoop2.7   with  cgroup enabled on Redhat7.1.

When I run  large MR jobs, some nodemanager machine auto reboot. 
If I use DefaultLCEResourcesHandler instead of CgroupsLCEResourcesHandler, 
The MR jobs run fine.

/var/crash/127.0.0.1-2016.04.23-21:52:08/vmcore-dmesg.txt  like this:
CPU: 29 PID: 63957 Comm: java Not tainted 3.10.0-229.el7.x86_64 #1
...
...
[15770.097168] Call Trace:
[15770.097536]  [] ? pick_next_task_fair+0x129/0x1d0
[15770.097905]  [] __schedule+0x127/0x7c0
[15770.098271]  [] schedule+0x29/0x70
[15770.098633]  [] futex_wait_queue_me+0xd3/0x130
[15770.098992]  [] futex_wait+0x179/0x280
[15770.099353]  [] ? native_sched_clock+0x13/0x80
[15770.099698]  [] ? sched_clock+0x9/0x10
[15770.100057]  [] ? sched_slice.isra.51+0x5e/0xc0
[15770.100419]  [] ? __enqueue_entity+0x78/0x80
[15770.100783]  [] do_futex+0xfe/0x5b0
[15770.101143]  [] ? wake_up_new_task+0x104/0x160
[15770.101496]  [] SyS_futex+0x80/0x180
[15770.101852]  [] system_call_fastpath+0x16/0x1b


Any suggestion will be appreciated. Thanks  
  

System auto reboot When MR runs

2016-04-24 Thread sunww
Hi
I'm using Hadoop2.7   with  cgroup enabled on Redhat7.1.

When I run  large MR jobs, some nodemanager machine auto reboot. 
If I use DefaultLCEResourcesHandler instead of CgroupsLCEResourcesHandler, 
The MR jobs run fine.

/var/crash/127.0.0.1-2016.04.23-21:52:08/vmcore-dmesg.txt  like this:
CPU: 29 PID: 63957 Comm: java Not tainted 3.10.0-229.el7.x86_64 #1
...
...
[15770.097168] Call Trace:
[15770.097536]  [] ? pick_next_task_fair+0x129/0x1d0
[15770.097905]  [] __schedule+0x127/0x7c0
[15770.098271]  [] schedule+0x29/0x70
[15770.098633]  [] futex_wait_queue_me+0xd3/0x130
[15770.098992]  [] futex_wait+0x179/0x280
[15770.099353]  [] ? native_sched_clock+0x13/0x80
[15770.099698]  [] ? sched_clock+0x9/0x10
[15770.100057]  [] ? sched_slice.isra.51+0x5e/0xc0
[15770.100419]  [] ? __enqueue_entity+0x78/0x80
[15770.100783]  [] do_futex+0xfe/0x5b0
[15770.101143]  [] ? wake_up_new_task+0x104/0x160
[15770.101496]  [] SyS_futex+0x80/0x180
[15770.101852]  [] system_call_fastpath+0x16/0x1b


Any suggestion will be appreciated. Thanks  
  

write to most datanode fail quickly

2014-10-14 Thread sunww
HiI'm using hbase with about 20 regionserver. And  one regionserver failed 
to write  most of datanodes quickly, finally cause this regionserver die. While 
other regionserver is ok. 
logs like this:java.io.IOException: Bad response ERROR for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339217 from 
datanode 132.228.248.20:50010  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:681)2014-10-13
 09:23:01,227 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339217 in 
pipeline 132.228.248.17:50010, 132.228.248.20:50010, 132.228.248.41:50010: bad 
datanode 132.228.248.20:500102014-10-13 09:23:32,021 WARN 
org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  
for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339415java.io.IOException:
 Bad response ERROR for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339415 from 
datanode 132.228.248.41:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:681)
then serveral  firstBadLink error 2014-10-13 09:23:33,390 
INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStreamjava.io.IOException: Bad connect ack with firstBadLink 
as 132.228.248.18:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1090)
then serveral Failed to add a datanode2014-10-13 09:23:44,331 
WARN org.apache.hadoop.hdfs.DFSClient: Error while syncingjava.io.IOException: 
Failed to add a datanode.  User may turn off this feature by setting 
dfs.client.block.write.replace-datanode-on-failure.policy in configuration, 
where the current policy is DEFAULT.  (Nodes: current=[132.228.248.17:50010, 
132.228.248.35:50010], original=[132.228.248.17:50010, 132.228.248.35:50010])
the full log is in http://paste2.org/xfn16jm2Any suggestion will be 
appreciated. Thanks.  

RE: write to most datanode fail quickly

2014-10-14 Thread sunww

I'm using Hadoop 2.0.0 and  not  run fsck.  only one regionserver have these 
dfs logs,   strange.

Thanks
CC: user@hadoop.apache.org
From: yuzhih...@gmail.com
Subject: Re: write   to most datanode fail quickly
Date: Tue, 14 Oct 2014 02:43:26 -0700
To: user@hadoop.apache.org

Which Hadoop release are you using ?
Have you run fsck ?
Cheers
On Oct 14, 2014, at 2:31 AM, sunww spe...@outlook.com wrote:




HiI'm using hbase with about 20 regionserver. And  one regionserver failed 
to write  most of datanodes quickly, finally cause this regionserver die. While 
other regionserver is ok. 
logs like this:java.io.IOException: Bad response ERROR for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339217 from 
datanode 132.228.248.20:50010  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:681)2014-10-13
 09:23:01,227 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339217 in 
pipeline 132.228.248.17:50010, 132.228.248.20:50010, 132.228.248.41:50010: bad 
datanode 132.228.248.20:500102014-10-13 09:23:32,021 WARN 
org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  
for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339415java.io.IOException:
 Bad response ERROR for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339415 from 
datanode 132.228.248.41:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:681)
then serveral  firstBadLink error 2014-10-13 09:23:33,390 
INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStreamjava.io.IOException: Bad connect ack with firstBadLink 
as 132.228.248.18:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1090)
then serveral Failed to add a datanode2014-10-13 09:23:44,331 
WARN org.apache.hadoop.hdfs.DFSClient: Error while syncingjava.io.IOException: 
Failed to add a datanode.  User may turn off this feature by setting 
dfs.client.block.write.replace-datanode-on-failure.policy in configuration, 
where the current policy is DEFAULT.  (Nodes: current=[132.228.248.17:50010, 
132.228.248.35:50010], original=[132.228.248.17:50010, 132.228.248.35:50010])
the full log is in http://paste2.org/xfn16jm2Any suggestion will be 
appreciated. Thanks.  
  

RE: write to most datanode fail quickly

2014-10-14 Thread sunww
Hi
dfs.client.read.shortcircuit is true.
this is namenode log at that moment:http://paste2.org/U0zDA9ms
It seems like there is no special in namenode log. 

Thanks
CC: user@hadoop.apache.org
From: yuzhih...@gmail.com
Subject: Re: write   to most datanode fail quickly
Date: Tue, 14 Oct 2014 03:09:24 -0700
To: user@hadoop.apache.org

Can you check NameNode log for 132.228.48.20 ?
Have you turned on short circuit read ?
Cheers
On Oct 14, 2014, at 3:00 AM, sunww spe...@outlook.com wrote:





I'm using Hadoop 2.0.0 and  not  run fsck.  only one regionserver have these 
dfs logs,   strange.

Thanks
CC: user@hadoop.apache.org
From: yuzhih...@gmail.com
Subject: Re: write   to most datanode fail quickly
Date: Tue, 14 Oct 2014 02:43:26 -0700
To: user@hadoop.apache.org

Which Hadoop release are you using ?
Have you run fsck ?
Cheers
On Oct 14, 2014, at 2:31 AM, sunww spe...@outlook.com wrote:




HiI'm using hbase with about 20 regionserver. And  one regionserver failed 
to write  most of datanodes quickly, finally cause this regionserver die. While 
other regionserver is ok. 
logs like this:java.io.IOException: Bad response ERROR for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339217 from 
datanode 132.228.248.20:50010  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:681)2014-10-13
 09:23:01,227 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339217 in 
pipeline 132.228.248.17:50010, 132.228.248.20:50010, 132.228.248.41:50010: bad 
datanode 132.228.248.20:500102014-10-13 09:23:32,021 WARN 
org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  
for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339415java.io.IOException:
 Bad response ERROR for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339415 from 
datanode 132.228.248.41:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:681)
then serveral  firstBadLink error 2014-10-13 09:23:33,390 
INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStreamjava.io.IOException: Bad connect ack with firstBadLink 
as 132.228.248.18:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1090)
then serveral Failed to add a datanode2014-10-13 09:23:44,331 
WARN org.apache.hadoop.hdfs.DFSClient: Error while syncingjava.io.IOException: 
Failed to add a datanode.  User may turn off this feature by setting 
dfs.client.block.write.replace-datanode-on-failure.policy in configuration, 
where the current policy is DEFAULT.  (Nodes: current=[132.228.248.17:50010, 
132.228.248.35:50010], original=[132.228.248.17:50010, 132.228.248.35:50010])
the full log is in http://paste2.org/xfn16jm2Any suggestion will be 
appreciated. Thanks.  
  
  

RE: write to most datanode fail quickly

2014-10-14 Thread sunww
Hithe correct  ip is  132.228.248.20.I check  hdfs log in  the dead 
regionserver, it have some error message, maybe it's useful.
http://paste2.org/NwpcaGVv
Thanks

Date: Tue, 14 Oct 2014 10:28:31 -0700
Subject: Re: write to most datanode fail quickly
From: yuzhih...@gmail.com
To: user@hadoop.apache.org

132.228.48.20 didn't show up in the snippet (spanning 3 minutes only) you 
posted.

I don't see error or exception either.
Perhaps search in wider scope.
On Tue, Oct 14, 2014 at 5:36 AM, sunww spe...@outlook.com wrote:



Hi
dfs.client.read.shortcircuit is true.
this is namenode log at that moment:http://paste2.org/U0zDA9ms
It seems like there is no special in namenode log. 

Thanks
CC: user@hadoop.apache.org
From: yuzhih...@gmail.com
Subject: Re: write   to most datanode fail quickly
Date: Tue, 14 Oct 2014 03:09:24 -0700
To: user@hadoop.apache.org

Can you check NameNode log for 132.228.48.20 ?
Have you turned on short circuit read ?
Cheers
On Oct 14, 2014, at 3:00 AM, sunww spe...@outlook.com wrote:





I'm using Hadoop 2.0.0 and  not  run fsck.  only one regionserver have these 
dfs logs,   strange.

Thanks
CC: user@hadoop.apache.org
From: yuzhih...@gmail.com
Subject: Re: write   to most datanode fail quickly
Date: Tue, 14 Oct 2014 02:43:26 -0700
To: user@hadoop.apache.org

Which Hadoop release are you using ?
Have you run fsck ?
Cheers
On Oct 14, 2014, at 2:31 AM, sunww spe...@outlook.com wrote:




HiI'm using hbase with about 20 regionserver. And  one regionserver failed 
to write  most of datanodes quickly, finally cause this regionserver die. While 
other regionserver is ok. 
logs like this:java.io.IOException: Bad response ERROR for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339217 from 
datanode 132.228.248.20:50010  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:681)2014-10-13
 09:23:01,227 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339217 in 
pipeline 132.228.248.17:50010, 132.228.248.20:50010, 132.228.248.41:50010: bad 
datanode 132.228.248.20:500102014-10-13 09:23:32,021 WARN 
org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  
for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339415java.io.IOException:
 Bad response ERROR for block 
BP-165080589-132.228.248.11-1371617709677:blk_5069077415583579127_39339415 from 
datanode 132.228.248.41:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:681)
then serveral  firstBadLink error 2014-10-13 09:23:33,390 
INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
createBlockOutputStreamjava.io.IOException: Bad connect ack with firstBadLink 
as 132.228.248.18:50010 at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1090)
then serveral Failed to add a datanode2014-10-13 09:23:44,331 
WARN org.apache.hadoop.hdfs.DFSClient: Error while syncingjava.io.IOException: 
Failed to add a datanode.  User may turn off this feature by setting 
dfs.client.block.write.replace-datanode-on-failure.policy in configuration, 
where the current policy is DEFAULT.  (Nodes: current=[132.228.248.17:50010, 
132.228.248.35:50010], original=[132.228.248.17:50010, 132.228.248.35:50010])
the full log is in http://paste2.org/xfn16jm2Any suggestion will be 
appreciated. Thanks.  
  
  

  

RE: about long time balance stop

2014-10-09 Thread sunww
Maybe it is related tohttps://issues.apache.org/jira/browse/HDFS-5806

From: spe...@outlook.com
To: user@hadoop.apache.org
Subject: about long time balance stop
Date: Thu, 9 Oct 2014 02:49:56 +




Hi
I'm using Hadoop 2.2.0。After I add some new nodes to cluster, I run 
balance.After several days, I find there is only   
block.BlockTokenSecretManager: Setting block keys  in balance log. It looks 
like balance stop , But the balance process  exists. There is some logs:
14/10/06 00:32:06 INFO balancer.Balancer: Moving block 1166241997 from 
132.228.248.112:50010 to 132.228.248.131:50010 through 132.228.248.68:50010 is 
succeeded.14/10/06 02:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 05:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 07:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 10:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 12:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 15:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 17:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 20:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 22:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 01:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 03:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 06:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 08:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 11:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 13:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 16:10:12 INFO block.BlockTokenSecretManager: Setting block ke


Any suggestion will be  appreciated, Thanks 
  

about long time balance stop

2014-10-08 Thread sunww
Hi
I'm using Hadoop 2.2.0。After I add some new nodes to cluster, I run 
balance.After several days, I find there is only   
block.BlockTokenSecretManager: Setting block keys  in balance log. It looks 
like balance stop , But the balance process  exists. There is some logs:
14/10/06 00:32:06 INFO balancer.Balancer: Moving block 1166241997 from 
132.228.248.112:50010 to 132.228.248.131:50010 through 132.228.248.68:50010 is 
succeeded.14/10/06 02:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 05:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 07:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 10:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 12:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 15:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 17:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 20:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/06 22:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 01:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 03:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 06:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 08:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 11:10:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 13:40:12 INFO block.BlockTokenSecretManager: Setting block 
keys14/10/07 16:10:12 INFO block.BlockTokenSecretManager: Setting block ke


Any suggestion will be  appreciated, Thanks