[ 
https://issues.apache.org/jira/browse/HADOOP-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503159#comment-13503159
 ] 

Steve Loughran commented on HADOOP-9085:
----------------------------------------

Pid recycling is a permanent problem with Unix systems -you are correct that 
something needs to be done. We can't rely on deleting the pid file on a 
successful shutdown either, as all forms of killing are "successful" -even 
server reboot.

I don't think the proposed patch would work as it's still looking for a file 
{{$pid}}, even though it's no longer needed, and that file is also used in the 
error text. Better to skip the -f check and use {{$curpid}} in the error. Even 
after tha, it's pretty brittle against unintentional command matches.

What we need to do is move away from pid-file-liveness tests altogether.

There is a far more robust alternative, the service started up should create an 
exclusive write lock on a well-known file. When the process dies, the OS 
automatically releases this lock. I'll open a JIRA on it.

                
> start namenode failure,bacause pid of namenode pid file is other process pid 
> or thread id before start namenode
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-9085
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9085
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: bin
>    Affects Versions: 2.0.1-alpha, 2.0.3-alpha
>         Environment: NA
>            Reporter: liaowenrui
>             Fix For: 2.0.1-alpha, 2.0.2-alpha, 2.0.3-alpha
>
>
> pid of namenode pid file is other process pid or thread id before start 
> namenode,start namenode will failure.because the pid of namenode pid file 
> will be checked use kill -0 command before start namenode in hadoop-daemo.sh 
> script.when pid of namenode pid file is other process pid or thread id,checkt 
> is use kil -0 command,and the kill -0 will return success.it means the 
> namenode is runing.in really,namenode is not runing.
> 2338 is dead namenode pid 
> 2305 is datanode pid
> cqn2:/tmp # kill -0 2338
> cqn2:/tmp # ps -wweLo pid,ppid,tid | grep 2338
>  2305     1  2338

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to