[jira] [Comment Edited] (TEZ-2303) ConcurrentModificationException while processing recovery

2015-04-27 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516200#comment-14516200
 ] 

Jeff Zhang edited comment on TEZ-2303 at 4/28/15 2:22 AM:
--

[~hitesh] Yes I think it make sense for the short term fix as least it fix the 
ConcurrentModificationException, the recovery process can keep going. 

Regarding the issue of not providing info to clients until the recovery phase 
is over, I think there are 2 main scenario:

* ClientHandler RPC is started but recovery log is not read. In this case, it 
will throw "No dag running" exception in AM, no effect on the client side.  so 
I think it is OK.
{code}
2015-04-28 09:32:02,054 INFO [IPC Server handler 0 on 6000] ipc.Server: IPC 
Server handler 0 on 6000, call 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
from 127.0.0.1:63539 Call#9557 Retry#0
org.apache.tez.dag.api.TezException: No running dag at present
at 
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89)
at 
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156)
at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95)
at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{code}

* The second scenario is that even the recovery log is read, the 
RecoveryTransition may not have completed. Then the client side may still get 
wrong dag status.  As I mentioned, this may need some big change on the 
recovery. We can leave it in future and take it into account when refactoring 
the recovery code. 



was (Author: zjffdu):
[~hitesh] Yes I think it make sense for the short term fix as least it fix the 
ConcurrentModificationException. 

Regarding the issue of not providing info to clients until the recovery phase 
is over, I think there are 2 main scenario:

* ClientHandler RPC is started but recovery log is not read. In this case, it 
will throw "No dag running" exception in AM, no effect on the client side.  so 
I think it is OK.
{code}
2015-04-28 09:32:02,054 INFO [IPC Server handler 0 on 6000] ipc.Server: IPC 
Server handler 0 on 6000, call 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.getDAGStatus 
from 127.0.0.1:63539 Call#9557 Retry#0
org.apache.tez.dag.api.TezException: No running dag at present
at 
org.apache.tez.dag.api.client.DAGClientHandler.getDAG(DAGClientHandler.java:89)
at 
org.apache.tez.dag.api.client.DAGClientHandler.getACLManager(DAGClientHandler.java:156)
at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.getDAGStatus(DAGClientAMProtocolBlockingPBServerImpl.java:95)
at 
org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7465)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{code}

* The second scenario is that even the recovery log is read, the 
RecoveryTransition may not have completed. Then the client side may still get 
wrong dag status.  As I mentioned, this may need some big change on the 
recovery. We can leave it in future and take it into account when refactoring 
the recovery code. 


> ConcurrentModificationException while processing recovery
> -
>
> Key: TEZ-2303
> URL: https://issues.apache.org/jira/browse/TEZ-2303
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jason Lowe
>Assignee: Jeff Zhang
> Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch, TEZ-2303-4.patch
>
>
> 

[jira] [Comment Edited] (TEZ-2303) ConcurrentModificationException while processing recovery

2015-04-23 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14510217#comment-14510217
 ] 

Jeff Zhang edited comment on TEZ-2303 at 4/24/15 1:27 AM:
--

[~hitesh] Upload a new patch 

* Start the services after the recovery data is read. It should be fine to 
start all services after the recovery process because RecoveryParser don't use 
any services
* Also verify versionMismatch can shutdown the AM properly even services are 
not started.


was (Author: zjffdu):
[~hitesh] Upload a new patch 

* Start the services after the recovery data is read. It should be fine stop 
all services after the recovery process because RecoveryParser don't use any 
services
* Also verify versionMismatch can shutdown the AM properly even services are 
not started.

> ConcurrentModificationException while processing recovery
> -
>
> Key: TEZ-2303
> URL: https://issues.apache.org/jira/browse/TEZ-2303
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jason Lowe
>Assignee: Jeff Zhang
> Attachments: TEZ-2303-1.patch, TEZ-2303-2.patch
>
>
> Saw a Tez AM log a few ConcurrentModificationException messages while trying 
> to recover from a previous attempt that crashed.  Exception details to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2303) ConcurrentModificationException while processing recovery

2015-04-22 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508428#comment-14508428
 ] 

Jeff Zhang edited comment on TEZ-2303 at 4/23/15 4:18 AM:
--

Upload patch ( acquire writelock first in restoreFromEvent, this would prevent 
the ConcurrentModificationException while recovering, but will still cause 
client get incorrect dag status)  [~hitesh] Please help review it.


was (Author: zjffdu):
Upload patch ( acquire writelock first in restoreFromEvent, this would prevent 
the ConcurrentModificationException while recovering, but will still cause 
client get incorrect dag status)

> ConcurrentModificationException while processing recovery
> -
>
> Key: TEZ-2303
> URL: https://issues.apache.org/jira/browse/TEZ-2303
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jason Lowe
>Assignee: Jeff Zhang
> Attachments: TEZ-2303-1.patch
>
>
> Saw a Tez AM log a few ConcurrentModificationException messages while trying 
> to recover from a previous attempt that crashed.  Exception details to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2303) ConcurrentModificationException while processing recovery

2015-04-13 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492046#comment-14492046
 ] 

Jeff Zhang edited comment on TEZ-2303 at 4/13/15 7:59 AM:
--

DAG/Vertex/Task may have the similar issue.  It looks like should acquire write 
lock first in restoreFromEvent.


was (Author: zjffdu):
DAG/Vertex/Task may have the similar issue. 

> ConcurrentModificationException while processing recovery
> -
>
> Key: TEZ-2303
> URL: https://issues.apache.org/jira/browse/TEZ-2303
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jason Lowe
>Assignee: Jeff Zhang
>
> Saw a Tez AM log a few ConcurrentModificationException messages while trying 
> to recover from a previous attempt that crashed.  Exception details to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2303) ConcurrentModificationException while processing recovery

2015-04-13 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492046#comment-14492046
 ] 

Jeff Zhang edited comment on TEZ-2303 at 4/13/15 7:49 AM:
--

DAG/Vertex/Task may have the similar issue. 


was (Author: zjffdu):
DAG/Vertex/Task may has the similar issue. 

> ConcurrentModificationException while processing recovery
> -
>
> Key: TEZ-2303
> URL: https://issues.apache.org/jira/browse/TEZ-2303
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: Jason Lowe
>Assignee: Jeff Zhang
>
> Saw a Tez AM log a few ConcurrentModificationException messages while trying 
> to recover from a previous attempt that crashed.  Exception details to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)