[jira] [Commented] (FLINK-10133) finished job's jobgraph never been cleaned up in zookeeper for standalone clusters (HA mode with multiple masters)

2018-10-23 Thread Xiangyu Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660738#comment-16660738
 ] 

Xiangyu Zhu commented on FLINK-10133:
-

Hi all,

we tested with 1.6.1 and it seems the issue is gone and the issue can be closed.

Thank you.

> finished job's jobgraph never been cleaned up in zookeeper for standalone 
> clusters (HA mode with multiple masters)
> --
>
> Key: FLINK-10133
> URL: https://issues.apache.org/jira/browse/FLINK-10133
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.0, 1.5.2, 1.6.0
>Reporter: Xiangyu Zhu
>Priority: Major
> Attachments: client.log, namenode.log, standalonesession.log, 
> zookeeper.log
>
>
> Hi,
> We have 3 servers in our test environment, noted as node1-3. Setup is as 
> following:
>  * hadoop hdfs: node1 as namenode, node2,3 as datanode
>  * zookeeper: node1-3 as a quorum (but also tried node1 alone)
>  * flink: node1,2 as masters, node2,3 as slaves
> As my understanding when a job finished the corresponding job's blob data is 
> expected to be deleted from hdfs path and node under zookeeper's path `/\{zk 
> path root}/\{cluster-id}/jobgraphs/\{job id}` should be deleted after that. 
> However we observe that whenever we submitted a job and it finished (via 
> `bin/flink run WordCount.jar`), the blob data is gone whereas job id node 
> under zookeeper is still there, with a uuid style lock node inside it. From 
> the debug node in zookeeper we observed something like "cannot be deleted 
> because non empty". Because of this, as long as a job is finished and the 
> jobgraph node persists, if restart the clusters or kill one manager (to test 
> HA mode), it tries to recover a finished job and couldn't find blob data 
> under hdfs, and the whole cluster is down.
> If we use only node1 as master and node2,3 as slaves, the jobgraphs node can 
> be deleted successfully. If the jobgraphs is clean, killing one job manager 
> makes another stand-by JM raised as leader, so it is only this jobgraphs 
> issue preventing HA from working.
> I'm not sure if there's something wrong with our configs because this happens 
> every time for finished job (we only tested with wordcount.jar though). I'm 
> aware of FLINK-10011 and FLINK-10029, but unlike FLINK-10011 this happens 
> every time, rendering HA mode un-useable for us.
> Any idea what might cause this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10133) finished job's jobgraph never been cleaned up in zookeeper for standalone clusters (HA mode with multiple masters)

2018-08-13 Thread Xiangyu Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579248#comment-16579248
 ] 

Xiangyu Zhu commented on FLINK-10133:
-

[~Wosinsan] [~elevy] I have uploaded the logs with some sensitive info 
modified. If the log looks ok to you then this issue can be closed as 
duplicate. Thanks!

> finished job's jobgraph never been cleaned up in zookeeper for standalone 
> clusters (HA mode with multiple masters)
> --
>
> Key: FLINK-10133
> URL: https://issues.apache.org/jira/browse/FLINK-10133
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.0, 1.5.2, 1.6.0
>Reporter: Xiangyu Zhu
>Priority: Major
> Attachments: client.log, namenode.log, standalonesession.log, 
> zookeeper.log
>
>
> Hi,
> We have 3 servers in our test environment, noted as node1-3. Setup is as 
> following:
>  * hadoop hdfs: node1 as namenode, node2,3 as datanode
>  * zookeeper: node1-3 as a quorum (but also tried node1 alone)
>  * flink: node1,2 as masters, node2,3 as slaves
> As my understanding when a job finished the corresponding job's blob data is 
> expected to be deleted from hdfs path and node under zookeeper's path `/\{zk 
> path root}/\{cluster-id}/jobgraphs/\{job id}` should be deleted after that. 
> However we observe that whenever we submitted a job and it finished (via 
> `bin/flink run WordCount.jar`), the blob data is gone whereas job id node 
> under zookeeper is still there, with a uuid style lock node inside it. From 
> the debug node in zookeeper we observed something like "cannot be deleted 
> because non empty". Because of this, as long as a job is finished and the 
> jobgraph node persists, if restart the clusters or kill one manager (to test 
> HA mode), it tries to recover a finished job and couldn't find blob data 
> under hdfs, and the whole cluster is down.
> If we use only node1 as master and node2,3 as slaves, the jobgraphs node can 
> be deleted successfully. If the jobgraphs is clean, killing one job manager 
> makes another stand-by JM raised as leader, so it is only this jobgraphs 
> issue preventing HA from working.
> I'm not sure if there's something wrong with our configs because this happens 
> every time for finished job (we only tested with wordcount.jar though). I'm 
> aware of FLINK-10011 and FLINK-10029, but unlike FLINK-10011 this happens 
> every time, rendering HA mode un-useable for us.
> Any idea what might cause this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10133) finished job's jobgraph never been cleaned up in zookeeper for standalone clusters (HA mode with multiple masters)

2018-08-13 Thread Elias Levy (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578792#comment-16578792
 ] 

Elias Levy commented on FLINK-10133:


[~Frefreak] this is likely the same issue as FLINK-10011.  If so, mark this one 
as a duplicate.

> finished job's jobgraph never been cleaned up in zookeeper for standalone 
> clusters (HA mode with multiple masters)
> --
>
> Key: FLINK-10133
> URL: https://issues.apache.org/jira/browse/FLINK-10133
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.0, 1.5.2, 1.6.0
>Reporter: Xiangyu Zhu
>Priority: Major
>
> Hi,
> We have 3 servers in our test environment, noted as node1-3. Setup is as 
> following:
>  * hadoop hdfs: node1 as namenode, node2,3 as datanode
>  * zookeeper: node1-3 as a quorum (but also tried node1 alone)
>  * flink: node1,2 as masters, node2,3 as slaves
> As my understanding when a job finished the corresponding job's blob data is 
> expected to be deleted from hdfs path and node under zookeeper's path `/\{zk 
> path root}/\{cluster-id}/jobgraphs/\{job id}` should be deleted after that. 
> However we observe that whenever we submitted a job and it finished (via 
> `bin/flink run WordCount.jar`), the blob data is gone whereas job id node 
> under zookeeper is still there, with a uuid style lock node inside it. From 
> the debug node in zookeeper we observed something like "cannot be deleted 
> because non empty". Because of this, as long as a job is finished and the 
> jobgraph node persists, if restart the clusters or kill one manager (to test 
> HA mode), it tries to recover a finished job and couldn't find blob data 
> under hdfs, and the whole cluster is down.
> If we use only node1 as master and node2,3 as slaves, the jobgraphs node can 
> be deleted successfully. If the jobgraphs is clean, killing one job manager 
> makes another stand-by JM raised as leader, so it is only this jobgraphs 
> issue preventing HA from working.
> I'm not sure if there's something wrong with our configs because this happens 
> every time for finished job (we only tested with wordcount.jar though). I'm 
> aware of FLINK-10011 and FLINK-10029, but unlike FLINK-10011 this happens 
> every time, rendering HA mode un-useable for us.
> Any idea what might cause this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10133) finished job's jobgraph never been cleaned up in zookeeper for standalone clusters (HA mode with multiple masters)

2018-08-13 Thread Xiangyu Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578452#comment-16578452
 ] 

Xiangyu Zhu commented on FLINK-10133:
-

[~Wosinsan] Sure, but I have no access to my servers right now. I will post the 
logs tomorrow.

> finished job's jobgraph never been cleaned up in zookeeper for standalone 
> clusters (HA mode with multiple masters)
> --
>
> Key: FLINK-10133
> URL: https://issues.apache.org/jira/browse/FLINK-10133
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.0, 1.5.2, 1.6.0
>Reporter: Xiangyu Zhu
>Priority: Major
>
> Hi,
> We have 3 servers in our test environment, noted as node1-3. Setup is as 
> following:
>  * hadoop hdfs: node1 as namenode, node2,3 as datanode
>  * zookeeper: node1-3 as a quorum (but also tried node1 alone)
>  * flink: node1,2 as masters, node2,3 as slaves
> As my understanding when a job finished the corresponding job's blob data is 
> expected to be deleted from hdfs path and node under zookeeper's path `/\{zk 
> path root}/\{cluster-id}/jobgraphs/\{job id}` should be deleted after that. 
> However we observe that whenever we submitted a job and it finished (via 
> `bin/flink run WordCount.jar`), the blob data is gone whereas job id node 
> under zookeeper is still there, with a uuid style lock node inside it. From 
> the debug node in zookeeper we observed something like "cannot be deleted 
> because non empty". Because of this, as long as a job is finished and the 
> jobgraph node persists, if restart the clusters or kill one manager (to test 
> HA mode), it tries to recover a finished job and couldn't find blob data 
> under hdfs, and the whole cluster is down.
> If we use only node1 as master and node2,3 as slaves, the jobgraphs node can 
> be deleted successfully. If the jobgraphs is clean, killing one job manager 
> makes another stand-by JM raised as leader, so it is only this jobgraphs 
> issue preventing HA from working.
> I'm not sure if there's something wrong with our configs because this happens 
> every time for finished job (we only tested with wordcount.jar though). I'm 
> aware of FLINK-10011 and FLINK-10029, but unlike FLINK-10011 this happens 
> every time, rendering HA mode un-useable for us.
> Any idea what might cause this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10133) finished job's jobgraph never been cleaned up in zookeeper for standalone clusters (HA mode with multiple masters)

2018-08-13 Thread JIRA


[ 
https://issues.apache.org/jira/browse/FLINK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578403#comment-16578403
 ] 

Dominik Wosiński commented on FLINK-10133:
--

I have faced this issue when I was trying to deploy multiple JobManagers. I was 
not able to solve it. 
But it's weird that it still affects 1.5.2 and 1.6.0 since my idea was that 
jobs are failing due to the removal of blobs being independent of removal of 
the job graph and this was fixed in FLINK-9575. Could you post some logs here?

> finished job's jobgraph never been cleaned up in zookeeper for standalone 
> clusters (HA mode with multiple masters)
> --
>
> Key: FLINK-10133
> URL: https://issues.apache.org/jira/browse/FLINK-10133
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.0, 1.5.2, 1.6.0
>Reporter: Xiangyu Zhu
>Priority: Major
>
> Hi,
> We have 3 servers in our test environment, noted as node1-3. Setup is as 
> following:
>  * hadoop hdfs: node1 as namenode, node2,3 as datanode
>  * zookeeper: node1-3 as a quorum (but also tried node1 alone)
>  * flink: node1,2 as masters, node2,3 as slaves
> As my understanding when a job finished the corresponding job's blob data is 
> expected to be deleted from hdfs path and node under zookeeper's path `/\{zk 
> path root}/\{cluster-id}/jobgraphs/\{job id}` should be deleted after that. 
> However we observe that whenever we submitted a job and it finished (via 
> `bin/flink run WordCount.jar`), the blob data is gone whereas job id node 
> under zookeeper is still there, with a uuid style lock node inside it. From 
> the debug node in zookeeper we observed something like "cannot be deleted 
> because non empty". Because of this, as long as a job is finished and the 
> jobgraph node persists, if restart the clusters or kill one manager (to test 
> HA mode), it tries to recover a finished job and couldn't find blob data 
> under hdfs, and the whole cluster is down.
> If we use only node1 as master and node2,3 as slaves, the jobgraphs node can 
> be deleted successfully. If the jobgraphs is clean, killing one job manager 
> makes another stand-by JM raised as leader, so it is only this jobgraphs 
> issue preventing HA from working.
> I'm not sure if there's something wrong with our configs because this happens 
> every time for finished job (we only tested with wordcount.jar though). I'm 
> aware of FLINK-10011 and FLINK-10029, but unlike FLINK-10011 this happens 
> every time, rendering HA mode un-useable for us.
> Any idea what might cause this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)