[jira] [Created] (RATIS-697) Provide helper scripts for code quality checks
Marton Elek created RATIS-697: - Summary: Provide helper scripts for code quality checks Key: RATIS-697 URL: https://issues.apache.org/jira/browse/RATIS-697 Project: Ratis Issue Type: Improvement Reporter: Marton Elek Assignee: Marton Elek In Ozone we started to use simple shell scripts to check the quality of the code. (dev-support/check/checkstyle.sh). They help us to execute local maven commands quickly and collect all of the results. They also help us to use github-actions or other highly parallel CI in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-750) Ratis server fails with "java.lang.ClassNotFoundException: com.codahale.metrics.jvm.GarbageCollectorMetricSet"
[ https://issues.apache.org/jira/browse/RATIS-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971684#comment-16971684 ] Marton Elek commented on RATIS-750: --- Optional used intentionally as metrics may or may not be used by the apps. It's the responsibility of the user of ratis to put a specific version to the classpath. I think a better fix is to add it as a real compile dependency only where it is required (eg. add it to the logService and to the example server). > Ratis server fails with "java.lang.ClassNotFoundException: > com.codahale.metrics.jvm.GarbageCollectorMetricSet" > -- > > Key: RATIS-750 > URL: https://issues.apache.org/jira/browse/RATIS-750 > Project: Ratis > Issue Type: Bug > Components: build >Reporter: Clay B. >Assignee: Clay B. >Priority: Major > Attachments: > 0001-RATIS-750.-Ratis-server-fails-with-java.lang.ClassNo.patch > > > In testing the current master, starting the Ratis server via > {{./ratis-examples/src/main/bin/server.sh filestore server --storage $storage > --id $id --peers $peers 2>&1 | \}} I end up with the following failure to > start: > {code:java} > Found > /home/vagrant/incubator-ratis/ratis-examples/target/ratis-examples-0.5.0-SNAPSHOT.jar > 2019-11-11 03:27:52 INFO MetricRegistries:64 - Loaded MetricRegistries class > org.apache.ratis.metrics.impl.MetricRegistriesImpl > 2019-11-11 03:27:52 WARN MetricRegistriesImpl:61 - First MetricRegistry has > been created without registering reporters. You may need to call > MetricRegistries.global().addReportRegistration(...) before. > Exception in thread "main" java.lang.NoClassDefFoundError: > com/codahale/metrics/jvm/GarbageCollectorMetricSet > at > org.apache.ratis.metrics.JVMMetrics.addJvmMetrics(JVMMetrics.java:42) > at > org.apache.ratis.metrics.JVMMetrics.initJvmMetrics(JVMMetrics.java:32) > at org.apache.ratis.examples.filestore.cli.Server.run(Server.java:60) > at org.apache.ratis.examples.common.Runner.main(Runner.java:58) > Caused by: java.lang.ClassNotFoundException: > com.codahale.metrics.jvm.GarbageCollectorMetricSet > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 4 more > === Command terminated normally (Mon Nov 11 03:27:52 2019) === {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (RATIS-752) Update Ratis thirdparty to 0.3.0
[ https://issues.apache.org/jira/browse/RATIS-752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek reassigned RATIS-752: - Assignee: Mukul Kumar Singh > Update Ratis thirdparty to 0.3.0 > > > Key: RATIS-752 > URL: https://issues.apache.org/jira/browse/RATIS-752 > Project: Ratis > Issue Type: Bug > Components: thirdparty >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Attachments: RATIS-752.001.patch > > > This jira updates the ratis thirdparty version to 0.3.0 and also updates the > protobuf.version to 3.10.0 and grpc.version to 1.24.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-752) Update Ratis thirdparty to 0.3.0
[ https://issues.apache.org/jira/browse/RATIS-752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-752: -- Fix Version/s: 0.5.0 > Update Ratis thirdparty to 0.3.0 > > > Key: RATIS-752 > URL: https://issues.apache.org/jira/browse/RATIS-752 > Project: Ratis > Issue Type: Bug > Components: thirdparty >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Fix For: 0.5.0 > > Attachments: RATIS-752.001.patch > > > This jira updates the ratis thirdparty version to 0.3.0 and also updates the > protobuf.version to 3.10.0 and grpc.version to 1.24.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-750) Ratis server fails with "java.lang.ClassNotFoundException: com.codahale.metrics.jvm.GarbageCollectorMetricSet"
[ https://issues.apache.org/jira/browse/RATIS-750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972388#comment-16972388 ] Marton Elek commented on RATIS-750: --- Thanks the update [~clayb]Yes, I think it gives us more flexibility as ratis can be used without metrics-jvm. It's optional. Tested and worked well. Committed to the master... > Ratis server fails with "java.lang.ClassNotFoundException: > com.codahale.metrics.jvm.GarbageCollectorMetricSet" > -- > > Key: RATIS-750 > URL: https://issues.apache.org/jira/browse/RATIS-750 > Project: Ratis > Issue Type: Bug > Components: build >Reporter: Clay B. >Assignee: Clay B. >Priority: Major > Attachments: > 0001-RATIS-750.-Ratis-server-fails-with-java.lang.ClassNo.patch, > 0002-RATIS-750.-Ratis-server-fails-with-java.lang.ClassNo.patch > > > In testing the current master, starting the Ratis server via > {{./ratis-examples/src/main/bin/server.sh filestore server --storage $storage > --id $id --peers $peers 2>&1 | \}} I end up with the following failure to > start: > {code:java} > Found > /home/vagrant/incubator-ratis/ratis-examples/target/ratis-examples-0.5.0-SNAPSHOT.jar > 2019-11-11 03:27:52 INFO MetricRegistries:64 - Loaded MetricRegistries class > org.apache.ratis.metrics.impl.MetricRegistriesImpl > 2019-11-11 03:27:52 WARN MetricRegistriesImpl:61 - First MetricRegistry has > been created without registering reporters. You may need to call > MetricRegistries.global().addReportRegistration(...) before. > Exception in thread "main" java.lang.NoClassDefFoundError: > com/codahale/metrics/jvm/GarbageCollectorMetricSet > at > org.apache.ratis.metrics.JVMMetrics.addJvmMetrics(JVMMetrics.java:42) > at > org.apache.ratis.metrics.JVMMetrics.initJvmMetrics(JVMMetrics.java:32) > at org.apache.ratis.examples.filestore.cli.Server.run(Server.java:60) > at org.apache.ratis.examples.common.Runner.main(Runner.java:58) > Caused by: java.lang.ClassNotFoundException: > com.codahale.metrics.jvm.GarbageCollectorMetricSet > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 4 more > === Command terminated normally (Mon Nov 11 03:27:52 2019) === {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (RATIS-702) Make metrics reporting implementation pluggable
[ https://issues.apache.org/jira/browse/RATIS-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek reassigned RATIS-702: - Assignee: Marton Elek > Make metrics reporting implementation pluggable > --- > > Key: RATIS-702 > URL: https://issues.apache.org/jira/browse/RATIS-702 > Project: Ratis > Issue Type: Wish > Components: metrics >Reporter: Henrik Hegardt >Assignee: Marton Elek >Priority: Major > > It would be really nice if the metrics functionality also was pluggable so > one could choose how to report metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-702) Make metrics reporting implementation pluggable
[ https://issues.apache.org/jira/browse/RATIS-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16945977#comment-16945977 ] Marton Elek commented on RATIS-702: --- Thank you very much [~hheg] to report this issue. I totally agree. The reporters should be registered by the user of the ratis library. > Make metrics reporting implementation pluggable > --- > > Key: RATIS-702 > URL: https://issues.apache.org/jira/browse/RATIS-702 > Project: Ratis > Issue Type: Wish > Components: metrics >Reporter: Henrik Hegardt >Assignee: Marton Elek >Priority: Major > > It would be really nice if the metrics functionality also was pluggable so > one could choose how to report metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-702) Make metrics reporting implementation pluggable
[ https://issues.apache.org/jira/browse/RATIS-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-702: -- Attachment: RATIS-702.001.patch > Make metrics reporting implementation pluggable > --- > > Key: RATIS-702 > URL: https://issues.apache.org/jira/browse/RATIS-702 > Project: Ratis > Issue Type: Wish > Components: metrics >Reporter: Henrik Hegardt >Assignee: Marton Elek >Priority: Major > Attachments: RATIS-702.001.patch > > Time Spent: 50m > Remaining Estimate: 0h > > It would be really nice if the metrics functionality also was pluggable so > one could choose how to report metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-702) Make metrics reporting implementation pluggable
[ https://issues.apache.org/jira/browse/RATIS-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948429#comment-16948429 ] Marton Elek commented on RATIS-702: --- Oh, sorry if I misunderstood the goal of this jira. Making ratis metrics totally vendor independent is a bigger task as we have dropwizard interfaces in our interfaces. Supporting both 3 and 4 dropwizard seems to be easier. As far as I see after this patch only the JVMMetrics depends on dropwizard metrics 3 and all the others interfaces are compatible. I made the jvm and ganglia related dependencies optional. I didn't test it (yet) but I think if you use the metrics library and you bump the verison of dropwizard dependencies, it should work with 4. > Make metrics reporting implementation pluggable > --- > > Key: RATIS-702 > URL: https://issues.apache.org/jira/browse/RATIS-702 > Project: Ratis > Issue Type: Wish > Components: metrics >Reporter: Henrik Hegardt >Assignee: Marton Elek >Priority: Major > Attachments: RATIS-702.001.patch > > Time Spent: 50m > Remaining Estimate: 0h > > It would be really nice if the metrics functionality also was pluggable so > one could choose how to report metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-702) Make metrics reporting implementation pluggable
[ https://issues.apache.org/jira/browse/RATIS-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-702: -- Attachment: RATIS-702.002.patch > Make metrics reporting implementation pluggable > --- > > Key: RATIS-702 > URL: https://issues.apache.org/jira/browse/RATIS-702 > Project: Ratis > Issue Type: Wish > Components: metrics >Reporter: Henrik Hegardt >Assignee: Marton Elek >Priority: Major > Attachments: RATIS-702.001.patch, RATIS-702.002.patch > > Time Spent: 50m > Remaining Estimate: 0h > > It would be really nice if the metrics functionality also was pluggable so > one could choose how to report metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-697) Provide helper scripts for code quality checks
[ https://issues.apache.org/jira/browse/RATIS-697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948461#comment-16948461 ] Marton Elek commented on RATIS-697: --- I just enabled github actions as a test right inside this pull request. > Provide helper scripts for code quality checks > -- > > Key: RATIS-697 > URL: https://issues.apache.org/jira/browse/RATIS-697 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In Ozone we started to use simple shell scripts to check the quality of the > code. (dev-support/check/checkstyle.sh). They help us to execute local maven > commands quickly and collect all of the results. > They also help us to use github-actions or other highly parallel CI in the > future. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-706: -- Attachment: jstack.txt > Dead lock in GrpcClientRpc > -- > > Key: RATIS-706 > URL: https://issues.apache.org/jira/browse/RATIS-706 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Marton Elek >Priority: Major > Attachments: jstack.txt > > > I started an Ozone cluster on Kubernetes and started a freon test (ozone > freon ockg -n1) > After a while I found that the one freon instance is not creating keys any > more. I checked the om RPC endpoint with ozone insight and no RPC messages > has been arrived. > Based on the jstack output we have a deadlock between > PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. > I am not sure (yet) what is the exact problem, but based on the stack traces > It seems to be Ratis related. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-706) Dead lock in GrpcClientRpc
Marton Elek created RATIS-706: - Summary: Dead lock in GrpcClientRpc Key: RATIS-706 URL: https://issues.apache.org/jira/browse/RATIS-706 Project: Ratis Issue Type: Bug Components: gRPC Reporter: Marton Elek Attachments: jstack.txt I started an Ozone cluster on Kubernetes and started a freon test (ozone freon ockg -n1) After a while I found that the one freon instance is not creating keys any more. I checked the om RPC endpoint with ozone insight and no RPC messages has been arrived. Based on the jstack output we have a deadlock between PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. I am not sure (yet) what is the exact problem, but based on the stack traces It seems to be Ratis related. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951107#comment-16951107 ] Marton Elek commented on RATIS-706: --- Thanks the quick fix [~szetszwo]. I deployed it and worked well, couldn't see the dead lock any more. I am not sure If I can judge the code but will try if nobody else will review it... > Dead lock in GrpcClientRpc > -- > > Key: RATIS-706 > URL: https://issues.apache.org/jira/browse/RATIS-706 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Marton Elek >Assignee: Tsz-wo Sze >Priority: Major > Attachments: jstack.txt, r706_20191011.patch, r706_20191011b.patch > > > I started an Ozone cluster on Kubernetes and started a freon test (ozone > freon ockg -n1) > After a while I found that the one freon instance is not creating keys any > more. I checked the om RPC endpoint with ozone insight and no RPC messages > has been arrived. > Based on the jstack output we have a deadlock between > PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. > I am not sure (yet) what is the exact problem, but based on the stack traces > It seems to be Ratis related. > {code} > Found one Java-level deadlock: > = > "pool-2-thread-6": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a > org.apache.ratis.util.SlidingWindow$Client), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > Java stack information for the threads listed above: > === > "pool-2-thread-6": > at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103) > - waiting to lock <0x00033eb70a00> (a java.lang.Object) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78) > at > org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75) > at > org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59) > at > org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496) > at > org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473) > at > org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60) > - locked <0x0003f2ba4240> (a > org.apache.hadoop.ozone.client.io.OzoneOutputStream) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.3/Thread.java:834) > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > at >
[jira] [Created] (RATIS-820) Use https for the maven repositories
Marton Elek created RATIS-820: - Summary: Use https for the maven repositories Key: RATIS-820 URL: https://issues.apache.org/jira/browse/RATIS-820 Project: Ratis Issue Type: Improvement Reporter: Marton Elek As reported here: https://github.com/apache/incubator-ratis/pull/53 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-820) Use https for the maven repositories
[ https://issues.apache.org/jira/browse/RATIS-820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek resolved RATIS-820. --- Fix Version/s: 0.6.0 Resolution: Fixed > Use https for the maven repositories > > > Key: RATIS-820 > URL: https://issues.apache.org/jira/browse/RATIS-820 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Priority: Major > Fix For: 0.6.0 > > > As reported here: https://github.com/apache/incubator-ratis/pull/53 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-804) Race condition between cache evict and load in LogSegment
[ https://issues.apache.org/jira/browse/RATIS-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17025848#comment-17025848 ] Marton Elek commented on RATIS-804: --- The problematic segment is this (in CacheInvalidationPolicy.java): {code:java} if (result.isEmpty()) { for (int i = safeIndex; i >= j; i--) { LogSegment s = segments.get(i); if (s.getStartIndex() > lastAppliedIndex && s.hasCache()) { result.add(s); break; } } } {code} This is the last segment in the algorithm. The evictImpl: # First checks which segments are not flushed. They should be kept # (In case of follower) Which segments are already applied # (In case of follower and the no segments to remove until this point): *Remove the segments between the lastAppliedIndex and the localFlushIndex* with the hope that it can be loaded any time. It can, but only with locks. > Race condition between cache evict and load in LogSegment > - > > Key: RATIS-804 > URL: https://issues.apache.org/jira/browse/RATIS-804 > Project: Ratis > Issue Type: Bug >Reporter: Marton Elek >Priority: Critical > > I am doing some kind of stress testing with Ozone. I start one Datanode in > FOLLOWER mode and the load generator (Freon) behaves like a LEADER. > I am sending huge number of AppendLogEntries to the FOLLOWER without > inhibitions. > As a result I got NPE: > {code:java} > 2020-01-28 15:08:20 ERROR StateMachineUpdater:184 - > 3fda0c39-ce3c-4540-a804-44d9ac1f4853@group-E1B13B4CA5C0-StateMachineUpdater: > the StateMachineUp > dater hits Throwable > org.apache.ratis.server.raftlog.RaftLogIOException: > java.lang.NullPointerException > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:320) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:293) > at > org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:218) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:167) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.NullPointerException > at java.util.Objects.requireNonNull(Objects.java:203) > at > org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:214) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:318) > ... 4 more {code} > It seems to be a race condition between LogSegment.evictCache() and > LogSegment.loadCache(). > # StateMachineUpdater tries to update the StateMachine with the next log > entry > # It can't be found in the cache, therefore the LogSegment.loadCache() is > called > # The LogSegment.LogEntryLoader.load() reads the segment files from the disk > # After loading, it returns with the loaded entry > If the GRPC thread evicts the cache between 3 and 4. (it's possible that the > log segment is already flushed, therefore can be evicted) an NPE will be > thrown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-804) Race condition between cache evict and load in LogSegment
[ https://issues.apache.org/jira/browse/RATIS-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032334#comment-17032334 ] Marton Elek commented on RATIS-804: --- {quote}[~elek], would you mind testing the patch? {quote} Sure, thanks the patch. I just started to create a new build to deploy and test. > Race condition between cache evict and load in LogSegment > - > > Key: RATIS-804 > URL: https://issues.apache.org/jira/browse/RATIS-804 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: Marton Elek >Assignee: Tsz-wo Sze >Priority: Critical > Attachments: r804_20200205.patch > > > I am doing some kind of stress testing with Ozone. I start one Datanode in > FOLLOWER mode and the load generator (Freon) behaves like a LEADER. > I am sending huge number of AppendLogEntries to the FOLLOWER without > inhibitions. > As a result I got NPE: > {code:java} > 2020-01-28 15:08:20 ERROR StateMachineUpdater:184 - > 3fda0c39-ce3c-4540-a804-44d9ac1f4853@group-E1B13B4CA5C0-StateMachineUpdater: > the StateMachineUp > dater hits Throwable > org.apache.ratis.server.raftlog.RaftLogIOException: > java.lang.NullPointerException > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:320) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:293) > at > org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:218) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:167) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.NullPointerException > at java.util.Objects.requireNonNull(Objects.java:203) > at > org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:214) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:318) > ... 4 more {code} > It seems to be a race condition between LogSegment.evictCache() and > LogSegment.loadCache(). > # StateMachineUpdater tries to update the StateMachine with the next log > entry > # It can't be found in the cache, therefore the LogSegment.loadCache() is > called > # The LogSegment.LogEntryLoader.load() reads the segment files from the disk > # After loading, it returns with the loaded entry > If the GRPC thread evicts the cache between 3 and 4. (it's possible that the > log segment is already flushed, therefore can be evicted) an NPE will be > thrown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-804) Race condition between cache evict and load in LogSegment
[ https://issues.apache.org/jira/browse/RATIS-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032519#comment-17032519 ] Marton Elek commented on RATIS-804: --- +1 I tested and couldn't reproduce the Exception any more. (In fact it's very hard to reproduce with a properly configured client. I used a specific client which doesn't close the GRPC requests. With the fixed client, it's very hard to see the Exception during real tests...) > Race condition between cache evict and load in LogSegment > - > > Key: RATIS-804 > URL: https://issues.apache.org/jira/browse/RATIS-804 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: Marton Elek >Assignee: Tsz-wo Sze >Priority: Critical > Attachments: r804_20200205.patch > > > I am doing some kind of stress testing with Ozone. I start one Datanode in > FOLLOWER mode and the load generator (Freon) behaves like a LEADER. > I am sending huge number of AppendLogEntries to the FOLLOWER without > inhibitions. > As a result I got NPE: > {code:java} > 2020-01-28 15:08:20 ERROR StateMachineUpdater:184 - > 3fda0c39-ce3c-4540-a804-44d9ac1f4853@group-E1B13B4CA5C0-StateMachineUpdater: > the StateMachineUp > dater hits Throwable > org.apache.ratis.server.raftlog.RaftLogIOException: > java.lang.NullPointerException > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:320) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:293) > at > org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:218) > at > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:167) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.NullPointerException > at java.util.Objects.requireNonNull(Objects.java:203) > at > org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:214) > at > org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:318) > ... 4 more {code} > It seems to be a race condition between LogSegment.evictCache() and > LogSegment.loadCache(). > # StateMachineUpdater tries to update the StateMachine with the next log > entry > # It can't be found in the cache, therefore the LogSegment.loadCache() is > called > # The LogSegment.LogEntryLoader.load() reads the segment files from the disk > # After loading, it returns with the loaded entry > If the GRPC thread evicts the cache between 3 and 4. (it's possible that the > log segment is already flushed, therefore can be evicted) an NPE will be > thrown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-804) Race condition between cache evict and load in LogSegment
Marton Elek created RATIS-804: - Summary: Race condition between cache evict and load in LogSegment Key: RATIS-804 URL: https://issues.apache.org/jira/browse/RATIS-804 Project: Ratis Issue Type: Bug Reporter: Marton Elek I am doing some kind of stress testing with Ozone. I start one Datanode in FOLLOWER mode and the load generator (Freon) behaves like a LEADER. I am sending huge number of AppendLogEntries to the FOLLOWER without inhibitions. As a result I got NPE: {code:java} 2020-01-28 15:08:20 ERROR StateMachineUpdater:184 - 3fda0c39-ce3c-4540-a804-44d9ac1f4853@group-E1B13B4CA5C0-StateMachineUpdater: the StateMachineUp dater hits Throwable org.apache.ratis.server.raftlog.RaftLogIOException: java.lang.NullPointerException at org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:320) at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:293) at org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:218) at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:167) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at java.util.Objects.requireNonNull(Objects.java:203) at org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:214) at org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:318) ... 4 more {code} It seems to be a race condition between LogSegment.evictCache() and LogSegment.loadCache(). # StateMachineUpdater tries to update the StateMachine with the next log entry # It can't be found in the cache, therefore the LogSegment.loadCache() is called # The LogSegment.LogEntryLoader.load() reads the segment files from the disk # After loading, it returns with the loaded entry If the GRPC thread evicts the cache between 3 and 4. (it's possible that the log segment is already flushed, therefore can be evicted) an NPE will be thrown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-816) Use peerId in error log / exception of GrpcServerProtocolClient
Marton Elek created RATIS-816: - Summary: Use peerId in error log / exception of GrpcServerProtocolClient Key: RATIS-816 URL: https://issues.apache.org/jira/browse/RATIS-816 Project: Ratis Issue Type: Improvement Reporter: Marton Elek GrpcServerProtocolClient is used to send out requestVote and appendLogEntry requests. I propose to persist raftPeerId in the constructor and use it in the error / exception message. This is not just getting more meaningful message (it's a nice to have) but in HDDS-3023 I am modifying the byte code to mock the leader->follower communication. It's way more easier to do if the required raftPeerId is available in the class. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-816) Use peerId in error log / exception of GrpcServerProtocolClient
[ https://issues.apache.org/jira/browse/RATIS-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-816: -- Attachment: RATIS-816.001.patch > Use peerId in error log / exception of GrpcServerProtocolClient > --- > > Key: RATIS-816 > URL: https://issues.apache.org/jira/browse/RATIS-816 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Attachments: RATIS-816.001.patch > > > GrpcServerProtocolClient is used to send out requestVote and appendLogEntry > requests. > I propose to persist raftPeerId in the constructor and use it in the error / > exception message. > This is not just getting more meaningful message (it's a nice to have) but in > HDDS-3023 I am modifying the byte code to mock the leader->follower > communication. It's way more easier to do if the required raftPeerId is > available in the class. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (RATIS-816) Use peerId in error log / exception of GrpcServerProtocolClient
[ https://issues.apache.org/jira/browse/RATIS-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek reassigned RATIS-816: - Assignee: Marton Elek > Use peerId in error log / exception of GrpcServerProtocolClient > --- > > Key: RATIS-816 > URL: https://issues.apache.org/jira/browse/RATIS-816 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > > GrpcServerProtocolClient is used to send out requestVote and appendLogEntry > requests. > I propose to persist raftPeerId in the constructor and use it in the error / > exception message. > This is not just getting more meaningful message (it's a nice to have) but in > HDDS-3023 I am modifying the byte code to mock the leader->follower > communication. It's way more easier to do if the required raftPeerId is > available in the class. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-816) Use peerId in error log / exception of GrpcServerProtocolClient
[ https://issues.apache.org/jira/browse/RATIS-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-816: -- Fix Version/s: 0.6.0 > Use peerId in error log / exception of GrpcServerProtocolClient > --- > > Key: RATIS-816 > URL: https://issues.apache.org/jira/browse/RATIS-816 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Fix For: 0.6.0 > > Attachments: RATIS-816.001.patch, RATIS-816.002.patch, > RATIS-816.003.patch, RATIS-816.004.patch > > > GrpcServerProtocolClient is used to send out requestVote and appendLogEntry > requests. > I propose to persist raftPeerId in the constructor and use it in the error / > exception message. > This is not just getting more meaningful message (it's a nice to have) but in > HDDS-3023 I am modifying the byte code to mock the leader->follower > communication. It's way more easier to do if the required raftPeerId is > available in the class. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-816) Use peerId in error log / exception of GrpcServerProtocolClient
[ https://issues.apache.org/jira/browse/RATIS-816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057810#comment-17057810 ] Marton Elek commented on RATIS-816: --- Thanks the review [~msingh] and [~arp]. Checkstyle problem is fixed, I am merging it to the master. > Use peerId in error log / exception of GrpcServerProtocolClient > --- > > Key: RATIS-816 > URL: https://issues.apache.org/jira/browse/RATIS-816 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Attachments: RATIS-816.001.patch, RATIS-816.002.patch, > RATIS-816.003.patch, RATIS-816.004.patch > > > GrpcServerProtocolClient is used to send out requestVote and appendLogEntry > requests. > I propose to persist raftPeerId in the constructor and use it in the error / > exception message. > This is not just getting more meaningful message (it's a nice to have) but in > HDDS-3023 I am modifying the byte code to mock the leader->follower > communication. It's way more easier to do if the required raftPeerId is > available in the class. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-816) Use peerId in error log / exception of GrpcServerProtocolClient
[ https://issues.apache.org/jira/browse/RATIS-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-816: -- Attachment: RATIS-816.004.patch > Use peerId in error log / exception of GrpcServerProtocolClient > --- > > Key: RATIS-816 > URL: https://issues.apache.org/jira/browse/RATIS-816 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Attachments: RATIS-816.001.patch, RATIS-816.002.patch, > RATIS-816.003.patch, RATIS-816.004.patch > > > GrpcServerProtocolClient is used to send out requestVote and appendLogEntry > requests. > I propose to persist raftPeerId in the constructor and use it in the error / > exception message. > This is not just getting more meaningful message (it's a nice to have) but in > HDDS-3023 I am modifying the byte code to mock the leader->follower > communication. It's way more easier to do if the required raftPeerId is > available in the class. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-816) Use peerId in error log / exception of GrpcServerProtocolClient
[ https://issues.apache.org/jira/browse/RATIS-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-816: -- Attachment: RATIS-816.002.patch > Use peerId in error log / exception of GrpcServerProtocolClient > --- > > Key: RATIS-816 > URL: https://issues.apache.org/jira/browse/RATIS-816 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Attachments: RATIS-816.001.patch, RATIS-816.002.patch > > > GrpcServerProtocolClient is used to send out requestVote and appendLogEntry > requests. > I propose to persist raftPeerId in the constructor and use it in the error / > exception message. > This is not just getting more meaningful message (it's a nice to have) but in > HDDS-3023 I am modifying the byte code to mock the leader->follower > communication. It's way more easier to do if the required raftPeerId is > available in the class. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-816) Use peerId in error log / exception of GrpcServerProtocolClient
[ https://issues.apache.org/jira/browse/RATIS-816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057015#comment-17057015 ] Marton Elek commented on RATIS-816: --- Sure, uploaded in the 2nd patch. > Use peerId in error log / exception of GrpcServerProtocolClient > --- > > Key: RATIS-816 > URL: https://issues.apache.org/jira/browse/RATIS-816 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Attachments: RATIS-816.001.patch, RATIS-816.002.patch > > > GrpcServerProtocolClient is used to send out requestVote and appendLogEntry > requests. > I propose to persist raftPeerId in the constructor and use it in the error / > exception message. > This is not just getting more meaningful message (it's a nice to have) but in > HDDS-3023 I am modifying the byte code to mock the leader->follower > communication. It's way more easier to do if the required raftPeerId is > available in the class. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-816) Use peerId in error log / exception of GrpcServerProtocolClient
[ https://issues.apache.org/jira/browse/RATIS-816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-816: -- Attachment: RATIS-816.003.patch > Use peerId in error log / exception of GrpcServerProtocolClient > --- > > Key: RATIS-816 > URL: https://issues.apache.org/jira/browse/RATIS-816 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Attachments: RATIS-816.001.patch, RATIS-816.002.patch, > RATIS-816.003.patch > > > GrpcServerProtocolClient is used to send out requestVote and appendLogEntry > requests. > I propose to persist raftPeerId in the constructor and use it in the error / > exception message. > This is not just getting more meaningful message (it's a nice to have) but in > HDDS-3023 I am modifying the byte code to mock the leader->follower > communication. It's way more easier to do if the required raftPeerId is > available in the class. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (RATIS-827) Ratis example is leaking to the ratis-tools classpath
Marton Elek created RATIS-827: - Summary: Ratis example is leaking to the ratis-tools classpath Key: RATIS-827 URL: https://issues.apache.org/jira/browse/RATIS-827 Project: Ratis Issue Type: Improvement Reporter: Marton Elek Assignee: Marton Elek ratis-tools depends on ratis-example project which means that all the projects using ratis-tools can get unexpected dependencies from the example project: For example I see the following ozone. {code} SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/hadoop/share/ozone/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/hadoop/share/ozone/lib/ratis-examples-0.6.0-a320ae0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] {code} I propose to move the example dependent tools implementation to the example project and make the example project depends on the tools instead of the opposite direction. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-827) Ratis example is leaking to the ratis-tools classpath
[ https://issues.apache.org/jira/browse/RATIS-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-827: -- Attachment: RATIS-827.001.patch > Ratis example is leaking to the ratis-tools classpath > - > > Key: RATIS-827 > URL: https://issues.apache.org/jira/browse/RATIS-827 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Critical > Attachments: RATIS-827.001.patch > > > ratis-tools depends on ratis-example project which means that all the > projects using ratis-tools can get unexpected dependencies from the example > project: > For example I see the following ozone. > {code} > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/hadoop/share/ozone/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop/share/ozone/lib/ratis-examples-0.6.0-a320ae0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > {code} > I propose to move the example dependent tools implementation to the example > project and make the example project depends on the tools instead of the > opposite direction. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-827) Ratis example is leaking to the ratis-tools classpath
[ https://issues.apache.org/jira/browse/RATIS-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-827: -- Attachment: RATIS-827.002.patch > Ratis example is leaking to the ratis-tools classpath > - > > Key: RATIS-827 > URL: https://issues.apache.org/jira/browse/RATIS-827 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Critical > Attachments: RATIS-827.001.patch, RATIS-827.002.patch > > > ratis-tools depends on ratis-example project which means that all the > projects using ratis-tools can get unexpected dependencies from the example > project: > For example I see the following ozone. > {code} > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/hadoop/share/ozone/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop/share/ozone/lib/ratis-examples-0.6.0-a320ae0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > {code} > I propose to move the example dependent tools implementation to the example > project and make the example project depends on the tools instead of the > opposite direction. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-827) Ratis example is leaking to the ratis-tools classpath
[ https://issues.apache.org/jira/browse/RATIS-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-827: -- Attachment: RATIS-827.003.patch > Ratis example is leaking to the ratis-tools classpath > - > > Key: RATIS-827 > URL: https://issues.apache.org/jira/browse/RATIS-827 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Critical > Attachments: RATIS-827.001.patch, RATIS-827.002.patch, > RATIS-827.003.patch > > > ratis-tools depends on ratis-example project which means that all the > projects using ratis-tools can get unexpected dependencies from the example > project: > For example I see the following ozone. > {code} > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/opt/hadoop/share/ozone/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/opt/hadoop/share/ozone/lib/ratis-examples-0.6.0-a320ae0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > {code} > I propose to move the example dependent tools implementation to the example > project and make the example project depends on the tools instead of the > opposite direction. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094548#comment-17094548 ] Marton Elek commented on RATIS-840: --- [~szetszwo] can you please help to review it? Ozone test results are very noisy because this issue. > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Blocker > Attachments: RATIS-840.001.patch, RATIS-840.002.patch, > RATIS-840.003.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > Time Spent: 10m > Remaining Estimate: 0h > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append
[jira] [Created] (RATIS-875) Bump the copyright year in the NOTICE of thirdparty
Marton Elek created RATIS-875: - Summary: Bump the copyright year in the NOTICE of thirdparty Key: RATIS-875 URL: https://issues.apache.org/jira/browse/RATIS-875 Project: Ratis Issue Type: Improvement Reporter: Marton Elek Assignee: Marton Elek Reported by [~arp] during a 0.4.0 rc vote. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-880) Update github description and disable merge options apart from Squash and merge
[ https://issues.apache.org/jira/browse/RATIS-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek resolved RATIS-880. --- Fix Version/s: 0.6.0 Resolution: Fixed > Update github description and disable merge options apart from Squash and > merge > --- > > Key: RATIS-880 > URL: https://issues.apache.org/jira/browse/RATIS-880 > Project: Ratis > Issue Type: Bug >Reporter: Mukul Kumar Singh >Assignee: Mukul Kumar Singh >Priority: Major > Fix For: 0.6.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Update github description and disable merge options apart from Squash and > merge -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-840: -- Priority: Blocker (was: Critical) > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Blocker > Attachments: RATIS-840.001.patch, RATIS-840.002.patch, > RATIS-840.003.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because voterLists.get(0) return the > FollowerInfo of the old GrpcLogAppender, not the FollowerInfo of the new > GrpcLogAppender. {color} > 5. Because the majority commit got from the FollowerInfo of the old > GrpcLogAppender never changes. So even though follower has append entry > successfully, the leader can not update commit. So the new created > GrpcLogAppender can never work normally. > 6. The reason of unit test of
[jira] [Commented] (RATIS-840) Memory leak of LogAppender
[ https://issues.apache.org/jira/browse/RATIS-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095242#comment-17095242 ] Marton Elek commented on RATIS-840: --- bq. Marton Elek Please wait for me, I have to make sure the patch does not generate new failed ut. Because there are about 30 failed ut in ratis even though without my patch currently, it's need some time to do it. I am huge +1 with this approach, but do you suggest to fix all the unit tests before RATIS-840? As I wrote Ozone is bleeding. What is the plan? > Memory leak of LogAppender > -- > > Key: RATIS-840 > URL: https://issues.apache.org/jira/browse/RATIS-840 > Project: Ratis > Issue Type: Bug > Components: server >Reporter: runzhiwang >Assignee: runzhiwang >Priority: Blocker > Attachments: RATIS-840.001.patch, RATIS-840.002.patch, > RATIS-840.003.patch, image-2020-04-06-14-27-28-485.png, > image-2020-04-06-14-27-39-582.png, screenshot-1.png, screenshot-2.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > *What's the problem ?* > When run hadoop-ozone for 4 days, datanode memory leak. When dump heap, I > found there are 460710 instances of GrpcLogAppender. But there are only 6 > instances of SenderList, and each SenderList contains 1-2 instance of > GrpcLogAppender. And there are a lot of logs related to > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428]. > {code:java}INFO impl.RaftServerImpl: > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-LeaderState: > Restarting GrpcLogAppender for > 1665f5ea-ab17-4a0e-af6d-6958efd322fa@group-F64B465F37B5-\u003e229cbcc1-a3b2-4383-9c0d-c0f4c28c3d4a\n","stream":"stderr","time":"2020-04-06T03:59:53.37892512Z"}{code} > > So there are a lot of GrpcLogAppender did not stop the Daemon Thread when > removed from senders. > !image-2020-04-06-14-27-28-485.png! > !image-2020-04-06-14-27-39-582.png! > > *Why > [LeaderState::restartSender|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L428] > so many times ?* > 1. As the image shows, when remove group, SegmentedRaftLog will close, then > GrpcLogAppender throw exception when find the SegmentedRaftLog was closed. > Then GrpcLogAppender will be > [restarted|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LogAppender.java#L94], > and the new GrpcLogAppender throw exception again when find the > SegmentedRaftLog was closed, then GrpcLogAppender will be restarted again ... > . It results in an infinite restart of GrpcLogAppender. > 2. Actually, when remove group, GrpcLogAppender will be stoped: > RaftServerImpl::shutdown -> > [RoleInfo::shutdownLeaderState|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L266] > -> LeaderState::stop -> LogAppender::stopAppender, then SegmentedRaftLog > will be closed: RaftServerImpl::shutdown -> > [ServerState:close|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L271] > ... . Though RoleInfo::shutdownLeaderState called before ServerState:close, > but the GrpcLogAppender was stopped asynchronously. So infinite restart of > GrpcLogAppender happens, when GrpcLogAppender stop after SegmentedRaftLog > close. > !screenshot-1.png! > *Why GrpcLogAppender did not stop the Daemon Thread when removed from senders > ?* > I find a lot of GrpcLogAppender blocked inside logs4j. I think it's > GrpcLogAppender restart too fast, then blocked in logs4j. > !screenshot-2.png! > *Can the new GrpcLogAppender work normally ?* > 1. Even though without the above problem, the new created GrpcLogAppender > still can not work normally. > 2. When creat a new GrpcLogAppender, a new FollowerInfo will also be created: > LeaderState::addAndStartSenders -> > LeaderState::addSenders->RaftServerImpl::newLogAppender -> [new > FollowerInfo|https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L129] > 3. When the new created GrpcLogAppender append entry to follower, then the > follower response SUCCESS. > 4. Then LeaderState::updateCommit -> [LeaderState::getMajorityMin | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L599] > -> > [voterLists.get(0) | > https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderState.java#L607]. > {color:#DE350B}Error happens because
[jira] [Updated] (RATIS-948) Update Sonar statistics only from the apache repo, not from the forks
[ https://issues.apache.org/jira/browse/RATIS-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-948: -- Description: RATIS-940 enabled the Sonar check for all the commits but it doesn't work for forked repositories (unless somebody set an own SONAR_TOKEN). The fix is the same as HDDS-2627, we can restrict the execution to the apache repostiroy. Thanks to [~ljain] who reported this issue: Example failure: https://github.com/lokeshj1703/incubator-ratis/runs/716253801 Note: PRs are not affected as they work well (no sonar check there) only the builds of forked repos. was: RATIS-940 enabled the Sonar check for all the commits but it doesn't work for forked repositories (unless somebody set an own SONAR_TOKEN). The fix is the same as HDDS-2627, we can restrict the execution to the apache repostiroy. > Update Sonar statistics only from the apache repo, not from the forks > - > > Key: RATIS-948 > URL: https://issues.apache.org/jira/browse/RATIS-948 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > > RATIS-940 enabled the Sonar check for all the commits but it doesn't work for > forked repositories (unless somebody set an own SONAR_TOKEN). > The fix is the same as HDDS-2627, we can restrict the execution to the apache > repostiroy. > Thanks to [~ljain] who reported this issue: > Example failure: > https://github.com/lokeshj1703/incubator-ratis/runs/716253801 > Note: PRs are not affected as they work well (no sonar check there) only the > builds of forked repos. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Moved] (RATIS-948) Update Sonar statistics only from the apache repo, not from the forks
[ https://issues.apache.org/jira/browse/RATIS-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek moved HDDS-3675 to RATIS-948: - Key: RATIS-948 (was: HDDS-3675) Target Version/s: 0.6.0 (was: 0.6.0) Workflow: no-reopen-closed, patch-avail (was: patch-available, re-open possible) Project: Ratis (was: Hadoop Distributed Data Store) > Update Sonar statistics only from the apache repo, not from the forks > - > > Key: RATIS-948 > URL: https://issues.apache.org/jira/browse/RATIS-948 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > > RATIS-940 enabled the Sonar check for all the commits but it doesn't work for > forked repositories (unless somebody set an own SONAR_TOKEN). > The fix is the same as HDDS-2627, we can restrict the execution to the apache > repostiroy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (RATIS-948) Update Sonar statistics only from the apache repo, not from the forks
[ https://issues.apache.org/jira/browse/RATIS-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek resolved RATIS-948. --- Resolution: Fixed > Update Sonar statistics only from the apache repo, not from the forks > - > > Key: RATIS-948 > URL: https://issues.apache.org/jira/browse/RATIS-948 > Project: Ratis > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > RATIS-940 enabled the Sonar check for all the commits but it doesn't work for > forked repositories (unless somebody set an own SONAR_TOKEN). > The fix is the same as HDDS-2627, we can restrict the execution to the apache > repostiroy. > Thanks to [~ljain] who reported this issue: > Example failure: > https://github.com/lokeshj1703/incubator-ratis/runs/716253801 > Note: PRs are not affected as they work well (no sonar check there) only the > builds of forked repos. -- This message was sent by Atlassian Jira (v8.3.4#803005)