[GitHub] [kafka] viktorsomogyi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
viktorsomogyi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1596994184 > In addition, if we add integration tests, put them in this PR, or need to open another PR? Let's add them to this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] viktorsomogyi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
viktorsomogyi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1593258910 It seems like @clolov is right, I tested it both in quorum and zk mode, Kafka successfully reconciles the questionable case (when X-1 on B comes back after A has compacted the logs), so I think it's fine to merge in this PR. I was also thinking of creating some integration test for this but it's hard to simulate disk errors in Java and we can't have any assumptions about where the tests run, so I think that should be a separate task as it's out of scope for this one. If you folks know a good fault injection framework, I'm all ears. I'll come back tomorrow for a last round of review and if I find everything fine, I'll merge this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] viktorsomogyi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
viktorsomogyi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1582838059 So I have some context with the replica fetcher area (mostly by reading and debugging), I hope I can help. First, since the conversation is a bit long, let me summarize what I understand: - The problem is disk A reaches its capacity limits - The solution is to move partition X-1 to disk B - During the reassignment, log cleaning is disabled on X-1 (which can therefore fill disk A) - The reassignment of X-1 fails, it is left failed there on B and X-1 on A keeps growing Is this correct? If it is, we may need to separate the deletion and compaction cases. I think resuming deletion is safe, however resuming compaction might not be, since compaction alters the log. If an operator somehow resumes B and lets replication continue, then the history of X-1 in A and B might be different (I'm still working on a local test case that reproduces this). What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] viktorsomogyi commented on pull request #13421: KAFKA-14824: ReplicaAlterLogDirsThread may cause serious disk growing in case of potential exception
viktorsomogyi commented on PR #13421: URL: https://github.com/apache/kafka/pull/13421#issuecomment-1578862889 @hudeqi I added myself as a reviewer, I may not have time to review this today but will get to it this week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org