[kudu-CR] Fix stray memory writes due to tcmalloc profiling
Todd Lipcon has submitted this change and it was merged. Change subject: Fix stray memory writes due to tcmalloc profiling .. Fix stray memory writes due to tcmalloc profiling This fixes an issue that has been causing frequent crashes in JD.com's production cluster as well as various Cloudera test clusters. The crashes would be in various different places, but the key signature was that offset 120 in some data structure or array (eg the 16th element of a vector) would be corrupted. After doing a git bisect using an integration testing cluster running an ITBLL workload, I found that this was a regression caused by the introduction of tcmalloc contention profiling[1]. The short explanation is that, if we experienced contention while freeing a Trace object, we could in some cases increment offset 120 of some other allocation which occurred soon after the deallocation of the Trace. The issue is described in more detail in a new comment in trace.h. With this patch, I was unable to reproduce the issue on the test cluster. No new test is added since this is quite timing-dependent and not amenable to unit testing or even stress testing. [1] commit f6691e744b9cb796e1bbc6e07953f21f387c9a88 Change-Id: I9afca83d9cc24585960f6bf68d8996c4736ce6cb Reviewed-on: http://gerrit.cloudera.org:8080/3445 Reviewed-by: David Ribeiro AlvesReviewed-by: Mike Percy Tested-by: Kudu Jenkins --- M src/kudu/util/trace.h 1 file changed, 23 insertions(+), 3 deletions(-) Approvals: David Ribeiro Alves: Looks good to me, approved Mike Percy: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/3445 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I9afca83d9cc24585960f6bf68d8996c4736ce6cb Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Fix stray memory writes due to tcmalloc profiling
Mike Percy has posted comments on this change. Change subject: Fix stray memory writes due to tcmalloc profiling .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/3445 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9afca83d9cc24585960f6bf68d8996c4736ce6cb Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: song bruce zhang Gerrit-HasComments: No
[kudu-CR] Fix stray memory writes due to tcmalloc profiling
David Ribeiro Alves has posted comments on this change. Change subject: Fix stray memory writes due to tcmalloc profiling .. Patch Set 1: Code-Review+2 good catch -- To view, visit http://gerrit.cloudera.org:8080/3445 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9afca83d9cc24585960f6bf68d8996c4736ce6cb Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: song bruce zhang Gerrit-HasComments: No
[kudu-CR] Fix stray memory writes due to tcmalloc profiling
Hello David Ribeiro Alves, Mike Percy, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/3445 to review the following change. Change subject: Fix stray memory writes due to tcmalloc profiling .. Fix stray memory writes due to tcmalloc profiling This fixes an issue that has been causing frequent crashes in JD.com's production cluster as well as various Cloudera test clusters. The crashes would be in various different places, but the key signature was that offset 120 in some data structure or array (eg the 16th element of a vector) would be corrupted. After doing a git bisect using an integration testing cluster running an ITBLL workload, I found that this was a regression caused by the introduction of tcmalloc contention profiling[1]. The short explanation is that, if we experienced contention while freeing a Trace object, we could in some cases increment offset 120 of some other allocation which occurred soon after the deallocation of the Trace. The issue is described in more detail in a new comment in trace.h. With this patch, I was unable to reproduce the issue on the test cluster. No new test is added since this is quite timing-dependent and not amenable to unit testing or even stress testing. [1] commit f6691e744b9cb796e1bbc6e07953f21f387c9a88 Change-Id: I9afca83d9cc24585960f6bf68d8996c4736ce6cb --- M src/kudu/util/trace.h 1 file changed, 23 insertions(+), 3 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/3445/1 -- To view, visit http://gerrit.cloudera.org:8080/3445 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I9afca83d9cc24585960f6bf68d8996c4736ce6cb Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy