[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has submitted this change and it was merged. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable after it is loaded for the very first time. These fields are never overwritten again during tablet-copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. DCHECKs are added to make sure the oncoming protobuf fields match the immutable fields not updated during tablet-copy. Testing: Spun a test TabletCopyITest.TestDeleteTabletDuringTabletCopy which exercises this data race window, also ran original test which caught the race in raft_consensus-itest.TestCorruptReplicaMetadata. Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Reviewed-on: http://gerrit.cloudera.org:8080/3823 Tested-by: Kudu Jenkins Reviewed-by: Mike Percy --- M src/kudu/common/partition.cc M src/kudu/common/partition.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 4 files changed, 129 insertions(+), 16 deletions(-) Approvals: Mike Percy: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 18 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 17: Code-Review+2 OK let's commit this :) -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 17: > Do you guys think we should pull this in for 1.0 even though we > don't have an upgrade test yet? Maybe it's worth filing a JIRA and > committing this, since I think Adar saw it in a test environment. I think so, given that this change doesn't break forward compat(i.e, 0.10 should be able to accept remote-copy from 1.0). If we intro partition changes in 1.0, we anyways need to handle backward-compat in the future code. I had filed JIRA for upgrade/downgrade test. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 17: Do you guys think we should pull this in for 1.0 even though we don't have an upgrade test yet? Maybe it's worth filing a JIRA and committing this, since I think Adar saw it in a test environment. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 17: (2 comments) TFTRs Mike/Dan/Adar, updated the patch to reflect the discussions we have had on this. Please take a look. http://gerrit.cloudera.org:8080/#/c/3823/15/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: PS15, Line 313: CHECK_EQ(table_id_, superblock.table_id()); : PartitionSchema partition_schema; : RETURN_NOT_OK(PartitionSchema::FromPB(superblock.partition_schema(), : *schema_, &partition_schema)); : CHECK(partition_schema_.Equals(partition_schema)); : : Partition partition; : Partition::FromPB(superblock.partition(), &partition); : CHECK(partition_.Equals(partition)); > Providing equals on partition should be easy, it just needs to check that t also checked for hash_buckets equality Dan, assuming that's immutable as well. PS15, Line 313: CHECK_EQ(table_id_, superblock.table_id()); : PartitionSchema partition_schema; : RETURN_NOT_OK(PartitionSchema::FromPB(superblock.partition_schema(), : *schema_, &partition_schema)); : CHECK(partition_schema_.Equals(partition_schema)); : : Partition partition; : Partition::FromPB(superblock.partition(), &partition); : CHECK(partition_.Equals(partition)); > I suppose we could change this to CHECK and just rely on upgrade / downgrad Cool, thanks Mike, updated the code with Equals() for both partition and partition_schema, and moving to CHECK to keep the code same for all build types. I agree with catching the compat issues with white/black box automated tests for upgrade/downgrade. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 17: Build Started http://104.196.14.100/job/kudu-gerrit/3152/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, Adar Dembo, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#17). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable after it is loaded for the very first time. These fields are never overwritten again during tablet-copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. DCHECKs are added to make sure the oncoming protobuf fields match the immutable fields not updated during tablet-copy. Testing: Spun a test TabletCopyITest.TestDeleteTabletDuringTabletCopy which exercises this data race window, also ran original test which caught the race in raft_consensus-itest.TestCorruptReplicaMetadata. Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/common/partition.cc M src/kudu/common/partition.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 4 files changed, 129 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/17 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 17 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dan Burkert has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 15: (1 comment) http://gerrit.cloudera.org:8080/#/c/3823/15/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: PS15, Line 313: DCHECK_EQ(table_id_, superblock.table_id()); : PartitionSchemaPB partition_schema_pb; : partition_schema_.ToPB(&partition_schema_pb); : DCHECK_EQ(superblock.partition_schema().SerializeAsString(), : partition_schema_pb.SerializeAsString()); : PartitionPB partition_pb; : partition_.ToPB(&partition_pb); : DCHECK_EQ(superblock.partition().SerializeAsString(), : partition_pb.SerializeAsString()); > I suppose we could change this to CHECK and just rely on upgrade / downgrad Providing equals on partition should be easy, it just needs to check that the lower and upper bound partition key strings are equal. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 15 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 15: (1 comment) http://gerrit.cloudera.org:8080/#/c/3823/15/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: PS15, Line 313: DCHECK_EQ(table_id_, superblock.table_id()); : PartitionSchemaPB partition_schema_pb; : partition_schema_.ToPB(&partition_schema_pb); : DCHECK_EQ(superblock.partition_schema().SerializeAsString(), : partition_schema_pb.SerializeAsString()); : PartitionPB partition_pb; : partition_.ToPB(&partition_pb); : DCHECK_EQ(superblock.partition().SerializeAsString(), : partition_pb.SerializeAsString()); > I liked this idea, but it still doesn't prevent compat break if the wire fo I suppose we could change this to CHECK and just rely on upgrade / downgrade tests for coverage on this stuff, if we don't want release and debug builds to act differently. After all, this isn't a hot path. Dinesh, to your point: providing an Equals() method for the non-PB objects doesn't prevent forward-compatibility problems, you are right. These are just inherent issues we have to deal with and as far as I can tell, having upgrade tests are really the only way to avoid unforeseen compatibility problems. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 15 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 16: (1 comment) http://gerrit.cloudera.org:8080/#/c/3823/15/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: PS15, Line 313: } else { : DCHECK_EQ(table_id_, superblock.table_id()); : PartitionSchemaPB partition_schema_pb; : partition_schema_.ToPB(&partition_schema_pb); : DCHECK_EQ(superblock.partition_schema().SerializeAsString(), : partition_schema_pb.SerializeAsString()); : PartitionPB partition_pb; : partition_.ToPB(&partition_pb); : DCHECK_EQ(superblock.partition().SerializeAs > Well, it looks like we already have PartitionSchema::Equals() so maybe we j I liked this idea, but it still doesn't prevent compat break if the wire format differs from the local copy if the destination version is ahead of source during tablet-copy. In general, it seems this is a question of : if the node is accepting a remote copy, is it a good idea to accept a copy from a server with release (N-1) ? -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 16 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 15: (1 comment) http://gerrit.cloudera.org:8080/#/c/3823/15/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: PS15, Line 313: DCHECK_EQ(table_id_, superblock.table_id()); : PartitionSchemaPB partition_schema_pb; : partition_schema_.ToPB(&partition_schema_pb); : DCHECK_EQ(superblock.partition_schema().SerializeAsString(), : partition_schema_pb.SerializeAsString()); : PartitionPB partition_pb; : partition_.ToPB(&partition_pb); : DCHECK_EQ(superblock.partition().SerializeAsString(), : partition_pb.SerializeAsString()); > I understand that this code isn't executed in release mode, but I feel fair Well, it looks like we already have PartitionSchema::Equals() so maybe we just need to implement Partition::Equals() to do this without comparing protobuf encodings. We could have something like: PartitionNSchema s; RETURN_NOT_OK(PartitionSchema::FromPB(superblock.partition_schema(), &s)); DCHECK(partition_schema_.Equals(s)); And something similar for Partition. In general I'm mostly concerned with forward and backward compatibility in release mode, although also maintaining it for DEBUG builds clearly would be nice. I think writing automated upgrade / downgrade tests would also help detect problems before release. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 15 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Adar Dembo has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 15: (1 comment) http://gerrit.cloudera.org:8080/#/c/3823/15/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: PS15, Line 313: DCHECK_EQ(table_id_, superblock.table_id()); : PartitionSchemaPB partition_schema_pb; : partition_schema_.ToPB(&partition_schema_pb); : DCHECK_EQ(superblock.partition_schema().SerializeAsString(), : partition_schema_pb.SerializeAsString()); : PartitionPB partition_pb; : partition_.ToPB(&partition_pb); : DCHECK_EQ(superblock.partition().SerializeAsString(), : partition_pb.SerializeAsString()); > #1 is a good idea I understand that this code isn't executed in release mode, but I feel fairly strongly that forward (or backward, for that matter) compatibility behavior shouldn't vary between Kudu build types. That's really unexpected, and it's the kind of thing that may show up when a user encounters a real bug and tries to reproduce it with a debug build to get more information. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 15 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 15: (1 comment) http://gerrit.cloudera.org:8080/#/c/3823/15/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: PS15, Line 313: DCHECK_EQ(table_id_, superblock.table_id()); : PartitionSchemaPB partition_schema_pb; : partition_schema_.ToPB(&partition_schema_pb); : DCHECK_EQ(superblock.partition_schema().SerializeAsString(), : partition_schema_pb.SerializeAsString()); : PartitionPB partition_pb; : partition_.ToPB(&partition_pb); : DCHECK_EQ(superblock.partition().SerializeAsString(), : partition_pb.SerializeAsString()); > 1. I actually thought about NDEBUG, but forgot to materialize in between th #1 is a good idea Regarding #2, since it's a DCHECK we ignore it in release mode. Below is what the PBs look like. I suppose we could create equality functions for them to try to future proof them from PB changes but I'm not sure it's worth it for a DCHECK. message PartitionPB { // The hash buckets of the partition. The number of hash buckets must match // the number of hash bucket components in the partition's schema. repeated int32 hash_buckets = 1 [packed = true]; // The encoded start partition key (inclusive). optional bytes partition_key_start = 2; // The encoded end partition key (exclusive). optional bytes partition_key_end = 3; } // The serialized format of a Kudu table partition schema. message PartitionSchemaPB { // A column identifier for partition schemas. In general, the name will be // used when a client creates the table since column IDs are assigned by the // master. All other uses of partition schemas will use the numeric column ID. message ColumnIdentifierPB { oneof identifier { int32 id = 1; string name = 2; } } message RangeSchemaPB { // Column identifiers of columns included in the range. All columns must be // a component of the primary key. repeated ColumnIdentifierPB columns = 1; } message HashBucketSchemaPB { // Column identifiers of columns included in the hash. Every column must be // a component of the primary key. repeated ColumnIdentifierPB columns = 1; // Number of buckets into which columns will be hashed. Must be at least 2. required int32 num_buckets = 2; // Seed value for hash calculation. Administrators may set a seed value // on a per-table basis in order to randomize the mapping of rows to // buckets. Setting a seed provides some amount of protection against denial // of service attacks when the hash bucket columns contain user provided // input. optional uint32 seed = 3; enum HashAlgorithm { UNKNOWN = 0; MURMUR_HASH_2 = 1; } // The hash algorithm to use for calculating the hash bucket. optional HashAlgorithm hash_algorithm = 4; } repeated HashBucketSchemaPB hash_bucket_schemas = 1; optional RangeSchemaPB range_schema = 2; } -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 15 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 16: Build Started http://104.196.14.100/job/kudu-gerrit/3118/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 16 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, Adar Dembo, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#16). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable after it is loaded for the very first time. These fields are never overwritten again during tablet-copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. DCHECKs are added to make sure the oncoming protobuf fields match the immutable fields not updated during tablet-copy. Testing: Spun a test TabletCopyITest.TestDeleteTabletDuringTabletCopy which exercises this data race window, also ran original test which caught the race in raft_consensus-itest.TestCorruptReplicaMetadata. Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 2 files changed, 120 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/16 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 16 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 15: (4 comments) TFTR Adar, I will update the new patch in a while, please see responses inline. http://gerrit.cloudera.org:8080/#/c/3823/15/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: PS15, Line 271: fix(KUDU-1500) > Nit: "fix (KUDU-1500)" (separate 'fix' from the parenthetical block with a Done PS15, Line 325: scoped_refptr thread; : ASSERT_OK(kudu::Thread::Create(CURRENT_TEST_NAME(), : Substitute("list-tablet-thread-$0", i), :&ListTabletsDuringTabletCopy, :follower_ts, tablet_id, :&barrier, &finish, &thread)); : threads.push_back(thread); > Now that we can use C++11 we prefer std::thread for test threads. The advan TBH I need to read up about std::thread to understand what you are saying about these subtle differences. I will post an update once I convert this into std::thread. http://gerrit.cloudera.org:8080/#/c/3823/15/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: PS15, Line 304: // We treat few fields in the metadata as immutable objects, : // so they are loaded from protobuf only when tablet replica : // is being instantiated on this peer(KUDU-1500). > Nit: this comment is a little confusing. How about rewording like so: "Some Done. PS15, Line 313: DCHECK_EQ(table_id_, superblock.table_id()); : PartitionSchemaPB partition_schema_pb; : partition_schema_.ToPB(&partition_schema_pb); : DCHECK_EQ(superblock.partition_schema().SerializeAsString(), : partition_schema_pb.SerializeAsString()); : PartitionPB partition_pb; : partition_.ToPB(&partition_pb); : DCHECK_EQ(superblock.partition().SerializeAsString(), : partition_pb.SerializeAsString()); > I have some questions about this logic: 1. I actually thought about NDEBUG, but forgot to materialize in between the test experiments, good that you reminded :), consider it done. 2. Hmmm, good Qn. I believe we may land into similar issue if we change the superblock version(or format). i.e, destination version differs from source in the tablet-copy workflow. My hunch is that(not very sure) we probably have to worry about that compatibility at several other places more than here. However Mike may have more clearer picture here to answer this Qn. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 15 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Adar Dembo has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 15: (5 comments) Came to +2, ended up leaving some comments. :/ http://gerrit.cloudera.org:8080/#/c/3823/15//COMMIT_MSG Commit Message: PS15, Line 15: once after Nit: seems like you only need one of these words, but not both. http://gerrit.cloudera.org:8080/#/c/3823/15/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: PS15, Line 271: fix(KUDU-1500) Nit: "fix (KUDU-1500)" (separate 'fix' from the parenthetical block with a space). PS15, Line 325: scoped_refptr thread; : ASSERT_OK(kudu::Thread::Create(CURRENT_TEST_NAME(), : Substitute("list-tablet-thread-$0", i), :&ListTabletsDuringTabletCopy, :follower_ts, tablet_id, :&barrier, &finish, &thread)); : threads.push_back(thread); Now that we can use C++11 we prefer std::thread for test threads. The advantage is less boilerplate code, both in starting up threads and in the use of lambdas for the thread functions themselves, with lambda capture reducing parameter passing. See master-test.cc:TestConcurrentCreateOfSameTable for an example. http://gerrit.cloudera.org:8080/#/c/3823/15/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: PS15, Line 304: // We treat few fields in the metadata as immutable objects, : // so they are loaded from protobuf only when tablet replica : // is being instantiated on this peer(KUDU-1500). Nit: this comment is a little confusing. How about rewording like so: "Some metadata fields are assumed to be immutable and thus are only read from the protobuf when the tablet metadata is loaded for the very first time. See KUDU-1500 for more details." PS15, Line 313: DCHECK_EQ(table_id_, superblock.table_id()); : PartitionSchemaPB partition_schema_pb; : partition_schema_.ToPB(&partition_schema_pb); : DCHECK_EQ(superblock.partition_schema().SerializeAsString(), : partition_schema_pb.SerializeAsString()); : PartitionPB partition_pb; : partition_.ToPB(&partition_pb); : DCHECK_EQ(superblock.partition().SerializeAsString(), : partition_pb.SerializeAsString()); I have some questions about this logic: 1. Since this is just sanity checks that are enabled in debug builds, the entire block should be conditioned on #ifndef NDEBUG (and then the DCHECK_EQs should be converted into CHECK_EQ). This is to avoid the ToPB() calls in non-debug builds, where their results aren't used. 2. I'm worried about impacting forward compatibility by comparing serialized protobufs like this. In commit 43d2749, Mike removed an equivalent check in tserver registration which effectively prevented us from adding new fields to the registration protobuf. Would this lead to the same problem? Or is it different? -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 15 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 15: Build Started http://104.196.14.100/job/kudu-gerrit/3107/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 15 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#15). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable once after it is loaded during replica bootstrap. These fields are never overwritten again during tablet-copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. DCHECKs are added to make sure the oncoming protobuf fields match the immutable fields not updated during tablet-copy. Testing: Spun a test TabletCopyITest.TestDeleteTabletDuringTabletCopy which exercises this data race window, also ran original test which caught the race in raft_consensus-itest.TestCorruptReplicaMetadata. Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 2 files changed, 132 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/15 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 15 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#14). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable once after it is loaded during replica bootstrap. These fields are never overwritten again during tablet-copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. DCHECKs are added to make sure the oncoming protobuf fields match the immutable fields not updated during tablet-copy. Testing: Spun a test TabletCopyITest.TestDeleteTabletDuringTabletCopy which exercises this data race window, also ran original test which caught the race in raft_consensus-itest.TestCorruptReplicaMetadata. Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 2 files changed, 131 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/14 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 14 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 14: Build Started http://104.196.14.100/job/kudu-gerrit/3101/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 14 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 13: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 13 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#13). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable once after it is loaded during replica bootstrap. These fields are never overwritten again during tablet-copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. DCHECKs are added to make sure the oncoming protobuf fields match the immutable fields not updated during tablet-copy. Testing: Spun a test TabletCopyITest.TestDeleteTabletDuringTabletCopy which exercises this data race window, also ran original test which caught the race in raft_consensus-itest.TestCorruptReplicaMetadata. Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 3 files changed, 133 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/13 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 13 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 13: Build Started http://104.196.14.100/job/kudu-gerrit/3100/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 13 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 12: (1 comment) Thanks again, updated along with std:: prefixes to keep jenkins happy. http://gerrit.cloudera.org:8080/#/c/3823/11/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: Line 268: > Can get rid of this Nnice ! yeah that was a useless return, updated all of above. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 12 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 12: Build Started http://104.196.14.100/job/kudu-gerrit/3099/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 12 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#12). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable once after it is loaded during replica bootstrap. These fields are never overwritten again during tablet-copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. DCHECKs are added to make sure the oncoming protobuf fields match the immutable fields not updated during tablet-copy. Testing: Spun a test TabletCopyITest.TestDeleteTabletDuringTabletCopy which exercises this data race window, also ran original test which caught the race in raft_consensus-itest.TestCorruptReplicaMetadata. Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 3 files changed, 133 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/12 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 12 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 11: (3 comments) http://gerrit.cloudera.org:8080/#/c/3823/11/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: Line 258: Status ListTabletsDuringTabletCopy(const TServerDetails* ts, Shouldn't this return void because it's a main thread function? Line 265: ListTablets(ts, MonoDelta::FromSeconds(30), &tablets); CHECK_OK() ? Line 268: return Status::OK(); Can get rid of this -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 11 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 10: (10 comments) TFTR Mike, updated new patch, and also responses inline below. http://gerrit.cloudera.org:8080/#/c/3823/8/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: Line 327: // Threads to issue ListTablets RPCs to follower. > 4 spaces is the minimum. ok cool, i may have missed 'minimum' part. http://gerrit.cloudera.org:8080/#/c/3823/10/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: Line 23: #include > nit: alphabetical order for includes would put this at the top Fixed. Line 254: Status ListTabletsDuringTabletCopy(const TServerDetails* ts, > Mind adding a quick doc comment to this function? Done. Line 276: // This is only optimistic in that, we hope to catch the > Worth mentioning that we rely on TSAN to "catch" it. added some more verbiage, the line above had a tsan mention though. Line 322: std::atomic transition(false); > I don't think we use the transition variable anymore. good catch, removed. Line 324: int num_threads = FLAGS_test_num_threads; > why bother having this extra variable num_threads? good point, removed. that also makes the actual ListTablet thread lot thinner :) Line 355: finish = true; > It's only a test, but we can get away with finish.store(true, memory_order_ my thinking was since this is a test we could get away with the memory barrier :), but I see your point and agree with you, done. http://gerrit.cloudera.org:8080/#/c/3823/8/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: Line 283: last_durable_mrs_id_ = superblock.last_durable_mrs_id(); > Are you sure table_name_ is not accessed by ListTablets()? I was speaking from immutability point of view. To answer your Qn, table_name_ is accessed all over including ListTablets, but in safe way. The same lock which is held during SuperblockLoading is held across reads for table_name_ too. string TabletMetadata::table_name() const { std::lock_guard l(data_lock_); DCHECK_NE(state_, kNotLoadedYet); return table_name_; } http://gerrit.cloudera.org:8080/#/c/3823/10/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: Line 317: DCHECK_EQ( > Although, another thing to consider is whether there is always a guaranteed H, good point. Tests on localhost passing, I will run various tests on different platforms exercising this routine to make sure assumption holds true. Line 323: DCHECK_EQ(0, table_id_.compare(superblock.table_id())); > DCHECK_EQ(table_id_, superblock.table_id()) Ha ! didn't realize DCHECK_CHECK actually acts on string. thank you. Took care of all the above comments. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 11: Build Started http://104.196.14.100/job/kudu-gerrit/3096/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 11 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#11). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable once after it is loaded during replica bootstrap. These fields are never overwritten again during tablet-copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. DCHECKs are added to make sure the oncoming protobuf fields match the immutable fields not updated during tablet-copy. Testing: Spun a new test TabletCopyITest.TestDeleteTabletDuringTabletCopy which exercises this data race window, also ran original test which caught the race in raft_consensus-itest.TestCorruptReplicaMetadata Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 3 files changed, 135 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/11 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 11 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/3823/10/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: Line 317: DCHECK_EQ( > I don't think this is valid. The easiest thing to do is DCHECK_EQ(superbloc Although, another thing to consider is whether there is always a guaranteed order of the components. I'm not exactly sure -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 10: (13 comments) http://gerrit.cloudera.org:8080/#/c/3823/8/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: Line 327: // Threads to issue ListTablets RPCs to follower. > thought of following 'function call' formatting guideline. Shouldn't next l 4 spaces is the minimum. http://gerrit.cloudera.org:8080/#/c/3823/10/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: Line 23: #include nit: alphabetical order for includes would put this at the top Line 254: Status ListTabletsDuringTabletCopy(const TServerDetails* ts, Mind adding a quick doc comment to this function? Line 276: // This is only optimistic in that, we hope to catch the Worth mentioning that we rely on TSAN to "catch" it. Line 322: std::atomic transition(false); I don't think we use the transition variable anymore. Line 324: int num_threads = FLAGS_test_num_threads; why bother having this extra variable num_threads? Line 355: finish = true; It's only a test, but we can get away with finish.store(true, memory_order_relaxed) here since we don't need a memory barrier. http://gerrit.cloudera.org:8080/#/c/3823/8/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: Line 283: last_durable_mrs_id_ = superblock.last_durable_mrs_id(); > table_name_ is being set by AlterSchema, so even though it's not supported, Are you sure table_name_ is not accessed by ListTablets()? http://gerrit.cloudera.org:8080/#/c/3823/10/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: Line 312: } else { Can you do the validation in the else-block in the same order that you do assignment in the if-block? Line 313: PartitionPB partitionPB; nit: camelCaps not allowed for variable names in C++ by the style guide. Also below. Name this partition_pb and the below one partition_schema_pb (also note there is a typo in the variable name "partion_schemaPB"). Line 317: DCHECK_EQ( I don't think this is valid. The easiest thing to do is DCHECK_EQ(superblock.partition().SerializeAsString(), partition_pb.SerializeAsString()); Line 320: DCHECK_EQ( Also here Line 323: DCHECK_EQ(0, table_id_.compare(superblock.table_id())); DCHECK_EQ(table_id_, superblock.table_id()) -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#10). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable once after it is loaded during replica bootstrap. These fields are never overwritten again during Tablet Copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. Testing: Passed about 2000 iteration of the failing test raft_consensus-itest.TestCorruptReplicaMetadata Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 3 files changed, 140 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/10 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 10: Build Started http://104.196.14.100/job/kudu-gerrit/3061/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 9: Build Started http://104.196.14.100/job/kudu-gerrit/3053/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 9 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 8: (12 comments) TFTR MIke, addressed all the comments, and also re-ran the test suite. Please take a look at responses inline. http://gerrit.cloudera.org:8080/#/c/3823/8/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: Line 42: DEFINE_int32(num_test_threads, 16, > Let's follow the convention of the other params and name this test_num_thre Done. Line 262: // Step 4: Verify that at least one thread caught the state transition. > Let's just remove step 4. Done. Line 287: for (int i = 0; i < cluster_->num_tablet_servers(); i++) { > This check is not needed. WaitForServersToAgree() already effectively achie Yeah, bad copy paste skills :), done. Line 298: // Find out who is leader so that we can tombstone a follower > It doesn't matter whether you kill a leader or a follower, does it? We only Sounds good, that's true - that was the original intention of choosing a follower to tombstone to keep this test specifically geared towards kudu-1500. Line 306: AtomicBool transition(false); > These days, prefer std::atomic_bool or std::atomic (the former is a t Done. Line 327: << " on TS " << follower_ts->uuid(); > nit: it's prettier / more readable to align the << with the one on the prev thought of following 'function call' formatting guideline. Shouldn't next line be offset by 4 spaces ? Took your suggestion anyways. Line 347: if (!transition.Load()) > Hmm. I think we should remove this since it's not really doing anything, an Done. Agreed. Line 351: ASSERT_TRUE(cluster_->tablet_server(kTsIndex)->IsProcessAlive()); > This can be replaced with NO_FATALS(cluster_->AssertNoCrashes()); Done, thanks. http://gerrit.cloudera.org:8080/#/c/3823/8/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: Line 283: table_id_ = superblock.table_id(); > What about table_id_? Shouldn't this also be inside the "if" statement belo table_name_ is being set by AlterSchema, so even though it's not supported, we could keep it mutable(as-is) as long as it is not racy. I liked idea of doing DCHECK on all other 3 objects inside an else{} that they didn't differ from our assumption, added code and also tested them. Please take a look again. I am also getting this warning for the memcmp usage: "/Users/dinesh/Documents/kudu_exp/kudu/src/kudu/tablet/tablet_metadata.cc:317:140: warning: first operand of this 'memcmp' call is a pointer to dynamic class 'PartitionPB'; vtable pointer will be compared [-Wdynamic-class-memaccess]" Line 298: LOG_WITH_PREFIX(WARNING) << "Upgrading from Kudu 0.5.0 directly to this" > This seems like it should be a FATAL error to me. Dan? Hmmm, I kept it as warning for now, but I guess I was supposed to return error from here, just fixed that now. Line 300: << " moving to a higher version." << std::endl; > std::endl is not required here and should be removed Done. Line 303: if (state_ == kNotLoadedYet) { > Please add a comment here explaining why we only do this when the TabletMet Done. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#9). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable once after it is loaded during replica bootstrap. These fields are never overwritten again during Tablet Copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. Testing: Passed about 2000 iteration of the failing test raft_consensus-itest.TestCorruptReplicaMetadata Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 3 files changed, 139 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/9 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 9 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 8: (13 comments) http://gerrit.cloudera.org:8080/#/c/3823/8/src/kudu/integration-tests/cluster_itest_util.h File src/kudu/integration-tests/cluster_itest_util.h: Line 242: Status ListTabletsDuringTabletCopy(const TServerDetails* ts, I don't think this utility is generic enough to put here. It is very specific. Let's move it to the test. http://gerrit.cloudera.org:8080/#/c/3823/8/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: Line 42: DEFINE_int32(num_test_threads, 16, Let's follow the convention of the other params and name this test_num_threads Line 262: // Step 4: Verify that at least one thread caught the state transition. Let's just remove step 4. Line 287: for (int i = 0; i < cluster_->num_tablet_servers(); i++) { This check is not needed. WaitForServersToAgree() already effectively achieves this. Line 298: // Find out who is leader so that we can tombstone a follower It doesn't matter whether you kill a leader or a follower, does it? We only care that one of the servers eventually becomes leader and starts a tablet copy on our tombstoned replica. That said, even though this isn't necessary for correctness, it makes average test time lower since we don't have to wait for an election to get the tablet copy kicked off. Let's just document that as the reason for doing this. Line 306: AtomicBool transition(false); These days, prefer std::atomic_bool or std::atomic (the former is a typedef of the latter) from C++11 to this older C++03 compatible atomics library. Line 327: << " on TS " << follower_ts->uuid(); nit: it's prettier / more readable to align the << with the one on the previous line Line 347: if (!transition.Load()) Hmm. I think we should remove this since it's not really doing anything, and trying to rely on it would make this test racy. No one will ever check this test log output and if they did it wouldn't tell them very much. Line 351: ASSERT_TRUE(cluster_->tablet_server(kTsIndex)->IsProcessAlive()); This can be replaced with NO_FATALS(cluster_->AssertNoCrashes()); http://gerrit.cloudera.org:8080/#/c/3823/8/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: Line 283: table_id_ = superblock.table_id(); What about table_id_? Shouldn't this also be inside the "if" statement below? Also the table_name_. Right now, we don't support renaming tables. I think we should add a DHECK that these things didn't change and then put them inside the "if". Line 298: LOG_WITH_PREFIX(WARNING) << "Upgrading from Kudu 0.5.0 directly to this" This seems like it should be a FATAL error to me. Dan? Line 300: << " moving to a higher version." << std::endl; std::endl is not required here and should be removed Line 303: if (state_ == kNotLoadedYet) { Please add a comment here explaining why we only do this when the TabletMetadata is uninitialized. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 8: (2 comments) TFTR Dan, addressed both comments with responses inline... http://gerrit.cloudera.org:8080/#/c/3823/7/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: Line 323: barrier.Wait(); > Could this be done with a countdown latch or similar? 5 seconds is a very Just stumbled across a cute Barrier implementation in util/, I am beginning to appreciate the power of C++ util libraries at the moment :) http://gerrit.cloudera.org:8080/#/c/3823/7/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: Line 298: LOG_WITH_PREFIX(WARNING) << "Upgrading from Kudu 0.5.0 directly to this" > I think this is a little hard to follow, perhaps a better way to state it i Done, although I wonder if there are any other situations which meet !superblock.has_partition() in which case this message may become misleading. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 8: Build Started http://104.196.14.100/job/kudu-gerrit/2866/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#8). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable once after it is loaded during replica bootstrap. These fields are never overwritten again during Tablet Copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. Testing: Passed about 2000 iteration of the failing test raft_consensus-itest.TestCorruptReplicaMetadata Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/integration-tests/cluster_itest_util.cc M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 4 files changed, 138 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/8 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dan Burkert has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/3823/7/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: Line 323: SleepFor(MonoDelta::FromSeconds(5)); Could this be done with a countdown latch or similar? 5 seconds is a very long time to wait. http://gerrit.cloudera.org:8080/#/c/3823/7/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: Line 298: LOG_WITH_PREFIX(WARNING) << "Upgrade from release <= 0.5 to > 0.6 " I think this is a little hard to follow, perhaps a better way to state it is: Upgrading from Kudu 0.5.0 directly to this version is not supported. Please upgrade to 0.6.0 before moving to a higher version. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has abandoned this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Abandoned abandoning due to duplicate links. original review is under http://gerrit.cloudera.org:8080/3823 -- To view, visit http://gerrit.cloudera.org:8080/3955 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: abandon Gerrit-Change-Id: I0a7a62b7c5aa62e9888b4b454790861df9fe2ad5 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has abandoned this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Abandoned abandoning due to duplicate links. original review is under http://gerrit.cloudera.org:8080/3823 -- To view, visit http://gerrit.cloudera.org:8080/3957 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: abandon Gerrit-Change-Id: I204e9ef326b95b4d6216778a35b0438760ff Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has abandoned this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Abandoned abandoning due to duplicate links. original review is under http://gerrit.cloudera.org:8080/3823 -- To view, visit http://gerrit.cloudera.org:8080/3956 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: abandon Gerrit-Change-Id: I0cfea3ea1b8e308fe05a0b3e94c93b7b1369ac24 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#7). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable once after it is loaded during replica bootstrap. These fields are never overwritten again during Tablet Copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. Testing: Passed about 2000 iteration of the failing test raft_consensus-itest.TestCorruptReplicaMetadata Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/integration-tests/cluster_itest_util.cc M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 4 files changed, 136 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/7 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 7: Build Started http://104.196.14.100/job/kudu-gerrit/2840/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/3957 to review the following change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable once after it is loaded during replica bootstrap. These fields are never overwritten again during Tablet Copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. Testing: Passed about 2000 iteration of the failing test raft_consensus-itest.TestCorruptReplicaMetadata Change-Id: I0a7a62b7c5aa62e9888b4b454790861df9fe2ad5 Change-Id: I204e9ef326b95b4d6216778a35b0438760ff --- M src/kudu/integration-tests/cluster_itest_util.cc M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 4 files changed, 136 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/57/3957/1 -- To view, visit http://gerrit.cloudera.org:8080/3957 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I204e9ef326b95b4d6216778a35b0438760ff Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Mike Percy
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 1: Build Started http://104.196.14.100/job/kudu-gerrit/2836/ -- To view, visit http://gerrit.cloudera.org:8080/3957 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I204e9ef326b95b4d6216778a35b0438760ff Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 6: (2 comments) TFTR mpercy/dan, I have posted the new diffs under https://gerrit.cloudera.org/#/c/3955/, and decided to abandon this patch. PS: I tried to use same Change-Id as this one in my commit message to keep everything in one thread, but failed to do so because I got the new link when I published the new set of diffs. http://gerrit.cloudera.org:8080/#/c/3823/6/src/kudu/tablet/tablet_metadata.h File src/kudu/tablet/tablet_metadata.h: Line 109: const Partition partition_locked() const { > Is this the right return type? Seems weird to return const by value. Yeah, please ignore this change altogether, I have posted a new review with the new fix we had discussed earlier this week, and I am gonna abandon this change. http://gerrit.cloudera.org:8080/#/c/3823/6/src/kudu/tserver/tserver-path-handlers.cc File src/kudu/tserver/tserver-path-handlers.cc: Line 211: peer->tablet_metadata()->schema()); > Note: schema_ is also modified in TabletMetadata::LoadFromSuperBlock() Yes, but schema_ is not race-prone in that it has atomic pointer swap to new schema_ and also keeps history of stale schema_. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/3955 to review the following change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable once after it is loaded during replica bootstrap. These fields are never overwritten again during Tablet Copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. Testing: Passed about 2000 iteration of the failing test raft_consensus-itest.TestCorruptReplicaMetadata Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Change-Id: I0a7a62b7c5aa62e9888b4b454790861df9fe2ad5 --- M src/kudu/integration-tests/cluster_itest_util.cc M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 4 files changed, 136 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/55/3955/1 -- To view, visit http://gerrit.cloudera.org:8080/3955 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I0a7a62b7c5aa62e9888b4b454790861df9fe2ad5 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Mike Percy
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 1: Build Started http://104.196.14.100/job/kudu-gerrit/2835/ -- To view, visit http://gerrit.cloudera.org:8080/3956 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0cfea3ea1b8e308fe05a0b3e94c93b7b1369ac24 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Dan Burkert, Mike Percy, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/3956 to review the following change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the Partition and PartitionSchema on that tserver is accessed using TabletPeer in an unguarded manner via ListTablets RPCs. The proposed fix here keeps the Partition and PartitionSchema immutable once after it is loaded during replica bootstrap. These fields are never overwritten again during Tablet Copy or any other workflow. Also a unit test is added to exercise this data race window. i.e, the unit test aims to issue ListTablets RPCs during a follower's tablet copy stage and provides an optimistic coverage for the fix. Testing: Passed about 2000 iteration of the failing test raft_consensus-itest.TestCorruptReplicaMetadata Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Change-Id: I0cfea3ea1b8e308fe05a0b3e94c93b7b1369ac24 --- M src/kudu/integration-tests/cluster_itest_util.cc M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.cc 4 files changed, 136 insertions(+), 16 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/56/3956/1 -- To view, visit http://gerrit.cloudera.org:8080/3956 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I0cfea3ea1b8e308fe05a0b3e94c93b7b1369ac24 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Mike Percy
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 1: Build Started http://104.196.14.100/job/kudu-gerrit/2834/ -- To view, visit http://gerrit.cloudera.org:8080/3955 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0a7a62b7c5aa62e9888b4b454790861df9fe2ad5 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dan Burkert has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 6: (2 comments) http://gerrit.cloudera.org:8080/#/c/3823/6/src/kudu/tablet/tablet_metadata.h File src/kudu/tablet/tablet_metadata.h: Line 109: const Partition partition_locked() const { Is this the right return type? Seems weird to return const by value. Line 142: const PartitionSchema partition_schema_locked() const { same here. -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/3823/6/src/kudu/tserver/tserver-path-handlers.cc File src/kudu/tserver/tserver-path-handlers.cc: Line 211: peer->tablet_metadata()->schema()); Note: schema_ is also modified in TabletMetadata::LoadFromSuperBlock() -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Dinesh Bhat has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 6: (1 comment) TFTR MIke, this is still WIP, I uploaded one missing piece from my previous patchset to confirm jenkins test suite pass. I will upload another patchset today. http://gerrit.cloudera.org:8080/#/c/3823/4/src/kudu/tablet/tablet_metadata.h File src/kudu/tablet/tablet_metadata.h: Line 142: const PartitionSchema partition_schema_locked() const { > I don't think this is needed anymore if we make PartitionSchema::PartitionD I think I had missed out one important callsite in ListTablets(), while cleaning up the changes. I am inclined to keep this routine and PartitionDebugString as it is since this routine is called by 2 other callers now. This is still WIP as I uploaded the patch only to see if kudu jenkins test suite is happy, -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#6). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the partition schema on that tserver is accessed in an unguarded manner from a thread servicing ListTablets API. The proposed fix here accesses the metadata partition resources via copy by value mechanism now in slow paths, where the copy itself happens under the same lock which makes the metadata resurrection operation idempotent. Fast paths are untouched since the parition schema is bound to be accessed in fast paths only after tablet metadata is fully resurrected and tablet is in RUNNING state. Testing: Passed about 2000 iteration of the failing test raft_consensus-itest.TestCorruptReplicaMetadata Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_bootstrap.h M src/kudu/tablet/tablet_metadata.h M src/kudu/tablet/tablet_peer.cc M src/kudu/tools/fs_tool.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/tserver-path-handlers.cc 8 files changed, 20 insertions(+), 14 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/6 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 6: Build Started http://104.196.14.100/job/kudu-gerrit/2760/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 5: Build Started http://104.196.14.100/job/kudu-gerrit/2758/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Mike Percy has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 4: (3 comments) http://gerrit.cloudera.org:8080/#/c/3823/4/src/kudu/tablet/tablet_metadata.h File src/kudu/tablet/tablet_metadata.h: Line 104: const Partition& partition() const { Typically when we create functions like this we name them like: GetPartitionUnlocked() GetPartitionCopy() And add thread-safety comments to them. I'd even suggest going one step further. Have Tablet keep a copy of the Partition field locally when it's constructed and never access it from the TabletMetadata after that. That way all the calls that looks like tablet->metadata()->partition_locked() etc will just look like tablet->partition() and they don't have to worry about taking locks or copying. Line 142: const PartitionSchema partition_schema_locked() const { I don't think this is needed anymore if we make PartitionSchema::PartitionDebugString() static http://gerrit.cloudera.org:8080/#/c/3823/4/src/kudu/tserver/tserver-path-handlers.cc File src/kudu/tserver/tserver-path-handlers.cc: Line 209: string partition = peer->tablet_metadata()->partition_schema_locked() Can we just make PartitionSchema::PartitionDebugString() a static method? -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 4: Build Started http://104.196.14.100/job/kudu-gerrit/2746/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. Patch Set 3: Build Started http://104.196.14.100/job/kudu-gerrit/2740/ -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3823 to look at the new patch set (#3). Change subject: KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata .. KUDU-1500: Data race in RaftConsensusITest.TestCorruptReplicaMetadata The test intends to corrupt the metadata of one of the tserver tablets. While the cluster is in the process of resurrecting the failed tserver, the partition schema on that tserver is accessed in an unguarded manner from a thread servicing ListTablets API. The proposed fix here accesses the metadata partition resources via copy by value mechanism now in slow paths, where the copy itself happens under the same lock which makes the metadata resurrection operation idempotent. Fast paths are untouched since the parition schema is bound to be accessed in fast paths only after tablet metadata is fully resurrected and tablet is in RUNNING state. Testing: Passed about 2000 iteration of the failing test raft_consensus-itest.TestCorruptReplicaMetadata Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 --- M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_bootstrap.h M src/kudu/tablet/tablet_metadata.h M src/kudu/tablet/tablet_peer.cc M src/kudu/tools/fs_tool.cc M src/kudu/tserver/tserver-path-handlers.cc 7 files changed, 18 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/3823/3 -- To view, visit http://gerrit.cloudera.org:8080/3823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If57617e22b41296b8d4e8ad131220f1ebb235019 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Dinesh Bhat Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon