[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969167#comment-13969167 ] Aleksey Yeschenko commented on CASSANDRA-6134: -- [~m0nstermind] FYI, I pushed a 2.1-rebased version to https://github.com/iamaleksey/cassandra/commits/6134 There are at least a couple issues with it: 1. rateLimiter.acquire() call at https://github.com/iamaleksey/cassandra/commit/337368f14aa3546e9d8057c48ab8f5a32efe88c4#diff-642bb5d5ca328b50d59f2a550c94e5edR280 uses the size of the list instead of the mutation size 2. Using Verb.READ_REPAIR in https://github.com/iamaleksey/cassandra/commit/337368f14aa3546e9d8057c48ab8f5a32efe88c4#diff-642bb5d5ca328b50d59f2a550c94e5edR402 does not, in fact, stop C* from writing a hint on timeout I haven't forgotten about the issue, but need to fix a few batchlog/HHOM bugs first :\ More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Assignee: Oleg Anastasyev Priority: Minor Attachments: 6134-async.txt, 6134-cleanup.txt, BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910081#comment-13910081 ] Oleg Anastasyev commented on CASSANDRA-6134: Um, I am not sure, what exactly can I do on this task ? If #1 would be implemented by Aleksey, I could take async replay then. More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Assignee: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905471#comment-13905471 ] Jonathan Ellis commented on CASSANDRA-6134: --- Are you planning to pick this back up, [~m0nstermind]? More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Assignee: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905891#comment-13905891 ] Aleksey Yeschenko commented on CASSANDRA-6134: -- For the record, several of those improvements have already made it into C*. If we don't do the partitioning, then only two are left to implement: 1. Don't do full scans, but limit the range to (nothing could be written earlier than that, batches not ready to replay yet) - the uuids are timeuuids there now, so it's a simple change, on my todo list 2.Replay several batches simultaneously, async - this is slightly more work, but only slightly Stuff that made it recently, thanks to rbranson: CASSANDRA-6569, CASSANDRA-6550, CASSANDRA-6488, CASSANDRA-6481 Stuff that's still waiting (aside from 1. and 2.) : CASSANDRA-6551 More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Assignee: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807928#comment-13807928 ] Oleg Anastasyev commented on CASSANDRA-6134: So, you'll be writing your own version, so nothing to be done by me on this task. Am I got you right ? More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807939#comment-13807939 ] Aleksey Yeschenko commented on CASSANDRA-6134: -- bq. So, you'll be writing your own version, so nothing to be done by me on this task. Am I got you right ? No, I've got way too much on my plate for this. I was hoping that you could bring the improvements you suggested to the current schema. More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807959#comment-13807959 ] Oleg Anastasyev commented on CASSANDRA-6134: This means rewriting already tested and working code almost from scratch. I dont see reasons for it. As you mentioned, if 2.0 - 2.1 upgrade will require full stop of the cluster, so trying to preserve old schema is meaningless, because we dont have to provide migration at all. More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Assignee: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808024#comment-13808024 ] Aleksey Yeschenko commented on CASSANDRA-6134: -- W/ timeuuid as key you can also start the scanning from the last known uuid (kinda. doing this naively is not exactly safe, b/c an old batch record might arrive with a delay of up to the write_timeout, and won't be replayed if we just start from the last-replayed entry). More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Assignee: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808015#comment-13808015 ] Aleksey Yeschenko commented on CASSANDRA-6134: -- No stopping the cluster, obviously - that would be a deal-breaker. Just stay on 2.0.3+ until all the batches have been flushed (say, 10 minutes) before proceeding to 2.1. So only the people migrating from 1.2 to 2.1 (through 2.0) will have to take some extra action (wait a little on 2.0 before switching to 2.1). More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Assignee: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13806901#comment-13806901 ] Aleksey Yeschenko commented on CASSANDRA-6134: -- bq. Do you mean time jumping, if operator forcibly changes time on machine or some other scenario ? Yup. That's a minor concern though. bq. Using it without COMPACT STORAGE will add 2x to memory and disk. How so? And yeah, having the ability to add a map or a set with some extra metadata there is useful. While it hasn't been done to the batchlog, we've done it for other system cf-s (system.schema_columnfamilies for one) and were burnt by COMPACT with system.schema_keyspaces (can't switch rf options to a map and have to keep the ghetto-json b/c can't add a map) (see CASSANDRA-4603). More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13806934#comment-13806934 ] Aleksey Yeschenko commented on CASSANDRA-6134: -- I don't like the idea of having two batchlog cfs, and two separate batchlog implementations. But as Oleg says, there is a lot of room for improvement in the current batchlog implementation. I want to bring as much of them as possible w/out changing the schema (in incompatible ways). Regarding full scan - we can actually start using v1 uuid instead of random for the batchlog keys, without changing the key type ('uuid' will accept any uuid type, unlike 'timeuuid' that would only accept v1). And then stop replaying as soon as we stumble upon a batch that is too new. (Can't exactly do that in 2.0, but we can start using v1 ids in 2.0 and tell people to either force batchlog replay or wait for a while on the fully upgraded 2.0 cluster before moving to 2.1, where we could start using this logic). We already require a stop at 2.0 for anyone upgrading to 2.1 so this should work. More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13806954#comment-13806954 ] Aleksey Yeschenko commented on CASSANDRA-6134: -- (Ninja-committed the v4-v1 UUID change in b5d563ec3c7d569e626119bd5900026c07f247b6) More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802135#comment-13802135 ] Oleg Anastasyev commented on CASSANDRA-6134: bq. This isn't what we want to ensure though. The current timeout (write timeout * 2) is there to account for maximum batchlog write timeout + actual data write timeout. Avoiding extra mutations is IMO more important than having less delay in the failure scenario (and slow writes would happen more often than outright failures). As we discussed earlier, whole batchlog thing makes little sense, if clients cannot read their own writes. Consider client written to batchlog very fast and timed out from coordinator having batch half applied. Reading from another coordinator it would see batch partially applied for almost yet another write timeout. So just having write timeout*2 is not a good idea. From the other hand, hammering is one-by-one replay of unplayed mutation. Dont think this could be an issue practically. +1 having RateLimiter there, so hammering could be more limited. bq. -1 on using writeTime for TTL calculation from the UUID (the time can actually jump, but uuids will always increase, and it's not what we want for TTL calc) Do you mean time jumping, if operator forcibly changes time on machine or some other scenario ? bq. making the table COMPACT STORAGE limits our flexibility wrt future batchlog schema changes, so -1 on that Using it without COMPACT STORAGE will add 2x to memory and disk. Does supporting change really neccessary ? I did not noticed any changes to original structure since very beginning of batchlog. bq. We should avoid any potentially brittle/breaking extra migration code on the already slow-ish startup. Um, i did not thinking about migrating old batchlog records on startup. This cannot be done, because old version nodes will continue to write old format batchlog entries while operator roll upgrades cluster. What i was thinking is having BatchlogManagerOld reading from old batchlog CF and replaying batches old way; And having BatchlogManager, reading from new batchlog2 CF and replaying batchlogs new way. As soon as all nodes are upgraded they start to write ti new batchlog2 CF, so BatchlogManagerOld after it precessed all old records reads nothing from old batchlog CF, and basically does a NOP cycle every 60 secs. So the migration is not so big deal to aim at not changing structure of batch log so badly. More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13800851#comment-13800851 ] Aleksey Yeschenko commented on CASSANDRA-6134: -- bq. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. This isn't what we want to ensure though. The current timeout (write timeout * 2) is there to account for maximum batchlog write timeout + actual data write timeout. Avoiding extra mutations is IMO more important than having less delay in the failure scenario (and slow writes would happen more often than outright failures). And you definitely don't want to hammer an already slow node with twice the load. So -1 on this particular change. bq. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). This is fine. But yeah, we could/should parallelize batchlog replay more (can be done w/out modifying the schema). bq. It fixes a subtle racing bug with incorrect hint TTL calculation Care to elaborate? I think there was a tricky open bug related to this, but can't fine the JIRA #. To avoid random reads, we could read the mutation blob in replayAllFailedBatches() and and pass it to replayBatch() (I thought we were already doing that). To make replay more async, as you suggest, we could read several batches and initate their replay async instead of replaying them one by one (but w/ RateLimiter in place). To avoid iterating over the already replayed batches (tombstones), we could purge the replayed batches directly from the memtable (although I'd need to see a benchmark proving that it's worth doing it first). Other stuff, in no particular order: - making the table COMPACT STORAGE limits our flexibility wrt future batchlog schema changes, so -1 on that - we should probably rate-limit batchlog replay w/ RateLimiter - +1 on moving forceBatchlogReplay() to batchlogTasks as well (this was an omission from CASSANDRA-6079, ninja-committed it in 7e057f504613e68082a76642983d353f3f0400fb) - +1 on running cleanup() on startup - -1 on using writeTime for TTL calculation from the UUID (the time can actually jump, but uuids will always increase, and it's not what we want for TTL calc) In general: I like some of the suggested changes, and would like to see the ones that are possible w/out the schema change implemented first. I'm strongly against altering the batchlog schema, unless the benchmarks can clearly prove that the version with the partitioned schema is significantly better than what we could come up with without altering the schema, and many of them can be. We should avoid any potentially brittle/breaking extra migration code on the already slow-ish startup. Could you give it a try, [~m0nstermind]? Namely, - replaying several mutations read in replayAllFailedBatches() simultaneously instead of 1-by-1 - avoiding the random read by passing the read blob to replayBatch() - measure the effect of purging the replayed batch from the memtable (when not read from the disk) If this gives us most of the win of a version with the altered schema, then I'll be satisfied with just those changes. If benchmarks say that we have a lot extra relative and absolute efficiency to gain from the schema change, then I won't argue with the data. More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790064#comment-13790064 ] Oleg Anastasyev commented on CASSANDRA-6134: timestamp type will add 4 bytes more for every batch in traffic and in memory storage,(timestamp is 8 bytes instead of 4 for int AFAIK). making it human readable seems not very useful to me. CLUSTERING order is there to slice mostly from the beginning of partition. This could have some (not much through) performance gain, esp, if batchlog is flushed to disk. More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788849#comment-13788849 ] Jonathan Ellis commented on CASSANDRA-6134: --- I'm going to bikeshed this a bit. How about {code} CREATE TABLE batchlog ( partition_id timestamp, id timeuuid, data blob, PRIMARY KEY (partition_id, id) ) WITH COMPACT STORAGE {code} Giving it timestamp type instead of int just makes it a bit more human-readable. Does the CLUSTERING order actually matter? More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784111#comment-13784111 ] Oleg Anastasyev commented on CASSANDRA-6134: Well, the way how to migrate old batchlog records is a subject to discussion and TBD. The easiest way is to have batchlog2 CF with new definition and batchlog with old one. But i find it somewhat ugly. More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784115#comment-13784115 ] Jonathan Ellis commented on CASSANDRA-6134: --- I vote for easy in this case, if users get so low level that they care what this table is named then they have no right to be offended. :) More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784122#comment-13784122 ] Aleksey Yeschenko commented on CASSANDRA-6134: -- If the changes are worth it in practice *AND* if there is absolutely no way to reuse the current schema, you still have to migrate the old batches. More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager
[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784148#comment-13784148 ] Oleg Anastasyev commented on CASSANDRA-6134: Alex: It seems that current schema completely incompatible with new one. So, could you plz then look and decide is new batchlog manager useful for you, so it is worth to implement migration. More efficient BatchlogManager -- Key: CASSANDRA-6134 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 Project: Cassandra Issue Type: Improvement Reporter: Oleg Anastasyev Priority: Minor Attachments: BatchlogManager.txt As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. It stores batch records in {code} CREATE TABLE batchlog ( id_partition int, id timeuuid, data blob, PRIMARY KEY (id_partition, id) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (id DESC) {code} where id_partition is minute-since-epoch of id uuid. So when it scans for batches to replay ot scans within a single partition for a slice of ids since last processed date till now minus write timeout. So no full batchlog CF scan and lot of randrom reads are made on normal cycle. Other improvements: 1. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. 2. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)