[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2014-04-14 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969167#comment-13969167
 ] 

Aleksey Yeschenko commented on CASSANDRA-6134:
--

[~m0nstermind] FYI, I pushed a 2.1-rebased version to 
https://github.com/iamaleksey/cassandra/commits/6134

There are at least a couple issues with it:
1. rateLimiter.acquire() call at 
https://github.com/iamaleksey/cassandra/commit/337368f14aa3546e9d8057c48ab8f5a32efe88c4#diff-642bb5d5ca328b50d59f2a550c94e5edR280
 uses the size of the list instead of the mutation size
2. Using Verb.READ_REPAIR in 
https://github.com/iamaleksey/cassandra/commit/337368f14aa3546e9d8057c48ab8f5a32efe88c4#diff-642bb5d5ca328b50d59f2a550c94e5edR402
 does not, in fact, stop C* from writing a hint on timeout

I haven't forgotten about the issue, but need to fix a few batchlog/HHOM bugs 
first :\

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Assignee: Oleg Anastasyev
Priority: Minor
 Attachments: 6134-async.txt, 6134-cleanup.txt, BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2014-02-23 Thread Oleg Anastasyev (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13910081#comment-13910081
 ] 

Oleg Anastasyev commented on CASSANDRA-6134:


Um, I am not sure, what exactly can I do on this task ? If #1 would be 
implemented by Aleksey, I could take async replay then.

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Assignee: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2014-02-19 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905471#comment-13905471
 ] 

Jonathan Ellis commented on CASSANDRA-6134:
---

Are you planning to pick this back up, [~m0nstermind]?

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Assignee: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2014-02-19 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905891#comment-13905891
 ] 

Aleksey Yeschenko commented on CASSANDRA-6134:
--

For the record, several of those improvements have already made it into C*. If 
we don't do the partitioning, then only two are left to implement:

1. Don't do full scans, but limit the range to (nothing could be written 
earlier than that, batches not ready to replay yet) - the uuids are timeuuids 
there now, so it's a simple change, on my todo list
2.Replay several batches simultaneously, async - this is slightly more work, 
but only slightly

Stuff that made it recently, thanks to rbranson: CASSANDRA-6569, 
CASSANDRA-6550, CASSANDRA-6488, CASSANDRA-6481

Stuff that's still waiting (aside from 1. and 2.) : CASSANDRA-6551

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Assignee: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-29 Thread Oleg Anastasyev (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807928#comment-13807928
 ] 

Oleg Anastasyev commented on CASSANDRA-6134:


So, you'll be writing your own version, so nothing to be done by me on this 
task. Am I got you right ?

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-29 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807939#comment-13807939
 ] 

Aleksey Yeschenko commented on CASSANDRA-6134:
--

bq. So, you'll be writing your own version, so nothing to be done by me on this 
task. Am I got you right ?

No, I've got way too much on my plate for this. I was hoping that you could 
bring the improvements you suggested to the current schema.

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-29 Thread Oleg Anastasyev (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13807959#comment-13807959
 ] 

Oleg Anastasyev commented on CASSANDRA-6134:


This means rewriting already tested and working code almost from scratch. I 
dont see reasons for it. As you mentioned, if 2.0 - 2.1 upgrade will require 
full stop of the cluster, so trying to preserve old schema is meaningless, 
because we dont have to provide migration at all.

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Assignee: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-29 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808024#comment-13808024
 ] 

Aleksey Yeschenko commented on CASSANDRA-6134:
--

W/ timeuuid as key you can also start the scanning from the last known uuid 
(kinda. doing this naively is not exactly safe, b/c an old batch record might 
arrive with a delay of up to the write_timeout, and won't be replayed if we 
just start from the last-replayed entry).

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Assignee: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-29 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13808015#comment-13808015
 ] 

Aleksey Yeschenko commented on CASSANDRA-6134:
--

No stopping the cluster, obviously - that would be a deal-breaker. Just stay on 
2.0.3+ until all the batches have been flushed (say, 10 minutes) before 
proceeding to 2.1. So only the people migrating from 1.2 to 2.1 (through 2.0) 
will have to take some extra action (wait a little on 2.0 before switching to 
2.1).

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Assignee: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-28 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13806901#comment-13806901
 ] 

Aleksey Yeschenko commented on CASSANDRA-6134:
--

bq. Do you mean time jumping, if operator forcibly changes time on machine or 
some other scenario ?

Yup. That's a minor concern though.

bq. Using it without COMPACT STORAGE will add 2x to memory and disk.

How so? And yeah, having the ability to add a map or a set with some extra 
metadata there is useful. While it hasn't been done to the batchlog, we've done 
it for other system cf-s (system.schema_columnfamilies for one) and were burnt 
by COMPACT with system.schema_keyspaces (can't switch rf options to a map and 
have to keep the ghetto-json b/c can't add a map) (see CASSANDRA-4603).

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-28 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13806934#comment-13806934
 ] 

Aleksey Yeschenko commented on CASSANDRA-6134:
--

I don't like the idea of having two batchlog cfs, and two separate batchlog 
implementations. But as Oleg says, there is a lot of room for improvement in 
the current batchlog implementation.

I want to bring as much of them as possible w/out changing the schema (in 
incompatible ways).

Regarding full scan - we can actually start using v1 uuid instead of random for 
the batchlog keys, without changing the key type ('uuid' will accept any uuid 
type, unlike 'timeuuid' that would only accept v1). And then stop replaying as 
soon as we stumble upon a batch that is too new. (Can't exactly do that in 2.0, 
but we can start using v1 ids in 2.0 and tell people to either force batchlog 
replay or wait for a while on the fully upgraded 2.0 cluster before moving to 
2.1, where we could start using this logic). We already require a stop at 2.0 
for anyone upgrading to 2.1 so this should work.

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-28 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13806954#comment-13806954
 ] 

Aleksey Yeschenko commented on CASSANDRA-6134:
--

(Ninja-committed the v4-v1 UUID change in 
b5d563ec3c7d569e626119bd5900026c07f247b6)

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-22 Thread Oleg Anastasyev (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802135#comment-13802135
 ] 

Oleg Anastasyev commented on CASSANDRA-6134:


bq. This isn't what we want to ensure though. The current timeout (write 
timeout * 2) is there to account for maximum batchlog write timeout + actual 
data write timeout. Avoiding extra mutations is IMO more important than having 
less delay in the failure scenario (and slow writes would happen more often 
than outright failures). 

As we discussed earlier, whole batchlog thing makes little sense, if clients 
cannot read their own writes. Consider client written to batchlog very fast and 
timed out from coordinator having batch half applied. Reading from another 
coordinator it would see batch partially applied for almost yet another write 
timeout. So just having write timeout*2 is not a good idea. From the other 
hand, hammering is one-by-one replay of unplayed mutation.  Dont think this 
could be an issue practically. +1 having RateLimiter there, so hammering could 
be more limited.

bq. -1 on using writeTime for TTL calculation from the UUID (the time can 
actually jump, but uuids will always increase, and it's not what we want for 
TTL calc)

Do you mean time jumping, if operator forcibly changes time on machine or some 
other scenario ?

bq. making the table COMPACT STORAGE limits our flexibility wrt future batchlog 
schema changes, so -1 on that

Using it without COMPACT STORAGE will add 2x to memory and disk. Does 
supporting change really neccessary ? I did not noticed any changes to original 
structure since very beginning of batchlog.

bq. We should avoid any potentially brittle/breaking extra migration code on 
the already slow-ish startup.

Um, i did not thinking about migrating old batchlog records on startup. This 
cannot be done, because old version nodes will continue to write old format 
batchlog entries while operator roll upgrades cluster. What i was thinking is 
having BatchlogManagerOld reading from old batchlog CF and replaying batches 
old way; And having BatchlogManager, reading from new batchlog2 CF and 
replaying batchlogs new way. As soon as all nodes are upgraded they start to 
write ti new batchlog2 CF, so BatchlogManagerOld after it precessed all old 
records reads nothing from old batchlog CF, and basically does a NOP cycle 
every 60 secs. So the migration is not so big deal to aim at not changing 
structure of batch log so badly.

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-21 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13800851#comment-13800851
 ] 

Aleksey Yeschenko commented on CASSANDRA-6134:
--

bq. It runs every 1/2 of write timeout and replays all batches written within 
0.9 * write timeout from now. This way we ensure, that batched updates will be 
replayed to th moment client times out from coordinator.

This isn't what we want to ensure though. The current timeout (write timeout * 
2) is there to account for maximum batchlog write timeout + actual data write 
timeout. Avoiding extra mutations is IMO more important than having less delay 
in the failure scenario (and slow writes would happen more often than outright 
failures). And you definitely don't want to hammer an already slow node with 
twice the load. So -1 on this particular change.

bq. It submits all mutations from single batch in parallel (Like StorageProxy 
do). Old implementation played them one-by-one, so client can see half applied 
batches in CF for a long time (depending on size of batch).

This is fine. But yeah, we could/should parallelize batchlog replay more (can 
be done w/out modifying the schema).

bq. It fixes a subtle racing bug with incorrect hint TTL calculation

Care to elaborate? I think there was a tricky open bug related to this, but 
can't fine the JIRA #.

To avoid random reads, we could read the mutation blob in 
replayAllFailedBatches() and and pass it to replayBatch() (I thought we were 
already doing that). To make replay more async, as you suggest, we could read 
several batches and initate their replay async instead of replaying them one by 
one (but w/ RateLimiter in place).

To avoid iterating over the already replayed batches (tombstones), we could 
purge the replayed batches directly from the memtable (although I'd need to see 
a benchmark proving that it's worth doing it first).

Other stuff, in no particular order:

- making the table COMPACT STORAGE limits our flexibility wrt future batchlog 
schema changes, so -1 on that
- we should probably rate-limit batchlog replay w/ RateLimiter
- +1 on moving forceBatchlogReplay() to batchlogTasks as well (this was an 
omission from CASSANDRA-6079, ninja-committed it in 
7e057f504613e68082a76642983d353f3f0400fb)
- +1 on running cleanup() on startup
- -1 on using writeTime for TTL calculation from the UUID (the time can 
actually jump, but uuids will always increase, and it's not what we want for 
TTL calc)

In general:

I like some of the suggested changes, and would like to see the ones that are 
possible w/out the schema change implemented first. I'm strongly against 
altering the batchlog schema, unless the benchmarks can clearly prove that the 
version with the partitioned schema is significantly better than what we could 
come up with without altering the schema, and many of them can be. We should 
avoid any potentially brittle/breaking extra migration code on the already 
slow-ish startup.

Could you give it a try, [~m0nstermind]? Namely,
- replaying several mutations read in replayAllFailedBatches() simultaneously 
instead of 1-by-1
- avoiding the random read by passing the read blob to replayBatch()
- measure the effect of purging the replayed batch from the memtable (when not 
read from the disk)

If this gives us most of the win of a version with the altered schema, then 
I'll be satisfied with just those changes. If benchmarks say that we have a lot 
extra relative and absolute efficiency to gain from the schema change, then I 
won't argue with the data.

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF 

[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-09 Thread Oleg Anastasyev (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13790064#comment-13790064
 ] 

Oleg Anastasyev commented on CASSANDRA-6134:


timestamp type will add 4 bytes more for every batch in traffic and in memory 
storage,(timestamp is 8 bytes instead of 4 for int AFAIK). making it human 
readable seems not very useful to me.

CLUSTERING order is there to slice mostly from the beginning of partition. This 
could have some (not much through) performance gain, esp, if batchlog is 
flushed to disk.


 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-07 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788849#comment-13788849
 ] 

Jonathan Ellis commented on CASSANDRA-6134:
---

I'm going to bikeshed this a bit.  How about

{code}
CREATE TABLE batchlog (
  partition_id timestamp,
  id timeuuid,
  data blob,
  PRIMARY KEY (partition_id, id)
) WITH COMPACT STORAGE
{code}

Giving it timestamp type instead of int just makes it a bit more human-readable.

Does the CLUSTERING order actually matter?

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-02 Thread Oleg Anastasyev (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784111#comment-13784111
 ] 

Oleg Anastasyev commented on CASSANDRA-6134:


Well, the way how to migrate old batchlog records is a subject to discussion 
and TBD. The easiest way is to have batchlog2 CF with new definition and 
batchlog with old one. But i find it somewhat ugly.

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-02 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784115#comment-13784115
 ] 

Jonathan Ellis commented on CASSANDRA-6134:
---

I vote for easy in this case, if users get so low level that they care what 
this table is named then they have no right to be offended. :)

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-02 Thread Aleksey Yeschenko (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784122#comment-13784122
 ] 

Aleksey Yeschenko commented on CASSANDRA-6134:
--

If the changes are worth it in practice *AND* if there is absolutely no way to 
reuse the current schema, you still have to migrate the old batches.

 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6134) More efficient BatchlogManager

2013-10-02 Thread Oleg Anastasyev (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784148#comment-13784148
 ] 

Oleg Anastasyev commented on CASSANDRA-6134:


Alex: It seems that current schema completely incompatible with new one. 
So, could you plz then look and decide is new batchlog manager useful for you, 
so it is worth to implement migration.




 More efficient BatchlogManager
 --

 Key: CASSANDRA-6134
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
 Project: Cassandra
  Issue Type: Improvement
Reporter: Oleg Anastasyev
Priority: Minor
 Attachments: BatchlogManager.txt


 As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
 It stores batch records in 
 {code}
 CREATE TABLE batchlog (
   id_partition int,
   id timeuuid,
   data blob,
   PRIMARY KEY (id_partition, id)
 ) WITH COMPACT STORAGE AND
   CLUSTERING ORDER BY (id DESC)
 {code}
 where id_partition is minute-since-epoch of id uuid. 
 So when it scans for batches to replay ot scans within a single partition for 
  a slice of ids since last processed date till now minus write timeout.
 So no full batchlog CF scan and lot of randrom reads are made on normal 
 cycle. 
 Other improvements:
 1. It runs every 1/2 of write timeout and replays all batches written within 
 0.9 * write timeout from now. This way we ensure, that batched updates will 
 be replayed to th moment client times out from coordinator.
 2. It submits all mutations from single batch in parallel (Like StorageProxy 
 do). Old implementation played them one-by-one, so client can see half 
 applied batches in CF for a long time (depending on size of batch).
 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)