from:"Carl Mueller"

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Carl Mueller

"large/giant clusters and admins are the target audience for the value we
select"

There are reasons aside from massive scale to pick cassandra, but the
primary reason cassandra is selected technically is to support vertically
scaling to large clusters.

Why pick a value that once you reach scale you need to switch token count?
It's still a ticking time bomb, although 16 won't be what 256 is.

H. But 4 is bad and could scare off adoption.

Ultimately a well-written article on operations and how to transition from
16 --> 4 and at what point that is a good idea (aka not when your cluster
is too big) should be a critical part of this.

On Fri, Jan 31, 2020 at 11:45 AM Michael Shuler 
wrote:

> On 1/31/20 9:58 AM, Dimitar Dimitrov wrote:
> > one corollary of the way the algorithm works (or more
> > precisely might not work) with multiple seeds or simultaneous
> > multi-node bootstraps or decommissions, is that a lot of dtests
> > start failing due to deterministic token conflicts. I wasn't
> > able to fix that by changing solely ccm and the dtests
> I appreciate all the detailed discussion. For a little historic context,
> since I brought up this topic in the contributors zoom meeting, unstable
> dtests was precisely the reason we moved the dtest configurations to
> 'num_tokens: 32'. That value has been used in CI dtest since something
> like 2014, when we found that this helped stabilize a large segment of
> flaky dtest failures. No real science there, other than "this hurts less."
>
> I have no real opinion on the suggestions of using 4 or 16, other than I
> believe most "default config using" new users are starting with smaller
> numbers of nodes. The small-but-growing users and veteran large cluster
> admins should be gaining more operational knowledge and be able to
> adjust their own config choices according to their needs (and good
> comment suggestions in the yaml). Whatever default config value is
> chosen for num_tokens, I think it should suit the new users with smaller
> clusters. The suggestion Mick makes that 16 makes a better choice for
> small numbers of nodes, well, that would seem to be the better choice
> for those users we are trying to help the most with the default.
>
> I fully agree that science, maths, and support/ops experience should
> guide the choice, but I don't believe that large/giant clusters and
> admins are the target audience for the value we select.
>
> --
> Kind regards,
> Michael
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Carl Mueller

edit: 4 is bad at small cluster sizes and could scare off adoption

On Fri, Jan 31, 2020 at 12:15 PM Carl Mueller 
wrote:

> "large/giant clusters and admins are the target audience for the value we
> select"
>
> There are reasons aside from massive scale to pick cassandra, but the
> primary reason cassandra is selected technically is to support vertically
> scaling to large clusters.
>
> Why pick a value that once you reach scale you need to switch token count?
> It's still a ticking time bomb, although 16 won't be what 256 is.
>
> H. But 4 is bad and could scare off adoption.
>
> Ultimately a well-written article on operations and how to transition from
> 16 --> 4 and at what point that is a good idea (aka not when your cluster
> is too big) should be a critical part of this.
>
> On Fri, Jan 31, 2020 at 11:45 AM Michael Shuler 
> wrote:
>
>> On 1/31/20 9:58 AM, Dimitar Dimitrov wrote:
>> > one corollary of the way the algorithm works (or more
>> > precisely might not work) with multiple seeds or simultaneous
>> > multi-node bootstraps or decommissions, is that a lot of dtests
>> > start failing due to deterministic token conflicts. I wasn't
>> > able to fix that by changing solely ccm and the dtests
>> I appreciate all the detailed discussion. For a little historic context,
>> since I brought up this topic in the contributors zoom meeting, unstable
>> dtests was precisely the reason we moved the dtest configurations to
>> 'num_tokens: 32'. That value has been used in CI dtest since something
>> like 2014, when we found that this helped stabilize a large segment of
>> flaky dtest failures. No real science there, other than "this hurts less."
>>
>> I have no real opinion on the suggestions of using 4 or 16, other than I
>> believe most "default config using" new users are starting with smaller
>> numbers of nodes. The small-but-growing users and veteran large cluster
>> admins should be gaining more operational knowledge and be able to
>> adjust their own config choices according to their needs (and good
>> comment suggestions in the yaml). Whatever default config value is
>> chosen for num_tokens, I think it should suit the new users with smaller
>> clusters. The suggestion Mick makes that 16 makes a better choice for
>> small numbers of nodes, well, that would seem to be the better choice
>> for those users we are trying to help the most with the default.
>>
>> I fully agree that science, maths, and support/ops experience should
>> guide the choice, but I don't believe that large/giant clusters and
>> admins are the target audience for the value we select.
>>
>> --
>> Kind regards,
>> Michael
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>
>>

Re: [Discuss] num_tokens default in Cassandra 4.0

2020-01-31 Thread Carl Mueller

So why even have virtual nodes at all, why not work on improving single
token approaches so that we can support cluster doubling, which IMO would
enable cassandra to more quickly scale for volatile loads?

It's my guess/understanding that vnodes eliminate the token rebalancing
that existed back in the days of single token. Did vnodes also help reduce
the amount of streamed data in rebalancing/expansion? VNodes also help the
streaming from multiple sources in expansion, but if it limits us to single
node expansion that really limits flexibility on large node count clusters.

Were there other advantages to VNodes that I missed?

IIRC High vnode count basically broke the secondary low cardinality
indexes, and vnode=4 might help that a lot.

But if 4 hasn't shown any balancing issues, I'm all for it.

On Fri, Jan 31, 2020 at 9:58 AM Dimitar Dimitrov 
wrote:

> Hey all,
>
> At some point not too long ago I spent some time trying to
> make the token allocation algorithm the default.
>
> I didn't foresee it, although it might be obvious for many of
> you, but one corollary of the way the algorithm works (or more
> precisely might not work) with multiple seeds or simultaneous
> multi-node bootstraps or decommissions, is that a lot of dtests
> start failing due to deterministic token conflicts. I wasn't
> able to fix that by changing solely ccm and the dtests, unless
> careful, sequential node bootstrap was enforced. While it's strongly
> suggested to users to do exactly that in the real world, it would
> have exploded dtest run times to unacceptable levels.
>
> I have to clarify that what I'm working with is not exactly
> C*, and my knowledge of the C* codebase is not as up to date as
> I would want it to, but I suspect that the above problem might very
> well affect C* too, in which case changing the defaults might
> be a less-than-trivial undertaking.
>
> Regards,
> Dimitar
>
> На пт, 31.01.2020 г. в 17:20 Joshua McKenzie 
> написа:
>
> > >
> > > We should be using the default value that benefits the most people,
> > rather
> > > than an arbitrary compromise.
> >
> > I'd caution we're talking about the default value *we believe* will
> benefit
> > the most people according to our respective understandings of C* usage.
> >
> >  Most clusters don't shrink, they stay the same size or grow. I'd say 90%
> > > or more fall in this category.
> >
> > While I agree with the "most don't shrink, they stay the same or grow"
> > claim intuitively, there's a distinct difference impacting the 4 vs. 16
> > debate between what ratio we think stays the same size and what ratio we
> > think grows that I think informs this discussion.
> >
> > There's a *lot* of Cassandra out in the world, and these changes are
> going
> > to impact all of it. I'm not advocating a certain position on 4 vs. 16,
> but
> > I do think we need to be very careful about how strongly we hold our
> > beliefs and present them as facts in discussions like this.
> >
> > For my unsolicited .02, it sounds an awful lot like we're stuck between a
> > rock and a hard place in that there is no correct "one size fits all"
> > answer here (or, said another way: both 4 and 16 are correct, just for
> > different cases and we don't know / agree on which one we think is the
> > right one to target), so perhaps a discussion on a smart evolution of
> token
> > allocation counts based on quantized tiers of cluster size and dataset
> > growth (either automated or through operational best practices) could be
> > valuable along with this.
> >
> > On Fri, Jan 31, 2020 at 8:57 AM Alexander Dejanovski <
> > a...@thelastpickle.com>
> > wrote:
> >
> > > While I (mostly) understand the maths behind using 4 vnodes as a
> default
> > > (which really is a question of extreme availability), I don't think
> they
> > > provide noticeable performance improvements over using 16, while 16
> > vnodes
> > > will protect folks from imbalances. It is very hard to deal with
> > unbalanced
> > > clusters, and people start to deal with it once some nodes are already
> > > close to being full. Operationally, it's far from trivial.
> > > We're going to make some experiments at bootstrapping clusters with 4
> > > tokens on the latest alpha to see how much balance we can expect, and
> how
> > > removing one node could impact it.
> > >
> > > If we're talking about repairs, using 4 vnodes will generate
> > overstreaming,
> > > which can create lots of serious performance issues. Even on clusters
> > with
> > > 500GB of node density, we never use less than ~15 segments per node
> with
> > > Reaper.
> > > Not everyone uses Reaper, obviously, and there will be no protection
> > > against overstreaming with such a low default for folks not using
> > subrange
> > > repairs.
> > > On small clusters, even with 256 vnodes, using Cassandra 3.0/3.x and
> > Reaper
> > > already allows to get good repair performance because token ranges
> > sharing
> > > the exact same replicas will be processed in a single repair

gossip tuning

2019-09-26 Thread Carl Mueller

We have three datacenters (EU, AP, US) in aws and have problems
bootstrapping new nodes in the AP datacenter:

java.lang.RuntimeException: A node required to move the data
consistently is down (/SOME_NODE). If you wish to move the data from a
potentially inconsistent replica, restart the node with
-Dcassandra.consistent.rangemovement=false

Per the code increasing the ring_delay fixed the bootstrap sleep race
condition. But my research brought up some other questions. We are
generally frustrated with AWS networking but it is what it is, and was
wondering if we could further tune gossip. We have been getting some
intermittent:

Gossip stage has {} pending tasks; skipping status check (no nodes will be
marked down)

Which is the GossipStage queue backing up, and another page indicated
gossip has one thread for its pool.

For large clusters (>30 nodes from what we've seen in crappy aws) ,
wouldn't it be good to increase the threads so more frequent communication
occurs and a slow cross-region gossip connect doesn't slow things? Can the
GossipStage be increased to two threads from the one? Or is the Gossiper
single-threaded/not reentrant?

Is the random ordering of selecting a node to gossip-check a shuffle of the
known IPs to contact, or a totally random pick-one every second? The second
interval is based on a comment in the code, is that the gossip interval,
and can that be changed?

Would it be a bad idea to create a prioritized gossip request for nodes
that are up for a while to respond to bootstrapping nodes / nodes that are
known to be restarting? Is that even possible? That might also get hinted
handoff processing proceeding more quickly.

For multidatacenter, it also would make sense to have gossip process local
datacenter gossip requests more quickly (or have a thread dedicated to
fastracking that) and the "far" nodes in a different thread? Again, to
prevent cross-pacific gossip requests holding up other requests that can be
more quickly handled?

Re: Bootstrapping process questions (CASSANDRA-15155)

2019-06-17 Thread Carl Mueller

We are in conversations with AWS, hopefully an IPV6 expert, to examine what
happened.

On Thu, Jun 13, 2019 at 11:19 AM Carl Mueller 
wrote:

> Our cassandra.ring_delay_ms is current around 30 to get nodes to
> bootstrap.
>
> On Wed, Jun 12, 2019 at 5:56 PM Carl Mueller 
> wrote:
>
>> We're seeing nodes bootstrapping but not streaming and joining a cluster
>> in 2.2.13.
>>
>> I have been looking through the MigrationManager code and the
>> StorageService code that seems relevant based on the Bootstrap status
>> messages that are coming through. I'll be referencing line numbers from the
>> 2.2.13 source code.
>>
>> Major unit of work #1: prechecks
>>
>> We start our bootstrap process with StorageService:759 checking if we are
>> autobootstrap (yes), bootstrap already in progress (no), bootstrap complete
>> (no), and a check if seeds contains the current broadcast address (no)
>>
>> We proceed through the 771 shouldBootstrap() call to the "JOINING:
>> waiting for ring information" state.
>>
>> Prior to this point, MigrationManager has already Gossip checked the
>> current (nonmatching) schema version uuid. I believe in this state we only
>> have system keyspace setup based on log messages that have already occurred
>> like:
>>
>> Schema.java:421 - Adding org.apache.cassandra.config.CFMetaData@35645047
>> [cfId=5a1ff267-ace0-3f12-8563-cfae6103c65e,ksName=system,cfName=sstable_activity
>>
>> Those only have system tables listed, nothing else.
>>
>> Major unit of work #2: Ring Information
>>
>> Here we seem to poll all the nodes getting their token responsibilities.
>> Each time for almost every node we also get:
>>
>> DEBUG [GossipStage:1] 2019-06-12 15:20:07,621 MigrationManager.java:96 -
>> Not pulling schema because versions match or shouldPullSchemaFrom returned
>> false
>>
>> However our schema versions DON'T match, so the shouldPullSchemaFrom must
>> be short-circuiting this. shouldPullSChemaFrom() is a one-liner in
>> MigrationManager on line 151, it seems to check if it knows the ip it is
>> talking to, If that version's schema matches (which it definitely should
>> not at this point). and if it is isGossipOnlyMember which I may have
>> seen referred to as a "fat client". Perhaps we are defaulting to this state
>> somehow for nodes/endpoints that are far away or not communicating? Or our
>> view hasn't fully initialized yet?
>>
>> Anyway, I think these are all occurring in StorageService line 779, a for
>> loop. We do not see any of the "got schema" messages that would break this
>> for loop. So we must reach the dealy point? Is the delay parameter equal
>> to -Dcassandra.ring_delay_ms setting? Because perhaps the successes we DID
>> see were due to the increase to ring delay and that allowed this for loop
>> to actually complete?
>>
>> Anyway, we seem to make it out of that for loop without initializing ring
>> information, which also seems to not synchronize the schema.
>>
>> From there we seem to sail through line 791, the
>> MigrationManager.isReadyForBootstrap() for some reason. This was also
>> mentioned in https://issues.apache.org/jira/browse/CASSANDRA-6648 as
>> leading to bootstrap race conditions similar to ours and was allegedly
>> fixed back in 1.2.X/2.0.X releases.
>>
>> Major Unit of work #3: perform bootstrap tasks
>>
>> lines 796+ of StorageService output ready to bootstrap / range
>> calculation messages, and confirm ring + schema info has been performed,
>> even though the logs are showing we didn't seem to sync schema.
>>
>> the strict consistency check is performed at line 803, which we are using
>> by the way
>>
>> The log outputs the getting bootstrap token status message of JOINING
>> from line 821
>>
>> From here there is a ton of log errors for UnknwonColumnFamilyExceptions.
>> Hundreds of them
>>
>> 19-06-12 15:21:47,644 IncomingTcpConnection.java:100 -
>> UnknownColumnFamilyException reading from socket; closing
>> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
>> cfId=dd5d7fa0-1e50-11e9-a62d-0d41c97b2404
>>
>> At the end of these exceptions is Three log statements:
>>
>> INFO  [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING:
>> Starting to bootstrap...
>>
>> INFO  [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream
>> #05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for
>> Bootstrap
>>
>> INFO  [main] 2019

Re: Bootstrapping process questions (CASSANDRA-15155)

2019-06-13 Thread Carl Mueller

Our cassandra.ring_delay_ms is current around 30 to get nodes to
bootstrap.

On Wed, Jun 12, 2019 at 5:56 PM Carl Mueller 
wrote:

> We're seeing nodes bootstrapping but not streaming and joining a cluster
> in 2.2.13.
>
> I have been looking through the MigrationManager code and the
> StorageService code that seems relevant based on the Bootstrap status
> messages that are coming through. I'll be referencing line numbers from the
> 2.2.13 source code.
>
> Major unit of work #1: prechecks
>
> We start our bootstrap process with StorageService:759 checking if we are
> autobootstrap (yes), bootstrap already in progress (no), bootstrap complete
> (no), and a check if seeds contains the current broadcast address (no)
>
> We proceed through the 771 shouldBootstrap() call to the "JOINING: waiting
> for ring information" state.
>
> Prior to this point, MigrationManager has already Gossip checked the
> current (nonmatching) schema version uuid. I believe in this state we only
> have system keyspace setup based on log messages that have already occurred
> like:
>
> Schema.java:421 - Adding org.apache.cassandra.config.CFMetaData@35645047
> [cfId=5a1ff267-ace0-3f12-8563-cfae6103c65e,ksName=system,cfName=sstable_activity
>
> Those only have system tables listed, nothing else.
>
> Major unit of work #2: Ring Information
>
> Here we seem to poll all the nodes getting their token responsibilities.
> Each time for almost every node we also get:
>
> DEBUG [GossipStage:1] 2019-06-12 15:20:07,621 MigrationManager.java:96 -
> Not pulling schema because versions match or shouldPullSchemaFrom returned
> false
>
> However our schema versions DON'T match, so the shouldPullSchemaFrom must
> be short-circuiting this. shouldPullSChemaFrom() is a one-liner in
> MigrationManager on line 151, it seems to check if it knows the ip it is
> talking to, If that version's schema matches (which it definitely should
> not at this point). and if it is isGossipOnlyMember which I may have
> seen referred to as a "fat client". Perhaps we are defaulting to this state
> somehow for nodes/endpoints that are far away or not communicating? Or our
> view hasn't fully initialized yet?
>
> Anyway, I think these are all occurring in StorageService line 779, a for
> loop. We do not see any of the "got schema" messages that would break this
> for loop. So we must reach the dealy point? Is the delay parameter equal
> to -Dcassandra.ring_delay_ms setting? Because perhaps the successes we DID
> see were due to the increase to ring delay and that allowed this for loop
> to actually complete?
>
> Anyway, we seem to make it out of that for loop without initializing ring
> information, which also seems to not synchronize the schema.
>
> From there we seem to sail through line 791, the
> MigrationManager.isReadyForBootstrap() for some reason. This was also
> mentioned in https://issues.apache.org/jira/browse/CASSANDRA-6648 as
> leading to bootstrap race conditions similar to ours and was allegedly
> fixed back in 1.2.X/2.0.X releases.
>
> Major Unit of work #3: perform bootstrap tasks
>
> lines 796+ of StorageService output ready to bootstrap / range calculation
> messages, and confirm ring + schema info has been performed, even though
> the logs are showing we didn't seem to sync schema.
>
> the strict consistency check is performed at line 803, which we are using
> by the way
>
> The log outputs the getting bootstrap token status message of JOINING from
> line 821
>
> From here there is a ton of log errors for UnknwonColumnFamilyExceptions.
> Hundreds of them
>
> 19-06-12 15:21:47,644 IncomingTcpConnection.java:100 -
> UnknownColumnFamilyException reading from socket; closing
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
> cfId=dd5d7fa0-1e50-11e9-a62d-0d41c97b2404
>
> At the end of these exceptions is Three log statements:
>
> INFO  [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING:
> Starting to bootstrap...
>
> INFO  [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream
> #05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for
> Bootstrap
>
> INFO  [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap
> completed! for the tokens [-7314981925085449175, <254 more tokens>,
> 5499447097629838103]
>
> And that is where we think our node has joined the cluster as a data-less
> UN node. Maybe like a "fat client"? Who knows.
>
> Bootstrap completed! message comes from a Future in StorageService:1199,
> which was produced on 1192 by bootstrapper.bootstrap(). It times out in 10s
> of millis or less, so it clearly isn't doing anything or hasn't b

Bootstrapping process questions (CASSANDRA-15155)

2019-06-12 Thread Carl Mueller

We're seeing nodes bootstrapping but not streaming and joining a cluster in
2.2.13.

I have been looking through the MigrationManager code and the
StorageService code that seems relevant based on the Bootstrap status
messages that are coming through. I'll be referencing line numbers from the
2.2.13 source code.

Major unit of work #1: prechecks

We start our bootstrap process with StorageService:759 checking if we are
autobootstrap (yes), bootstrap already in progress (no), bootstrap complete
(no), and a check if seeds contains the current broadcast address (no)

We proceed through the 771 shouldBootstrap() call to the "JOINING: waiting
for ring information" state.

Prior to this point, MigrationManager has already Gossip checked the
current (nonmatching) schema version uuid. I believe in this state we only
have system keyspace setup based on log messages that have already occurred
like:

Schema.java:421 - Adding org.apache.cassandra.config.CFMetaData@35645047
[cfId=5a1ff267-ace0-3f12-8563-cfae6103c65e,ksName=system,cfName=sstable_activity

Those only have system tables listed, nothing else.

Major unit of work #2: Ring Information

Here we seem to poll all the nodes getting their token responsibilities.
Each time for almost every node we also get:

DEBUG [GossipStage:1] 2019-06-12 15:20:07,621 MigrationManager.java:96 -
Not pulling schema because versions match or shouldPullSchemaFrom returned
false

However our schema versions DON'T match, so the shouldPullSchemaFrom must
be short-circuiting this. shouldPullSChemaFrom() is a one-liner in
MigrationManager on line 151, it seems to check if it knows the ip it is
talking to, If that version's schema matches (which it definitely should
not at this point). and if it is isGossipOnlyMember which I may have
seen referred to as a "fat client". Perhaps we are defaulting to this state
somehow for nodes/endpoints that are far away or not communicating? Or our
view hasn't fully initialized yet?

Anyway, I think these are all occurring in StorageService line 779, a for
loop. We do not see any of the "got schema" messages that would break this
for loop. So we must reach the dealy point? Is the delay parameter equal
to -Dcassandra.ring_delay_ms setting? Because perhaps the successes we DID
see were due to the increase to ring delay and that allowed this for loop
to actually complete?

Anyway, we seem to make it out of that for loop without initializing ring
information, which also seems to not synchronize the schema.

>From there we seem to sail through line 791, the
MigrationManager.isReadyForBootstrap() for some reason. This was also
mentioned in https://issues.apache.org/jira/browse/CASSANDRA-6648 as
leading to bootstrap race conditions similar to ours and was allegedly
fixed back in 1.2.X/2.0.X releases.

Major Unit of work #3: perform bootstrap tasks

lines 796+ of StorageService output ready to bootstrap / range calculation
messages, and confirm ring + schema info has been performed, even though
the logs are showing we didn't seem to sync schema.

the strict consistency check is performed at line 803, which we are using
by the way

The log outputs the getting bootstrap token status message of JOINING from
line 821

>From here there is a ton of log errors for UnknwonColumnFamilyExceptions.
Hundreds of them

19-06-12 15:21:47,644 IncomingTcpConnection.java:100 -
UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
cfId=dd5d7fa0-1e50-11e9-a62d-0d41c97b2404

At the end of these exceptions is Three log statements:

INFO  [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING:
Starting to bootstrap...

INFO  [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream
#05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for
Bootstrap

INFO  [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap
completed! for the tokens [-7314981925085449175, <254 more tokens>,
5499447097629838103]

And that is where we think our node has joined the cluster as a data-less
UN node. Maybe like a "fat client"? Who knows.

Bootstrap completed! message comes from a Future in StorageService:1199,
which was produced on 1192 by bootstrapper.bootstrap(). It times out in 10s
of millis or less, so it clearly isn't doing anything or hasn't been given
anything to do.

--

AFTER the bootstrp completeion message, we have messages for setting up

system_traces and system_distributed

These come 10s of millis after the completed future, so they must be lines
905 and 906 executing.

THEN we have more messages for gossiping schema version (three or so) where
FINALLY the UUID of the schema matches all the other schema uuids as shown
in gossipinfo. I speculate this is occuring i the finishJoiningRing() call
on line 912.


---

So it appears the "guard" for loop is not sufficiently guarding, and schema
is not being synchronized in our case in

Upgrading 2.1.x with EC2MRS problems: CASSANDRA-15068

2019-03-26 Thread Carl Mueller

Can someone do a quick check of

https://issues.apache.org/jira/browse/CASSANDRA-15068?jql=text%20~%20%22EC2MRS%22

I think we may do a custom class old-behavior snitch that doesn't have the
broadcast_rpc_address==null check and see if that works.

But that seems extreme, can someone that knows the EC2MRS ins and outs tell
me if we're just doing something fundamentally wrong in configuration?

We have in 2.1.x:

listen_address:  vpc internal address
rpc_address: 0.0.0.0
broadcast_rpc_address: vpc internal address (which is overridden by EC2MRS
in 2.1.x to public IP)
broadcast_address is commented out.

Re: SSTable exclusion from read path based on sstable metadata marked by custom compaction strategies

2019-02-01 Thread Carl Mueller

I'd still need a "all events for app_id" query. We have seconds-level
events :-(


On Fri, Feb 1, 2019 at 3:02 PM Jeff Jirsa  wrote:

> On Fri, Feb 1, 2019 at 12:58 PM Carl Mueller
>  wrote:
>
> > Jeff: so the partition key with timestamp would then need a separate
> index
> > table to track the appid->partition keys. Which isn't horrible, but also
> > tracks into another desire of mine: some way to make the replica mapping
> > match locally between the index table and the data table:
> >
> > So in the composite partition key for the TWCS table, you'd have app_id +
> > timestamp, BUT ONLY THE app_id GENERATES the hash/key.
> >
> >
> Huh? No, you'd have a composite partition key of app_id + timestamp
> ROUNDED/CEIL/FLOOR to some time window, and both would be used for
> hash/key.
>
> And you dont need any extra table, because app_id is known and the
> timestamp can be calculated (e.g., 4 digits of year + 3 digits for day of
> year makes today 2019032 )
>
>
>
> > Thus it would match with the index table that is just partition key
> app_id,
> > column key timestamp.
> >
> > And then theoretically a node-local "join" could be done without an
> > additional query hop, and batched updates would be more easily atomic to
> a
> > single node.
> >
> > Now how we would communicate all that in CQL/etc: who knows. Hm. Maybe
> > materialized views cover this, but I haven't tracked that since we don't
> > have versions that support them and they got "deprecated".
> >
> >
> > On Fri, Feb 1, 2019 at 2:53 PM Carl Mueller <
> carl.muel...@smartthings.com>
> > wrote:
> >
> > > Interesting. Now that we have semiautomated upgrades, we are going to
> > > hopefully get everything to 3.11X once we get the intermediate hop to
> > 2.2.
> > >
> > > I'm thinking we could also use sstable metadata markings + custom
> > > compactors for things like multiple customers on the same table. So you
> > > could sequester the data for a customer in their own sstables and then
> > > queries could effectively be subdivided against only the sstables that
> > had
> > > that customer. Maybe the min and max would cover that, I'd have to look
> > at
> > > the details.
> > >
> > > On Thu, Jan 31, 2019 at 8:11 PM Jonathan Haddad 
> > wrote:
> > >
> > >> In addition to what Jeff mentioned, there was an optimization in 3.4
> > that
> > >> can significantly reduce the number of sstables accessed when a LIMIT
> > >> clause was used.  This can be a pretty big win with TWCS.
> > >>
> > >>
> > >>
> >
> http://thelastpickle.com/blog/2017/03/07/The-limit-clause-in-cassandra-might-not-work-as-you-think.html
> > >>
> > >> On Thu, Jan 31, 2019 at 5:50 PM Jeff Jirsa  wrote:
> > >>
> > >> > In my original TWCS talk a few years back, I suggested that people
> > make
> > >> > the partitions match the time window to avoid exactly what you’re
> > >> > describing. I added that to the talk because my first team that used
> > >> TWCS
> > >> > (the team for which I built TWCS) had a data model not unlike yours,
> > and
> > >> > the read-every-sstable thing turns out not to work that well if you
> > have
> > >> > lots of windows (or very large partitions). If you do this, you can
> > fan
> > >> out
> > >> > a bunch of async reads for the first few days and ask for more as
> you
> > >> need
> > >> > to fill the page - this means the reads are more distributed, too,
> > >> which is
> > >> > an extra bonus when you have noisy partitions.
> > >> >
> > >> > In 3.0 and newer (I think, don’t quote me in the specific version),
> > the
> > >> > sstable metadata has the min and max clustering which helps exclude
> > >> > sstables from the read path quite well if everything in the table is
> > >> using
> > >> > timestamp clustering columns. I know there was some issue with this
> > and
> > >> RTs
> > >> > recently, so I’m not sure if it’s current state, but worth
> considering
> > >> that
> > >> > this may be much better on 3.0+
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Jeff Jirsa
> > >> >
> > >> >
> > >> > > On Jan 31,

Re: SSTable exclusion from read path based on sstable metadata marked by custom compaction strategies

2019-02-01 Thread Carl Mueller

Jeff: so the partition key with timestamp would then need a separate index
table to track the appid->partition keys. Which isn't horrible, but also
tracks into another desire of mine: some way to make the replica mapping
match locally between the index table and the data table:

So in the composite partition key for the TWCS table, you'd have app_id +
timestamp, BUT ONLY THE app_id GENERATES the hash/key.

Thus it would match with the index table that is just partition key app_id,
column key timestamp.

And then theoretically a node-local "join" could be done without an
additional query hop, and batched updates would be more easily atomic to a
single node.

Now how we would communicate all that in CQL/etc: who knows. Hm. Maybe
materialized views cover this, but I haven't tracked that since we don't
have versions that support them and they got "deprecated".


On Fri, Feb 1, 2019 at 2:53 PM Carl Mueller 
wrote:

> Interesting. Now that we have semiautomated upgrades, we are going to
> hopefully get everything to 3.11X once we get the intermediate hop to 2.2.
>
> I'm thinking we could also use sstable metadata markings + custom
> compactors for things like multiple customers on the same table. So you
> could sequester the data for a customer in their own sstables and then
> queries could effectively be subdivided against only the sstables that had
> that customer. Maybe the min and max would cover that, I'd have to look at
> the details.
>
> On Thu, Jan 31, 2019 at 8:11 PM Jonathan Haddad  wrote:
>
>> In addition to what Jeff mentioned, there was an optimization in 3.4 that
>> can significantly reduce the number of sstables accessed when a LIMIT
>> clause was used.  This can be a pretty big win with TWCS.
>>
>>
>> http://thelastpickle.com/blog/2017/03/07/The-limit-clause-in-cassandra-might-not-work-as-you-think.html
>>
>> On Thu, Jan 31, 2019 at 5:50 PM Jeff Jirsa  wrote:
>>
>> > In my original TWCS talk a few years back, I suggested that people make
>> > the partitions match the time window to avoid exactly what you’re
>> > describing. I added that to the talk because my first team that used
>> TWCS
>> > (the team for which I built TWCS) had a data model not unlike yours, and
>> > the read-every-sstable thing turns out not to work that well if you have
>> > lots of windows (or very large partitions). If you do this, you can fan
>> out
>> > a bunch of async reads for the first few days and ask for more as you
>> need
>> > to fill the page - this means the reads are more distributed, too,
>> which is
>> > an extra bonus when you have noisy partitions.
>> >
>> > In 3.0 and newer (I think, don’t quote me in the specific version), the
>> > sstable metadata has the min and max clustering which helps exclude
>> > sstables from the read path quite well if everything in the table is
>> using
>> > timestamp clustering columns. I know there was some issue with this and
>> RTs
>> > recently, so I’m not sure if it’s current state, but worth considering
>> that
>> > this may be much better on 3.0+
>> >
>> >
>> >
>> > --
>> > Jeff Jirsa
>> >
>> >
>> > > On Jan 31, 2019, at 1:56 PM, Carl Mueller <
>> carl.muel...@smartthings.com.invalid>
>> > wrote:
>> > >
>> > > Situation:
>> > >
>> > > We use TWCS for a task history table (partition is user, column key is
>> > > timeuuid of task, TWCS is used due to tombstone TTLs that rotate out
>> the
>> > > tasks every say month. )
>> > >
>> > > However, if we want to get a "slice" of tasks (say, tasks in the last
>> two
>> > > days and we are using TWCS sstable blocks of 12 hours).
>> > >
>> > > The problem is, this is a frequent user and they have tasks in ALL the
>> > > sstables that are organized by the TWCS into time-bucketed sstables.
>> > >
>> > > So Cassandra has to first read in, say 80 sstables to reconstruct the
>> > row,
>> > > THEN it can exclude/slice on the column key.
>> > >
>> > > Question:
>> > >
>> > > Or am I wrong that the read path needs to grab all relevant sstables
>> > before
>> > > applying column key slicing and this is possible? Admittedly we are in
>> > 2.1
>> > > for this table (we in the process of upgrading now that we have an
>> > > automated upgrading program that seems to work pretty well)
>> > >
>> > > If my assumption is co

Re: SSTable exclusion from read path based on sstable metadata marked by custom compaction strategies

2019-02-01 Thread Carl Mueller

Interesting. Now that we have semiautomated upgrades, we are going to
hopefully get everything to 3.11X once we get the intermediate hop to 2.2.

I'm thinking we could also use sstable metadata markings + custom
compactors for things like multiple customers on the same table. So you
could sequester the data for a customer in their own sstables and then
queries could effectively be subdivided against only the sstables that had
that customer. Maybe the min and max would cover that, I'd have to look at
the details.

On Thu, Jan 31, 2019 at 8:11 PM Jonathan Haddad  wrote:

> In addition to what Jeff mentioned, there was an optimization in 3.4 that
> can significantly reduce the number of sstables accessed when a LIMIT
> clause was used.  This can be a pretty big win with TWCS.
>
>
> http://thelastpickle.com/blog/2017/03/07/The-limit-clause-in-cassandra-might-not-work-as-you-think.html
>
> On Thu, Jan 31, 2019 at 5:50 PM Jeff Jirsa  wrote:
>
> > In my original TWCS talk a few years back, I suggested that people make
> > the partitions match the time window to avoid exactly what you’re
> > describing. I added that to the talk because my first team that used TWCS
> > (the team for which I built TWCS) had a data model not unlike yours, and
> > the read-every-sstable thing turns out not to work that well if you have
> > lots of windows (or very large partitions). If you do this, you can fan
> out
> > a bunch of async reads for the first few days and ask for more as you
> need
> > to fill the page - this means the reads are more distributed, too, which
> is
> > an extra bonus when you have noisy partitions.
> >
> > In 3.0 and newer (I think, don’t quote me in the specific version), the
> > sstable metadata has the min and max clustering which helps exclude
> > sstables from the read path quite well if everything in the table is
> using
> > timestamp clustering columns. I know there was some issue with this and
> RTs
> > recently, so I’m not sure if it’s current state, but worth considering
> that
> > this may be much better on 3.0+
> >
> >
> >
> > --
> > Jeff Jirsa
> >
> >
> > > On Jan 31, 2019, at 1:56 PM, Carl Mueller <
> carl.muel...@smartthings.com.invalid>
> > wrote:
> > >
> > > Situation:
> > >
> > > We use TWCS for a task history table (partition is user, column key is
> > > timeuuid of task, TWCS is used due to tombstone TTLs that rotate out
> the
> > > tasks every say month. )
> > >
> > > However, if we want to get a "slice" of tasks (say, tasks in the last
> two
> > > days and we are using TWCS sstable blocks of 12 hours).
> > >
> > > The problem is, this is a frequent user and they have tasks in ALL the
> > > sstables that are organized by the TWCS into time-bucketed sstables.
> > >
> > > So Cassandra has to first read in, say 80 sstables to reconstruct the
> > row,
> > > THEN it can exclude/slice on the column key.
> > >
> > > Question:
> > >
> > > Or am I wrong that the read path needs to grab all relevant sstables
> > before
> > > applying column key slicing and this is possible? Admittedly we are in
> > 2.1
> > > for this table (we in the process of upgrading now that we have an
> > > automated upgrading program that seems to work pretty well)
> > >
> > > If my assumption is correct, then the compaction strategy knows as it
> > > writes the sstables what it is bucketing them as (and could encode in
> > > sstable metadata?). If my assumption about slicing is that the whole
> row
> > > needs reconstruction, if we had a perfect infinite monkey coding team
> > that
> > > could generate whatever we wanted within some feasibility, could we
> > provide
> > > special hooks to do sstable exclusion based on metadata if we know that
> > > that the metadata will indicate exclusion/inclusion of columns based on
> > > metadata?
> > >
> > > Goal:
> > >
> > > The overall goal would be to support exclusion of sstables from a read
> > > path, in case we had compaction strategies hand-tailored for other
> > queries.
> > > Essentially we would be doing a first-pass bucketsort exclusion with
> the
> > > sstable metadata marking the buckets. This might aid support of
> superwide
> > > rows and paging through column keys if we allowed the table creator to
> > > specify bucketing as flushing occurs. In general it appears query
> > > performance quickly degrades based on # sstables required for a loo

SSTable exclusion from read path based on sstable metadata marked by custom compaction strategies

2019-01-31 Thread Carl Mueller

Situation:

We use TWCS for a task history table (partition is user, column key is
timeuuid of task, TWCS is used due to tombstone TTLs that rotate out the
tasks every say month. )

However, if we want to get a "slice" of tasks (say, tasks in the last two
days and we are using TWCS sstable blocks of 12 hours).

The problem is, this is a frequent user and they have tasks in ALL the
sstables that are organized by the TWCS into time-bucketed sstables.

So Cassandra has to first read in, say 80 sstables to reconstruct the row,
THEN it can exclude/slice on the column key.

Question:

Or am I wrong that the read path needs to grab all relevant sstables before
applying column key slicing and this is possible? Admittedly we are in 2.1
for this table (we in the process of upgrading now that we have an
automated upgrading program that seems to work pretty well)

If my assumption is correct, then the compaction strategy knows as it
writes the sstables what it is bucketing them as (and could encode in
sstable metadata?). If my assumption about slicing is that the whole row
needs reconstruction, if we had a perfect infinite monkey coding team that
could generate whatever we wanted within some feasibility, could we provide
special hooks to do sstable exclusion based on metadata if we know that
that the metadata will indicate exclusion/inclusion of columns based on
metadata?

Goal:

The overall goal would be to support exclusion of sstables from a read
path, in case we had compaction strategies hand-tailored for other queries.
Essentially we would be doing a first-pass bucketsort exclusion with the
sstable metadata marking the buckets. This might aid support of superwide
rows and paging through column keys if we allowed the table creator to
specify bucketing as flushing occurs. In general it appears query
performance quickly degrades based on # sstables required for a lookup.

I still don't know the code nearly well enough to do patches, it would seem
based on my looking at custom compaction strategies and the basic read path
that this would be a useful extension for advanced users.

The fallback would be a set of tables to serve as buckets and we span the
buckets with queries when one bucket runs out. The tables rotate.

Re: SEDA, queues, and a second lower-priority queue set

2019-01-16 Thread Carl Mueller

additionally, a certain number of the threads in each stage could be
restricted from serving the low-priority queues at all, say 8/32 or 16/32
threads, to further ensure processing availability to the higher-priority
tasks.

On Wed, Jan 16, 2019 at 3:04 PM Carl Mueller 
wrote:

> At a theoretical level assuming it could be implemented with a magic wand,
> would there be value to having a dual set of queues/threadpools at each of
> the SEDA stages inside cassandra for a two-tier of priority? Such that you
> could mark queries that return pages and pages of data as lower-priority
> while smaller single-partition queries could be marked/defaulted as normal
> priority, such that the lower-priority queues are only served if the normal
> priority queues are empty?
>
> I suppose rough equivalency to this would be dual-datacenter with an
> analysis cluster to serve the "slow" queries and a frontline one for the
> higher priority stuff.
>
> However, it has come up several times that I'd like to run a one-off
> maintenance job/query against production that could not be easily changed
> (can't just throw up a DC), and while I can do app-level throttling with
> some pain and sweat, it would seem something like this could do
> lower-priority work in a somewhat-loaded cluster without impacting the main
> workload.
>
>
>

SEDA, queues, and a second lower-priority queue set

2019-01-16 Thread Carl Mueller

At a theoretical level assuming it could be implemented with a magic wand,
would there be value to having a dual set of queues/threadpools at each of
the SEDA stages inside cassandra for a two-tier of priority? Such that you
could mark queries that return pages and pages of data as lower-priority
while smaller single-partition queries could be marked/defaulted as normal
priority, such that the lower-priority queues are only served if the normal
priority queues are empty?

I suppose rough equivalency to this would be dual-datacenter with an
analysis cluster to serve the "slow" queries and a frontline one for the
higher priority stuff.

However, it has come up several times that I'd like to run a one-off
maintenance job/query against production that could not be easily changed
(can't just throw up a DC), and while I can do app-level throttling with
some pain and sweat, it would seem something like this could do
lower-priority work in a somewhat-loaded cluster without impacting the main
workload.

Re: EOL 2.1 series?

2019-01-16 Thread Carl Mueller

A second vote for any bugfixes we can get for 2.1, we'd probably use it.

We are finally getting traction behind upgrades with an automated
upgrade/rollback for 2.1 --> 2.2, but it's going to be a while.

Aaron, if you want to use our 2.1-2.2 migration tool, we can talk about at
the next MPLS meetup. I'm not saying its the hot sauce, but it does seem to
do things fairly well from the 10 or so clusters we've done with it (and
the bugfixes it has shown).

The code hasn't been OSS'd yet, but it is slated to be.

On Tue, Jan 8, 2019 at 8:49 AM Aaron Ploetz  wrote:

> Speaking as a user with a few large clusters still running on 2.1, I think
> a final release of it with some additional fixes would be welcomed.  That
> being said, I thought we were holding off on EOLing 2.1 until 4.0 is
> released.  Although, the timings probably don’t need to coincide.
>
> In short, a (non-binding) final release of 2.1 is a good idea.
>
> Thanks,
>
> Aaron
>
>
> > On Jan 7, 2019, at 8:01 PM, Michael Shuler 
> wrote:
> >
> > It came to my attention on IRC a week or so ago, and following up on the
> > ticket that someone asked if they should commit to 2.1, that developers
> > have been actively ignoring the 2.1 branch. If we're not committing
> > critical fixes there, even when we know they exist, I think it's time to
> > just call it EOL an do one last release of the few fixes that did get
> > committed. Comments?
> >
> > Michael
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: CASSANDRA-13241 lower default chunk_length_in_kb

2018-11-03 Thread Carl Mueller

IMO slightly bigger memory requirements for substantial improvements is a
good exchange, especially for a 4.0 release of the database. Optane and
lots of other memory are coming down the hardware pipeline, and risk-wise
almost all cassandra people know to testbed the major versions, so major
versions are a good time for significant default changes (vnode count,
this). I've read TLP blogs on this before, and the memory impact seems to
only get huge for node sizes that start to get out of ideal size, and if
they want to run nodes that big then fine, run big memory too.

But I don't actually write code for the project so I don't count :-)

On Mon, Oct 29, 2018 at 2:42 PM Jonathan Haddad  wrote:

> Looks straightforward, I can review today.
>
> On Mon, Oct 29, 2018 at 12:25 PM Ariel Weisberg  wrote:
>
> > Hi,
> >
> > Seeing too many -'s for changing the representation and essentially no
> +1s
> > so I submitted a patch for just changing the default. I could use a
> > reviewer for https://issues.apache.org/jira/browse/CASSANDRA-13241
> >
> > I created https://issues.apache.org/jira/browse/CASSANDRA-14857  "Use a
> > more space efficient representation for compressed chunk offsets" for
> post
> > 4.0.
> >
> > Regards,
> > Ariel
> >
> > On Tue, Oct 23, 2018, at 11:46 AM, Ariel Weisberg wrote:
> > > Hi,
> > >
> > > To summarize who we have heard from so far
> > >
> > > WRT to changing just the default:
> > >
> > > +1:
> > > Jon Haddadd
> > > Ben Bromhead
> > > Alain Rodriguez
> > > Sankalp Kohli (not explicit)
> > >
> > > -0:
> > > Sylvaine Lebresne
> > > Jeff Jirsa
> > >
> > > Not sure:
> > > Kurt Greaves
> > > Joshua Mckenzie
> > > Benedict Elliot Smith
> > >
> > > WRT to change the representation:
> > >
> > > +1:
> > > There are only conditional +1s at this point
> > >
> > > -0:
> > > Sylvaine Lebresne
> > >
> > > -.5:
> > > Jeff Jirsa
> > >
> > > This
> > > (
> >
> https://github.com/aweisberg/cassandra/commit/a9ae85daa3ede092b9a1cf84879fb1a9f25b9dce
> )
> >
> > > is a rough cut of the change for the representation. It needs better
> > > naming, unit tests, javadoc etc. but it does implement the change.
> > >
> > > Ariel
> > > On Fri, Oct 19, 2018, at 3:42 PM, Jonathan Haddad wrote:
> > > > Sorry, to be clear - I'm +1 on changing the configuration default,
> but
> > I
> > > > think changing the compression in memory representations warrants
> > further
> > > > discussion and investigation before making a case for or against it
> > yet.
> > > > An optimization that reduces in memory cost by over 50% sounds pretty
> > good
> > > > and we never were really explicit that those sort of optimizations
> > would be
> > > > excluded after our feature freeze.  I don't think they should
> > necessarily
> > > > be excluded at this time, but it depends on the size and risk of the
> > patch.
> > > >
> > > > On Sat, Oct 20, 2018 at 8:38 AM Jonathan Haddad 
> > wrote:
> > > >
> > > > > I think we should try to do the right thing for the most people
> that
> > we
> > > > > can.  The number of folks impacted by 64KB is huge.  I've worked on
> > a lot
> > > > > of clusters created by a lot of different teams, going from brand
> > new to
> > > > > pretty damn knowledgeable.  I can't think of a single time over the
> > last 2
> > > > > years that I've seen a cluster use non-default settings for
> > compression.
> > > > > With only a handful of exceptions, I've lowered the chunk size
> > considerably
> > > > > (usually to 4 or 8K) and the impact has always been very
> noticeable,
> > > > > frequently resulting in hardware reduction and cost savings.  Of
> all
> > the
> > > > > poorly chosen defaults we have, this is one of the biggest
> offenders
> > that I
> > > > > see.  There's a good reason ScyllaDB  claims they're so much faster
> > than
> > > > > Cassandra - we ship a DB that performs poorly for 90+% of teams
> > because we
> > > > > ship for a specific use case, not a general one (time series on
> > memory
> > > > > constrained boxes being the specific use case)
> > > > >
> > > > > This doesn't impact existing tables, just new ones.  More and more
> > teams
> > > > > are using Cassandra as a general purpose database, we should
> > acknowledge
> > > > > that adjusting our defaults accordingly.  Yes, we use a little bit
> > more
> > > > > memory on new tables if we just change this setting, and what we
> get
> > out of
> > > > > it is a massive performance win.
> > > > >
> > > > > I'm +1 on the change as well.
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Oct 20, 2018 at 4:21 AM Sankalp Kohli <
> > kohlisank...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> (We should definitely harden the definition for freeze in a
> separate
> > > > >> thread)
> > > > >>
> > > > >> My thinking is that this is the best time to do this change as we
> > have
> > > > >> not even cut alpha or beta. All the people involved in the test
> will
> > > > >> definitely be testing it again when we have these releases.
> > > > >>
> > > > >> > On Oct

Re: Built in trigger: double-write for app migration

2018-10-19 Thread Carl Mueller

Also we have 2.1.x and 2.2 clusters, so we can't use CDC since apparently
that is a 3.8 feature.

Virtual tables are very exciting so we could do some collating stuff (which
I'd LOVE to do with our scheduling application where we can split tasks
into near term/most frequent(hours to days), medium-term/less common(days
to weeks), long/years ), with the aim of avoiding having to do compaction
at all and just truncating buckets as they "expire" for a nice O(1)
compaction process.

On Fri, Oct 19, 2018 at 9:57 AM Carl Mueller 
wrote:

> new DC and then split is one way, but you have to wait for it to stream,
> and then how do you know the dc coherence is good enough to switch the
> targetted DC for local_quorum? And then once we split it we'd have downtime
> to "change the name" and other work that would distinguish it from the
> original cluster, from what I'm told from the peoples that do the DC /
> cluster setup and aws provisioning. It is a tool in the toolchest...
>
> We might be able to get stats of the queries and updates impacting the
> cluster in a centralized manner with a trigger too.
>
> We will probably do stream-to-kafka trigger based on what is on the
> intarweb and since we have kafka here already.
>
> I will look at CDC.
>
> Thank you everybody!
>
>
> On Fri, Oct 19, 2018 at 3:29 AM Antonis Papaioannou 
> wrote:
>
>> It reminds me of “shadow writes” described in [1].
>> During data migration the coordinator forwards  a copy of any write
>> request regarding tokens that are being transferred to the new node.
>>
>> [1] Incremental Elasticity for NoSQL Data Stores, SRDS’17,
>> https://ieeexplore.ieee.org/document/8069080
>>
>>
>> > On 18 Oct 2018, at 18:53, Carl Mueller 
>> > 
>> wrote:
>> >
>> > tl;dr: a generic trigger on TABLES that will mirror all writes to
>> > facilitate data migrations between clusters or systems. What is
>> necessary
>> > to ensure full write mirroring/coherency?
>> >
>> > When cassandra clusters have several "apps" aka keyspaces serving
>> > applications colocated on them, but the app/keyspace bandwidth and size
>> > demands begin impacting other keyspaces/apps, then one strategy is to
>> > migrate the keyspace to its own dedicated cluster.
>> >
>> > With backups/sstableloading, this will entail a delay and therefore a
>> > "coherency" shortfall between the clusters. So typically one would
>> employ a
>> > "double write, read once":
>> >
>> > - all updates are mirrored to both clusters
>> > - writes come from the current most coherent.
>> >
>> > Often two sstable loads are done:
>> >
>> > 1) first load
>> > 2) turn on double writes/write mirroring
>> > 3) a second load is done to finalize coherency
>> > 4) switch the app to point to the new cluster now that it is coherent
>> >
>> > The double writes and read is the sticking point. We could do it at the
>> app
>> > layer, but if the app wasn't written with that, it is a lot of testing
>> and
>> > customization specific to the framework.
>> >
>> > We could theoretically do some sort of proxying of the java-driver
>> somehow,
>> > but all the async structures and complex interfaces/apis would be
>> difficult
>> > to proxy. Maybe there is a lower level in the java-driver that is
>> possible.
>> > This also would only apply to the java-driver, and not
>> > python/go/javascript/other drivers.
>> >
>> > Finally, I suppose we could do a trigger on the tables. It would be
>> really
>> > nice if we could add to the cassandra toolbox the basics of a write
>> > mirroring trigger that could be activated "fairly easily"... now I know
>> > there are the complexities of inter-cluster access, and if we are even
>> > using cassandra as the target mirror system (for example there is an
>> > article on triggers write-mirroring to kafka:
>> > https://dzone.com/articles/cassandra-to-kafka-data-pipeline-part-1).
>> >
>> > And this starts to get into the complexities of hinted handoff as well.
>> But
>> > fundamentally this seems something that would be a very nice feature
>> > (especially when you NEED it) to have in the core of cassandra.
>> >
>> > Finally, is the mutation hook in triggers sufficient to track all
>> incoming
>> > mutations (outside of "shudder" other triggers generating data)
>>
>>

Re: Built in trigger: double-write for app migration

2018-10-19 Thread Carl Mueller

new DC and then split is one way, but you have to wait for it to stream,
and then how do you know the dc coherence is good enough to switch the
targetted DC for local_quorum? And then once we split it we'd have downtime
to "change the name" and other work that would distinguish it from the
original cluster, from what I'm told from the peoples that do the DC /
cluster setup and aws provisioning. It is a tool in the toolchest...

We might be able to get stats of the queries and updates impacting the
cluster in a centralized manner with a trigger too.

We will probably do stream-to-kafka trigger based on what is on the
intarweb and since we have kafka here already.

I will look at CDC.

Thank you everybody!


On Fri, Oct 19, 2018 at 3:29 AM Antonis Papaioannou 
wrote:

> It reminds me of “shadow writes” described in [1].
> During data migration the coordinator forwards  a copy of any write
> request regarding tokens that are being transferred to the new node.
>
> [1] Incremental Elasticity for NoSQL Data Stores, SRDS’17,
> https://ieeexplore.ieee.org/document/8069080
>
>
> > On 18 Oct 2018, at 18:53, Carl Mueller 
> > 
> wrote:
> >
> > tl;dr: a generic trigger on TABLES that will mirror all writes to
> > facilitate data migrations between clusters or systems. What is necessary
> > to ensure full write mirroring/coherency?
> >
> > When cassandra clusters have several "apps" aka keyspaces serving
> > applications colocated on them, but the app/keyspace bandwidth and size
> > demands begin impacting other keyspaces/apps, then one strategy is to
> > migrate the keyspace to its own dedicated cluster.
> >
> > With backups/sstableloading, this will entail a delay and therefore a
> > "coherency" shortfall between the clusters. So typically one would
> employ a
> > "double write, read once":
> >
> > - all updates are mirrored to both clusters
> > - writes come from the current most coherent.
> >
> > Often two sstable loads are done:
> >
> > 1) first load
> > 2) turn on double writes/write mirroring
> > 3) a second load is done to finalize coherency
> > 4) switch the app to point to the new cluster now that it is coherent
> >
> > The double writes and read is the sticking point. We could do it at the
> app
> > layer, but if the app wasn't written with that, it is a lot of testing
> and
> > customization specific to the framework.
> >
> > We could theoretically do some sort of proxying of the java-driver
> somehow,
> > but all the async structures and complex interfaces/apis would be
> difficult
> > to proxy. Maybe there is a lower level in the java-driver that is
> possible.
> > This also would only apply to the java-driver, and not
> > python/go/javascript/other drivers.
> >
> > Finally, I suppose we could do a trigger on the tables. It would be
> really
> > nice if we could add to the cassandra toolbox the basics of a write
> > mirroring trigger that could be activated "fairly easily"... now I know
> > there are the complexities of inter-cluster access, and if we are even
> > using cassandra as the target mirror system (for example there is an
> > article on triggers write-mirroring to kafka:
> > https://dzone.com/articles/cassandra-to-kafka-data-pipeline-part-1).
> >
> > And this starts to get into the complexities of hinted handoff as well.
> But
> > fundamentally this seems something that would be a very nice feature
> > (especially when you NEED it) to have in the core of cassandra.
> >
> > Finally, is the mutation hook in triggers sufficient to track all
> incoming
> > mutations (outside of "shudder" other triggers generating data)
>
>

Re: Built in trigger: double-write for app migration

2018-10-18 Thread Carl Mueller

Thanks. Well, at a minimum I'll probably start writing something soon for
trigger-based write mirroring, and we will probably support kafka and
another cassandra cluster, so if those seem to work I will contribute
those.

On Thu, Oct 18, 2018 at 11:27 AM Jeff Jirsa  wrote:

> The write sampling is adding an extra instance with the same schema to
> test things like yaml params or compaction without impacting reads or
> correctness - it’s different than what you describe
>
>
>
> --
> Jeff Jirsa
>
>
> > On Oct 18, 2018, at 5:57 PM, Carl Mueller 
> > 
> wrote:
> >
> > I guess there is also write-survey-mode from cass 1.1:
> >
> > https://issues.apache.org/jira/browse/CASSANDRA-3452
> >
> > Were triggers intended to supersede this capability? I can't find a lot
> of
> > "user level" info on it.
> >
> >
> > On Thu, Oct 18, 2018 at 10:53 AM Carl Mueller <
> carl.muel...@smartthings.com>
> > wrote:
> >
> >> tl;dr: a generic trigger on TABLES that will mirror all writes to
> >> facilitate data migrations between clusters or systems. What is
> necessary
> >> to ensure full write mirroring/coherency?
> >>
> >> When cassandra clusters have several "apps" aka keyspaces serving
> >> applications colocated on them, but the app/keyspace bandwidth and size
> >> demands begin impacting other keyspaces/apps, then one strategy is to
> >> migrate the keyspace to its own dedicated cluster.
> >>
> >> With backups/sstableloading, this will entail a delay and therefore a
> >> "coherency" shortfall between the clusters. So typically one would
> employ a
> >> "double write, read once":
> >>
> >> - all updates are mirrored to both clusters
> >> - writes come from the current most coherent.
> >>
> >> Often two sstable loads are done:
> >>
> >> 1) first load
> >> 2) turn on double writes/write mirroring
> >> 3) a second load is done to finalize coherency
> >> 4) switch the app to point to the new cluster now that it is coherent
> >>
> >> The double writes and read is the sticking point. We could do it at the
> >> app layer, but if the app wasn't written with that, it is a lot of
> testing
> >> and customization specific to the framework.
> >>
> >> We could theoretically do some sort of proxying of the java-driver
> >> somehow, but all the async structures and complex interfaces/apis would
> be
> >> difficult to proxy. Maybe there is a lower level in the java-driver
> that is
> >> possible. This also would only apply to the java-driver, and not
> >> python/go/javascript/other drivers.
> >>
> >> Finally, I suppose we could do a trigger on the tables. It would be
> really
> >> nice if we could add to the cassandra toolbox the basics of a write
> >> mirroring trigger that could be activated "fairly easily"... now I know
> >> there are the complexities of inter-cluster access, and if we are even
> >> using cassandra as the target mirror system (for example there is an
> >> article on triggers write-mirroring to kafka:
> >> https://dzone.com/articles/cassandra-to-kafka-data-pipeline-part-1).
> >>
> >> And this starts to get into the complexities of hinted handoff as well.
> >> But fundamentally this seems something that would be a very nice feature
> >> (especially when you NEED it) to have in the core of cassandra.
> >>
> >> Finally, is the mutation hook in triggers sufficient to track all
> incoming
> >> mutations (outside of "shudder" other triggers generating data)
> >>
> >>
> >>
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: Built in trigger: double-write for app migration

2018-10-18 Thread Carl Mueller

I guess there is also write-survey-mode from cass 1.1:

https://issues.apache.org/jira/browse/CASSANDRA-3452

Were triggers intended to supersede this capability? I can't find a lot of
"user level" info on it.


On Thu, Oct 18, 2018 at 10:53 AM Carl Mueller 
wrote:

> tl;dr: a generic trigger on TABLES that will mirror all writes to
> facilitate data migrations between clusters or systems. What is necessary
> to ensure full write mirroring/coherency?
>
> When cassandra clusters have several "apps" aka keyspaces serving
> applications colocated on them, but the app/keyspace bandwidth and size
> demands begin impacting other keyspaces/apps, then one strategy is to
> migrate the keyspace to its own dedicated cluster.
>
> With backups/sstableloading, this will entail a delay and therefore a
> "coherency" shortfall between the clusters. So typically one would employ a
> "double write, read once":
>
> - all updates are mirrored to both clusters
> - writes come from the current most coherent.
>
> Often two sstable loads are done:
>
> 1) first load
> 2) turn on double writes/write mirroring
> 3) a second load is done to finalize coherency
> 4) switch the app to point to the new cluster now that it is coherent
>
> The double writes and read is the sticking point. We could do it at the
> app layer, but if the app wasn't written with that, it is a lot of testing
> and customization specific to the framework.
>
> We could theoretically do some sort of proxying of the java-driver
> somehow, but all the async structures and complex interfaces/apis would be
> difficult to proxy. Maybe there is a lower level in the java-driver that is
> possible. This also would only apply to the java-driver, and not
> python/go/javascript/other drivers.
>
> Finally, I suppose we could do a trigger on the tables. It would be really
> nice if we could add to the cassandra toolbox the basics of a write
> mirroring trigger that could be activated "fairly easily"... now I know
> there are the complexities of inter-cluster access, and if we are even
> using cassandra as the target mirror system (for example there is an
> article on triggers write-mirroring to kafka:
> https://dzone.com/articles/cassandra-to-kafka-data-pipeline-part-1).
>
> And this starts to get into the complexities of hinted handoff as well.
> But fundamentally this seems something that would be a very nice feature
> (especially when you NEED it) to have in the core of cassandra.
>
> Finally, is the mutation hook in triggers sufficient to track all incoming
> mutations (outside of "shudder" other triggers generating data)
>
>
>
>

Built in trigger: double-write for app migration

2018-10-18 Thread Carl Mueller

tl;dr: a generic trigger on TABLES that will mirror all writes to
facilitate data migrations between clusters or systems. What is necessary
to ensure full write mirroring/coherency?

When cassandra clusters have several "apps" aka keyspaces serving
applications colocated on them, but the app/keyspace bandwidth and size
demands begin impacting other keyspaces/apps, then one strategy is to
migrate the keyspace to its own dedicated cluster.

With backups/sstableloading, this will entail a delay and therefore a
"coherency" shortfall between the clusters. So typically one would employ a
"double write, read once":

- all updates are mirrored to both clusters
- writes come from the current most coherent.

Often two sstable loads are done:

1) first load
2) turn on double writes/write mirroring
3) a second load is done to finalize coherency
4) switch the app to point to the new cluster now that it is coherent

The double writes and read is the sticking point. We could do it at the app
layer, but if the app wasn't written with that, it is a lot of testing and
customization specific to the framework.

We could theoretically do some sort of proxying of the java-driver somehow,
but all the async structures and complex interfaces/apis would be difficult
to proxy. Maybe there is a lower level in the java-driver that is possible.
This also would only apply to the java-driver, and not
python/go/javascript/other drivers.

Finally, I suppose we could do a trigger on the tables. It would be really
nice if we could add to the cassandra toolbox the basics of a write
mirroring trigger that could be activated "fairly easily"... now I know
there are the complexities of inter-cluster access, and if we are even
using cassandra as the target mirror system (for example there is an
article on triggers write-mirroring to kafka:
https://dzone.com/articles/cassandra-to-kafka-data-pipeline-part-1).

And this starts to get into the complexities of hinted handoff as well. But
fundamentally this seems something that would be a very nice feature
(especially when you NEED it) to have in the core of cassandra.

Finally, is the mutation hook in triggers sufficient to track all incoming
mutations (outside of "shudder" other triggers generating data)

Re: Proposing an Apache Cassandra Management process

2018-10-16 Thread Carl Mueller

I too have built a framework over the last year similar to what cstar does
but for our purposes at smartthings. The intention is to OSS it, but it
needs a round of polish, since it really is more of a utility toolbox for
our small cassandra group.

It relies on ssh heavily and performing work on the nodes themselves/using
the installed software that is on the nodes. It also assumes a bash shell.
It is groovy based, uses jsch for the ssh'ing, and we don't use ssh config
(for better or worse). It is fairly aws-centric, although I would like to
abstract out the aws dependence. We use it for backups, restores, some data
collection/triage analysis. The backups can do incremental or fulls, where
incrementals compare the current sstable set against the previous and only
uploads the newer sstables along with manifests that link to the locations
of the already-uploaded sstables. This in particular is very aws/s3 centric
and would be a primary focus to get that dependency abstracted.

Our clusters run behind a variety of different modes, from bastions to
jumpboxes, and once upon a time direct global ssh access, and now global
IPs but with an internal backend access. We will also have ipv6 soon. We
run 2.1.x, 2.2.x, and 3.1 (even some dse community still), so the framework
attempts to deal with all the intricacies and headaches that entails. The
clusters have a lot of variance on what security has been enabled (ssl,
password, password files, etc). Not all operations have been done against
2.2 and especially 3.1, but our big project is a push to 3.x, and this will
be a big tool to enable a lot of that, so I hope to get a lot of the
idiosyncracies for those versions as we go through those upgrades. We will
be running kubernetes soon too, and I will look into abstracting the access
method to the nodes to use maybe kubectl commands once I get to know kuber
better.

I have recently found out about cstar. I'm going to look at that and see if
their way of doing things is better than how we are doing it, especially
with regards to ssh connection maintenance and those types of things. One
thing that I have found is having a "registry" that stores all the
different cluster-specific idiosyncracies, so you can just do 
  for lots of things.

It isn't particularly efficient in some ways, especially not with
connection pooling/caching/conservation. It wastes a lot of reconnecting,
but slow but steady works ok for our backups and not disrupting the work
the clusters need to do. Parallelism helps a lot, but cstar may have a lot
of good ideas for throttling, parallelism, etc.

We also plan on using it to make some dashboards/UI too at some point.

On Thu, Oct 4, 2018 at 7:20 PM Mick Semb Wever  wrote:

> Dinesh / Sankalp,
>
> My suggestion was to document the landscape in hope and an attempt to
> better understand the requirements possible to a side-car.  It wasn't a
> suggestion to patchwork together everything. But rather as part of
> brainstorming, designing, and exercising an inclusive process to see what's
> been done and how it could have been done better.
>
> A well designed side-car could also be a valuable fundamental to some of
> these third-party solutions, not just our own designs and ideals. Maybe, I
> hope, that's already obvious.
>
> It would be really fantastic to see more explorative documentation in
> confluence. Part of that can be to list up all these external tools,
> listing their goals, state, and how a side-car might help them. Reaching
> out to their maintainers to be involved in the process would be awesome
> too. I can start something in the cwiki (but i'm on vacation this week),
> I've also given you write-access Dinesh.
>
> > I also haven't seen a process to propose & discuss larger changes to
> Cassandra. The Cassandra contribution[1] guide possibly needs to be
> updated. Some communities have a process which facilitate things. See Kafka
> Improvement Process[2], Spark Improvement Process[3].
>
> Bringing this up was gold, imho. I would love to see something like this
> exist in the C* community (also in cwiki), and the side-car brainstorming
> and design used to test and flesh it out.
>
> regards,
> Mick
>
>
> On Sun, 30 Sep 2018, at 05:19, Dinesh Joshi wrote:
> > > On Sep 27, 2018, at 7:35 PM, Mick Semb Wever  wrote:
> > >
> > > Reaper,
> >
> > I have looked at this already.
> >
> > > Priam,
> >
> > I have looked at this already.
> >
> > > Marcus Olsson's offering,
> >
> > This isn't OSS.
> >
> > > CStar,
> >
> > I have looked at this already.
> >
> > > OpsCenter.
> >
> > Latest release is only compatible with DSE and not Apache Cassandra[1]
> >
> > > Then there's a host of command line tools like:
> > > ic-tools,
> > > ctop (was awesome, but is it maintained anymore?),
> > > tablesnap.
> >
> > These are interesting tools and I don't think they do what we're
> > interested in doing.
> >
> > > And maybe it's worth including the diy approach people take…
> pssh/dsh/clusterssh/mussh/fabric, etc
> >
> >

Re: Java 11 Z garbage collector

2018-09-06 Thread Carl Mueller

Thanks Jeff.

On Fri, Aug 31, 2018 at 1:01 PM Jeff Jirsa  wrote:

> Read heavy workload with wider partitions (like 1-2gb) and disable the key
> cache will be worst case for GC
>
>
>
>
> --
> Jeff Jirsa
>
>
> > On Aug 31, 2018, at 10:51 AM, Carl Mueller 
> > 
> wrote:
> >
> > I'm assuming that p99 that Rocksandra tries to target is caused by GC
> > pauses, does anyone have data patterns or datasets that will generate GC
> > pauses in Cassandra to highlight the abilities of Rocksandra (and...
> > Scylla?) and perhaps this GC approach?
> >
> > On Thu, Aug 30, 2018 at 8:11 PM Carl Mueller <
> carl.muel...@smartthings.com>
> > wrote:
> >
> >> Oh nice, I'll check that out.
> >>
> >> On Thu, Aug 30, 2018 at 11:07 AM Jonathan Haddad 
> >> wrote:
> >>
> >>> Advertised, yes, but so far I haven't found it to be any better than
> >>> ParNew + CMS or G1 in the performance tests I did when writing
> >>> http://thelastpickle.com/blog/2018/08/16/java11.html.
> >>>
> >>> That said, I didn't try it with a huge heap (i think it was 16 or
> 24GB),
> >>> so
> >>> maybe it'll do better if I throw 50 GB RAM at it.
> >>>
> >>>
> >>>
> >>> On Thu, Aug 30, 2018 at 8:42 AM Carl Mueller
> >>>  wrote:
> >>>
> >>>> https://www.opsian.com/blog/javas-new-zgc-is-very-exciting/
> >>>>
> >>>> .. max of 4ms for stop the world, large terabyte heaps, seems
> promising.
> >>>>
> >>>> Will this be a major boon to cassandra p99 times? Anyone know the
> >>> aspects
> >>>> of cassandra that cause the most churn and lead to StopTheWorld GC? I
> >>> was
> >>>> under the impression that bloom filters, caches, etc are statically
> >>>> allocated at startup.
> >>>>
> >>>
> >>>
> >>> --
> >>> Jon Haddad
> >>> http://www.rustyrazorblade.com
> >>> twitter: rustyrazorblade
> >>>
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: Transient Replication 4.0 status update

2018-08-31 Thread Carl Mueller

I see, so there are no dedicated transient nodes, just other nodes that
function as witnesses.

This is still very exciting.


On Fri, Aug 31, 2018 at 1:49 PM Ariel Weisberg  wrote:

> Hi,
>
> All nodes being the same (in terms of functionality) is something we
> wanted to stick with at least for now. I think we want a design that
> changes the operational, availability, and consistency story as little as
> possible when it's completed.
>
> Ariel
> On Fri, Aug 31, 2018, at 2:27 PM, Carl Mueller wrote:
> > SOrry to spam this with two messages...
> >
> > This ticket is also interesting because it is very close to what I
> imagined
> > a useful use case of RF4 / RF6: being basically RF3 + hot spare where you
> > marked (in the case of RF4) three nodes as primary and the fourth as hot
> > standby, which may be equivalent if I understand the paper/protocol to
> > RF3+1 transient.
> >
> > On Fri, Aug 31, 2018 at 1:07 PM Carl Mueller <
> carl.muel...@smartthings.com>
> > wrote:
> >
> > > I put these questions on the ticket too... Sorry if some of them are
> > > stupid.
> > >
> > > So are (basically) these transient nodes basically serving as
> centralized
> > > hinted handoff caches rather than having the hinted handoffs
> cluttering up
> > > full replicas, especially nodes that have no concern for the token
> range
> > > involved? I understand that hinted handoffs aren't being replaced by
> this,
> > > but is that kind of the idea?
> > >
> > > Are the transient nodes "sitting around"?
> > >
> > > Will the transient nodes have cheaper/lower hardware requirements?
> > >
> > > During cluster expansion, does the newly streaming node acquiring data
> > > function as a temporary transient node until it becomes a full replica?
> > > Likewise while shrinking, does a previously full replica function as a
> > > transient while it streams off data?
> > >
> > > Can this help vnode expansion with multiple concurrent nodes?
> Admittedly
> > > I'm not familiar with how much work has gone into fixing cluster
> expansion
> > > with vnodes, it is my understanding that you typically expand only one
> node
> > > at a time or in multiples of the datacenter size
> > >
> > > On Mon, Aug 27, 2018 at 12:29 PM Ariel Weisberg 
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I wanted to give everyone an update on how development of Transient
> > >> Replication is going and where we are going to be as of 9/1. Blake
> > >> Eggleston, Alex Petrov, Benedict Elliott Smith, and myself have been
> > >> working to get TR implemented for 4.0. Up to now we have avoided
> merging
> > >> anything related to TR to trunk because we weren't 100% sure we were
> going
> > >> to make the 9/1 deadline and even minimal TR functionality requires
> > >> significant changes (see 14405).
> > >>
> > >> We focused on getting a minimal set of deployable functionality
> working,
> > >> and want to avoid overselling what's going to work in the first
> version.
> > >> The feature is marked explicitly as experimental and has to be
> enabled via
> > >> a feature flag in cassandra.yaml. The expected audience for TR in 4.0
> is
> > >> more experienced users who are ready to tackle deploying experimental
> > >> functionality. As it is deployed by experienced users and we gain more
> > >> confidence in it and remove caveats the # of users it will be
> appropriate
> > >> for will expand.
> > >>
> > >> For 4.0 it looks like we will be able to merge TR with support for
> normal
> > >> reads and writes without monotonic reads. Monotonic reads require
> blocking
> > >> read repair and blocking read repair with TR requires further changes
> that
> > >> aren't feasible by 9/1.
> > >>
> > >> Future TR support would look something like
> > >>
> > >> 4.0.next:
> > >> * vnodes (https://issues.apache.org/jira/browse/CASSANDRA-14404)
> > >>
> > >> 4.next:
> > >> * Monotonic reads (
> > >> https://issues.apache.org/jira/browse/CASSANDRA-14665)
> > >> * LWT (https://issues.apache.org/jira/browse/CASSANDRA-14547)
> > >> * Batch log (
> https://issues.apache.org/jira/browse/CASSANDRA-14549)
> > >>

Re: Transient Replication 4.0 status update

2018-08-31 Thread Carl Mueller

SOrry to spam this with two messages...

This ticket is also interesting because it is very close to what I imagined
a useful use case of RF4 / RF6: being basically RF3 + hot spare where you
marked (in the case of RF4) three nodes as primary and the fourth as hot
standby, which may be equivalent if I understand the paper/protocol to
RF3+1 transient.

On Fri, Aug 31, 2018 at 1:07 PM Carl Mueller 
wrote:

> I put these questions on the ticket too... Sorry if some of them are
> stupid.
>
> So are (basically) these transient nodes basically serving as centralized
> hinted handoff caches rather than having the hinted handoffs cluttering up
> full replicas, especially nodes that have no concern for the token range
> involved? I understand that hinted handoffs aren't being replaced by this,
> but is that kind of the idea?
>
> Are the transient nodes "sitting around"?
>
> Will the transient nodes have cheaper/lower hardware requirements?
>
> During cluster expansion, does the newly streaming node acquiring data
> function as a temporary transient node until it becomes a full replica?
> Likewise while shrinking, does a previously full replica function as a
> transient while it streams off data?
>
> Can this help vnode expansion with multiple concurrent nodes? Admittedly
> I'm not familiar with how much work has gone into fixing cluster expansion
> with vnodes, it is my understanding that you typically expand only one node
> at a time or in multiples of the datacenter size
>
> On Mon, Aug 27, 2018 at 12:29 PM Ariel Weisberg  wrote:
>
>> Hi all,
>>
>> I wanted to give everyone an update on how development of Transient
>> Replication is going and where we are going to be as of 9/1. Blake
>> Eggleston, Alex Petrov, Benedict Elliott Smith, and myself have been
>> working to get TR implemented for 4.0. Up to now we have avoided merging
>> anything related to TR to trunk because we weren't 100% sure we were going
>> to make the 9/1 deadline and even minimal TR functionality requires
>> significant changes (see 14405).
>>
>> We focused on getting a minimal set of deployable functionality working,
>> and want to avoid overselling what's going to work in the first version.
>> The feature is marked explicitly as experimental and has to be enabled via
>> a feature flag in cassandra.yaml. The expected audience for TR in 4.0 is
>> more experienced users who are ready to tackle deploying experimental
>> functionality. As it is deployed by experienced users and we gain more
>> confidence in it and remove caveats the # of users it will be appropriate
>> for will expand.
>>
>> For 4.0 it looks like we will be able to merge TR with support for normal
>> reads and writes without monotonic reads. Monotonic reads require blocking
>> read repair and blocking read repair with TR requires further changes that
>> aren't feasible by 9/1.
>>
>> Future TR support would look something like
>>
>> 4.0.next:
>> * vnodes (https://issues.apache.org/jira/browse/CASSANDRA-14404)
>>
>> 4.next:
>> * Monotonic reads (
>> https://issues.apache.org/jira/browse/CASSANDRA-14665)
>> * LWT (https://issues.apache.org/jira/browse/CASSANDRA-14547)
>> * Batch log (https://issues.apache.org/jira/browse/CASSANDRA-14549)
>> * Counters (https://issues.apache.org/jira/browse/CASSANDRA-14548)
>>
>> Possibly never:
>> * Materialized views
>>
>> Probably never:
>> * Secondary indexes
>>
>> The most difficult changes to support Transient Replication should be
>> behind us. LWT, Batch log, and counters shouldn't be that hard to make
>> transient replication aware. Monotonic reads require some changes to the
>> read path, but are at least conceptually not that hard to support. I am
>> confident that by 4.next TR will have fewer tradeoffs.
>>
>> If you want to take a peek the current feature branch is
>> https://github.com/aweisberg/cassandra/tree/14409-7 although we will be
>> moving to 14409-8 to rebase on to trunk.
>>
>> Regards,
>> Ariel
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>
>>

Re: Transient Replication 4.0 status update

2018-08-31 Thread Carl Mueller

I put these questions on the ticket too... Sorry if some of them are
stupid.

So are (basically) these transient nodes basically serving as centralized
hinted handoff caches rather than having the hinted handoffs cluttering up
full replicas, especially nodes that have no concern for the token range
involved? I understand that hinted handoffs aren't being replaced by this,
but is that kind of the idea?

Are the transient nodes "sitting around"?

Will the transient nodes have cheaper/lower hardware requirements?

During cluster expansion, does the newly streaming node acquiring data
function as a temporary transient node until it becomes a full replica?
Likewise while shrinking, does a previously full replica function as a
transient while it streams off data?

Can this help vnode expansion with multiple concurrent nodes? Admittedly
I'm not familiar with how much work has gone into fixing cluster expansion
with vnodes, it is my understanding that you typically expand only one node
at a time or in multiples of the datacenter size

On Mon, Aug 27, 2018 at 12:29 PM Ariel Weisberg  wrote:

> Hi all,
>
> I wanted to give everyone an update on how development of Transient
> Replication is going and where we are going to be as of 9/1. Blake
> Eggleston, Alex Petrov, Benedict Elliott Smith, and myself have been
> working to get TR implemented for 4.0. Up to now we have avoided merging
> anything related to TR to trunk because we weren't 100% sure we were going
> to make the 9/1 deadline and even minimal TR functionality requires
> significant changes (see 14405).
>
> We focused on getting a minimal set of deployable functionality working,
> and want to avoid overselling what's going to work in the first version.
> The feature is marked explicitly as experimental and has to be enabled via
> a feature flag in cassandra.yaml. The expected audience for TR in 4.0 is
> more experienced users who are ready to tackle deploying experimental
> functionality. As it is deployed by experienced users and we gain more
> confidence in it and remove caveats the # of users it will be appropriate
> for will expand.
>
> For 4.0 it looks like we will be able to merge TR with support for normal
> reads and writes without monotonic reads. Monotonic reads require blocking
> read repair and blocking read repair with TR requires further changes that
> aren't feasible by 9/1.
>
> Future TR support would look something like
>
> 4.0.next:
> * vnodes (https://issues.apache.org/jira/browse/CASSANDRA-14404)
>
> 4.next:
> * Monotonic reads (
> https://issues.apache.org/jira/browse/CASSANDRA-14665)
> * LWT (https://issues.apache.org/jira/browse/CASSANDRA-14547)
> * Batch log (https://issues.apache.org/jira/browse/CASSANDRA-14549)
> * Counters (https://issues.apache.org/jira/browse/CASSANDRA-14548)
>
> Possibly never:
> * Materialized views
>
> Probably never:
> * Secondary indexes
>
> The most difficult changes to support Transient Replication should be
> behind us. LWT, Batch log, and counters shouldn't be that hard to make
> transient replication aware. Monotonic reads require some changes to the
> read path, but are at least conceptually not that hard to support. I am
> confident that by 4.next TR will have fewer tradeoffs.
>
> If you want to take a peek the current feature branch is
> https://github.com/aweisberg/cassandra/tree/14409-7 although we will be
> moving to 14409-8 to rebase on to trunk.
>
> Regards,
> Ariel
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: Java 11 Z garbage collector

2018-08-31 Thread Carl Mueller

I'm assuming that p99 that Rocksandra tries to target is caused by GC
pauses, does anyone have data patterns or datasets that will generate GC
pauses in Cassandra to highlight the abilities of Rocksandra (and...
Scylla?) and perhaps this GC approach?

On Thu, Aug 30, 2018 at 8:11 PM Carl Mueller 
wrote:

> Oh nice, I'll check that out.
>
> On Thu, Aug 30, 2018 at 11:07 AM Jonathan Haddad 
> wrote:
>
>> Advertised, yes, but so far I haven't found it to be any better than
>> ParNew + CMS or G1 in the performance tests I did when writing
>> http://thelastpickle.com/blog/2018/08/16/java11.html.
>>
>> That said, I didn't try it with a huge heap (i think it was 16 or 24GB),
>> so
>> maybe it'll do better if I throw 50 GB RAM at it.
>>
>>
>>
>> On Thu, Aug 30, 2018 at 8:42 AM Carl Mueller
>>  wrote:
>>
>> > https://www.opsian.com/blog/javas-new-zgc-is-very-exciting/
>> >
>> > .. max of 4ms for stop the world, large terabyte heaps, seems promising.
>> >
>> > Will this be a major boon to cassandra p99 times? Anyone know the
>> aspects
>> > of cassandra that cause the most churn and lead to StopTheWorld GC? I
>> was
>> > under the impression that bloom filters, caches, etc are statically
>> > allocated at startup.
>> >
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade
>>
>

Re: Java 11 Z garbage collector

2018-08-30 Thread Carl Mueller

Oh nice, I'll check that out.

On Thu, Aug 30, 2018 at 11:07 AM Jonathan Haddad  wrote:

> Advertised, yes, but so far I haven't found it to be any better than
> ParNew + CMS or G1 in the performance tests I did when writing
> http://thelastpickle.com/blog/2018/08/16/java11.html.
>
> That said, I didn't try it with a huge heap (i think it was 16 or 24GB), so
> maybe it'll do better if I throw 50 GB RAM at it.
>
>
>
> On Thu, Aug 30, 2018 at 8:42 AM Carl Mueller
>  wrote:
>
> > https://www.opsian.com/blog/javas-new-zgc-is-very-exciting/
> >
> > .. max of 4ms for stop the world, large terabyte heaps, seems promising.
> >
> > Will this be a major boon to cassandra p99 times? Anyone know the aspects
> > of cassandra that cause the most churn and lead to StopTheWorld GC? I was
> > under the impression that bloom filters, caches, etc are statically
> > allocated at startup.
> >
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>

Java 11 Z garbage collector

2018-08-30 Thread Carl Mueller

https://www.opsian.com/blog/javas-new-zgc-is-very-exciting/

.. max of 4ms for stop the world, large terabyte heaps, seems promising.

Will this be a major boon to cassandra p99 times? Anyone know the aspects
of cassandra that cause the most churn and lead to StopTheWorld GC? I was
under the impression that bloom filters, caches, etc are statically
allocated at startup.

Re: replicated data in different sstables

2018-07-25 Thread Carl Mueller

Oh duh, RACS does this already. But it would be nice to get some education
on the bloom filter memory use vs # sstables question.

On Wed, Jul 25, 2018 at 10:41 AM Carl Mueller 
wrote:

> It would seem to me that if the replicated data managed by a node is in
> separate sstables from the "main" data it manages, when a new node came
> online it would be easier to discard the data it no longer is responsible
> for since it was shifted a slot down the ring.
>
> Generally speaking I've been asking lots of questions about sstables that
> would increase the number of them. It is my impression that the size of
> bloom filters are linearly proportional to the number of hash keys
> contained in the sstables of a particular node. Is that true?
>
> We also want to avoid massive numbers of sstables mostly due to
> filesystem/inode problems? Because the endstate of me suggesting sstables
> be segmented by RACS, primary/replicated, and possibly application-specific
> separations would impose say 5-10x more sstables, even though the absolute
> amount of data and partition keys wouldn't change.
>

replicated data in different sstables

2018-07-25 Thread Carl Mueller

It would seem to me that if the replicated data managed by a node is in
separate sstables from the "main" data it manages, when a new node came
online it would be easier to discard the data it no longer is responsible
for since it was shifted a slot down the ring.

Generally speaking I've been asking lots of questions about sstables that
would increase the number of them. It is my impression that the size of
bloom filters are linearly proportional to the number of hash keys
contained in the sstables of a particular node. Is that true?

We also want to avoid massive numbers of sstables mostly due to
filesystem/inode problems? Because the endstate of me suggesting sstables
be segmented by RACS, primary/replicated, and possibly application-specific
separations would impose say 5-10x more sstables, even though the absolute
amount of data and partition keys wouldn't change.

RangeAwareCompaction for manual token management

2018-07-19 Thread Carl Mueller

I don't want to comment on the 10540 ticket since it seems very well
focused on vnode-aligned sstable partitioning and compaction. I'm pretty
excited about that ticket. RACS should enable:

- smaller scale LCS, more constrained I/O consumption
- less sstables to hit in read path
- multithreaded/multiprocessor compactions and even serving of data based
on individual vnode or pools of vnodes
- better alignment of tombstones with data they should be
nullifying/eventually removing
- repair streaming efficiency
- backups have more granularity for not uploading sstables that didn't
change for the range since last backup snapshot

There is ongoing discussions as to using Priam for cluster management where
I am, and as I understand it (superficially) Priam does not use vnodes and
use manual tokens, and expands via node multiples. I believe it has certain
advantages over vnodes including expanding by multiple machines at once,
backups could possibly do (nodecount / RF) number of nodes for data backups
rather than the mess of vnodes where you have to do basically all of them.

But we could still do some divisor split of the manual range and apply RACS
to that. I guess this would be vnode-lite. We could have some number like
100 subranges on a  node and expansion might just involve temporary lower
bound count of subranges until the sstables can be reprocessed to the
typical subrange count.

Is this theoretically correct, or are there glaring things I might have
missed with respect to RACS-style compaction and manual tokens?

Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread Carl Mueller

Is this a fundamental vnode disadvantage:

do Vnodes preclude cluster expansion faster than 1 at a time? I would think
with manual management you could expand a datacenter by multiples of
machines/nodes. Or at least in multiples of ReplicationFactor:

RF3 starts as:

a1 b1 c1

doubles to:

a1 a2 b1 b2 c1 c2

expands again by 3:

a1 a2 a3 b1 b2 b3 c1 c3 c3

all via sneakernet or similar schemes? Or am I wrong about being able to do
bigger expansions on manual tokens and that vnodes can't safely do that?

Most of the paper seems to surround the streaming time being what exposes
the cluster to risk. But manual tokens lend themselves to sneakernet
rebuilds, do they not?


On Tue, Apr 17, 2018 at 11:16 AM, Richard Low <rich...@wentnet.com> wrote:

> I'm also not convinced the problems listed in the paper with removenode are
> so serious. With lots of vnodes per node, removenode causes data to be
> streamed into all other nodes in parallel, so is (n-1) times quicker than
> replacement for n nodes. For R=3, the failure rate goes up with vnodes
> (without vnodes, after the first failure, any 4 neighbouring node failures
> lose quorum but for vnodes, any other node failure loses quorum) by a
> factor of (n-1)/4. The increase in speed more than offsets this so in fact
> vnodes with removenode give theoretically 4x higher availability than no
> vnodes.
>
> If anyone is interested in using vnodes in large clusters I'd strongly
> suggest testing this out to see if the concerns in section 4.3.3 are valid.
>
> Richard.
>
> On 17 April 2018 at 08:29, Jeff Jirsa <jji...@gmail.com> wrote:
>
> > There are two huge advantages
> >
> > 1) during expansion / replacement / decom, you stream from far more
> > ranges. Since streaming is single threaded per stream, this enables you
> to
> > max out machines during streaming where single token doesn’t
> >
> > 2) when adjusting the size of a cluster, you can often grow incrementally
> > without rebalancing
> >
> > Streaming entire wholly covered/contained/owned sstables during range
> > movements is probably a huge benefit in many use cases that may make the
> > single threaded streaming implementation less of a concern, and likely
> > works reasonably well without major changes to LCS in particular  - I’m
> > fairly confident there’s a JIRA for this, if not it’s been discussed in
> > person among various operators for years as an obvious future
> improvement.
> >
> > --
> > Jeff Jirsa
> >
> >
> > > On Apr 17, 2018, at 8:17 AM, Carl Mueller <
> carl.muel...@smartthings.com>
> > wrote:
> > >
> > > Do Vnodes address anything besides alleviating cluster planners from
> > doing
> > > token range management on nodes manually? Do we have a centralized list
> > of
> > > advantages they provide beyond that?
> > >
> > > There seem to be lots of downsides. 2i index performance, the above
> > > availability, etc.
> > >
> > > I also wonder if in vnodes (and manually managed tokens... I'll return
> to
> > > this) the node recovery scenarios are being hampered by sstables having
> > the
> > > hash ranges of the vnodes intermingled in the same set of sstables. I
> > > wondered in another thread in vnodes why sstables are separated into
> sets
> > > by the vnode ranges they represent. For a manually managed contiguous
> > token
> > > range, you could separate the sstables into a fixed number of sets,
> kind
> > of
> > > vnode-light.
> > >
> > > So if there was rebalancing or reconstruction, you could sneakernet or
> > > reliably send entire sstable sets that would belong in a range.
> > >
> > > I also thing this would improve compactions and repairs too.
> Compactions
> > > would be naturally parallelizable in all compaction schemes, and
> repairs
> > > would have natural subsets to do merkle tree calculations.
> > >
> > > Granted sending sstables might result in "overstreaming" due to data
> > > replication across the sstables, but you wouldn't have CPU and random
> I/O
> > > to look up the data. Just sequential transfers.
> > >
> > > For manually managed tokens with subdivided sstables, if there was
> > > rebalancing, you would have the "fringe" edges of the hash range
> > subdivided
> > > already, and you would only need to deal with the data in the border
> > areas
> > > of the token range, and again could sneakernet / dumb transfer the
> tables
> > > and then let the new node remove the unneeded in future repairs.
>

Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread Carl Mueller

I've posted a bunch of things relevant to commitlog --> sstable and
associated compaction / sstable metadata changes on here. I really need to
learn that section of the code.

On Tue, Apr 17, 2018 at 10:29 AM, Jeff Jirsa <jji...@gmail.com> wrote:

> There are two huge advantages
>
> 1) during expansion / replacement / decom, you stream from far more
> ranges. Since streaming is single threaded per stream, this enables you to
> max out machines during streaming where single token doesn’t
>
> 2) when adjusting the size of a cluster, you can often grow incrementally
> without rebalancing
>
> Streaming entire wholly covered/contained/owned sstables during range
> movements is probably a huge benefit in many use cases that may make the
> single threaded streaming implementation less of a concern, and likely
> works reasonably well without major changes to LCS in particular  - I’m
> fairly confident there’s a JIRA for this, if not it’s been discussed in
> person among various operators for years as an obvious future improvement.
>
> --
> Jeff Jirsa
>
>
> > On Apr 17, 2018, at 8:17 AM, Carl Mueller <carl.muel...@smartthings.com>
> wrote:
> >
> > Do Vnodes address anything besides alleviating cluster planners from
> doing
> > token range management on nodes manually? Do we have a centralized list
> of
> > advantages they provide beyond that?
> >
> > There seem to be lots of downsides. 2i index performance, the above
> > availability, etc.
> >
> > I also wonder if in vnodes (and manually managed tokens... I'll return to
> > this) the node recovery scenarios are being hampered by sstables having
> the
> > hash ranges of the vnodes intermingled in the same set of sstables. I
> > wondered in another thread in vnodes why sstables are separated into sets
> > by the vnode ranges they represent. For a manually managed contiguous
> token
> > range, you could separate the sstables into a fixed number of sets, kind
> of
> > vnode-light.
> >
> > So if there was rebalancing or reconstruction, you could sneakernet or
> > reliably send entire sstable sets that would belong in a range.
> >
> > I also thing this would improve compactions and repairs too. Compactions
> > would be naturally parallelizable in all compaction schemes, and repairs
> > would have natural subsets to do merkle tree calculations.
> >
> > Granted sending sstables might result in "overstreaming" due to data
> > replication across the sstables, but you wouldn't have CPU and random I/O
> > to look up the data. Just sequential transfers.
> >
> > For manually managed tokens with subdivided sstables, if there was
> > rebalancing, you would have the "fringe" edges of the hash range
> subdivided
> > already, and you would only need to deal with the data in the border
> areas
> > of the token range, and again could sneakernet / dumb transfer the tables
> > and then let the new node remove the unneeded in future repairs.
> > (Compaction does not remove data that is not longer managed by a node,
> only
> > repair does? Or does only nodetool clean do that?)
> >
> > Pre-subdivided sstables for manually maanged tokens would REALLY pay big
> > dividends in large-scale cluster expansion. Say you wanted to double or
> > triple the cluster. Since the sstables are already split by some numeric
> > factor that has lots of even divisors (60 for RF 2,3,4,5), you simply
> bulk
> > copy the already-subdivided sstables for the new nodes' hash ranges and
> > you'd basically be done. In AWS EBS volumes, that could just be a drive
> > detach / drive attach.
> >
> >
> >
> >
> >> On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves <k...@instaclustr.com>
> wrote:
> >>
> >> Great write up. Glad someone finally did the math for us. I don't think
> >> this will come as a surprise for many of the developers. Availability is
> >> only one issue raised by vnodes. Load distribution and performance are
> also
> >> pretty big concerns.
> >>
> >> I'm always a proponent for fixing vnodes, and removing them as a default
> >> until we do. Happy to help on this and we have ideas in mind that at
> some
> >> point I'll create tickets for...
> >>
> >>> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch, <joe.e.ly...@gmail.com>
> wrote:
> >>>
> >>> If the blob link on github doesn't work for the pdf (looks like mobile
> >>> might not like it), try:
> >>>
> >>>
> >>> https://github.com/jolynch/python_performance_toolki

Re: Quantifying Virtual Node Impact on Cassandra Availability

2018-04-17 Thread Carl Mueller

Do Vnodes address anything besides alleviating cluster planners from doing
token range management on nodes manually? Do we have a centralized list of
advantages they provide beyond that?

There seem to be lots of downsides. 2i index performance, the above
availability, etc.

I also wonder if in vnodes (and manually managed tokens... I'll return to
this) the node recovery scenarios are being hampered by sstables having the
hash ranges of the vnodes intermingled in the same set of sstables. I
wondered in another thread in vnodes why sstables are separated into sets
by the vnode ranges they represent. For a manually managed contiguous token
range, you could separate the sstables into a fixed number of sets, kind of
vnode-light.

So if there was rebalancing or reconstruction, you could sneakernet or
reliably send entire sstable sets that would belong in a range.

I also thing this would improve compactions and repairs too. Compactions
would be naturally parallelizable in all compaction schemes, and repairs
would have natural subsets to do merkle tree calculations.

Granted sending sstables might result in "overstreaming" due to data
replication across the sstables, but you wouldn't have CPU and random I/O
to look up the data. Just sequential transfers.

For manually managed tokens with subdivided sstables, if there was
rebalancing, you would have the "fringe" edges of the hash range subdivided
already, and you would only need to deal with the data in the border areas
of the token range, and again could sneakernet / dumb transfer the tables
and then let the new node remove the unneeded in future repairs.
(Compaction does not remove data that is not longer managed by a node, only
repair does? Or does only nodetool clean do that?)

Pre-subdivided sstables for manually maanged tokens would REALLY pay big
dividends in large-scale cluster expansion. Say you wanted to double or
triple the cluster. Since the sstables are already split by some numeric
factor that has lots of even divisors (60 for RF 2,3,4,5), you simply bulk
copy the already-subdivided sstables for the new nodes' hash ranges and
you'd basically be done. In AWS EBS volumes, that could just be a drive
detach / drive attach.

On Tue, Apr 17, 2018 at 7:37 AM, kurt greaves  wrote:

> Great write up. Glad someone finally did the math for us. I don't think
> this will come as a surprise for many of the developers. Availability is
> only one issue raised by vnodes. Load distribution and performance are also
> pretty big concerns.
>
> I'm always a proponent for fixing vnodes, and removing them as a default
> until we do. Happy to help on this and we have ideas in mind that at some
> point I'll create tickets for...
>
> On Tue., 17 Apr. 2018, 06:16 Joseph Lynch,  wrote:
>
> > If the blob link on github doesn't work for the pdf (looks like mobile
> > might not like it), try:
> >
> >
> > https://github.com/jolynch/python_performance_toolkit/
> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> availability-virtual.pdf
> >
> > -Joey
> > <
> > https://github.com/jolynch/python_performance_toolkit/
> raw/master/notebooks/cassandra_availability/whitepaper/cassandra-
> availability-virtual.pdf
> > >
> >
> > On Mon, Apr 16, 2018 at 1:14 PM, Joseph Lynch 
> > wrote:
> >
> > > Josh Snyder and I have been working on evaluating virtual nodes for
> large
> > > scale deployments and while it seems like there is a lot of anecdotal
> > > support for reducing the vnode count [1], we couldn't find any concrete
> > > math on the topic, so we had some fun and took a whack at quantifying
> how
> > > different choices of num_tokens impact a Cassandra cluster.
> > >
> > > According to the model we developed [2] it seems that at small cluster
> > > sizes there isn't much of a negative impact on availability, but when
> > > clusters scale up to hundreds of hosts, vnodes have a major impact on
> > > availability. In particular, the probability of outage during short
> > > failures (e.g. process restarts or failures) or permanent failure (e.g.
> > > disk or machine failure) appears to be orders of magnitude higher for
> > large
> > > clusters.
> > >
> > > The model attempts to explain why we may care about this and advances a
> > > few existing/new ideas for how to fix the scalability problems that
> > vnodes
> > > fix without the availability (and consistency—due to the effects on
> > repair)
> > > problems high num_tokens create. We would of course be very interested
> in
> > > any feedback. The model source code is on github [3], PRs are welcome
> or
> > > feel free to play around with the jupyter notebook to match your
> > > environment and see what the graphs look like. I didn't attach the pdf
> > here
> > > because it's too large apparently (lots of pretty graphs).
> > >
> > > I know that users can always just pick whichever number they prefer,
> but
> > I
> > > think the current default was

Re: Repair scheduling tools

2018-04-16 Thread Carl Mueller

So reading (
https://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1)...
anticompaction problems from repair seem related to the fact that the
sstables for a repair range can have data that isn't in the repaired data
range, so we then have an sstable with the repaired data (I'm ... guessing
... this "repaired" sstable only has the repair-range-relevant data?), and
the unrepaired sstable with data outside the repair range needs to stick
around too.

But if our sstables from the start were organized by subdivided ranges
(either the vnode ranges or some fraction of manually managed tokens), then
the hash range is constrained for both the repair and compaction... and if
the sstables are implicitly bucketed by a hash range, compactions are very
easy to parallelize?

I guess for 256 vnodes and RF 3 that would be 768 sets of sstables per
table...

On Mon, Apr 16, 2018 at 12:21 PM, Carl Mueller <carl.muel...@smartthings.com
> wrote:

> Is the fundamental nature of sstable fragmentation the big wrench here?
> I've been trying to imagine aids like an offline repair resolver or a
> gradual node replacement/regenerator process that could serve as a
> backstop/insurance for compaction and repair problems. After all, some of
> the "we don't even bother repairing" places just do gradual automatic node
> replacement, or what the one with the ALL scrubber was doing.
>
> Is there a reason cassandra does not subdivide sstables by hash range,
> especially for vnodes? Reduction of seeks (not an issue in the ssd era
> really)?  Since repair that avoids overstreaming is performed on subranges
> and generate new sstables for further compaction, if the sstables (in
> vnodes or not) were split by dedicated hash ranges then maybe the scale of
> data being dealt with on a node and a repair and compaction would be
> reduced in scope/complexity.
>
> It's before lunch for me, so I'm probably missing a major major caveat
> here...
>
> But I'm trying to think why we wouldn't bucket sstables by hash range.
> Seems to me it would be simple to do in the commitlog --> sstable step, an
> addition to the sstable metadata that isn't too big, and then the
> compaction and repair processes could unentangle and validate ranges with
> more quickly with less excess I/O
>
> On Thu, Apr 12, 2018 at 9:18 PM, Rahul Singh <rahul.xavier.si...@gmail.com
> > wrote:
>
>> Schedule scheme looks good. I believe in process / sidecar can both
>> coexist. As an admin would love to be able to run one or the other or none.
>>
>> Thank you for taking a lead and producing a plan that can actually be
>> executed.
>>
>> --
>> Rahul Singh
>> rahul.si...@anant.us
>>
>> Anant Corporation
>>
>> On Apr 12, 2018, 6:35 PM -0400, Joseph Lynch <joe.e.ly...@gmail.com>,
>> wrote:
>> > Given the feedback here and on the ticket, I've written up a proposal
>> > for a repair
>> > sidecar tool
>> > <https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t4
>> 5rz7H3xs9GbFSEyGzEtM/edit#heading=h.5f10ng8gzle8
>> > in the ticket's design document. If there are no major concerns we're
>> going
>> > to start working on porting the Priam implementation into this new tool
>> > soon.
>> >
>> > -Joey
>> >
>> > On Tue, Apr 10, 2018 at 4:17 PM, Elliott Sims <elli...@backblaze.com>
>> wrote:
>> >
>> > > My two cents as a (relatively small) user. I'm coming at this from the
>> > > ops/user side, so my apologies if some of these don't make sense
>> based on a
>> > > more detailed understanding of the codebase:
>> > >
>> > > Repair is definitely a major missing piece of Cassandra. Integrated
>> would
>> > > be easier, but a sidecar might be more flexible. As an intermediate
>> step
>> > > that works towards both options, does it make sense to start with
>> > > finer-grained tracking and reporting for subrange repairs? That is,
>> expose
>> > > a set of interfaces (both internally and via JMX) that give a
>> scheduler
>> > > enough information to run subrange repairs across multiple keyspaces
>> or
>> > > even non-overlapping ranges at the same time. That lets people
>> experiment
>> > > with and quickly/safely/easily iterate on different scheduling
>> strategies
>> > > in the short term, and long-term those strategies can be integrated
>> into a
>> > > built-in scheduler
>> > >
>> > > On the subject of scheduling, I think adjusting
>> parallelism/aggression with
>> > > a possible whitelist

Re: Repair scheduling tools

2018-04-16 Thread Carl Mueller

Is the fundamental nature of sstable fragmentation the big wrench here?
I've been trying to imagine aids like an offline repair resolver or a
gradual node replacement/regenerator process that could serve as a
backstop/insurance for compaction and repair problems. After all, some of
the "we don't even bother repairing" places just do gradual automatic node
replacement, or what the one with the ALL scrubber was doing.

Is there a reason cassandra does not subdivide sstables by hash range,
especially for vnodes? Reduction of seeks (not an issue in the ssd era
really)?  Since repair that avoids overstreaming is performed on subranges
and generate new sstables for further compaction, if the sstables (in
vnodes or not) were split by dedicated hash ranges then maybe the scale of
data being dealt with on a node and a repair and compaction would be
reduced in scope/complexity.

It's before lunch for me, so I'm probably missing a major major caveat
here...

But I'm trying to think why we wouldn't bucket sstables by hash range.
Seems to me it would be simple to do in the commitlog --> sstable step, an
addition to the sstable metadata that isn't too big, and then the
compaction and repair processes could unentangle and validate ranges with
more quickly with less excess I/O

On Thu, Apr 12, 2018 at 9:18 PM, Rahul Singh 
wrote:

> Schedule scheme looks good. I believe in process / sidecar can both
> coexist. As an admin would love to be able to run one or the other or none.
>
> Thank you for taking a lead and producing a plan that can actually be
> executed.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Apr 12, 2018, 6:35 PM -0400, Joseph Lynch ,
> wrote:
> > Given the feedback here and on the ticket, I've written up a proposal
> > for a repair
> > sidecar tool
> >  t45rz7H3xs9GbFSEyGzEtM/edit#heading=h.5f10ng8gzle8
> > in the ticket's design document. If there are no major concerns we're
> going
> > to start working on porting the Priam implementation into this new tool
> > soon.
> >
> > -Joey
> >
> > On Tue, Apr 10, 2018 at 4:17 PM, Elliott Sims 
> wrote:
> >
> > > My two cents as a (relatively small) user. I'm coming at this from the
> > > ops/user side, so my apologies if some of these don't make sense based
> on a
> > > more detailed understanding of the codebase:
> > >
> > > Repair is definitely a major missing piece of Cassandra. Integrated
> would
> > > be easier, but a sidecar might be more flexible. As an intermediate
> step
> > > that works towards both options, does it make sense to start with
> > > finer-grained tracking and reporting for subrange repairs? That is,
> expose
> > > a set of interfaces (both internally and via JMX) that give a scheduler
> > > enough information to run subrange repairs across multiple keyspaces or
> > > even non-overlapping ranges at the same time. That lets people
> experiment
> > > with and quickly/safely/easily iterate on different scheduling
> strategies
> > > in the short term, and long-term those strategies can be integrated
> into a
> > > built-in scheduler
> > >
> > > On the subject of scheduling, I think adjusting parallelism/aggression
> with
> > > a possible whitelist or blacklist would be a lot more useful than a
> "time
> > > between repairs". That is, if repairs run for a few hours then don't
> run
> > > for a few (somewhat hard-to-predict) hours, I still have to size the
> > > cluster for the load when the repairs are running. The only reason I
> can
> > > think of for an interval between repairs is to allow re-compaction from
> > > repair anticompactions, and subrange repairs seem to eliminate this.
> Even
> > > if they didn't, a more direct method along the lines of "don't repair
> when
> > > the compaction queue is too long" might make more sense. Blacklisted
> > > timeslots might be useful for avoiding peak time or batch jobs, but
> only if
> > > they can be specified for consistent time-of-day intervals instead of
> > > unpredictable lulls between repairs.
> > >
> > > I really like the idea of automatically adjusting gc_grace_seconds
> based on
> > > repair state. The only_purge_repaired_tombstones option fixes this
> > > elegantly for sequential/incremental repairs on STCS, but not for
> subrange
> > > repairs or LCS (unless a scheduler gains the ability somehow to
> determine
> > > that every subrange in an sstable has been repaired and mark it
> > > accordingly?)
> > >
> > >
> > > On 2018/04/03 17:48:14, Blake Eggleston  wrote:
> > > > Hi dev@,
> > > >
> > > > >
> > > >
> > > > The question of the best way to schedule repairs came up on
> > > CASSANDRA-14346, and I thought it would be good to bring up the idea
> of an
> > > external tool on the dev list.
> > > >
> > > > >
> > > >
> > > > Cassandra lacks any sort of tools for automating routine tasks that
> are
> > > required for

Re: Repair scheduling tools

2018-04-03 Thread Carl Mueller

LastPickle's reaper should be the starting point of any discussion on
repair scheduling.

On Tue, Apr 3, 2018 at 12:48 PM, Blake Eggleston 
wrote:

> Hi dev@,
>
>
>
> The question of the best way to schedule repairs came up on
> CASSANDRA-14346, and I thought it would be good to bring up the idea of an
> external tool on the dev list.
>
>
>
> Cassandra lacks any sort of tools for automating routine tasks that are
> required for running clusters, specifically repair. Regular repair is a
> must for most clusters, like compaction. This means that, especially as far
> as eventual consistency is concerned, Cassandra isn’t totally functional
> out of the box. Operators either need to find a 3rd party solution or
> implement one themselves. Adding this to Cassandra would make it easier to
> use.
>
>
>
> Is this something we should be doing? If so, what should it look like?
>
>
>
> Personally, I feel like this is a pretty big gap in the project and would
> like to see an out of process tool offered. Ideally, Cassandra would just
> take care of itself, but writing a distributed repair scheduler that you
> trust to run in production is a lot harder than writing a single process
> management application that can failover.
>
>
>
> Any thoughts on this?
>
>
>
> Thanks,
>
>
>
> Blake
>
>

Re: [DISCUSS] java 9 and the future of cassandra on the jdk

2018-03-25 Thread Carl Mueller

ent versions, which is only possible starting with Java 9.
>> We don't even have that and would still have to make sure the same code
>> runs on both 8 and 11.
>>
>> 3) Release 4.0 for Java 8, branch 4.1 for Java 11 later
>>
>> Don't do anything yet and release 4.0 for Java 8. Keep an eye on how the
>> situation unfolds during the next months and how fast Java 11 will be
>> adopted by Cassandra users. Branch 4.1 for Java 11, if there's public
>> demand and we agree that it makes sense at that point. This is basically
>> an incremental approach to 1), but we'll end up with another branch,
>> which we also would have to support in the future (4.0 for 8, 4.1 for 11).
>>
>>
>>
>>
>> On 22.03.2018 23:30, Michael Shuler wrote:
>>
>>> As I mentioned in IRC and was pasted earlier in the thread, I believe
>>> the easiest path is to follow the major releases of OpenJDK in the
>>> long-term-support Linux OS releases. Currently, Debian Stable (Stretch),
>>> Ubuntu 16.04 (Bionic (near release)), and Red Hat / CentOS 7 all have
>>> OpenJDK 8 as the default JDK. For long-term support, they all have build
>>> facilities in place for their supported architectures and developers
>>> that care about security updates for users through their documented EOL
>>> dates.
>>>
>>> The current deb and rpm packages for Apache Cassandra all properly
>>> depend on OpenJDK 8, so there's really nothing to be done here, until
>>> the project decides to implicitly depend on a JDK version not easily
>>> installable on the major OS LTS releases. (Users of older OS versions
>>> may need to fiddle with yum and apt sources to get OpenJDK 8, but this
>>> is a relatively solved problem.)
>>>
>>> Users have the ability to deviate and set a JAVA_HOME env var to use a
>>> custom-installed JDK of their liking, or go down the `alternatives` path
>>> of their favorite OS.
>>>
>>> 1) I don't think we should be get into the business of distributing
>>> Java, even if licensing allowed it.
>>> 2) The OS vendors are in the business of keeping users updated with
>>> upstream releases of Java, so there's no reason not to utilize them.
>>>
>>> Michael
>>>
>>> On 03/22/2018 05:12 PM, Jason Brown wrote:
>>>
>>>> See the legal-discuss@ thread:
>>>> https://mail-archives.apache.org/mod_mbox/www-legal-discuss/
>>>> 201803.mbox/browser
>>>> .
>>>>
>>>> TL;DR jlink-based distributions are not gonna fly due to OpenJDK's
>>>> license,
>>>> so let's focus on other paths forward.
>>>>
>>>>
>>>> On Thu, Mar 22, 2018 at 2:04 PM, Carl Mueller <
>>>> carl.muel...@smartthings.com>
>>>> wrote:
>>>>
>>>> Is OpenJDK really not addressing this at all? Is that because OpenJDK is
>>>>> beholden to Oracle somehow? This is a major disservice to Apache and
>>>>> the
>>>>> java ecosystem as a whole.
>>>>>
>>>>> When java was fully open sourced, it was supposed to free the
>>>>> ecosystem to
>>>>> a large degree from Oracle. Why is OpenJDK being so uncooperative? Are
>>>>> they
>>>>> that resource strapped? Can no one, from consulting empires, Google,
>>>>> IBM,
>>>>> Amazon, and a host of other major companies take care of this?
>>>>>
>>>>> This is basically OpenSSL all over again.
>>>>>
>>>>> Deciding on a way to get a stable language runtime isn't our job. It's
>>>>> the
>>>>> job of either the runtime authors (OpenJDK) or another group that
>>>>> should
>>>>> form around it.
>>>>>
>>>>> There is no looming deadline on this, is there? Can we just let the
>>>>> dust
>>>>> settle on this in the overall ecosystem to see what happens? And again,
>>>>> what is the Apache Software Foundation's approach to this that affects
>>>>> so
>>>>> many of their projects?
>>>>>
>>>>> On Wed, Mar 21, 2018 at 12:55 PM, Jason Brown <jasedbr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Well, that was quick. TL;DR Redistributing any part of the OpenJDK is
>>>>>> basically a no-go.
>>>>>>
>>>>>> Thus, that option is off the table.
>>>>>>
>>

Re: [DISCUSS] java 9 and the future of cassandra on the jdk

2018-03-23 Thread Carl Mueller

I am now thinking that aligning to the major JDK release that is for-pay
three years if you want it is the best strategy. What I think will happen
is that there will be a consortium that maintains/backports that release
level independent of oracle, if only to spite them. I'm thinking IBM, Azul,
etc will do so.

Fundamentally the project doesn't really care about the churn in the
language spec from new releases, except for breaking changes for a JDK8-ish
codebase, which should be, what, Jigsaw? What the project cares about are
security patches and JVM advancements.

JVM advancements will probably trickle into the long term non-Oracle
(assuming it appears) if it is important enough.

On Fri, Mar 23, 2018 at 11:16 AM, Jonathan Haddad  wrote:

> I suppose given the short lifetime of each Java release you could argue
> we're always close to EOL.  I feel like we shouldn't ship with a version
> that is currently EOL.
>
> Coming up with a policy for all upcoming releases may also be incredibly
> difficult.  6 months java releases could pan out like Tick Tock and reveal
> itself to be a fun idea with some really bad consequences, and it goes away
> after Java 12.  Impossible to tell.  How about we figure out the next
> release and get a little experience under our belts with their new release
> schedule before we try to make long term decisions?
>
> On Fri, Mar 23, 2018 at 9:08 AM Josh McKenzie 
> wrote:
>
> > > At this point I feel like we should already be
> > > targeting Java 10 at a minimum.
> > Barring some surprises from other people supporting 10 longer-term,
> > wouldn't that be coupling C*'s 4.0 release with a runtime that's
> > likely EOL shortly after?
> >
> > On Fri, Mar 23, 2018 at 11:52 AM, Jonathan Haddad 
> > wrote:
> > > Java 8 was marked as EOL in the middle of last year, I hope we wouldn't
> > > require it for Cassandra 4.  At this point I feel like we should
> already
> > be
> > > targeting Java 10 at a minimum.
> > >
> > > Personally I'd prefer not to tie our releases to any vendor / product /
> > > package's release schedule.
> > >
> > >
> > > On Fri, Mar 23, 2018 at 6:49 AM Jason Brown 
> > wrote:
> > >
> > >> I'm coming to be on-board with #3.
> > >>
> > >> One thing to watch out for (we can't account for it now) is how our
> > >> dependencies choose to move forward. If we need to upgrade a jar
> (netty,
> > >> for example) due to some leak or vulnerability, and it only runs on a
> > >> higher version, we may be forced to upgrade the base java version.
> > Like, I
> > >> said we can't possibly foresee these things, and we'll just have to
> > make a
> > >> hard decision if the situation arises, but just something to keep in
> > mind.
> > >>
> > >>
> > >> On Fri, Mar 23, 2018 at 5:39 AM, Josh McKenzie 
> > >> wrote:
> > >>
> > >> > >
> > >> > > 3) Release 4.0 for Java 8, *optionally* branch 4.1 for Java 11
> later
> > >> >
> > >> > This seems like the best of our bad options, with the addition of
> > >> > "optionally".
> > >> >
> > >> >
> > >> > On Fri, Mar 23, 2018 at 8:12 AM, Gerald Henriksen <
> ghenr...@gmail.com
> > >
> > >> > wrote:
> > >> >
> > >> > > On Fri, 23 Mar 2018 04:54:23 +, you wrote:
> > >> > >
> > >> > > >I think Michael is right. It would be impossible to make everyone
> > >> follow
> > >> > > >such a fast release scheme, and supporting it will be pressured
> > onto
> > >> the
> > >> > > >various distributions, M$ and Apple.
> > >> > > >On the other hand https://adoptopenjdk.net has already done a
> lot
> > of
> > >> > the
> > >> > > >work and it's already rumoured they may take up backporting of
> > >> > > security/bug
> > >> > > >fixes. I'd fully expect a lot of users to collaborate around this
> > (or
> > >> > > >similar), and there's no reason we couldn't do our part to
> > contribute.
> > >> > >
> > >> > > A posting on Reddit yesterday from someone from adoptopenjdk
> claimes
> > >> > > that they will be doing LTS releases starting with Java 11, and
> > there
> > >> > > should be updates to their website to reflect that soon:
> > >> > >
> > >> > >
> > https://www.reddit.com/r/java/comments/86ce66/java_long_term_support/
> > >> > >
> > >> > > So I guess a wait and see to what they commit could be worthwhile.
> > >> > >
> > >> > >
> > -
> > >> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > >> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >> > >
> > >> > >
> > >> >
> > >>
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> >
>

Re: [DISCUSS] java 9 and the future of cassandra on the jdk

2018-03-22 Thread Carl Mueller

Is OpenJDK really not addressing this at all? Is that because OpenJDK is
beholden to Oracle somehow? This is a major disservice to Apache and the
java ecosystem as a whole.

When java was fully open sourced, it was supposed to free the ecosystem to
a large degree from Oracle. Why is OpenJDK being so uncooperative? Are they
that resource strapped? Can no one, from consulting empires, Google, IBM,
Amazon, and a host of other major companies take care of this?

This is basically OpenSSL all over again.

Deciding on a way to get a stable language runtime isn't our job. It's the
job of either the runtime authors (OpenJDK) or another group that should
form around it.

There is no looming deadline on this, is there? Can we just let the dust
settle on this in the overall ecosystem to see what happens? And again,
what is the Apache Software Foundation's approach to this that affects so
many of their projects?

On Wed, Mar 21, 2018 at 12:55 PM, Jason Brown  wrote:

> Well, that was quick. TL;DR Redistributing any part of the OpenJDK is
> basically a no-go.
>
> Thus, that option is off the table.
>
> On Wed, Mar 21, 2018 at 10:46 AM, Jason Brown 
> wrote:
>
> > ftr, I've sent a message to legal-discuss to inquire about the licensing
> > aspect of the OpenJDK as we've been discussing. I believe anyone can
> follow
> > the thread by subscribing to the legal-discuss@ ML, or you can wait for
> > updates on this thread as I get them.
> >
> > On Wed, Mar 21, 2018 at 9:49 AM, Jason Brown 
> wrote:
> >
> >> If we went down this path, I can't imagine we would build OpenJDK
> >> ourselves, but probably build a release with jlink or javapackager. I
> >> haven't done homework on that yet, but i *think* it uses a blessed
> OpenJDK
> >> release for the packaging (or perhaps whatever JDK you happen to be
> >> compiling/building with). Thus as long as we build/release when an
> openJDK
> >> rev is released, we would hypothetically be ok from a secutiry POV.
> >>
> >> That being said, Micke's points about multiple architectures and other
> >> OSes (Windows for sure, macOS not so sure) are a legit concern as those
> >> would need to be separate packages, with separate CI/testing and so on
> :(
> >>
> >> I'm not sure betting the farm on linux disto support is the path to
> >> happiness, either. Not everyone uses one of the distros mentioned (RH,
> >> ubuntu), nor does everyone use linux (sure, the vast majority is
> Linux/x86,
> >> but we do support Windows deployment and macOS development).
> >>
> >> -Jason
> >>
> >>
> >>
> >> On Wed, Mar 21, 2018 at 9:26 AM, Michael Burman 
> >> wrote:
> >>
> >>> On 03/21/2018 04:52 PM, Josh McKenzie wrote:
> >>>
> >>> This would certainly mitigate a lot of the core problems with the new
>  release model. Has there been any public statements of plans/intent
>  with regards to distros doing this?
> 
> >>> Since the latest official LTS version is Java 8, that's the only one
> >>> with publicly available information For RHEL, OpenJDK8 will receive
> updates
> >>> until October 2020.  "A major version of OpenJDK is supported for a
> period
> >>> of six years from the time that it is first introduced in any version
> of
> >>> RHEL, or until the retirement date of the underlying RHEL platform ,
> >>> whichever is earlier." [1]
> >>>
> >>> [1] https://access.redhat.com/articles/1299013
> >>>
> >>> In terms of the burden of bugfixes and security fixes if we bundled a
>  JRE w/C*, cutting a patch release of C* with a new JRE distribution
>  would be a really low friction process (add to build, check CI, green,
>  done), so I don't think that would be a blocker for the concept.
> 
>  And do we have someone actively monitoring CVEs for this? Would we
> ship
> >>> a version of OpenJDK which ensures that it works with all the major
> >>> distributions? Would we run tests against all the major distributions
> for
> >>> each of the OpenJDK version we would ship after each CVE with each
> >>> Cassandra version? Who compiles the OpenJDK distribution we would
> create
> >>> (which wouldn't be the official one if we need to maintain support for
> each
> >>> distribution we support) ? What if one build doesn't work for one
> distro?
> >>> Would we not update that CVE? OpenJDK builds that are in the distros
> are
> >>> not necessarily the pure ones from the upstream, they might include
> patches
> >>> that provide better support for the distribution - or even fix bugs
> that
> >>> are not yet in the upstream version.
> >>>
> >>> I guess we also need the Windows versions, maybe the PowerPC & ARM
> >>> versions also at some point. I'm not sure if we plan to support J9 or
> other
> >>> JVMs at some point.
> >>>
> >>> We would also need to create CVE reports after each Java CVE for
> >>> Cassandra as well I would assume since it would affect us separately
> (and
> >>> updating only the Java wouldn't help).

Re: [DISCUSS] java 9 and the future of cassandra on the jdk

2018-03-20 Thread Carl Mueller

So this is basically Oracle imposing a rapid upgrade path on free users to
force them to buy commercial to get LTS stability?

This will probably shake out in the community somehow. Cassandra is complex
but we are small fry in the land of IT supports and Enterprise upgrades.
Something will organize around OpenJDK I'd guess. There's probably a
collective tens or hundreds of billions of dollars in IT budgets affected
by this, so something will shake out.

So many other projects are impacted by this. What is the official/consensus
Apache Software Foundation strategy on this?

On Tue, Mar 20, 2018 at 3:50 PM, Jason Brown  wrote:

> Thanks to Hannu and others pointing out that the OracleJDK is a
> *commercial* LTS, and thus not an option. mea culpa for missing the
> "commercial" and just focusing on the "LTS" bit. OpenJDK is is, then.
>
> Stefan's elastic search link is rather interesting. Looks like they are
> compiling for both a LTS version as well as the current OpenJDK. They
> assume some of their users will stick to a LTS version and some will run
> the current version of OpenJDK.
>
> While it's extra work to add JDK version as yet another matrix variable in
> addition to our branching, is that something we should consider? Or are we
> going to burden maintainers even more? Do we have a choice? Note: I think
> this is similar to what Jeremiah is proposed.
>
> @Ariel: Going beyond 3 years could be tricky in the worst case because
> bringing in up to 3 years of JDK changes to an older release might mean
> some of our dependencies no longer function and now it's not just minor
> fixes it's bringing in who knows what in terms of updated dependencies
>
> I'm not sure we have a choice anymore, as we're basically bound to what the
> JDK developers choose to do (and we're bound to the JDK ...). However, if
> we have the changes necessary for the JDK releases higher than the LTS (if
> we following the elastic search model), perhaps it'll be a reasonably
> smooth transition?
>
> On Tue, Mar 20, 2018 at 1:31 PM, Jason Brown  wrote:
>
> > copied directly from dev channel, just to keep with this ML conversation
> >
> > 08:08:26   Robert Stupp jasobrown: https://www.azul.com/java-
> > stable-secure-free-choose-two-three/ and https://blogs.oracle.com/java-
> > platform-group/faster-and-easier-use-and-redistribution-of-java-se
> > 08:08:38 the 2nd says: "The Oracle JDK will continue as a commercial long
> > term support offering"
> > 08:08:46 also: http://www.oracle.com/technetwork/java/eol-135779.html
> > 08:09:21 the keyword in that cite is "commercial"
> > 08:21:21  Michael Shuler a couple more thoughts.. 1) keep C*
> > support in step with latest Ubuntu LTS OpenJDK major in main, 2) bundle
> JRE
> > in C* releases? (JDK is not "legal" to bundle)
> > 08:23:44  https://www.elastic.co/blog/elasticsearch-
> > java-9-and-beyond  - interesting read on that matter
> > 08:26:04 can't wait for the infra and CI testing implications.. will be
> > lot's of fun ;(
> > 08:42:13  Robert Stupp Not sure whether stepping with Ubuntu is
> > necessary. It's not so difficult to update apt.source ;)
> > 08:42:43 CI ? It just let's your test matrix explode - a litte ;)
> > 08:46:48  Michael Shuler yep, we currently `def jdkLabel = 'JDK 1.8
> > (latest)'` in job DSL and could easily modify that
> >
> > On Tue, Mar 20, 2018 at 9:08 AM, Kant Kodali  wrote:
> >
> >> Java 10 is releasing today!
> >>
> >> On Tue, Mar 20, 2018 at 9:07 AM, Ariel Weisberg 
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > +1 to what Jordan is saying.
> >> >
> >> > It seems like if we are cutting a release off of trunk we want to make
> >> > sure we get N years of supported JDK out of it. For a single LTS
> >> release N
> >> > could be at most 3 and historically that isn't long enough and it's
> very
> >> > likely we will get < 3 after a release is cut.
> >> >
> >> > Going beyond 3 years could be tricky in the worst case because
> bringing
> >> in
> >> > up to 3 years of JDK changes to an older release might mean some of
> our
> >> > dependencies no longer function and now it's not just minor fixes it's
> >> > bringing in who knows what in terms of updated dependencies.
> >> >
> >> > I think in some cases we are going to need to take a release we have
> >> > already cut and make it work with an LTS release that didn't exist
> when
> >> the
> >> > release was cut.
> >> >
> >> > We also need to update how CI works. We should at least build and run
> a
> >> > quick smoke test with the JDKs we are claiming to support and
> >> > asynchronously run all the tests on the rather large matrix that now
> >> exists.
> >> >
> >> > Ariel
> >> >
> >> > On Tue, Mar 20, 2018, at 11:07 AM, Jeremiah Jordan wrote:
> >> > > My suggestion would be to keep trunk on the latest LTS by default,
> but
> >> > > with compatibility with the latest release if possible.  Since
> Oracle
> >> > > LTS releases are every 3

Re: Making RF4 useful aka primary and secondary ranges

2018-03-15 Thread Carl Mueller

Close. I'm suggesting that if you have RF4 or 5 or 6, you get to designate
a subset of three replicas that are strongly preferred. From this "virtual
subset/datacenter" if you do QUORUM against that subset, it just does n/2+1
of the subset. Updates are still sent to the non-primary replicas, and if a
primary fails one of the replicas is then added as a primary.

For your RF4 case you write QUORUM and read HALF, but that limits the
number of "hot spares" although with admittedly better consistency
guarantee on the hot spares.

I'm suggesting something where you could have a ton more hot spares, say
RF8 or RF10, but you still rely on the first three in the ring as the
primary that you do CL.2 QUORUM against, and you accept the additional
replicas are eventually consistent when they are eventually consistent.

Yours, mine, and the two others that were linked all kind of bandy around
the same thing: trying to finagle more replicas while not being shoehorned
into having to do more replica agreements on the reads. Some have better
guarantees than the others...

To do mine via the driver is basically making the driver the coordinator.
Works in some network topologies, but would be too high otherwise.

On Wed, Mar 14, 2018 at 9:14 PM, Jason Brown <jasedbr...@gmail.com> wrote:

> I feel like we've had a very similar conversation (not so) recently:
> https://lists.apache.org/thread.html/9952c419398a1a2f22e2887e3492f9
> d6899365f0ea7c2b68d6fbe0d4@%3Cuser.cassandra.apache.org%3E
>
> Which led to the creation of this JIRA:
> https://issues.apache.org/jira/browse/CASSANDRA-13645
>
>
> On Wed, Mar 14, 2018 at 4:23 PM, Carl Mueller <
> carl.muel...@smartthings.com>
> wrote:
>
> > Since this is basically driver syntactic sugar... Yes I'll try that.
> >
> >
> > On Wed, Mar 14, 2018 at 5:59 PM, Jonathan Haddad <j...@jonhaddad.com>
> > wrote:
> >
> > > You could use a load balancing policy at the driver level to do what
> you
> > > want, mixed with the existing consistency levels as Jeff suggested.
> > >
> > > On Wed, Mar 14, 2018 at 3:47 PM Carl Mueller <
> > carl.muel...@smartthings.com
> > > >
> > > wrote:
> > >
> > > > But we COULD have CL2 write (for RF4)
> > > >
> > > > The extension to this idea is multiple backup/secondary replicas. So
> > you
> > > > have RF5 or RF6 or higher, but still are performing CL2 against the
> > > > preferred first three for both read and write.
> > > >
> > > > You could also ascertain the general write health of affected ranges
> > > before
> > > > taking a node down for maintenance from the primary, and then know
> the
> > > > switchover is in good shape. Yes there are CAP limits and race
> > conditions
> > > > there, but you could get pretty good assurances (all repaired,
> low/zero
> > > > queued hinted handoffs, etc).
> > > >
> > > > This is essentially like if you had two datacenters, but are doing
> > > > local_quorum on the one datacenter. Well, except switchover is a bit
> > more
> > > > granular if you run out of replicas in the local.
> > > >
> > > >
> > > >
> > > > On Wed, Mar 14, 2018 at 5:17 PM, Jeff Jirsa <jji...@gmail.com>
> wrote:
> > > >
> > > > > Write at CL 3 and read at CL 2
> > > > >
> > > > > --
> > > > > Jeff Jirsa
> > > > >
> > > > >
> > > > > > On Mar 14, 2018, at 2:40 PM, Carl Mueller <
> > > > carl.muel...@smartthings.com>
> > > > > wrote:
> > > > > >
> > > > > > Currently there is little use for RF4. You're getting the
> > > requirements
> > > > of
> > > > > > QUORUM-3 but only one extra backup.
> > > > > >
> > > > > > I'd like to propose something that would make RF4 a sort of more
> > > > heavily
> > > > > > backed up RF3.
> > > > > >
> > > > > > A lot of this is probably achievable with strictly driver-level
> > > logic,
> > > > so
> > > > > > perhaps it would belong more there.
> > > > > >
> > > > > > Basically the idea is to have four replicas of the data, but only
> > > have
> > > > to
> > > > > > practically do QUORUM with three nodes. We consider the first
> three
> > > > > > replicas the "primary replicas". On an ongoing basis for QU

Re: Making RF4 useful aka primary and secondary ranges

2018-03-14 Thread Carl Mueller

Since this is basically driver syntactic sugar... Yes I'll try that.


On Wed, Mar 14, 2018 at 5:59 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> You could use a load balancing policy at the driver level to do what you
> want, mixed with the existing consistency levels as Jeff suggested.
>
> On Wed, Mar 14, 2018 at 3:47 PM Carl Mueller <carl.muel...@smartthings.com
> >
> wrote:
>
> > But we COULD have CL2 write (for RF4)
> >
> > The extension to this idea is multiple backup/secondary replicas. So you
> > have RF5 or RF6 or higher, but still are performing CL2 against the
> > preferred first three for both read and write.
> >
> > You could also ascertain the general write health of affected ranges
> before
> > taking a node down for maintenance from the primary, and then know the
> > switchover is in good shape. Yes there are CAP limits and race conditions
> > there, but you could get pretty good assurances (all repaired, low/zero
> > queued hinted handoffs, etc).
> >
> > This is essentially like if you had two datacenters, but are doing
> > local_quorum on the one datacenter. Well, except switchover is a bit more
> > granular if you run out of replicas in the local.
> >
> >
> >
> > On Wed, Mar 14, 2018 at 5:17 PM, Jeff Jirsa <jji...@gmail.com> wrote:
> >
> > > Write at CL 3 and read at CL 2
> > >
> > > --
> > > Jeff Jirsa
> > >
> > >
> > > > On Mar 14, 2018, at 2:40 PM, Carl Mueller <
> > carl.muel...@smartthings.com>
> > > wrote:
> > > >
> > > > Currently there is little use for RF4. You're getting the
> requirements
> > of
> > > > QUORUM-3 but only one extra backup.
> > > >
> > > > I'd like to propose something that would make RF4 a sort of more
> > heavily
> > > > backed up RF3.
> > > >
> > > > A lot of this is probably achievable with strictly driver-level
> logic,
> > so
> > > > perhaps it would belong more there.
> > > >
> > > > Basically the idea is to have four replicas of the data, but only
> have
> > to
> > > > practically do QUORUM with three nodes. We consider the first three
> > > > replicas the "primary replicas". On an ongoing basis for QUORUM reads
> > and
> > > > writes, we would rely on only those three replicas to satisfy
> > > > two-out-of-three QUORUM. Writes are persisted to the fourth replica
> in
> > > the
> > > > normal manner of cassandra, it just doesn't count towards the QUORUM
> > > write.
> > > >
> > > > On reads, with token and node health awareness by the driver, if the
> > > > primaries are all healthy, two-of-three QUORUM is calculated from
> > those.
> > > >
> > > > If however one of the three primaries is down, read QUORUM is a bit
> > > > different:
> > > > 1) if the first two replies come from the two remaining primaries and
> > > > agree, the is returned
> > > > 2) if the first two replies are a primary and the "hot spare" and
> those
> > > > agree, that is returned
> > > > 3) if the primary and hot spare disagree, wait for the next primary
> to
> > > > return, and then take the agreement (hopefully) that results
> > > >
> > > > Then once the previous primary comes back online, the read quorum
> goes
> > > back
> > > > to preferring that set, with the assuming hinted handoff and repair
> > will
> > > > get it back up to snuff.
> > > >
> > > > There could also be some mechanism examining the hinted handoff
> status
> > of
> > > > the four to determine when to reactivate the primary that was down.
> > > >
> > > > For mutations, one could prefer a "QUORUM plus" that was a quorum of
> > the
> > > > primaries plus the hot spare.
> > > >
> > > > Of course one could do multiple hot spares, so RF5 could still be
> > treated
> > > > as RF3 + hot spares.
> > > >
> > > > The goal here is more data resiliency but not having to rely on as
> many
> > > > nodes for resiliency.
> > > >
> > > > Since the data is ring-distributed, the fact there are primary owners
> > of
> > > > ranges should still be evenly distributed and no hot nodes should
> > result
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> > >
> >
>

Re: Making RF4 useful aka primary and secondary ranges

2018-03-14 Thread Carl Mueller

But we COULD have CL2 write (for RF4)

The extension to this idea is multiple backup/secondary replicas. So you
have RF5 or RF6 or higher, but still are performing CL2 against the
preferred first three for both read and write.

You could also ascertain the general write health of affected ranges before
taking a node down for maintenance from the primary, and then know the
switchover is in good shape. Yes there are CAP limits and race conditions
there, but you could get pretty good assurances (all repaired, low/zero
queued hinted handoffs, etc).

This is essentially like if you had two datacenters, but are doing
local_quorum on the one datacenter. Well, except switchover is a bit more
granular if you run out of replicas in the local.



On Wed, Mar 14, 2018 at 5:17 PM, Jeff Jirsa <jji...@gmail.com> wrote:

> Write at CL 3 and read at CL 2
>
> --
> Jeff Jirsa
>
>
> > On Mar 14, 2018, at 2:40 PM, Carl Mueller <carl.muel...@smartthings.com>
> wrote:
> >
> > Currently there is little use for RF4. You're getting the requirements of
> > QUORUM-3 but only one extra backup.
> >
> > I'd like to propose something that would make RF4 a sort of more heavily
> > backed up RF3.
> >
> > A lot of this is probably achievable with strictly driver-level logic, so
> > perhaps it would belong more there.
> >
> > Basically the idea is to have four replicas of the data, but only have to
> > practically do QUORUM with three nodes. We consider the first three
> > replicas the "primary replicas". On an ongoing basis for QUORUM reads and
> > writes, we would rely on only those three replicas to satisfy
> > two-out-of-three QUORUM. Writes are persisted to the fourth replica in
> the
> > normal manner of cassandra, it just doesn't count towards the QUORUM
> write.
> >
> > On reads, with token and node health awareness by the driver, if the
> > primaries are all healthy, two-of-three QUORUM is calculated from those.
> >
> > If however one of the three primaries is down, read QUORUM is a bit
> > different:
> > 1) if the first two replies come from the two remaining primaries and
> > agree, the is returned
> > 2) if the first two replies are a primary and the "hot spare" and those
> > agree, that is returned
> > 3) if the primary and hot spare disagree, wait for the next primary to
> > return, and then take the agreement (hopefully) that results
> >
> > Then once the previous primary comes back online, the read quorum goes
> back
> > to preferring that set, with the assuming hinted handoff and repair will
> > get it back up to snuff.
> >
> > There could also be some mechanism examining the hinted handoff status of
> > the four to determine when to reactivate the primary that was down.
> >
> > For mutations, one could prefer a "QUORUM plus" that was a quorum of the
> > primaries plus the hot spare.
> >
> > Of course one could do multiple hot spares, so RF5 could still be treated
> > as RF3 + hot spares.
> >
> > The goal here is more data resiliency but not having to rely on as many
> > nodes for resiliency.
> >
> > Since the data is ring-distributed, the fact there are primary owners of
> > ranges should still be evenly distributed and no hot nodes should result
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: Making RF4 useful aka primary and secondary ranges

2018-03-14 Thread Carl Mueller

I also wonder if the state of hinted handoff can inform the validity of
extra replicas. Repair is mentioned in 7168.


On Wed, Mar 14, 2018 at 4:55 PM, Carl Mueller <carl.muel...@smartthings.com>
wrote:

> For my reference: https://issues.apache.org/jira/browse/CASSANDRA-7168
>
>
> On Wed, Mar 14, 2018 at 4:46 PM, Ariel Weisberg <ar...@weisberg.ws> wrote:
>
>> Hi,
>>
>> There is a JIRA for decoupling the size of the group size used for
>> consensus with level of data redundancy. https://issues.apache.org/jira
>> /browse/CASSANDRA-13442
>>
>> It's been discussed quite a bit offline and I did a presentation on it at
>> NGCC. Hopefully we will see some movement on it soon.
>>
>> Ariel
>>
>> On Wed, Mar 14, 2018, at 5:40 PM, Carl Mueller wrote:
>> > Currently there is little use for RF4. You're getting the requirements
>> of
>> > QUORUM-3 but only one extra backup.
>> >
>> > I'd like to propose something that would make RF4 a sort of more heavily
>> > backed up RF3.
>> >
>> > A lot of this is probably achievable with strictly driver-level logic,
>> so
>> > perhaps it would belong more there.
>> >
>> > Basically the idea is to have four replicas of the data, but only have
>> to
>> > practically do QUORUM with three nodes. We consider the first three
>> > replicas the "primary replicas". On an ongoing basis for QUORUM reads
>> and
>> > writes, we would rely on only those three replicas to satisfy
>> > two-out-of-three QUORUM. Writes are persisted to the fourth replica in
>> the
>> > normal manner of cassandra, it just doesn't count towards the QUORUM
>> write.
>> >
>> > On reads, with token and node health awareness by the driver, if the
>> > primaries are all healthy, two-of-three QUORUM is calculated from those.
>> >
>> > If however one of the three primaries is down, read QUORUM is a bit
>> > different:
>> > 1) if the first two replies come from the two remaining primaries and
>> > agree, the is returned
>> > 2) if the first two replies are a primary and the "hot spare" and those
>> > agree, that is returned
>> > 3) if the primary and hot spare disagree, wait for the next primary to
>> > return, and then take the agreement (hopefully) that results
>> >
>> > Then once the previous primary comes back online, the read quorum goes
>> back
>> > to preferring that set, with the assuming hinted handoff and repair will
>> > get it back up to snuff.
>> >
>> > There could also be some mechanism examining the hinted handoff status
>> of
>> > the four to determine when to reactivate the primary that was down.
>> >
>> > For mutations, one could prefer a "QUORUM plus" that was a quorum of the
>> > primaries plus the hot spare.
>> >
>> > Of course one could do multiple hot spares, so RF5 could still be
>> treated
>> > as RF3 + hot spares.
>> >
>> > The goal here is more data resiliency but not having to rely on as many
>> > nodes for resiliency.
>> >
>> > Since the data is ring-distributed, the fact there are primary owners of
>> > ranges should still be evenly distributed and no hot nodes should result
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>
>>
>

Re: Making RF4 useful aka primary and secondary ranges

2018-03-14 Thread Carl Mueller

For my reference: https://issues.apache.org/jira/browse/CASSANDRA-7168


On Wed, Mar 14, 2018 at 4:46 PM, Ariel Weisberg <ar...@weisberg.ws> wrote:

> Hi,
>
> There is a JIRA for decoupling the size of the group size used for
> consensus with level of data redundancy. https://issues.apache.org/
> jira/browse/CASSANDRA-13442
>
> It's been discussed quite a bit offline and I did a presentation on it at
> NGCC. Hopefully we will see some movement on it soon.
>
> Ariel
>
> On Wed, Mar 14, 2018, at 5:40 PM, Carl Mueller wrote:
> > Currently there is little use for RF4. You're getting the requirements of
> > QUORUM-3 but only one extra backup.
> >
> > I'd like to propose something that would make RF4 a sort of more heavily
> > backed up RF3.
> >
> > A lot of this is probably achievable with strictly driver-level logic, so
> > perhaps it would belong more there.
> >
> > Basically the idea is to have four replicas of the data, but only have to
> > practically do QUORUM with three nodes. We consider the first three
> > replicas the "primary replicas". On an ongoing basis for QUORUM reads and
> > writes, we would rely on only those three replicas to satisfy
> > two-out-of-three QUORUM. Writes are persisted to the fourth replica in
> the
> > normal manner of cassandra, it just doesn't count towards the QUORUM
> write.
> >
> > On reads, with token and node health awareness by the driver, if the
> > primaries are all healthy, two-of-three QUORUM is calculated from those.
> >
> > If however one of the three primaries is down, read QUORUM is a bit
> > different:
> > 1) if the first two replies come from the two remaining primaries and
> > agree, the is returned
> > 2) if the first two replies are a primary and the "hot spare" and those
> > agree, that is returned
> > 3) if the primary and hot spare disagree, wait for the next primary to
> > return, and then take the agreement (hopefully) that results
> >
> > Then once the previous primary comes back online, the read quorum goes
> back
> > to preferring that set, with the assuming hinted handoff and repair will
> > get it back up to snuff.
> >
> > There could also be some mechanism examining the hinted handoff status of
> > the four to determine when to reactivate the primary that was down.
> >
> > For mutations, one could prefer a "QUORUM plus" that was a quorum of the
> > primaries plus the hot spare.
> >
> > Of course one could do multiple hot spares, so RF5 could still be treated
> > as RF3 + hot spares.
> >
> > The goal here is more data resiliency but not having to rely on as many
> > nodes for resiliency.
> >
> > Since the data is ring-distributed, the fact there are primary owners of
> > ranges should still be evenly distributed and no hot nodes should result
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Making RF4 useful aka primary and secondary ranges

2018-03-14 Thread Carl Mueller

Currently there is little use for RF4. You're getting the requirements of
QUORUM-3 but only one extra backup.

I'd like to propose something that would make RF4 a sort of more heavily
backed up RF3.

A lot of this is probably achievable with strictly driver-level logic, so
perhaps it would belong more there.

Basically the idea is to have four replicas of the data, but only have to
practically do QUORUM with three nodes. We consider the first three
replicas the "primary replicas". On an ongoing basis for QUORUM reads and
writes, we would rely on only those three replicas to satisfy
two-out-of-three QUORUM. Writes are persisted to the fourth replica in the
normal manner of cassandra, it just doesn't count towards the QUORUM write.

On reads, with token and node health awareness by the driver, if the
primaries are all healthy, two-of-three QUORUM is calculated from those.

If however one of the three primaries is down, read QUORUM is a bit
different:
1) if the first two replies come from the two remaining primaries and
agree, the is returned
2) if the first two replies are a primary and the "hot spare" and those
agree, that is returned
3) if the primary and hot spare disagree, wait for the next primary to
return, and then take the agreement (hopefully) that results

Then once the previous primary comes back online, the read quorum goes back
to preferring that set, with the assuming hinted handoff and repair will
get it back up to snuff.

There could also be some mechanism examining the hinted handoff status of
the four to determine when to reactivate the primary that was down.

For mutations, one could prefer a "QUORUM plus" that was a quorum of the
primaries plus the hot spare.

Of course one could do multiple hot spares, so RF5 could still be treated
as RF3 + hot spares.

The goal here is more data resiliency but not having to rely on as many
nodes for resiliency.

Since the data is ring-distributed, the fact there are primary owners of
ranges should still be evenly distributed and no hot nodes should result

Re: Why isn't there a separate JVM per table?

2018-02-22 Thread Carl Mueller

Alternative: JVM per vnode.

On Thu, Feb 22, 2018 at 4:52 PM, Carl Mueller <carl.muel...@smartthings.com>
wrote:

> BLoom filters... nevermind
>
>
> On Thu, Feb 22, 2018 at 4:48 PM, Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> Is the current reason for a large starting heap due to the memtable?
>>
>> On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller <
>> carl.muel...@smartthings.com> wrote:
>>
>>>  ... compaction on its own jvm was also something I was thinking about,
>>> but then I realized even more JVM sharding could be done at the table level.
>>>
>>> On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad <j...@jonhaddad.com> wrote:
>>>
>>>> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world
>>>> where we’re isolating crazy GC churning parts of the DB.  It would mean
>>>> reworking how tasks are created and removal of all shared state in favor of
>>>> messaging + a smarter manager, which imo would be a good idea regardless.
>>>>
>>>> It might be a better use of time (especially for 4.0) to do some GC
>>>> performance profiling and cut down on the allocations, since that doesn’t
>>>> involve a massive effort.
>>>>
>>>> I’ve been meaning to do a little benchmarking and profiling for a while
>>>> now, and it seems like a few others have the same inclination as well,
>>>> maybe now is a good time to coordinate that.  A nice perf bump for 4.0
>>>> would be very rewarding.
>>>>
>>>> Jon
>>>>
>>>> > On Feb 22, 2018, at 2:00 PM, Nate McCall <zznat...@gmail.com> wrote:
>>>> >
>>>> > I've heard a couple of folks pontificate on compaction in its own
>>>> > process as well, given it has such a high impact on GC. Not sure about
>>>> > the value of individual tables. Interesting idea though.
>>>> >
>>>> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek <gdusba...@gmail.com>
>>>> wrote:
>>>> >> I've given it some thought in the past. In the end, I usually talk
>>>> myself
>>>> >> out of it because I think it increases the surface area for failure.
>>>> That
>>>> >> is, managing N processes is more difficult that managing one
>>>> process. But
>>>> >> if the additional failure modes are addressed, there are some
>>>> interesting
>>>> >> possibilities.
>>>> >>
>>>> >> For example, having gossip in its own process would decrease the
>>>> odds that
>>>> >> a node is marked dead because STW GC is happening in the storage
>>>> JVM. On
>>>> >> the flipside, you'd need checks to make sure that the gossip process
>>>> can
>>>> >> recognize when the storage process has died vs just running a long
>>>> GC.
>>>> >>
>>>> >> I don't know that I'd go so far as to have separate processes for
>>>> >> keyspaces, etc.
>>>> >>
>>>> >> There is probably some interesting work that could be done to
>>>> support the
>>>> >> orgs who run multiple cassandra instances on the same node (multiple
>>>> >> gossipers in that case is at least a little wasteful).
>>>> >>
>>>> >> I've also played around with using domain sockets for IPC inside of
>>>> >> cassandra. I never ran a proper benchmark, but there were some
>>>> throughput
>>>> >> advantages to this approach.
>>>> >>
>>>> >> Cheers,
>>>> >>
>>>> >> Gary.
>>>> >>
>>>> >>
>>>> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <
>>>> carl.muel...@smartthings.com>
>>>> >> wrote:
>>>> >>
>>>> >>> GC pauses may have been improved in newer releases, since we are in
>>>> 2.1.x,
>>>> >>> but I was wondering why cassandra uses one jvm for all tables and
>>>> >>> keyspaces, intermingling the heap for on-JVM objects.
>>>> >>>
>>>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm
>>>> can be
>>>> >>> tuned per table and gc tuned and gc impacts not impact other
>>>> tables? It
>>>> >>> would probably increase the number of endpoints if we avoid having
>>>> an
>>>> >>> overarching query router.
>>>> >>>
>>>> >
>>>> > -
>>>> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>>> > For additional commands, e-mail: dev-h...@cassandra.apache.org
>>>> >
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>>>
>>>>
>>>
>>
>

Re: Why isn't there a separate JVM per table?

2018-02-22 Thread Carl Mueller

BLoom filters... nevermind


On Thu, Feb 22, 2018 at 4:48 PM, Carl Mueller <carl.muel...@smartthings.com>
wrote:

> Is the current reason for a large starting heap due to the memtable?
>
> On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>>  ... compaction on its own jvm was also something I was thinking about,
>> but then I realized even more JVM sharding could be done at the table level.
>>
>> On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad <j...@jonhaddad.com> wrote:
>>
>>> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world
>>> where we’re isolating crazy GC churning parts of the DB.  It would mean
>>> reworking how tasks are created and removal of all shared state in favor of
>>> messaging + a smarter manager, which imo would be a good idea regardless.
>>>
>>> It might be a better use of time (especially for 4.0) to do some GC
>>> performance profiling and cut down on the allocations, since that doesn’t
>>> involve a massive effort.
>>>
>>> I’ve been meaning to do a little benchmarking and profiling for a while
>>> now, and it seems like a few others have the same inclination as well,
>>> maybe now is a good time to coordinate that.  A nice perf bump for 4.0
>>> would be very rewarding.
>>>
>>> Jon
>>>
>>> > On Feb 22, 2018, at 2:00 PM, Nate McCall <zznat...@gmail.com> wrote:
>>> >
>>> > I've heard a couple of folks pontificate on compaction in its own
>>> > process as well, given it has such a high impact on GC. Not sure about
>>> > the value of individual tables. Interesting idea though.
>>> >
>>> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek <gdusba...@gmail.com>
>>> wrote:
>>> >> I've given it some thought in the past. In the end, I usually talk
>>> myself
>>> >> out of it because I think it increases the surface area for failure.
>>> That
>>> >> is, managing N processes is more difficult that managing one process.
>>> But
>>> >> if the additional failure modes are addressed, there are some
>>> interesting
>>> >> possibilities.
>>> >>
>>> >> For example, having gossip in its own process would decrease the odds
>>> that
>>> >> a node is marked dead because STW GC is happening in the storage JVM.
>>> On
>>> >> the flipside, you'd need checks to make sure that the gossip process
>>> can
>>> >> recognize when the storage process has died vs just running a long GC.
>>> >>
>>> >> I don't know that I'd go so far as to have separate processes for
>>> >> keyspaces, etc.
>>> >>
>>> >> There is probably some interesting work that could be done to support
>>> the
>>> >> orgs who run multiple cassandra instances on the same node (multiple
>>> >> gossipers in that case is at least a little wasteful).
>>> >>
>>> >> I've also played around with using domain sockets for IPC inside of
>>> >> cassandra. I never ran a proper benchmark, but there were some
>>> throughput
>>> >> advantages to this approach.
>>> >>
>>> >> Cheers,
>>> >>
>>> >> Gary.
>>> >>
>>> >>
>>> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <
>>> carl.muel...@smartthings.com>
>>> >> wrote:
>>> >>
>>> >>> GC pauses may have been improved in newer releases, since we are in
>>> 2.1.x,
>>> >>> but I was wondering why cassandra uses one jvm for all tables and
>>> >>> keyspaces, intermingling the heap for on-JVM objects.
>>> >>>
>>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm
>>> can be
>>> >>> tuned per table and gc tuned and gc impacts not impact other tables?
>>> It
>>> >>> would probably increase the number of endpoints if we avoid having an
>>> >>> overarching query router.
>>> >>>
>>> >
>>> > -
>>> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>> > For additional commands, e-mail: dev-h...@cassandra.apache.org
>>> >
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>>
>>>
>>
>

Re: Why isn't there a separate JVM per table?

2018-02-22 Thread Carl Mueller

Is the current reason for a large starting heap due to the memtable?

On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller <carl.muel...@smartthings.com>
wrote:

>  ... compaction on its own jvm was also something I was thinking about,
> but then I realized even more JVM sharding could be done at the table level.
>
> On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad <j...@jonhaddad.com> wrote:
>
>> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world where
>> we’re isolating crazy GC churning parts of the DB.  It would mean reworking
>> how tasks are created and removal of all shared state in favor of messaging
>> + a smarter manager, which imo would be a good idea regardless.
>>
>> It might be a better use of time (especially for 4.0) to do some GC
>> performance profiling and cut down on the allocations, since that doesn’t
>> involve a massive effort.
>>
>> I’ve been meaning to do a little benchmarking and profiling for a while
>> now, and it seems like a few others have the same inclination as well,
>> maybe now is a good time to coordinate that.  A nice perf bump for 4.0
>> would be very rewarding.
>>
>> Jon
>>
>> > On Feb 22, 2018, at 2:00 PM, Nate McCall <zznat...@gmail.com> wrote:
>> >
>> > I've heard a couple of folks pontificate on compaction in its own
>> > process as well, given it has such a high impact on GC. Not sure about
>> > the value of individual tables. Interesting idea though.
>> >
>> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek <gdusba...@gmail.com>
>> wrote:
>> >> I've given it some thought in the past. In the end, I usually talk
>> myself
>> >> out of it because I think it increases the surface area for failure.
>> That
>> >> is, managing N processes is more difficult that managing one process.
>> But
>> >> if the additional failure modes are addressed, there are some
>> interesting
>> >> possibilities.
>> >>
>> >> For example, having gossip in its own process would decrease the odds
>> that
>> >> a node is marked dead because STW GC is happening in the storage JVM.
>> On
>> >> the flipside, you'd need checks to make sure that the gossip process
>> can
>> >> recognize when the storage process has died vs just running a long GC.
>> >>
>> >> I don't know that I'd go so far as to have separate processes for
>> >> keyspaces, etc.
>> >>
>> >> There is probably some interesting work that could be done to support
>> the
>> >> orgs who run multiple cassandra instances on the same node (multiple
>> >> gossipers in that case is at least a little wasteful).
>> >>
>> >> I've also played around with using domain sockets for IPC inside of
>> >> cassandra. I never ran a proper benchmark, but there were some
>> throughput
>> >> advantages to this approach.
>> >>
>> >> Cheers,
>> >>
>> >> Gary.
>> >>
>> >>
>> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <
>> carl.muel...@smartthings.com>
>> >> wrote:
>> >>
>> >>> GC pauses may have been improved in newer releases, since we are in
>> 2.1.x,
>> >>> but I was wondering why cassandra uses one jvm for all tables and
>> >>> keyspaces, intermingling the heap for on-JVM objects.
>> >>>
>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm can
>> be
>> >>> tuned per table and gc tuned and gc impacts not impact other tables?
>> It
>> >>> would probably increase the number of endpoints if we avoid having an
>> >>> overarching query router.
>> >>>
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> > For additional commands, e-mail: dev-h...@cassandra.apache.org
>> >
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>
>>
>

Re: Why isn't there a separate JVM per table?

2018-02-22 Thread Carl Mueller

 ... compaction on its own jvm was also something I was thinking about, but
then I realized even more JVM sharding could be done at the table level.

On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad <j...@jonhaddad.com> wrote:

> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world where
> we’re isolating crazy GC churning parts of the DB.  It would mean reworking
> how tasks are created and removal of all shared state in favor of messaging
> + a smarter manager, which imo would be a good idea regardless.
>
> It might be a better use of time (especially for 4.0) to do some GC
> performance profiling and cut down on the allocations, since that doesn’t
> involve a massive effort.
>
> I’ve been meaning to do a little benchmarking and profiling for a while
> now, and it seems like a few others have the same inclination as well,
> maybe now is a good time to coordinate that.  A nice perf bump for 4.0
> would be very rewarding.
>
> Jon
>
> > On Feb 22, 2018, at 2:00 PM, Nate McCall <zznat...@gmail.com> wrote:
> >
> > I've heard a couple of folks pontificate on compaction in its own
> > process as well, given it has such a high impact on GC. Not sure about
> > the value of individual tables. Interesting idea though.
> >
> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek <gdusba...@gmail.com>
> wrote:
> >> I've given it some thought in the past. In the end, I usually talk
> myself
> >> out of it because I think it increases the surface area for failure.
> That
> >> is, managing N processes is more difficult that managing one process.
> But
> >> if the additional failure modes are addressed, there are some
> interesting
> >> possibilities.
> >>
> >> For example, having gossip in its own process would decrease the odds
> that
> >> a node is marked dead because STW GC is happening in the storage JVM. On
> >> the flipside, you'd need checks to make sure that the gossip process can
> >> recognize when the storage process has died vs just running a long GC.
> >>
> >> I don't know that I'd go so far as to have separate processes for
> >> keyspaces, etc.
> >>
> >> There is probably some interesting work that could be done to support
> the
> >> orgs who run multiple cassandra instances on the same node (multiple
> >> gossipers in that case is at least a little wasteful).
> >>
> >> I've also played around with using domain sockets for IPC inside of
> >> cassandra. I never ran a proper benchmark, but there were some
> throughput
> >> advantages to this approach.
> >>
> >> Cheers,
> >>
> >> Gary.
> >>
> >>
> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller <
> carl.muel...@smartthings.com>
> >> wrote:
> >>
> >>> GC pauses may have been improved in newer releases, since we are in
> 2.1.x,
> >>> but I was wondering why cassandra uses one jvm for all tables and
> >>> keyspaces, intermingling the heap for on-JVM objects.
> >>>
> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm can
> be
> >>> tuned per table and gc tuned and gc impacts not impact other tables? It
> >>> would probably increase the number of endpoints if we avoid having an
> >>> overarching query router.
> >>>
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Why isn't there a separate JVM per table?

2018-02-22 Thread Carl Mueller

GC pauses may have been improved in newer releases, since we are in 2.1.x,
but I was wondering why cassandra uses one jvm for all tables and
keyspaces, intermingling the heap for on-JVM objects.

... so why doesn't cassandra spin off a jvm per table so each jvm can be
tuned per table and gc tuned and gc impacts not impact other tables? It
would probably increase the number of endpoints if we avoid having an
overarching query router.

penn state academic paper - "scalable" bloom filters

2018-02-22 Thread Carl Mueller

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.62.7953=rep1=pdf

looks to be an adaptive approach where the "initial guess" bloom filters
are enhanced with more layers of ones generated after usage stats are
gained.

Disclaimer: I suck at reading academic papers.

Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-20 Thread Carl Mueller

When memtables/CommitLogs are flushed to disk/sstable, does the sstable go
through sstable organization specific to each compaction strategy, or is
the sstable creation the same for all compactionstrats and it is up to the
compaction strategy to recompact the sstable if desired?

Re: scheduled work compaction strategy

2018-02-17 Thread Carl Mueller

I'm probably going to take a shot at doing it basing it off of TWCS. But I
don't know the fundamentals of compaction strategies and coding that well.
Fundamentally you have memtable sets being flushed out to sstables, and
then those sstables being reprocessed with background threads. And then
forming the bloom filters from the sstables, but that might be just a
cassandra service/method call.

For example, what is "aggressive tombstone subproperties"? Is that metadata
attached to sstables about tombstones within the sstables?

On Fri, Feb 16, 2018 at 8:17 PM, Jeff Jirsa <jji...@gmail.com> wrote:

> There’s a company using TWCS in this config - I’m not going to out them,
> but I think they do it (or used to) with aggressive tombstone sub
> properties. They may have since extended/enhanced it somewhat.
>
> --
> Jeff Jirsa
>
>
> > On Feb 16, 2018, at 2:24 PM, Carl Mueller <carl.muel...@smartthings.com>
> wrote:
> >
> > Oh and as a further refinement outside of our use case.
> >
> > If we could group/organize the sstables by the rowkey time value or
> > inherent TTL value, the naive version would be evenly distributed buckets
> > into the future.
> >
> > But many/most data patterns like this have "busy" data in the near term.
> > Far out scheduled stuff would be more sparse. In our case, 50% of the
> data
> > is in the first 12 hours, 50% of the remaining in the next day or two,
> 50%
> > of the remaining in the next week, etc etc.
> >
> > So we could have a "long term" general bucket to take data far in the
> > future. But here's the thing, if we could actively process the "long
> term"
> > sstable on a regular basis into two sstables: the stuff that is still
> "long
> > term" and sstables for the "near term", that could solve many general
> > cases. The "long term" bucket could even be STCS by default, and as the
> > near term comes into play, that is considered a different "level".
> >
> > Of course all this relies on the ability to look at the data in the
> rowkey
> > or the TTL associated with the row.
> >
> > On Fri, Feb 16, 2018 at 4:17 PM, Carl Mueller <
> carl.muel...@smartthings.com>
> > wrote:
> >
> >> We have a scheduler app here at smartthings, where we track per-second
> >> tasks to be executed.
> >>
> >> These are all TTL'd to be destroyed after the second the event was
> >> registered with has passed.
> >>
> >> If the scheduling window was sufficiently small, say, 1 day, we could
> >> probably use a time window compaction strategy with this. But the
> window is
> >> one-two years worth of adhoc event registration per the contract.
> >>
> >> Thus, the intermingling of all this data TTL'ing at the different times
> >> since they are registered at different times means the sstables are not
> >> written with data TTLing in the same rough time period. If they were,
> then
> >> compaction would be a relatively easy process since the entire sstable
> >> would tombstone.
> >>
> >> We could kind of do this by doing sharded tables for the time periods
> and
> >> rotating the shards for duty, and truncating them as they are recycled.
> >>
> >> But an elegant way would be a custom compaction strategy that would
> >> "window" the data into clustered sstables that could be compacted with
> >> other similarly time bucketed sstables.
> >>
> >> This would require visibility into the rowkey when it came time to
> convert
> >> the memtable data to sstables. Is that even possible with compaction
> >> schemes? We would provide a requirement that the time-based data would
> be
> >> in the row key if it is a composite row key, making it required.
> >>
> >>
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

Re: scheduled work compaction strategy

2018-02-16 Thread Carl Mueller

An even MORE complicated version could address the case where the TTLs are
at the column key rather than the row key. That would divide the row across
sstables by the rowkey, in essence the opposite of what most compaction
strategies try to do: eventually centralize the data for a rowkey in one
sstable. This strategy assumes TTLs would be cleaning up these row
fragments, so that the distribution of the data across many many sstables
wouldn't pollute the bloom filters too much.

On Fri, Feb 16, 2018 at 4:24 PM, Carl Mueller <carl.muel...@smartthings.com>
wrote:

> Oh and as a further refinement outside of our use case.
>
> If we could group/organize the sstables by the rowkey time value or
> inherent TTL value, the naive version would be evenly distributed buckets
> into the future.
>
> But many/most data patterns like this have "busy" data in the near term.
> Far out scheduled stuff would be more sparse. In our case, 50% of the data
> is in the first 12 hours, 50% of the remaining in the next day or two, 50%
> of the remaining in the next week, etc etc.
>
> So we could have a "long term" general bucket to take data far in the
> future. But here's the thing, if we could actively process the "long term"
> sstable on a regular basis into two sstables: the stuff that is still "long
> term" and sstables for the "near term", that could solve many general
> cases. The "long term" bucket could even be STCS by default, and as the
> near term comes into play, that is considered a different "level".
>
> Of course all this relies on the ability to look at the data in the rowkey
> or the TTL associated with the row.
>
> On Fri, Feb 16, 2018 at 4:17 PM, Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> We have a scheduler app here at smartthings, where we track per-second
>> tasks to be executed.
>>
>> These are all TTL'd to be destroyed after the second the event was
>> registered with has passed.
>>
>> If the scheduling window was sufficiently small, say, 1 day, we could
>> probably use a time window compaction strategy with this. But the window is
>> one-two years worth of adhoc event registration per the contract.
>>
>> Thus, the intermingling of all this data TTL'ing at the different times
>> since they are registered at different times means the sstables are not
>> written with data TTLing in the same rough time period. If they were, then
>> compaction would be a relatively easy process since the entire sstable
>> would tombstone.
>>
>> We could kind of do this by doing sharded tables for the time periods and
>> rotating the shards for duty, and truncating them as they are recycled.
>>
>> But an elegant way would be a custom compaction strategy that would
>> "window" the data into clustered sstables that could be compacted with
>> other similarly time bucketed sstables.
>>
>> This would require visibility into the rowkey when it came time to
>> convert the memtable data to sstables. Is that even possible with
>> compaction schemes? We would provide a requirement that the time-based data
>> would be in the row key if it is a composite row key, making it required.
>>
>>
>>
>

Re: scheduled work compaction strategy

2018-02-16 Thread Carl Mueller

Oh and as a further refinement outside of our use case.

If we could group/organize the sstables by the rowkey time value or
inherent TTL value, the naive version would be evenly distributed buckets
into the future.

But many/most data patterns like this have "busy" data in the near term.
Far out scheduled stuff would be more sparse. In our case, 50% of the data
is in the first 12 hours, 50% of the remaining in the next day or two, 50%
of the remaining in the next week, etc etc.

So we could have a "long term" general bucket to take data far in the
future. But here's the thing, if we could actively process the "long term"
sstable on a regular basis into two sstables: the stuff that is still "long
term" and sstables for the "near term", that could solve many general
cases. The "long term" bucket could even be STCS by default, and as the
near term comes into play, that is considered a different "level".

Of course all this relies on the ability to look at the data in the rowkey
or the TTL associated with the row.

On Fri, Feb 16, 2018 at 4:17 PM, Carl Mueller <carl.muel...@smartthings.com>
wrote:

> We have a scheduler app here at smartthings, where we track per-second
> tasks to be executed.
>
> These are all TTL'd to be destroyed after the second the event was
> registered with has passed.
>
> If the scheduling window was sufficiently small, say, 1 day, we could
> probably use a time window compaction strategy with this. But the window is
> one-two years worth of adhoc event registration per the contract.
>
> Thus, the intermingling of all this data TTL'ing at the different times
> since they are registered at different times means the sstables are not
> written with data TTLing in the same rough time period. If they were, then
> compaction would be a relatively easy process since the entire sstable
> would tombstone.
>
> We could kind of do this by doing sharded tables for the time periods and
> rotating the shards for duty, and truncating them as they are recycled.
>
> But an elegant way would be a custom compaction strategy that would
> "window" the data into clustered sstables that could be compacted with
> other similarly time bucketed sstables.
>
> This would require visibility into the rowkey when it came time to convert
> the memtable data to sstables. Is that even possible with compaction
> schemes? We would provide a requirement that the time-based data would be
> in the row key if it is a composite row key, making it required.
>
>
>

scheduled work compaction strategy

2018-02-16 Thread Carl Mueller

We have a scheduler app here at smartthings, where we track per-second
tasks to be executed.

These are all TTL'd to be destroyed after the second the event was
registered with has passed.

If the scheduling window was sufficiently small, say, 1 day, we could
probably use a time window compaction strategy with this. But the window is
one-two years worth of adhoc event registration per the contract.

Thus, the intermingling of all this data TTL'ing at the different times
since they are registered at different times means the sstables are not
written with data TTLing in the same rough time period. If they were, then
compaction would be a relatively easy process since the entire sstable
would tombstone.

We could kind of do this by doing sharded tables for the time periods and
rotating the shards for duty, and truncating them as they are recycled.

But an elegant way would be a custom compaction strategy that would
"window" the data into clustered sstables that could be compacted with
other similarly time bucketed sstables.

This would require visibility into the rowkey when it came time to convert
the memtable data to sstables. Is that even possible with compaction
schemes? We would provide a requirement that the time-based data would be
in the row key if it is a composite row key, making it required.

Re: row tombstones as a separate sstable citizen

2018-02-16 Thread Carl Mueller

re: the tombstone sstables being read-only inputs to compaction, there
would be one case the non-tombstone sstables would input to the compaction
of the row tombstones: when the row no longer exists in any of the data
sstables with respect to the row tombstone timestamp.

There may be other opportunities for simplified processing of the row
tombstone sstables, as they are pure key-value (row key : deletion flag)
rather than columnar data. We may be able to offer the option of a memory
map if the row tombstones fit in a sufficiently small space. The "row
cache" may be wayyy simpler for these than the general row cache
difficulties for cassandra data. Those caches could only be loaded during
compaction operations too.

On Thu, Feb 15, 2018 at 11:24 AM, Jeff Jirsa <jji...@gmail.com> wrote:

> Worth a JIRA, yes
>
>
> On Wed, Feb 14, 2018 at 9:45 AM, Carl Mueller <
> carl.muel...@smartthings.com>
> wrote:
>
> > So is this at least a decent candidate for a feature request ticket?
> >
> >
> > On Tue, Feb 13, 2018 at 8:09 PM, Carl Mueller <
> > carl.muel...@smartthings.com>
> > wrote:
> >
> > > I'm particularly interested in getting the tombstones to "promote" up
> the
> > > levels of LCS more quickly. Currently they get attached at the low
> level
> > > and don't propagate up to higher levels until enough activity at a
> lower
> > > level promotes the data. Meanwhile, LCS means compactions can occur in
> > > parallel at each level. So row tombstones in their own sstable could be
> > up
> > > promoted the LCS levels preferentially before normal processes would
> move
> > > them up.
> > >
> > > So if the delete-only sstables could move up more quickly, the
> compaction
> > > at the levels would happen more quickly.
> > >
> > > The threshold stuff is nice if I read 7019 correctly, but what is the %
> > > there? % of rows? % of columns? or % of the size of the sstable? Row
> > > tombstones are pretty compact being just the rowkey and the tombstone
> > > marker. So if 7019 is triggered at 10% of the sstable size, even a
> > crapton
> > > of tombstones deleting practially the entire database would only be a
> > small
> > > % size of the sstable.
> > >
> > > Since the row tombstones are so compact, that's why I think they are
> good
> > > candidates for special handling.
> > >
> > > On Tue, Feb 13, 2018 at 5:22 PM, J. D. Jordan <
> jeremiah.jor...@gmail.com
> > >
> > > wrote:
> > >
> > >> Have you taken a look at the new stuff introduced by
> > >> https://issues.apache.org/jira/browse/CASSANDRA-7019 ?  I think it
> may
> > >> go a ways to reducing the need for something complicated like this.
> > >> Though it is an interesting idea as special handling for bulk deletes.
> > >> If they were truly just sstables that only contained deletes the logic
> > from
> > >> 7109 would probably go a long ways. Though if you are bulk inserting
> > >> deletes that is what you would end up with, so maybe it already works.
> > >>
> > >> -Jeremiah
> > >>
> > >> > On Feb 13, 2018, at 6:04 PM, Jeff Jirsa <jji...@gmail.com> wrote:
> > >> >
> > >> > On Tue, Feb 13, 2018 at 2:38 PM, Carl Mueller <
> > >> carl.muel...@smartthings.com>
> > >> > wrote:
> > >> >
> > >> >> In process of doing my second major data purge from a cassandra
> > system.
> > >> >>
> > >> >> Almost all of my purging is done via row tombstones. While
> performing
> > >> this
> > >> >> the second time while trying to cajole compaction to occur (in
> 2.1.x,
> > >> >> LevelledCompaction) to goddamn actually compact the data, I've been
> > >> >> thinking as to why there isn't a separate set of sstable
> > infrastructure
> > >> >> setup for row deletion tombstones.
> > >> >>
> > >> >> I'm imagining that row tombstones are written to separate sstables
> > than
> > >> >> mainline data updates/appends and range/column tombstones.
> > >> >>
> > >> >> By writing them to separate sstables, the compaction systems can
> > >> >> preferentially merge / process them when compacting sstables.
> > >> >>
> > >> >> This would create an additional sstable for lookup in the bloom
> > >>

Re: row tombstones as a separate sstable citizen

2018-02-14 Thread Carl Mueller

So is this at least a decent candidate for a feature request ticket?


On Tue, Feb 13, 2018 at 8:09 PM, Carl Mueller <carl.muel...@smartthings.com>
wrote:

> I'm particularly interested in getting the tombstones to "promote" up the
> levels of LCS more quickly. Currently they get attached at the low level
> and don't propagate up to higher levels until enough activity at a lower
> level promotes the data. Meanwhile, LCS means compactions can occur in
> parallel at each level. So row tombstones in their own sstable could be up
> promoted the LCS levels preferentially before normal processes would move
> them up.
>
> So if the delete-only sstables could move up more quickly, the compaction
> at the levels would happen more quickly.
>
> The threshold stuff is nice if I read 7019 correctly, but what is the %
> there? % of rows? % of columns? or % of the size of the sstable? Row
> tombstones are pretty compact being just the rowkey and the tombstone
> marker. So if 7019 is triggered at 10% of the sstable size, even a crapton
> of tombstones deleting practially the entire database would only be a small
> % size of the sstable.
>
> Since the row tombstones are so compact, that's why I think they are good
> candidates for special handling.
>
> On Tue, Feb 13, 2018 at 5:22 PM, J. D. Jordan <jeremiah.jor...@gmail.com>
> wrote:
>
>> Have you taken a look at the new stuff introduced by
>> https://issues.apache.org/jira/browse/CASSANDRA-7019 ?  I think it may
>> go a ways to reducing the need for something complicated like this.
>> Though it is an interesting idea as special handling for bulk deletes.
>> If they were truly just sstables that only contained deletes the logic from
>> 7109 would probably go a long ways. Though if you are bulk inserting
>> deletes that is what you would end up with, so maybe it already works.
>>
>> -Jeremiah
>>
>> > On Feb 13, 2018, at 6:04 PM, Jeff Jirsa <jji...@gmail.com> wrote:
>> >
>> > On Tue, Feb 13, 2018 at 2:38 PM, Carl Mueller <
>> carl.muel...@smartthings.com>
>> > wrote:
>> >
>> >> In process of doing my second major data purge from a cassandra system.
>> >>
>> >> Almost all of my purging is done via row tombstones. While performing
>> this
>> >> the second time while trying to cajole compaction to occur (in 2.1.x,
>> >> LevelledCompaction) to goddamn actually compact the data, I've been
>> >> thinking as to why there isn't a separate set of sstable infrastructure
>> >> setup for row deletion tombstones.
>> >>
>> >> I'm imagining that row tombstones are written to separate sstables than
>> >> mainline data updates/appends and range/column tombstones.
>> >>
>> >> By writing them to separate sstables, the compaction systems can
>> >> preferentially merge / process them when compacting sstables.
>> >>
>> >> This would create an additional sstable for lookup in the bloom
>> filters,
>> >> granted. I had visions of short circuiting the lookups to other
>> sstables if
>> >> a row tombstone was present in one of the special row tombstone
>> sstables.
>> >>
>> >>
>> > All of the above sounds really interesting to me, but I suspect it's a
>> LOT
>> > of work to make it happen correctly.
>> >
>> > You'd almost end up with 2 sets of logs for the LSM - a tombstone
>> > log/generation, and a data log/generation, and the tombstone logs would
>> be
>> > read-only inputs to data compactions.
>> >
>> >
>> >> But that would only be possible if there was the notion of a "super row
>> >> tombstone" that permanently deleted a rowkey and all future writes
>> would be
>> >> invalidated. Kind of like how a tombstone with a mistakenly huge
>> timestamp
>> >> becomes a sneaky permanent tombstone, but intended. There could be a
>> >> special operation / statement to undo this permanent tombstone, and
>> since
>> >> the row tombstones would be in their own dedicated sstables, they could
>> >> process and compact more quickly, with prioritization by the compactor.
>> >>
>> >>
>> > This part sounds way less interesting to me (other than the fact you can
>> > already do this with a timestamp in the future, but it'll gc away at
>> gcgs).
>> >
>> >
>> >> I'm thinking there must be something I am forgetting in the
>> >> read/write/compaction paths that invalidate this.
>> >>
>> >
>> > There are a lot of places where we do "smart" things to make sure we
>> don't
>> > accidentally resurrect data. Read path includes old sstables for
>> tombstones
>> > for example. Those all need to be concretely identified and handled (and
>> > tested),.
>>
>
>

Re: row tombstones as a separate sstable citizen

2018-02-13 Thread Carl Mueller

I'm particularly interested in getting the tombstones to "promote" up the
levels of LCS more quickly. Currently they get attached at the low level
and don't propagate up to higher levels until enough activity at a lower
level promotes the data. Meanwhile, LCS means compactions can occur in
parallel at each level. So row tombstones in their own sstable could be up
promoted the LCS levels preferentially before normal processes would move
them up.

So if the delete-only sstables could move up more quickly, the compaction
at the levels would happen more quickly.

The threshold stuff is nice if I read 7019 correctly, but what is the %
there? % of rows? % of columns? or % of the size of the sstable? Row
tombstones are pretty compact being just the rowkey and the tombstone
marker. So if 7019 is triggered at 10% of the sstable size, even a crapton
of tombstones deleting practially the entire database would only be a small
% size of the sstable.

Since the row tombstones are so compact, that's why I think they are good
candidates for special handling.

On Tue, Feb 13, 2018 at 5:22 PM, J. D. Jordan <jeremiah.jor...@gmail.com>
wrote:

> Have you taken a look at the new stuff introduced by
> https://issues.apache.org/jira/browse/CASSANDRA-7019 ?  I think it may go
> a ways to reducing the need for something complicated like this.
> Though it is an interesting idea as special handling for bulk deletes.  If
> they were truly just sstables that only contained deletes the logic from
> 7109 would probably go a long ways. Though if you are bulk inserting
> deletes that is what you would end up with, so maybe it already works.
>
> -Jeremiah
>
> > On Feb 13, 2018, at 6:04 PM, Jeff Jirsa <jji...@gmail.com> wrote:
> >
> > On Tue, Feb 13, 2018 at 2:38 PM, Carl Mueller <
> carl.muel...@smartthings.com>
> > wrote:
> >
> >> In process of doing my second major data purge from a cassandra system.
> >>
> >> Almost all of my purging is done via row tombstones. While performing
> this
> >> the second time while trying to cajole compaction to occur (in 2.1.x,
> >> LevelledCompaction) to goddamn actually compact the data, I've been
> >> thinking as to why there isn't a separate set of sstable infrastructure
> >> setup for row deletion tombstones.
> >>
> >> I'm imagining that row tombstones are written to separate sstables than
> >> mainline data updates/appends and range/column tombstones.
> >>
> >> By writing them to separate sstables, the compaction systems can
> >> preferentially merge / process them when compacting sstables.
> >>
> >> This would create an additional sstable for lookup in the bloom filters,
> >> granted. I had visions of short circuiting the lookups to other
> sstables if
> >> a row tombstone was present in one of the special row tombstone
> sstables.
> >>
> >>
> > All of the above sounds really interesting to me, but I suspect it's a
> LOT
> > of work to make it happen correctly.
> >
> > You'd almost end up with 2 sets of logs for the LSM - a tombstone
> > log/generation, and a data log/generation, and the tombstone logs would
> be
> > read-only inputs to data compactions.
> >
> >
> >> But that would only be possible if there was the notion of a "super row
> >> tombstone" that permanently deleted a rowkey and all future writes
> would be
> >> invalidated. Kind of like how a tombstone with a mistakenly huge
> timestamp
> >> becomes a sneaky permanent tombstone, but intended. There could be a
> >> special operation / statement to undo this permanent tombstone, and
> since
> >> the row tombstones would be in their own dedicated sstables, they could
> >> process and compact more quickly, with prioritization by the compactor.
> >>
> >>
> > This part sounds way less interesting to me (other than the fact you can
> > already do this with a timestamp in the future, but it'll gc away at
> gcgs).
> >
> >
> >> I'm thinking there must be something I am forgetting in the
> >> read/write/compaction paths that invalidate this.
> >>
> >
> > There are a lot of places where we do "smart" things to make sure we
> don't
> > accidentally resurrect data. Read path includes old sstables for
> tombstones
> > for example. Those all need to be concretely identified and handled (and
> > tested),.
>

row tombstones as a separate sstable citizen

2018-02-13 Thread Carl Mueller

In process of doing my second major data purge from a cassandra system.

Almost all of my purging is done via row tombstones. While performing this
the second time while trying to cajole compaction to occur (in 2.1.x,
LevelledCompaction) to goddamn actually compact the data, I've been
thinking as to why there isn't a separate set of sstable infrastructure
setup for row deletion tombstones.

I'm imagining that row tombstones are written to separate sstables than
mainline data updates/appends and range/column tombstones.

By writing them to separate sstables, the compaction systems can
preferentially merge / process them when compacting sstables.

This would create an additional sstable for lookup in the bloom filters,
granted. I had visions of short circuiting the lookups to other sstables if
a row tombstone was present in one of the special row tombstone sstables.

But that would only be possible if there was the notion of a "super row
tombstone" that permanently deleted a rowkey and all future writes would be
invalidated. Kind of like how a tombstone with a mistakenly huge timestamp
becomes a sneaky permanent tombstone, but intended. There could be a
special operation / statement to undo this permanent tombstone, and since
the row tombstones would be in their own dedicated sstables, they could
process and compact more quickly, with prioritization by the compactor.

I'm thinking there must be something I am forgetting in the
read/write/compaction paths that invalidate this.

63 matches

Mail list logo