from:"Bowen Song"

Re: [DISCUSS] Adding support for BETWEEN operator

2024-05-14 Thread Bowen Song via dev


Ranged update sounds like a disaster for compaction and read performance.

Imagine compacting or reading some SSTables in which a large number of 
overlapping but non-identical ranges were updated with different values. 
It gives me a headache by just thinking about it.


Ranged delete is much simpler, because the "value" is the same tombstone 
marker, and it also is guaranteed to expire and disappear eventually, so 
the performance impact of dealing with them at read and compaction time 
doesn't suffer in the long term.



On 14/05/2024 16:59, Benjamin Lerer wrote:
It should be like range tombstones ... in much worse ;-). A tombstone 
is a simple marker (deleted). An update can be far more complex.


Le mar. 14 mai 2024 à 15:52, Jon Haddad  a écrit :

Is there a technical limitation that would prevent a range write
that functions the same way as a range tombstone, other than
probably needing a version bump of the storage format?


On Tue, May 14, 2024 at 12:03 AM Benjamin Lerer
 wrote:

Range restrictions (>, >=, =<, < and BETWEEN) do not work on
UPDATEs. They do work on DELETE because under the hood C* they
get translated into range tombstones.

Le mar. 14 mai 2024 à 02:44, David Capwell
 a écrit :

I would also include in UPDATE… but yeah, <3 BETWEEN and
welcome this work.


On May 13, 2024, at 7:40 AM, Patrick McFadin
 wrote:

This is a great feature addition to CQL! I get
asked about it from time to time but then people figure
out a workaround. It will be great to just have it
available.

And right on Simon! I think the only project I had as a
high school senior was figuring out how many parties I
could go to and still maintain a passing grade. Thanks
for your work here.

Patrick

On Mon, May 13, 2024 at 1:35 AM Benjamin Lerer
 wrote:

Hi everybody,

Just raising awareness that Simon is working on
adding support for the BETWEEN operator in WHERE
clauses (SELECT and DELETE) in CASSANDRA-19604. We
plan to add support for it in conditions in a
separate patch.

The patch is available.

As a side note, Simon chose to do his highschool
senior project contributing to Apache Cassandra. This
patch is his first contribution for his senior
project (his second feature contribution to Apache
Cassandra).

Re: Schema Disagreement Issue for Cassandra 4.1

2024-04-01 Thread Bowen Song via dev


It sounds worthy of a Jira ticket.

On 01/04/2024 06:23, Cheng Wang via dev wrote:

Hello,

I have recently encountered a problem concerning schema disagreement 
in Cassandra 4.1. It appears that the schema versions do not reconcile 
as expected.


The issue can be reproduced by following these steps:
- Disable the gossip in Node A.
- Make a schema change in Node B, such as creating a new table.
- Re-enable the gossip in Node A.

My expectation was that the schema versions would eventually 
reconcile. However, in Cassandra 4.1, it seems that reconciliation 
hangs indefinitely unless I reboot the node. Interestingly, when 
performing the same steps in Cassandra 3.0, the schema version 
synchronizes within about a minute.


Has anyone else experienced this issue with Cassandra 4.x? It appears 
to me that this could be a regression in the 4.x series.


Any insights or suggestions would be greatly appreciated.

Thanks,
Cheng

Re: Default table compression defined in yaml.

2024-03-19 Thread Bowen Song via dev

I believe the `foobar_in_kb: 123` format in the cassandra.yaml file is 
deprecated, and the new format is `foobar: 123KiB`. Is there a need to 
introduce new settings entries with the deprecated format only to be 
removed at a later version?



On 18/03/2024 14:39, Claude Warren, Jr via dev wrote:
After much work by several people, I have pulled together the changes 
to define the default compression in the cassandra.yaml file and have 
created a pull request [1].


If you are interested this in topic, please take a look at the changes 
and give at least a cursory review.


[1] https://github.com/apache/cassandra/pull/3168 



Thanks,
Claude

Re: [DISCUSS] What SHOULD we do when we index an inet type that is ipv4?

2024-03-07 Thread Bowen Song via dev

I think the answer to that is, if an inet type column is a partition 
key, can I write to it in IPv4 and then query it with IPv6 and find the 
record? I believe the behaviour between SAI and partition key should be 
the same.

On 07/03/2024 17:43, Caleb Rackliffe wrote:
Yeah, what we have with inet is much like if we had a type like 
"numeric" that allowed you to write both ints and doubles. If we had 
actual "inet4" and "inet6" types, SAI would have been able to index 
them as fixed length values without doing the 4 -> 16 byte conversion. 
Given SAI could easily change this to go one way or another at 
post-filtering time, perhaps there's another option:

4.) Have an option on the column index that allows the user to specify 
whether ipv4 and ipv6 addresses are comparable. If they are, nothing 
changes. If they aren't, we can just take the matches from the index 
and filter "strictly".

I'm not sure what's best here, because what it seems to hinge on is 
what users actually want to do when they throw both v4 and v6 
addresses into a single column. Without any real loss in storage 
efficiency, you could index them in two separate columns on the same 
table, and none of this matters. If they are mixed, it feels like we 
should at least have the option to make them comparable, kind of like 
we have the option to make text case-insensitive or unicode normalized 
right now.

On Wed, Mar 6, 2024 at 4:35 PM Bowen Song via dev 
 wrote:

Technically, 127.0.0.1 (IPv4) is not 0:0:0:0:0::7f00:0001 (IPv6),
but their values are equal. Just like 1.0 (double) is not 1 (int),
but
their values are equal. So, what is the meaning of "=" in CQL?

On 06/03/2024 21:36, David Capwell wrote:
> So, was reviewing SAI and found we convert ipv4 to ipv6 (which
is valid for the type) and made me wonder what the behavior would
be if client mixed ipv4 with ipv4 encoded as ipv6… this caused me
to find a different behavior in SAI to the rest of C*… where I
feel C* is doing the wrong thing…
>
> Lets walk over a simple example
>
> ipv4: 127.0.0.1
> ipv6: 0:0:0:0:0::7f00:0001
>
> Both of these address are equal according to networking and
java… but for C* they are different!  These are 2 different values
as ipv4 is 4 bytes and ipv6 is 16 bytes, so 4 != 16!
>
> With SAI we convert all ipv4 to ipv6 so that the search logic is
correct… this causes SAI to return partitions that ALLOW FILTERING
and other indexes wouldn’t…
>
> This gets to the question in the subject… what SHOULD we do for
this type?
>
> I see 3 options:
>
> 1) SAI use the custom C* semantics where 4 != 16… this keeps us
consistent…
> 2) ALLOW FILTERING and other indexes are “fixed” so that we
actually match correctly… we are not really able to fix if the
type is in a partition or clustering column though…
> 3) deprecate inet in favor of a inet_better type… where inet
semantics is the custom C* semantics and inet_better handles this case
>
> Thoughts?

Re: [DISCUSS] What SHOULD we do when we index an inet type that is ipv4?

2024-03-06 Thread Bowen Song via dev

Technically, 127.0.0.1 (IPv4) is not 0:0:0:0:0::7f00:0001 (IPv6), 
but their values are equal. Just like 1.0 (double) is not 1 (int), but 
their values are equal. So, what is the meaning of "=" in CQL?


On 06/03/2024 21:36, David Capwell wrote:

So, was reviewing SAI and found we convert ipv4 to ipv6 (which is valid for the 
type) and made me wonder what the behavior would be if client mixed ipv4 with 
ipv4 encoded as ipv6… this caused me to find a different behavior in SAI to the 
rest of C*… where I feel C* is doing the wrong thing…

Lets walk over a simple example

ipv4: 127.0.0.1
ipv6: 0:0:0:0:0::7f00:0001

Both of these address are equal according to networking and java… but for C* 
they are different!  These are 2 different values as ipv4 is 4 bytes and ipv6 
is 16 bytes, so 4 != 16!

With SAI we convert all ipv4 to ipv6 so that the search logic is correct… this 
causes SAI to return partitions that ALLOW FILTERING and other indexes wouldn’t…

This gets to the question in the subject… what SHOULD we do for this type?

I see 3 options:

1) SAI use the custom C* semantics where 4 != 16… this keeps us consistent…
2) ALLOW FILTERING and other indexes are “fixed” so that we actually match 
correctly… we are not really able to fix if the type is in a partition or 
clustering column though…
3) deprecate inet in favor of a inet_better type… where inet semantics is the 
custom C* semantics and inet_better handles this case

Thoughts?

Re: [DISCUSS] New CQL command/option for listing roles with superuser privileges

2024-02-29 Thread Bowen Song via dev


I believe that opens the door to this kind of situations:

1. create superuser role "role1"
2. create superuser role "role2"
3. add "role2" to members of "role1"
4. remove "role2" from the members of "role1"
5. "role2" now inexplicitly lost the superuser state

TBH, my preferred solution is making superuser roles not inheritable. If 
a role has members, it cannot be made superuser; and if a role is 
superuser, no members can be added to it.


It doesn't make much sense to inherit from a superuser role, because it 
has unrestricted permissions, which renders any permission explicitly 
set on the child roles useless. This enforces the role to be made 
superuser explicitly, which makes all the display or filtering issues 
related to the inheritance goes away.


On 29/02/2024 11:30, Štefan Miklošovič wrote:
Why don't we just update the is_superuser column of a role when it 
effectively achieves a superuser status when it is granted some 
superuser role? Similarly, we would remove its superuser status when 
there are no superuser roles granted to it anymore.


I think that at least for the second case (when a superuser role is 
revoked and there is none anymore), there would need to be some 
transaction because as it checks if there are any superuser roles or 
not to set it to false, somebody else might grant that superuser role 
to it again so we might end up with having is_superuser set to false 
while it might still have a superuser role granted.


I am not sure if this is achievable and I am sorry if this was already 
answered / rejected elsewhere.


On Thu, Feb 29, 2024 at 11:33 AM  wrote:

Hi Maxwell,

Currently system_auth.roles table doesn’t have acquired superuser
info available in columns to filter on it. Below is the
system_auth.roles table for the example I have listed in the
previous email. If you notice, though role1 and role11 acquired
superuser status through grants, is_superuser column is False for
these roles and acquired superuser status is not apparent directly
from the columns of this table. member_of column shows immediate
parent/grant of a given role. But these grants can result in a
huge tree of roles hierarchy and there may be a role anywhere up
in the hierarchy which is a superuser.

cassandra@cqlsh> select * from system_auth.roles;

 role      | can_login | is_superuser | member_of  | salted_hash

---+---+--++--
     role2 |     False |    False |       null |                  
        null
    role11 |     False |    False |  {'role1'} |                  
        null
    super1 |     False |     True |       null |                  
        null
     role1 |     False |    False | {'super1'} |                  
        null


Thanks,
Shailaja



On Feb 29, 2024, at 2:11 AM, guo Maxwell 
wrote:

Hi ,
 1. can this cql "SELECT role from system_auth.roles where
is_superuser = True ALLOW FILTERING ;"  meet your needs if the
user to execute the cql have the right to do so  ？
 2. I think may be we can also add the ability to filter on list
role/user grammar, for example : list user where super = True;



Shailaja Koppu  于2024年2月28日周三 20:40写道：

Hi Team,

Currently LIST ROLES command doesn’t indicate if a role has
superuser privilege, if acquired through a grant in roles
hierarchy (LIST ROLES has super column true only if the role
is created with SUPERUSER=true). For example, in the below
example, super1 is a superuser, role1 acquired superuser
status through grant of super1 and role11 acquired superuser
status through grant of role1. LIST ROLES output has super
column true only for super1.


cassandra@cqlsh> create role super1 WITH SUPERUSER = true;
cassandra@cqlsh> create role role1;
cassandra@cqlsh> create role role11;
cassandra@cqlsh> create role role2;
cassandra@cqlsh> grant super1 to role1;
cassandra@cqlsh> grant role1 to role11;
cassandra@cqlsh> list roles;

 role      | super | login | options | datacenters
---+---+---+-+-
     role1 | False | False |        {} |         ALL
    role11 | False | False |        {} |         ALL
     role2 | False | False |        {} |         ALL
    super1 |  True | False |        {} |         ALL


One way to check has a role acquired superuser status is by
running LIST ROLES of  and looking for at least one
row with super column true. This works fine to check
superuser status of a given role.

cassandra@cqlsh> list roles of role11;

 role   | super | login | options | datacenters
+---+---+-+-
  role1 | False |

Re: Table name length limit in Cassandra

2024-02-22 Thread Bowen Song via dev


Hi Gaurav,

I would be less worried about performance issues than interoperability 
issues. Other tools/client libraries do not expect this, and may cause 
them to behave unexpectedly (e.g. truncating/crashing/...).


If you can, try get rid of common prefix/suffix, and use abbreviations 
where possible. You shouldn't have thousands of tables (and yes, there's 
performance issue with that), so the table name length limit really 
shouldn't be an issue.


Best,
Bowen

On 22/02/2024 05:47, Gaurav Agarwal wrote:

Hi team,

Currently Cassandra has a table name length limit of 48 characters. If 
I understand correctly, it was made due to the fact that filename can 
not be more than 255 characters in windows. However, Linux supports up 
to 4096 bytes of file name.


Given my Cassandra nodes are on Linux systems, can I increase the 
limit from 48 characters to 64 characters? Will there be any 
performance issues due to increasing the limit?


Thanks
Gaurav

Re: [DISCUSS] Add subscription mangement instructions to user@, dev@ message footers

2024-01-22 Thread Bowen Song via dev

Google Group works slightly differently. They "forward" emails using the 
group's email address as the "From" address, not the original sender's 
email address, unless the sender address happen to be a Google mail 
address (including Gmail and others). Technically speaking, that's not 
forwarding, but sending a new email with the original email's content, 
subject, sender name (but not address), etc. information copied over.


I believe the mailing list software this mailing list is using also 
supports such feature. For example, this email's "From" address is 
"Bowen Song via dev ", not my actual email 
address (which is the "Cc" address). If we add the footer to all emails, 
all "From" addresses, other than those Apache email addresses (e.g. 
f...@apache.org), will have to be turned into "dev@cassandra.apache.org". 
This works, but there's a catch. Many people (habitually) hits the 
"reply all" button on their mail client instead of the "reply" button, 
and as a result of that, the person being replied to will receive two 
nearly identical emails, one addressed to the mailing list which is then 
modified to added the footer, and the other Cc-ed to them without the 
footer. This may turn out to be very annoying if a mailing list 
participant can't (or doesn't know how to) setup inbox rules to filter 
these out.


There's no "Prefect Solution™", unsurprisingly.


On 22/01/2024 19:08, C. Scott Andreas wrote:

Bowen and Jeremiah, thanks for remembering this.

I'd remembered the DKIM/SPF issue, but not its relationship to the 
message footer - appreciate your work fixing that, Bowen.


I'm part of a few Google Groups that relay messages with an appended 
footer that don't seem to encounter invalidation, but am not curious 
enough to learn how they make that work right now. :)


I withdraw the proposal. 

– Scott


On Jan 22, 2024, at 10:56 AM, Brandon Williams  wrote:


That's right, I'd forgotten about this.  I change my +1 to -1, 
there's not enough value in this to break signatures.


Kind Regards,
Brandon


On Mon, Jan 22, 2024 at 12:42 PM Jeremiah Jordan 
 wrote:



Here was the thread where it was removed:
lists.apache.org
<https://lists.apache.org/thread/9wtw9m4r858xdm78krf1z74q3krc27st>
favicon.ico
<https://lists.apache.org/thread/9wtw9m4r858xdm78krf1z74q3krc27st>



On Jan 22, 2024, at 12:37 PM, J. D. Jordan
 wrote:
I think we used to have this and removed them because it was
breaking the encryption signature on messages or something which
meant they were very likely to be treated as spam?

Not saying we can’t put it back on, but it was removed for good
reasons from what I recall.


On Jan 22, 2024, at 12:19 PM, Brandon Williams
 wrote:

+1

Kind Regards,
Brandon


On Mon, Jan 22, 2024 at 12:10 PM C. Scott Andreas
 wrote:

Hi all,

I'd like to propose appending the following two footers to
messages sent to the user@ and dev@ lists. The proposed
postscript including line breaks is between the "X" blocks
below.

User List Footer:
X

---
Unsubscribe: Send a blank email to
user-unsubscr...@cassandra.apache.org. Do not reply to this
message.
Cassandra Community: Follow other mailing lists or join us in
Slack: https://cassandra.apache.org/_/community.html
X

Dev List Footer:
X

---
Unsubscribe: Send a blank email to
dev-unsubscr...@cassandra.apache.org. Do not reply to this
message.
Cassandra Community: Follow other mailing lists or join us in
Slack: https://cassandra.apache.org/_/community.html
X

Offering this proposal for three reasons:
– Many users are sending "Unsubscribe" messages to the full
mailing list which prompts others to wish to unsubscribe – a
negative cascade that affects the size of our user community.
– Many users don't know where to go to figure out how to
unsubscribe, especially if they'd joined many years ago.
– Nearly all mailing lists provide a one-click mechanism for
unsubscribing or built-in mail client integration to do so via
message headers. Including compact instructions on how to
leave is valuable to subscribers.

#asfinfra indicates that such footers can be appended given
project consensus and an INFRA- ticket:
https://the-asf.slack.com/archives/CBX4TSBQ8/p1705939868631079

If we reach consensus on adding a message footer, I'll file an
INFRA ticket with a link to this thread.

Thanks,

– Scott

Re: [DISCUSS] Add subscription mangement instructions to user@, dev@ message footers

2024-01-22 Thread Bowen Song via dev

Adding a footer or modifying the email content in any way will break the 
DKIM signature of the email if it has one. Since the mailing list's mail 
server will forward the emails to the recipients, the SPF check will 
fail too. Failing the DKIM signature & SPF check will result in the 
email likely being treated as spam and either end up in the spam/junk 
mailbox or being rejected by recipients' mail server. The DMARC standard 
also requires at least one of the DKIM signature and SPF check must 
pass, otherwise it is considered as a failure. If the sender domain has 
a valid DMARC rule to reject or quarantine the failing emails, the 
mailing list subscribers with a mail service provider supporting the 
DMARC standard will never see any email from these senders via the 
mailing list landing in their inbox.


Balancing the pros and cons, I believe it's better to have small number 
of users occasionally spamming the mailing lists with invalid 
unsubscription emails than having the vast majority of users unable to 
receive emails from a subset of users (e.g. anyone from the @yahoo.com 
domain, or myself).


On 22/01/2024 18:10, C. Scott Andreas wrote:

Hi all,

I'd like to propose appending the following two footers to messages 
sent to the user@ and dev@ lists. The proposed postscript including 
line breaks is between the "X" blocks below.


User List Footer:
X

---
Unsubscribe: Send a blank email to 
user-unsubscr...@cassandra.apache.org. Do not reply to this message.
Cassandra Community: Follow other mailing lists or join us in Slack: 
https://cassandra.apache.org/_/community.html

X

Dev List Footer:
X

---
Unsubscribe: Send a blank email to 
dev-unsubscr...@cassandra.apache.org. Do not reply to this message.
Cassandra Community: Follow other mailing lists or join us in Slack: 
https://cassandra.apache.org/_/community.html

X

Offering this proposal for three reasons:
– Many users are sending "Unsubscribe" messages to the full mailing 
list which prompts others to wish to unsubscribe – a negative cascade 
that affects the size of our user community.
– Many users don't know where to go to figure out how to unsubscribe, 
especially if they'd joined many years ago.
– Nearly all mailing lists provide a one-click mechanism for 
unsubscribing or built-in mail client integration to do so via message 
headers. Including compact instructions on how to leave is valuable to 
subscribers.


#asfinfra indicates that such footers can be appended given project 
consensus and an INFRA- ticket: 
https://the-asf.slack.com/archives/CBX4TSBQ8/p1705939868631079


If we reach consensus on adding a message footer, I'll file an INFRA 
ticket with a link to this thread.


Thanks,

– Scott

Re: [DISCUSS] Maintain backwards compatibility after dependency upgrade in the 5.0

2023-06-28 Thread Bowen Song via dev

IMHO, anyone upgrading software between major versions should expect to 
see breaking changes. Introducing breaking or major changes is the whole 
point of bumping major version numbers.


Since the library upgrade need to happen sooner or later, I don't see 
any reason why it should not happen in the 5.0 release.



On 27/06/2023 19:21, Maxim Muzafarov wrote:

Hello everyone,


We use the Dropwizard Metrics 3.1.5 library, which provides a basic
set of classes to easily expose Cassandra internals to a user through
various interfaces (the most common being JMX). We want to upgrade
this library version in the next major release 5.0 up to the latest
stable 4.2.19 for the following reasons:
- the 3.x (and 4.0.x) Dropwizard Metrics library is no longer
supported, which means that if we face a critical CVE, we'll still
need to upgrade, so it's better to do it sooner and more calmly;
- as of 4.2.5 the library supports jdk11, jdk17, so we will be in-sync
[1] as well as having some of the compatibility fixes mentioned in the
related JIRA [2];
- there have been a few user-related requests [3][4] whose
applications collide with the old version of the library, we want to
help them;


The problem

The problem with simply upgrading is that the JmxReporter class of the
library has moved from the com.codahale.metrics package in the 3.x
release to the com.codahale.metrics.jmx package in the 4.x release.
This is a problem for applications/tools that rely on the cassandra
classpath (lib/jars) as after the upgrade they may be looking for the
JmxReporter class which has changed its location.

A good example of the problem that we (or a user) may face after the
upgrade is our tests and the cassandra-driver-core 3.1.1, which uses
the old 3.x version of the library in tests. Of course, in this case,
we can upgrade the cassandra driver up to 4.x [5][6] to fix the
problem, as the new driver uses a newer version of the library, but
that's another story I won't go into for now. I'm talking more about
visualising the problem a user might face after upgrading to 5.0 if
he/she rely on the cassandra classpath, but on the other hand, they
might not face this problem at all because, as I understand, they will
provide this library in their applications by themselves.


So, since Cassandra has a huge ecosystem and a variety of tools that I
can't even imagine, the main question here is:

Can we move forward with this change without breaking backwards
compatibility with any kind of tools that we have considering the
example above as the main case? Do you have any thoughts on this?

The changes are here:
https://github.com/apache/cassandra/pull/2238/files



[1] 
https://github.com/dropwizard/metrics/pull/2180/files#diff-5dbf1a803ecc13ff945a08ed3eb09149a83615e83f15320550af8e3a91976446R14
[2] https://issues.apache.org/jira/browse/CASSANDRA-14667
[3] https://github.com/dropwizard/metrics/issues/1581#issuecomment-628430870
[4] https://issues.apache.org/jira/browse/STORM-3204
[5] https://issues.apache.org/jira/browse/CASSANDRA-15750
[6] https://issues.apache.org/jira/browse/CASSANDRA-17231

Re: [DISCUSS] Introduce DATABASE as an alternative to KEYSPACE

2023-04-06 Thread Bowen Song via dev


/> I'm quite happy to leave things as they are if that is the consensus./

+1 to the above


On 06/04/2023 14:54, Mike Adamson wrote:
My apologies. I started this discussion off the back of a usability 
discussion around new user accessibility to Cassandra and the premise 
that there is an initial steep learning curve for new users. Including 
new users who have worked for a long time in the traditional DBMS field.


On the basis of the reason for the discussion,  TABLEGROUP doesn't sit 
well because of user types / functions / indexes etc. which are not 
strictly tables and is also yet another Cassandra only term.


NAMESPACE could work but it's different usage in other systems could 
be just as confusing to new users.


And, I certainly don't think having multiple names for the same thing 
just to satisfy different parties is a good idea at all.


I'm quite happy to leave things as they are if that is the consensus.

On Thu, 6 Apr 2023 at 14:16, Josh McKenzie  wrote:


KEYSPACE is fine. If we want to introduce a standard nomenclature
like DATABASE that’s also fine. Inventing brand new ones is not
fine, there’s no benefit.

I'm with Benedict in principle, with Aleksey in practice; I think
KEYSPACE and SCHEMA are actually fine enough.

If and when we get to any kind of multi-tenancy, having a more
metaphorical abstraction that users are familiar with like these
becomes more valuable; it's pretty clear that things in different
keyspaces, different databases, or even different schemas could
have different access rules, resourcing, etc from one another.

While the off-the-cuff logical TABLEGROUP thing is a /literal/
statement about what the thing is, it'd be another unique term to
us;  we have enough things in our system where we've charted our
own path. My personal .02 is we don't need to go adding more. :)

On Thu, Apr 6, 2023, at 8:54 AM, Mick Semb Wever wrote:


… but that should be a different discussion about how we
evolve config.



I disagree. Nomenclature being difficult can benefit from
holistic and forward thinking.
Sure you can label this off-topic if you like, but I value our
discuss threads being collaborative in an open-mode.
Sometimes the best idea is on the tail end of a sequence of bad
and/or unpopular ideas.








--
DataStax Logo Square   *Mike Adamson*
Engineering

+1 650 389 6000 |datastax.com 



Find DataStax Online: 	LinkedIn Logo 
 
Facebook Logo 
 
Twitter Logo  RSS Feed 
 Github Logo

Re: [DISCUSS] Introduce DATABASE as an alternative to KEYSPACE

2023-04-04 Thread Bowen Song via dev

I personally prefer to use the name "keyspace", because it avoids the 
confusion between the "database software/server" and the "collection of 
tables in a database". "An SQL database" can mean different things in 
different contexts, but "a Cassandra keyspace" always mean the same thing.


On 04/04/2023 16:48, Mike Adamson wrote:

Hi,

I'd like to propose that we add DATABASE to the CQL grammar as an 
alternative to KEYSPACE.


Background: While TABLE was introduced as an alternative for 
COLUMNFAMILY in the grammar we have kept KEYSPACE for the container 
name for a group of tables. Nearly all traditional SQL databases use 
DATABASE as the container name for a group of tables so it would make 
sense for Cassandra to adopt this naming as well.


KEYSPACE would be kept in the grammar but we would update some logging 
and documentation to encourage use of the new name.


Mike Adamson

--
DataStax Logo Square   *Mike Adamson*
Engineering

+1 650 389 6000 |datastax.com 



Find DataStax Online: 	LinkedIn Logo 
 
Facebook Logo 
 
Twitter Logo  RSS Feed 
 Github Logo

Re: [DISCUSS] Change the useage of nodetool tablehistograms

2023-03-23 Thread Bowen Song via dev

In that case, I would recommend fix the bug that prints everything when 
an arbitrary number of arguments is given.


On 23/03/2023 13:40, guo Maxwell wrote:
firstly I think anything existing must be reasonable，so ignore option 
for tablestats must be a need for the user to use. at least I used it 
some time ；
secondly in order  to keep this as simple as possible ，I think left 
the option unchanged is enough ，because the original usage is simple 
enough. user can just print the specified table after set nodetool 
tablehistorgrams ks table ，and if there is ten tables in kesypace  ，it 
is simple for him to type ten times with different table names which I 
think at first Only set with argument ks keyspace name is enough.
When we just want to see eight tables in the ks ，the user should just 
type eight table name which ignore two table may be enough.





Bowen Song via dev 于2023年3月23日 
周四下午8:07写道：


I don't think the nodetool tablestats command's parameters should
be used as a reference implementation for the nodetool
tablehistograms command. Because:

  * the tablehistograms command can take the keyspace and table as
two separate parameters, but the tablestats command can't.
  * the tablestats command can take keyspace (without table) as a
parameter, but the tablehistograms command can't.

The introduction of the -ks and -tbs options are unnecessary for
the tablestats command, because it's parameters are:

nodetool tablestats [|
[|[...]]]

Which means any positional parameter without a dot is treated as a
keyspace name, otherwise it's treated as dot-separated keyspace
and table name. That, however, does not apply to the nodetool
tablehistograms command, which led to your workaround - the
addition of the -ks and -tbs options.

But if you could just forget about the nodetool tablestats command
for a moment, and look at the nodetool tablehistograms command
alone, you will see that it's unnecessary to introduce the -ks and
-tbs options, because the command already takes keyspace name and
table name, just in a different format.

In addition to that, I would be interested to know how often do
people use the -i option in the nodetool tablestats command. My
best guess is, very very rarely.

If my guess is correct, we should keep the nodetool
tablehistograms command as simple as:

nodetool tablehistograms [  [ [...]] |
 [[...]]]

It's good enough if the above can cover the majority of use cases.
The remaining use cases can be dealt with individually, by
multiple invocations of the same command or providing it with a
script-generated list of tables in the  format.

TL;DR: The KISS principle
<https://en.wikipedia.org/wiki/KISS_principle> should apply here -
keep it simple.


On 23/03/2023 03:05, guo Maxwell wrote:


Maybe I didn't describe the usage of option "-i" clearly, The
reason why I think the command argument should be like this :


1. nodetool tablehistograms ks.tb1 or ks tb1  ... //this is
*one of the old way *of using tablehistogram. will print out
the histograms of tabke ks.tb1 , we keep the old format to
print out the table histograms,besides if more than two
arguments is provied, suchu as nodetool tablehistograms
system.local system_schema.columns system_schema.tables then
all tables's histograms will be printed out (I think this is
a bug that not as excepted in the document's decription, we
should remind the user that this is an incorrenct usage)

2. nodetool tablehistograms -tbs ks.tb1 ks.tb2  //print
out list of tables' histograms with format keyspace.table
3.nodetool tablehistograms -ks ks1 ks2 ks3 ... //print out
list of keyspaces histograms
4.nodetool tablehistograms -i -ks ks1 ks2  //print out
list of table histograms except for the keyspaces list behind
the option -i
5.nodetool tablehistograns -i -tbs ks.tb1 ks.tb2 // print out
list tables' histograms except for table in ks.tb1 ks.tb2
6.nodetool tablehistograns -i -tbs ks.tb1 ks.tb2 -ks ks1 //
print out list tables' histograms except for table in ks.tb1
ks.tb2 and all tables in ks1
6.none option specified ,then all tables histograms will be
print out.// this is *another one of the old way* of using
tablehistogram.


 is to make the command format  to be consistent with the format
of nodetool tablestats, so for users, there will be a unified
awareness of using these two commands, rather than different
commands requiring different usage awareness , we can see the
description of the tablestats doc for option "-i "

Ignore the list of tables and display the remaining tables


that is to say  if -i appears all the lists

Re: [DISCUSS] Change the useage of nodetool tablehistograms

2023-03-23 Thread Bowen Song via dev

 is I displayed parameters specifying option -ks and 
-tbs , but tablestats don't.





Josh McKenzie  于2023年3月22日周三 23:35写道：

Agree w/Bowen. I think the straight forward simplicity of "clear
inclusion and exclusion semantics, default to include all in scope
excepting things that are explicitly ignored" would be ideal.


    On Wed, Mar 22, 2023, at 8:45 AM, Bowen Song via dev wrote:


TBH, the syntax looks unnecessarily complex and confusing to me.

For example, for this command:

nodetool tablehistograns -ks ks1 -i -tbs ks1.tb1 ks2.tb2

Which one of the following should it do?

 1. all tables in the keyspace ks1,  except the table tb1; or
 2. all tables in all keyspaces, except any table in the keyspace
ks1 and the table tb2 in the keyspace ks2


I personally would prefer the simplicity of this approach:

nodetool tablehistograms ks1 tb1 tb2 tb3

nodetool tablehistograms ks1.tb1 ks1.tb2 ks2.tb3

nodetool tablehistograms -i ks1 -i ks2

nodetool tablehistograms -i ks1.tb1 -i ks2.tb2


They are self-explanatory. You don't need to read comments to
understand what do they do, as long as you know that "-i" means
"exclude".

A more complex and possibly confusing option could be:



nodetool tablehistograms ks1 -i ks1.tb1 -i ks1.tb2  # all
tables in the keyspace ks1, except the table tb1 and tb2

nodetool tablehistograms -i ks1.tb1 -i ks1.tb2 ks1  #
identical as above, as -i takes only one parameter

To avoid the above confusion, the command could enforce that the
"-i" option may only be used after any positional options, thus
makes the 2nd command a syntax error.


Beyond that, I don't see why the user can't make multiple
invocations of the nodetool tablehistograms command if they have
more complex or specific need.

For example, in this case:

/> 6.nodetool tablehistograns -i -tbs ks.tb1 ks.tb2 -ks ks1
// print out list tables' histograms except for table in
ks.tb1 ks.tb2 and all tables in ks1/

The same result can be achieved by concatenating the outputs of
the following two commands:

nodetool tablehistograms -i ks -i ks1

nodetool tablehistograms ks -i ks.tb1 -i ks.tb2


On 22/03/2023 05:12, guo Maxwell wrote:

Thanks everyone , So It seems that it is better to add new
parameter options to meet our needs, while keeping the original
parameter functions unaffected to achieve backward compatibility.
So the new options are :
1. nodetool tablehistograms ks.tb1 or ks tb1  ... //this is *one
of the old way *of using tablehistogram. will print out the
histograms of tabke ks.tb1 , we keep the old format to print out
the table histograms,besides if more than two arguments is
provied, suchu as nodetool tablehistograms system.local
system_schema.columns system_schema.tables then all tables's
histograms will be printed out (I think this is a bug that not
as excepted in the document's decription, we should remind the
user that this is an incorrenct usage)

2. nodetool tablehistograms -tbs ks.tb1 ks.tb2  //print out
list of tables' histograms with format keyspace.table
3.nodetool tablehistograms -ks ks1 ks2 ks3 ... //print out list
of keyspaces histograms
4.nodetool tablehistograms -i -ks ks1 ks2  //print out list
of table histograms except for the keyspaces list behind the
option -i
5.nodetool tablehistograns -i -tbs ks.tb1 ks.tb2 // print out
list tables' histograms except for table in ks.tb1 ks.tb2
6.nodetool tablehistograns -i -tbs ks.tb1 ks.tb2 -ks ks1 //
print out list tables' histograms except for table in ks.tb1
ks.tb2 and all tables in ks1
6.none option specified ,then all tables histograms will be
print out.// this is *another one of the old way* of using
tablehistogram.

So we add some more options like "-i", "-ks", "-tbs" , we can
combine these options  and we can also use any of them
individually, besides, we can also use the tool through old way
if a table with format ks.tb is provied.


Jeremiah D Jordan  于2023年3月16日周四
23:14写道：

-1 on any change which breaks the previously documented usage.
+1 any additions to what the tool can do without breaking
previously documented behavior.


On Mar 16, 2023, at 7:42 AM, Josh McKenzie
 wrote:

We could also consider augmenting the tool with new named
arguments with the functionality you described and leave
the positional usage intact.

On Thu, Mar 16, 2023, at 6:43 AM, Bowen Song via dev wrote:


The documented command options are:

nodetool tablehistograms [  |
]



That means one parameter will be treated as dot separated
keyspace and

Re: [DISCUSS] Change the useage of nodetool tablehistograms

2023-03-22 Thread Bowen Song via dev


TBH, the syntax looks unnecessarily complex and confusing to me.

For example, for this command:

   nodetool tablehistograns -ks ks1 -i -tbs ks1.tb1 ks2.tb2

Which one of the following should it do?

1. all tables in the keyspace ks1,  except the table tb1; or
2. all tables in all keyspaces, except any table in the keyspace ks1
   and the table tb2 in the keyspace ks2


I personally would prefer the simplicity of this approach:

   nodetool tablehistograms ks1 tb1 tb2 tb3

   nodetool tablehistograms ks1.tb1 ks1.tb2 ks2.tb3

   nodetool tablehistograms -i ks1 -i ks2

   nodetool tablehistograms -i ks1.tb1 -i ks2.tb2

They are self-explanatory. You don't need to read comments to understand 
what do they do, as long as you know that "-i" means "exclude".


A more complex and possibly confusing option could be:

   nodetool tablehistograms ks1 -i ks1.tb1 -i ks1.tb2  # all tables in
   the keyspace ks1, except the table tb1 and tb2

   nodetool tablehistograms -i ks1.tb1 -i ks1.tb2 ks1  # identical as
   above, as -i takes only one parameter

To avoid the above confusion, the command could enforce that the "-i" 
option may only be used after any positional options, thus makes the 2nd 
command a syntax error.


Beyond that, I don't see why the user can't make multiple invocations of 
the nodetool tablehistograms command if they have more complex or 
specific need.


For example, in this case:

   /> 6.nodetool tablehistograns -i -tbs ks.tb1 ks.tb2 -ks ks1 // print
   out list tables' histograms except for table in ks.tb1 ks.tb2 and
   all tables in ks1/

The same result can be achieved by concatenating the outputs of the 
following two commands:


   nodetool tablehistograms -i ks -i ks1

   nodetool tablehistograms ks -i ks.tb1 -i ks.tb2


On 22/03/2023 05:12, guo Maxwell wrote:
Thanks everyone , So It seems that it is better to add new parameter 
options to meet our needs, while keeping the original parameter 
functions unaffected to achieve backward compatibility.

So the new options are :
1. nodetool tablehistograms ks.tb1 or ks tb1  ... //this is *one of 
the old way *of using tablehistogram. will print out the histograms of 
tabke ks.tb1 , we keep the old format to print out the table 
histograms,besides if more than two arguments is provied, suchu as 
nodetool tablehistograms system.local system_schema.columns 
system_schema.tables then all tables's histograms will be printed out 
(I think this is a bug that not as excepted in the document's 
decription, we should remind the user that this is an incorrenct usage)


2. nodetool tablehistograms -tbs ks.tb1 ks.tb2  //print out list 
of tables' histograms with format keyspace.table
3.nodetool tablehistograms -ks ks1 ks2 ks3 ... //print out list of 
keyspaces histograms
4.nodetool tablehistograms -i -ks ks1 ks2  //print out list of 
table histograms except for the keyspaces list behind the option -i
5.nodetool tablehistograns -i -tbs ks.tb1 ks.tb2 // print out list 
tables' histograms except for table in ks.tb1 ks.tb2
6.nodetool tablehistograns -i -tbs ks.tb1 ks.tb2 -ks ks1 // print out 
list tables' histograms except for table in ks.tb1 ks.tb2 and all 
tables in ks1
6.none option specified ,then all tables histograms will be print 
out.// this is *another one of the old way* of using tablehistogram.


So we add some more options like "-i", "-ks", "-tbs" , we can combine 
these options  and we can also use any of them individually, besides, 
we can also use the tool through old way if a table with format ks.tb 
is provied.



Jeremiah D Jordan  于2023年3月16日周四 
23:14写道：


-1 on any change which breaks the previously documented usage.
+1 any additions to what the tool can do without breaking
previously documented behavior.


On Mar 16, 2023, at 7:42 AM, Josh McKenzie 
wrote:

We could also consider augmenting the tool with new named
arguments with the functionality you described and leave the
positional usage intact.

On Thu, Mar 16, 2023, at 6:43 AM, Bowen Song via dev wrote:


The documented command options are:

nodetool tablehistograms [  | ]



That means one parameter will be treated as dot separated
keyspace and table. Alternatively, two parameters will be
treated as the keyspace and table respectively.

To remain compatible with the documented behaviour, my
suggestion is to change the command options to:

nodetool tablehistograms [  [
[...]] |  [[...]]]

Feel free to add the "all except ..." feature to the above.

This doesn't break backward compatibility in documented ways. It
only changes the undocumented behaviour. If someone is using the
undocumented behaviour, they must know things may break when the
software is upgraded. We can just add a line to the NEWS.txt and
let them update their scripts.


On 16/03/2023 08:53, guo Maxwell wrote:

Hello ev

Re: [DISCUSS] Change the useage of nodetool tablehistograms

2023-03-16 Thread Bowen Song via dev


The documented command options are:

   nodetool tablehistograms [  | ]


That means one parameter will be treated as dot separated keyspace and 
table. Alternatively, two parameters will be treated as the keyspace and 
table respectively.


To remain compatible with the documented behaviour, my suggestion is to 
change the command options to:


   nodetool tablehistograms [  [ [...]] |
[ [...]]]

Feel free to add the "all except ..." feature to the above.

This doesn't break backward compatibility in documented ways. It only 
changes the undocumented behaviour. If someone is using the undocumented 
behaviour, they must know things may break when the software is 
upgraded. We can just add a line to the NEWS.txt and let them update 
their scripts.



On 16/03/2023 08:53, guo Maxwell wrote:

Hello everyone :
The nodetool tablehistograms have one argument which you can fill with 
only one table name with the format "keyspace_name.table_name 
/keyspace_name table_name", so that you can get the table histograms 
of the specied table.


And  if none arguments is set, all the tables' histograms will be 
print out.And if more than 2 arguments (nomatter the format is right 
or wrong) are set , all the tables' histograms will also be print out 
too(Which is a bug In my mind).


So the usage of nodetool tablehistograms has some usage restrictions, 
That is either output one , or all informations.


As CASSANDRA-18296 
 described , I 
will change the usage of nodetool tablehistograms, which support the 
feature below:
1. nodetool tablehistograms ks.tb1 ks.tb2  //print out list of 
tables' histograms with format keyspace.table
2.nodetool tablehistograms ks1 ks2 ks3 ... //print out list of 
keyspaces histograms
3.nodetool tablehistograms -i ks1 ks2  //print out list of table 
histograms except for the keyspaces list behind the option -i
4.nodetool tablehistograns -i ks ks.tb // print out list tables' 
histograms except for table in keyspace ks and ks.tb table.

5.none option specified ,then all tables histograms will be print out.

The usage will breaks compatibility with how it was done previously, 
and as this is a user facing tool.


So, What do you think?

Thanks~~~

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Bowen Song via dev

rsonally, I'd like to see the fix for this issue come after 
CEP-21. It could be feasible to implement a fix before then, that 
detects bit-errors on the read path and refuses to respond to the 
coordinator, implicitly having speculative execution handle the 
retry against another replica while repair of that range happens. 
But that feels suboptimal to me when a better framework is on the 
horizon.
I originally typed something in agreement with you but the more I 
think about this, the more a node-local "reject queries for 
specific token ranges" degradation profile seems like it _could_ 
work. I don't see an obvious way to remove the need for a 
human-in-the-loop on fixing things in a pre-CEP-21 world without 
opening pandora's box (Gossip + TMD + non-deterministic agreement 
on ownership state cluster-wide /cry).


And even in a post CEP-21 world you're definitely in the "at what 
point is it better to declare a host dead and replace it" fuzzy 
territory where there's no immediately correct answers.


A system_distributed table of corrupt token ranges that are 
currently being rejected by replicas with a mechanism to kick off a 
repair of those ranges could be interesting.


On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
Thanks for proposing this discussion Bowen. I see a few different 
issues here:


1. How do we safely handle corruption of a handful of tokens 
without taking an entire instance offline for re-bootstrap? This 
includes refusal to serve read requests for the corrupted 
token(s), and correct repair of the data.
2. How do we expose the corruption rate to operators, in a way 
that lets them decide whether a full disk replacement is worthwhile?
3. When CEP-21 lands it should become feasible to support 
ownership draining, which would let us migrate read traffic for a 
given token range away from an instance where that range is 
corrupted. Is it worth planning a fix for this issue before CEP-21 
lands?


I'm also curious whether there's any existing literature on how 
different filesystems and storage media accommodate bit-errors 
(correctable and uncorrectable), so we can be consistent with 
those behaviors.


Personally, I'd like to see the fix for this issue come after 
CEP-21. It could be feasible to implement a fix before then, that 
detects bit-errors on the read path and refuses to respond to the 
coordinator, implicitly having speculative execution handle the 
retry against another replica while repair of that range happens. 
But that feels suboptimal to me when a better framework is on the 
horizon.


--
Abe

On Mar 9, 2023, at 8:23 AM, Bowen Song via dev 
 wrote:


Hi Jeremiah,

I'm fully aware of that, which is why I said that deleting the 
affected SSTable files is "less safe".


If the "bad blocks" logic is implemented and the node abort the 
current read query when hitting a bad block, it should remain 
safe, as the data in other SSTable files will not be used. The 
streamed data should contain the unexpired tombstones, and that's 
enough to keep the data consistent on the node.



Cheers,
Bowen


On 09/03/2023 15:58, Jeremiah D Jordan wrote:
It is actually more complicated than just removing the sstable 
and running repair.


In the face of expired tombstones that might be covering data in 
other sstables the only safe way to deal with a bad sstable is 
wipe the token range in the bad sstable and rebuild/bootstrap 
that range (or wipe/rebuild the whole node which is usually the 
easier way).  If there are expired tombstones in play, it means 
they could have already been compacted away on the other 
replicas, but may not have compacted away on the current 
replica, meaning the data they cover could still be present in 
other sstables on this node.  Removing the sstable will mean 
resurrecting that data.  And pulling the range from other nodes 
does not help because they can have already compacted away the 
tombstone, so you won’t get it back.


Tl;DR you can’t just remove the one sstable you have to remove 
all data in the token range covered by the sstable (aka all data 
that sstable may have had a tombstone covering).  Then you can 
stream from the other nodes to get the data back.


-Jeremiah

On Mar 8, 2023, at 7:24 AM, Bowen Song via 
dev 
<mailto:dev@cassandra.apache.org>wrote:


At the moment, when a read error, such as unrecoverable bit 
error or data corruption, occurs in the SSTable data files, 
regardless of the disk_failure_policy configuration, manual (or 
to be precise, external) intervention is required to recover 
from the error.


Commonly, there's two approach to recover from such error:

 1. The safer, but slower recover strategy: replace the entire
node.
 2. The less safe, but faster recover strategy: shut down the
node, delete the affected SSTable file(s), and then bring
the node back online and run repair.

Based on my understanding of Cassandra, it should be possible 
to recover from such e

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Bowen Song via dev

/When we attempt to rectify any bit-error by streaming data from
peers, we implicitly take a lock on token ownership. A user needs to
know that it is unsafe to change token ownership in a cluster that
is currently in the process of repairing a corruption error on one
of its instances' disks./

I'm not sure about this.

Based on my knowledge, streaming does not require a lock on the token
ownership, if the node subsequently lost the ownership of the token
range being streamed, it will just end up with some extra SSTable files
containing useless data, and the files will get deleted when nodetool
cleanup is run.

BTW, just pointing out the obvious, streaming is neither repairing nor
bootstrapping. The latter two may require a lock on the token ownership.

On 09/03/2023 19:56, Abe Ratnofsky wrote:
I'm not seeing any reasons why CEP-21 would make this more difficult
to implement, besides the fact that it hasn't landed yet.

There are two major potential pitfalls that CEP-21 would help us avoid:
1. Bit-errors beget further bit-errors, so we ought to be resistant to
a high frequency of corruption events
2. Avoid token ownership changes when attempting to stream a corrupted
token

I found some data supporting (1) -
https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2014/20140806_T1_Hetzler.pdf

If we detect bit-errors and store them in system_distributed, then we
need a capacity to throttle that load and ensure that consistency is
maintained.

When we attempt to rectify any bit-error by streaming data from peers,
we implicitly take a lock on token ownership. A user needs to know
that it is unsafe to change token ownership in a cluster that is
currently in the process of repairing a corruption error on one of its
instances' disks. CEP-21 makes this sequencing safe, and provides
abstractions to better expose this information to operators.

--
Abe

On Mar 9, 2023, at 10:55 AM, Josh McKenzie wrote:

Personally, I'd like to see the fix for this issue come after
CEP-21. It could be feasible to implement a fix before then, that
detects bit-errors on the read path and refuses to respond to the
coordinator, implicitly having speculative execution handle the
retry against another replica while repair of that range happens.
But that feels suboptimal to me when a better framework is on the
horizon.
I originally typed something in agreement with you but the more I
think about this, the more a node-local "reject queries for specific
token ranges" degradation profile seems like it _could_ work. I don't
see an obvious way to remove the need for a human-in-the-loop on
fixing things in a pre-CEP-21 world without opening pandora's box
(Gossip + TMD + non-deterministic agreement on ownership state
cluster-wide /cry).

And even in a post CEP-21 world you're definitely in the "at what
point is it better to declare a host dead and replace it" fuzzy
territory where there's no immediately correct answers.

A system_distributed table of corrupt token ranges that are currently
being rejected by replicas with a mechanism to kick off a repair of
those ranges could be interesting.

On Thu, Mar 9, 2023, at 1:45 PM, Abe Ratnofsky wrote:
Thanks for proposing this discussion Bowen. I see a few different
issues here:

1. How do we safely handle corruption of a handful of tokens without
taking an entire instance offline for re-bootstrap? This includes
refusal to serve read requests for the corrupted token(s), and
correct repair of the data.
2. How do we expose the corruption rate to operators, in a way that
lets them decide whether a full disk replacement is worthwhile?
3. When CEP-21 lands it should become feasible to support ownership
draining, which would let us migrate read traffic for a given token
range away from an instance where that range is corrupted. Is it
worth planning a fix for this issue before CEP-21 lands?

I'm also curious whether there's any existing literature on how
different filesystems and storage media accommodate bit-errors
(correctable and uncorrectable), so we can be consistent with those
behaviors.

--
Abe

On Mar 9, 2023, at 8:23 AM, Bowen Song via dev
wrote:

Hi Jeremiah,

I'm fully aware of that, which is why I said that deleting the
affected SSTable files is "less safe".

If the "bad blocks" logic is implemented and the node abort the
current read query when hitting a bad block, it should remain safe,
as the data in other SSTable files will not be used. The streamed
data should cont

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-09 Thread Bowen Song via dev


Hi Jeremiah,

I'm fully aware of that, which is why I said that deleting the affected 
SSTable files is "less safe".


If the "bad blocks" logic is implemented and the node abort the current 
read query when hitting a bad block, it should remain safe, as the data 
in other SSTable files will not be used. The streamed data should 
contain the unexpired tombstones, and that's enough to keep the data 
consistent on the node.


Cheers,
Bowen


On 09/03/2023 15:58, Jeremiah D Jordan wrote:
It is actually more complicated than just removing the sstable and 
running repair.


In the face of expired tombstones that might be covering data in other 
sstables the only safe way to deal with a bad sstable is wipe the 
token range in the bad sstable and rebuild/bootstrap that range (or 
wipe/rebuild the whole node which is usually the easier way).  If 
there are expired tombstones in play, it means they could have already 
been compacted away on the other replicas, but may not have compacted 
away on the current replica, meaning the data they cover could still 
be present in other sstables on this node.  Removing the sstable will 
mean resurrecting that data.  And pulling the range from other nodes 
does not help because they can have already compacted away the 
tombstone, so you won’t get it back.


Tl;DR you can’t just remove the one sstable you have to remove all 
data in the token range covered by the sstable (aka all data that 
sstable may have had a tombstone covering).  Then you can stream from 
the other nodes to get the data back.


-Jeremiah

On Mar 8, 2023, at 7:24 AM, Bowen Song via dev 
 wrote:


At the moment, when a read error, such as unrecoverable bit error or 
data corruption, occurs in the SSTable data files, regardless of the 
disk_failure_policy configuration, manual (or to be precise, 
external) intervention is required to recover from the error.


Commonly, there's two approach to recover from such error:

 1. The safer, but slower recover strategy: replace the entire node.
 2. The less safe, but faster recover strategy: shut down the node,
delete the affected SSTable file(s), and then bring the node back
online and run repair.

Based on my understanding of Cassandra, it should be possible to 
recover from such error by marking the affected token range in the 
existing SSTable as "corrupted" and stop reading from them (e.g. 
creating a "bad block" file or in memory), and then streaming the 
affected token range from the healthy replicas. The corrupted SSTable 
file can then be removed upon the next successful compaction 
involving it, or alternatively an anti-compaction is performed on it 
to remove the corrupted data.


The advantage of this strategy is:

  * Reduced node down time - node restart or replacement is not needed
  * Less data streaming is required - only the affected token range
  * Faster recovery time - less streaming and delayed compaction or
anti-compaction
  * No less safe than replacing the entire node
  * This process can be automated internally, removing the need for
operator inputs

The disadvantage is added complexity on the SSTable read path and it 
may mask disk failures from the operator who is not paying attention 
to it.


What do you think about this?

Re: [DISCUSS] Enhanced Disk Error Handling

2023-03-08 Thread Bowen Song via dev


   /– A repair of the affected range would need to be completed among
   the replicas without such corruption (including paxos repair)./

It can be safe without a repair by over-streaming the data from more (or 
all) available replicas, either within the DC (when LOCAL_* CL is used) 
or across the whole cluster (when other CL is used), then perform a 
compaction locally on the streamed SSTables to get rid of the duplicate 
data. Since the read error should only affect a fairly limited range of 
tokens, over-streaming in theory should not be an issue.



   /– And we'd need a mechanism to execute repair on the affected node
   without it being available to respond to queries, either via the
   client protocol or via internode (similar to a partial bootstrap)./

The mechanism to not respond to queries already exists. I believe there 
may be better ways to do this, but at the minimal level, the affected 
node could just drop that read request silently, and then the 
coordinator will automatically retry it on other replicas if speculative 
retry is enabled, or the client may get a query failure (the "required 
responses N, received responses N-1" error).



   /My hunch is that the examples where this are desirable might be
   limited though. It might allow one to limp along on a bad drive
   momentarily while a proper replacement is bootstrapped, but
   typically with disk failures where there's smoke there's fire - I
   wouldn't expect a drive reporting uncorrectable errors / filesystem
   corruption to be long for this world./

Actually no. Regardless it's a mechanical hard drive or an SSD, they all 
have certain level of uncorrectable bit-error rate (UBER).


For example, a consumer grade hard drive may have an UBER of 1 in 1e14, 
that means on average roughly every 11 TiB read will lead to an 
unrecoverable read error, which result in an entire 512 bytes or 4096 
bytes sector becomes unreadable, and that's perfectly normal, the hard 
drive is still in good health and may still last for many years if not 
decades. Consumer grade SSDs often have UBER of 1 in 1e15, and data 
centre grade SSDs have far better UBER than consumer grade drives, but 
even then, the best still have UBER of about 1 in 1e17.


When managing a cluster of hundreds of Cassandra nodes, each has 
hundreds (if not thousands) GB of data read per day, the probability of 
hitting uncorrectable bit-error is pretty high. The Cassandra cluster of 
approximately 300 nodes I manage hits this fairly often, and replacing 
node for the sake of data consistency has become a chore.



On 08/03/2023 16:53, C. Scott Andreas wrote:

For this to be safe, my understanding is that:

– A repair of the affected range would need to be completed among the 
replicas without such corruption (including paxos repair).
– And we'd need a mechanism to execute repair on the affected node 
without it being available to respond to queries, either via the 
client protocol or via internode (similar to a partial bootstrap).


My hunch is that the examples where this are desirable might be 
limited though. It might allow one to limp along on a bad drive 
momentarily while a proper replacement is bootstrapped, but typically 
with disk failures where there's smoke there's fire - I wouldn't 
expect a drive reporting uncorrectable errors / filesystem corruption 
to be long for this world.


Can you say more about the scenarios you have in mind?

– Scott

On Mar 8, 2023, at 5:24 AM, Bowen Song via dev 
 wrote:



At the moment, when a read error, such as unrecoverable bit error or 
data corruption, occurs in the SSTable data files, regardless of the 
disk_failure_policy configuration, manual (or to be precise, 
external) intervention is required to recover from the error.


Commonly, there's two approach to recover from such error:

 1. The safer, but slower recover strategy: replace the entire node.
 2. The less safe, but faster recover strategy: shut down the node,
delete the affected SSTable file(s), and then bring the node back
online and run repair.

Based on my understanding of Cassandra, it should be possible to 
recover from such error by marking the affected token range in the 
existing SSTable as "corrupted" and stop reading from them (e.g. 
creating a "bad block" file or in memory), and then streaming the 
affected token range from the healthy replicas. The corrupted SSTable 
file can then be removed upon the next successful compaction 
involving it, or alternatively an anti-compaction is performed on it 
to remove the corrupted data.


The advantage of this strategy is:

  * Reduced node down time - node restart or replacement is not needed
  * Less data streaming is required - only the affected token range
  * Faster recovery time - less streaming and delayed compaction or
anti-compaction
  * No less safe than replacing the entire node
  * This process can be automated internally, removing the need for
operator inp

[DISCUSS] Enhanced Disk Error Handling

2023-03-08 Thread Bowen Song via dev

At the moment, when a read error, such as unrecoverable bit error or 
data corruption, occurs in the SSTable data files, regardless of the 
disk_failure_policy configuration, manual (or to be precise, external) 
intervention is required to recover from the error.


Commonly, there's two approach to recover from such error:

1. The safer, but slower recover strategy: replace the entire node.
2. The less safe, but faster recover strategy: shut down the node,
   delete the affected SSTable file(s), and then bring the node back
   online and run repair.

Based on my understanding of Cassandra, it should be possible to recover 
from such error by marking the affected token range in the existing 
SSTable as "corrupted" and stop reading from them (e.g. creating a "bad 
block" file or in memory), and then streaming the affected token range 
from the healthy replicas. The corrupted SSTable file can then be 
removed upon the next successful compaction involving it, or 
alternatively an anti-compaction is performed on it to remove the 
corrupted data.


The advantage of this strategy is:

 * Reduced node down time - node restart or replacement is not needed
 * Less data streaming is required - only the affected token range
 * Faster recovery time - less streaming and delayed compaction or
   anti-compaction
 * No less safe than replacing the entire node
 * This process can be automated internally, removing the need for
   operator inputs

The disadvantage is added complexity on the SSTable read path and it may 
mask disk failures from the operator who is not paying attention to it.


What do you think about this?

Re: [PROPOSAL] Moving deb/rpm repositories from downloads.apache.org to apache.jfrog.io

2022-08-11 Thread Bowen Song via dev


I see. In that case, stick to the original plan makes more sense.

On 11/08/2022 22:46, Mick Semb Wever wrote:


We should have the new domain/URL created before the final move is
made,
and redirecting to the existing download.apache.org
 for the time being.
This will ensure users can have a transition time and avoid causing a
cliff edge moment.


Good idea, but in this situation it would only complicate things, 
because (but mostly (3))
1. The jfrog repositories already exist, and have for a while now (we 
just have not publicised them so much).
2. The new URLs are already in place, redirecting to the jfrog 
repositories.
3. ASF Infra is requesting we remove the rpm/deb files from 
downloads.a.o asap.

Re: [PROPOSAL] Moving deb/rpm repositories from downloads.apache.org to apache.jfrog.io

2022-08-11 Thread Bowen Song via dev

I see. Now I fully understand the change. There's no objections from me, 
everything sounds fine.


We should have the new domain/URL created before the final move is made, 
and redirecting to the existing download.apache.org for the time being. 
This will ensure users can have a transition time and avoid causing a 
cliff edge moment.



On 11/08/2022 22:24, Brandon Williams wrote:

Nothing is changing in regard to signing.  Both package management
systems have their own system for that which will remain.  The package
locations are being moved because downloads.apache.org wants another
level of (superfluous) signing on top of that, which we do not
currently have.

Kind Regards,
Brandon

On Thu, Aug 11, 2022 at 4:20 PM Bowen Song via dev
 wrote:

In that case, the move from signed RPM/DEB to unsigned can be quiet problematic 
to some enterprise users.

On 11/08/2022 22:16, Jeremiah D Jordan wrote:

For ASF project the binary release are always considered as “convenience 
binaries”, the official release is always just the source artifacts.  See the 
ASF release policy for more information.

https://www.apache.org/legal/release-policy.html#compiled-packages


On Aug 11, 2022, at 4:12 PM, Bowen Song via dev  
wrote:

I'm a bit unclear what's the scope of this change. Is it limited to the 
"*-bin.tar.gz" files only?

I would assume the RPM/DEB packages are considered as parts of the "official 
releases", and aren't affected by this change. Am I right?


On 11/08/2022 21:59, Mick Semb Wever wrote:



These repositories and their binaries are "convenience binaries" and not the 
official Cassandra source binaries

Then where are the official binaries?



Wrong wording there., thanks for catching me.
The official *releases* are the source artefacts, e.g. the *-src.tar.gz in 
https://downloads.apache.org/cassandra/4.0.5/

The binaries (e.g. *-bin.tar.gz) are not considered official, but convenience.

https://infra.apache.org/release-distribution.html#release-content
https://www.apache.org/legal/release-policy.html#artifacts

Re: [PROPOSAL] Moving deb/rpm repositories from downloads.apache.org to apache.jfrog.io

2022-08-11 Thread Bowen Song via dev

In that case, the move from signed RPM/DEB to unsigned can be quiet 
problematic to some enterprise users.


On 11/08/2022 22:16, Jeremiah D Jordan wrote:
For ASF project the binary release are always considered as 
“convenience binaries”, the official release is always just the source 
artifacts.  See the ASF release policy for more information.


https://www.apache.org/legal/release-policy.html#compiled-packages


On Aug 11, 2022, at 4:12 PM, Bowen Song via dev 
 wrote:


I'm a bit unclear what's the scope of this change. Is it limited to 
the "*-bin.tar.gz" files only?


I would assume the RPM/DEB packages are considered as parts of the 
"official releases", and aren't affected by this change. Am I right?



On 11/08/2022 21:59, Mick Semb Wever wrote:


> /These repositories and their binaries are "convenience
binaries" and not the official Cassandra source binaries/

Then where are the official binaries?



Wrong wording there., thanks for catching me.
The official *releases* are the source artefacts, e.g. the 
*-src.tar.gz in https://downloads.apache.org/cassandra/4.0.5/


The binaries (e.g. *-bin.tar.gz) are not considered official, but 
convenience.


https://infra.apache.org/release-distribution.html#release-content
https://www.apache.org/legal/release-policy.html#artifacts

Re: [PROPOSAL] Moving deb/rpm repositories from downloads.apache.org to apache.jfrog.io

2022-08-11 Thread Bowen Song via dev

I'm a bit unclear what's the scope of this change. Is it limited to the 
"*-bin.tar.gz" files only?


I would assume the RPM/DEB packages are considered as parts of the 
"official releases", and aren't affected by this change. Am I right?



On 11/08/2022 21:59, Mick Semb Wever wrote:


> /These repositories and their binaries are "convenience
binaries" and not the official Cassandra source binaries/

Then where are the official binaries?



Wrong wording there., thanks for catching me.
The official *releases* are the source artefacts, e.g. the 
*-src.tar.gz in https://downloads.apache.org/cassandra/4.0.5/


The binaries (e.g. *-bin.tar.gz) are not considered official, but 
convenience.


https://infra.apache.org/release-distribution.html#release-content
https://www.apache.org/legal/release-policy.html#artifacts

Re: [PROPOSAL] Moving deb/rpm repositories from downloads.apache.org to apache.jfrog.io

2022-08-11 Thread Bowen Song via dev

> /These repositories and their binaries are "convenience binaries" and 
not the official Cassandra source binaries/


Then where are the official binaries?


On 11/08/2022 21:40, Mick Semb Wever wrote:


The proposal is to move our official debian and redhat repositories 
from downloads.apache.org  to Apache's 
JFrog Artifactory server at apache.jfrog.io  , 
fronting it with the url aliases debian.cassandra.apache.org 
 and redhat.cassandra.apache.org 



That is to replace the following URLs from
https://downloads.apache.org/cassandra/debian/
https://downloads.apache.org/cassandra/redhat/

to
https://debian.cassandra.apache.org

https://redhat.cassandra.apache.org


(which in turn redirect to our jfrog repositories at)
https://apache.jfrog.io/artifactory/cassandra-deb
https://apache.jfrog.io/artifactory/cassandra-rpm


The rationale to do this is to avoid the strict checksum and signature 
requirements on downloads.a.o (which is the same as dist.a.o), as the 
debian and redhat repositories have their own system for integrity and 
signing (which we already do).


These repositories and their binaries are "convenience binaries" and 
not the official Cassandra source binaries, so they do not need to be 
on downloads.a.o and can be served from apache.jfrog.io 
. This is similar to maven binaries (and 
docker images).


This will BREAK everyone's existing 
`/etc/apt/sources.list.d/cassandra.sources.list` and 
`/etc/yum.repos.d/cassandra.repo` files. Folk will need to update 
these files to point to the new repo URLs.


The plan is to do the following to ensure people are informed about 
this breaking change:

 - announcement to users@
 - README.md in the original URL locations explaining the breakage and 
how to fix. (The README.md must be voted on, signed and checksummed),

 - A warning banner on our website downloads page,
 - Every release email for the next 12 months will contain the warning.


background: https://issues.apache.org/jira/browse/CASSANDRA-17748

Anyone with any questions/objections?

Re: Unsubscribe

2022-08-09 Thread Bowen Song via dev

To unsubscribe from this mailing list, you'll need to send an email to 
dev-unsubscr...@cassandra.apache.org


On 09/08/2022 12:52, Schmidtberger, Brian M. (STL) wrote:


unsubscribe

+

BRIAN SCHMIDTBERGER

Software Engineering Senior Advisor, Core Engineering, Express Scripts

M: 785.766.7450

EVERNORTH.COM 

/Confidential, unpublished property of Evernorth. Do not duplicate or 
distribute. Use and distribution limited solely to authorized 
personnel. © Copyright 2022 Evernorth. _Legal Disclaimer 
_/

Re: [DISCUSS] Deprecate and remove resumable bootstrap and decommission

2022-08-03 Thread Bowen Song via dev

I should also add that because we use vnodes and STCS, in the absent of 
CASSANDRA-10540 <https://issues.apache.org/jira/browse/CASSANDRA-10540>, 
I don't think we will benefit from the zero copy streaming at all, as 
almost all SSTables files from the streaming source will contain a very 
wide token range outside the receiving node's desired token range.


On 04/08/2022 00:41, Bowen Song wrote:


That was Cassandra 3.11, before the introduction of zero copy. But I 
must say I'm not certain whether the new zero copy streaming can 
prevent the long GC pauses, as I haven't tried it.


On 03/08/2022 23:37, Josh McKenzie wrote:
I had to resume the bootstrap once or twice in order to get these 
nodes finish joinning the cluster.
Was this before or after the addition of zero copy streaming? The 
premise is that the pain point resumable bootstrap targets is 
mitigated by the much faster bootstrapping times without the 
correctness risks.


On Wed, Aug 3, 2022, at 6:21 PM, Bowen Song via dev wrote:


That would have to be assessed on a case by case basis.

* When the code doesn't delete data, which means there's a zero 
probability of resurrecting deleted data, I will still use resumable 
bootstrap.


* When resurrected data doesn't pose a problem to the system, it 
often can still be an acceptable behaviour to save hours or days of 
bootstrapping time. I may use resumable bootstrap.


* In other cases, where data correctness is important and there's a 
chance for resurrecting deleted data, I would certainly not use it 
if I had known it in advance (which I don't).



On 03/08/2022 23:11, Jeff Jirsa wrote:
The hypothetical concern described is around potential data 
resurrection - would you still use resumable bootstrap if you knew 
that data deleted during those STW pauses was improperly resurrected?


On Wed, Aug 3, 2022 at 2:40 PM Bowen Song via dev 
mailto:dev@cassandra.apache.org>> wrote:


I have benefited from the resumable bootstrap before, and I'm
in favour of keeping the feature around.

I've had streaming failures due to long STW GC pauses on some
bootstrapping nodes, and I had to resume the bootstrap once or
twice in order to get these nodes finish joinning the cluster.
They had not experienced more long STW GC pauses since they
joined the cluster. I would imagine I will spend a lots of time
tuning the GC parameters in order get these nodes to join if
the resumable bootstrapping feature is removed. Also, I'm not
concerned about racing conditions involving repairs, because we
don't run repairs while we are adding new nodes (to minimize
the additional load on the cluster).


On 03/08/2022 19:46, Josh McKenzie wrote:

Context: https://issues.apache.org/jira/browse/CASSANDRA-17679
<https://issues.apache.org/jira/browse/CASSANDRA-17679>

From the .yaml comment on the param I was working on adding:
In certain environments, operators may want to disable resumable bootstrap 
in order to avoid potential correctness violations or data loss scenarios. 
Largelythis  centers around nodes going down during bootstrap, tombstones being 
written, and potential races with repair. Bydefault  we leavethis  on as it's 
been enabledfor  quite some time, however the option to disable it is more 
palatable now that we have zero copy streaming as that greatly accelerates


Given zero copy streaming in the system and the general
unexplored correctness concerns of
https://issues.apache.org/jira/browse/CASSANDRA-8838
<https://issues.apache.org/jira/browse/CASSANDRA-8838>,
specifically pointed out by Jeff here:

https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234

<https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234>,
 I've
been chatting w/Paulo about this and we've both concluded we
think the functionality should be made configurable, default
off (?), deprecated in 4.2 and then completely removed next.

- First: anyone have any concerns with the general arc of
"remove resumable bootstrap and decommission"?
- Second: Should we leave them enabled by default in 4.2 or
disabled?
- Third: Should we consider revisiting older branches with
this functionality and making it toggle-able?

~Josh

Re: [DISCUSS] Deprecate and remove resumable bootstrap and decommission

2022-08-03 Thread Bowen Song via dev

That was Cassandra 3.11, before the introduction of zero copy. But I 
must say I'm not certain whether the new zero copy streaming can prevent 
the long GC pauses, as I haven't tried it.


On 03/08/2022 23:37, Josh McKenzie wrote:
I had to resume the bootstrap once or twice in order to get these 
nodes finish joinning the cluster.
Was this before or after the addition of zero copy streaming? The 
premise is that the pain point resumable bootstrap targets is 
mitigated by the much faster bootstrapping times without the 
correctness risks.


On Wed, Aug 3, 2022, at 6:21 PM, Bowen Song via dev wrote:


That would have to be assessed on a case by case basis.

* When the code doesn't delete data, which means there's a zero 
probability of resurrecting deleted data, I will still use resumable 
bootstrap.


* When resurrected data doesn't pose a problem to the system, it 
often can still be an acceptable behaviour to save hours or days of 
bootstrapping time. I may use resumable bootstrap.


* In other cases, where data correctness is important and there's a 
chance for resurrecting deleted data, I would certainly not use it if 
I had known it in advance (which I don't).



On 03/08/2022 23:11, Jeff Jirsa wrote:
The hypothetical concern described is around potential data 
resurrection - would you still use resumable bootstrap if you knew 
that data deleted during those STW pauses was improperly resurrected?


On Wed, Aug 3, 2022 at 2:40 PM Bowen Song via dev 
mailto:dev@cassandra.apache.org>> wrote:


I have benefited from the resumable bootstrap before, and I'm in
favour of keeping the feature around.

I've had streaming failures due to long STW GC pauses on some
bootstrapping nodes, and I had to resume the bootstrap once or
twice in order to get these nodes finish joinning the cluster.
They had not experienced more long STW GC pauses since they
joined the cluster. I would imagine I will spend a lots of time
tuning the GC parameters in order get these nodes to join if the
resumable bootstrapping feature is removed. Also, I'm not
concerned about racing conditions involving repairs, because we
don't run repairs while we are adding new nodes (to minimize the
additional load on the cluster).


On 03/08/2022 19:46, Josh McKenzie wrote:

Context: https://issues.apache.org/jira/browse/CASSANDRA-17679
<https://issues.apache.org/jira/browse/CASSANDRA-17679>

From the .yaml comment on the param I was working on adding:
In certain environments, operators may want to disable resumable bootstrap 
in order to avoid potential correctness violations or data loss scenarios. 
Largelythis  centers around nodes going down during bootstrap, tombstones being 
written, and potential races with repair. Bydefault  we leavethis  on as it's 
been enabledfor  quite some time, however the option to disable it is more 
palatable now that we have zero copy streaming as that greatly accelerates


Given zero copy streaming in the system and the general
unexplored correctness concerns of
https://issues.apache.org/jira/browse/CASSANDRA-8838
<https://issues.apache.org/jira/browse/CASSANDRA-8838>,
specifically pointed out by Jeff here:

https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234

<https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234>,
 I've
been chatting w/Paulo about this and we've both concluded we
think the functionality should be made configurable, default
off (?), deprecated in 4.2 and then completely removed next.

- First: anyone have any concerns with the general arc of
"remove resumable bootstrap and decommission"?
- Second: Should we leave them enabled by default in 4.2 or
disabled?
- Third: Should we consider revisiting older branches with this
functionality and making it toggle-able?

~Josh

Re: [DISCUSS] Deprecate and remove resumable bootstrap and decommission

2022-08-03 Thread Bowen Song via dev


That would have to be assessed on a case by case basis.

* When the code doesn't delete data, which means there's a zero 
probability of resurrecting deleted data, I will still use resumable 
bootstrap.


* When resurrected data doesn't pose a problem to the system, it often 
can still be an acceptable behaviour to save hours or days of 
bootstrapping time. I may use resumable bootstrap.


* In other cases, where data correctness is important and there's a 
chance for resurrecting deleted data, I would certainly not use it if I 
had known it in advance (which I don't).



On 03/08/2022 23:11, Jeff Jirsa wrote:
The hypothetical concern described is around potential data 
resurrection - would you still use resumable bootstrap if you knew 
that data deleted during those STW pauses was improperly resurrected?


On Wed, Aug 3, 2022 at 2:40 PM Bowen Song via dev 
 wrote:


I have benefited from the resumable bootstrap before, and I'm in
favour of keeping the feature around.

I've had streaming failures due to long STW GC pauses on some
bootstrapping nodes, and I had to resume the bootstrap once or
twice in order to get these nodes finish joinning the cluster.
They had not experienced more long STW GC pauses since they joined
the cluster. I would imagine I will spend a lots of time tuning
the GC parameters in order get these nodes to join if the
resumable bootstrapping feature is removed. Also, I'm not
concerned about racing conditions involving repairs, because we
don't run repairs while we are adding new nodes (to minimize the
additional load on the cluster).


On 03/08/2022 19:46, Josh McKenzie wrote:

Context: https://issues.apache.org/jira/browse/CASSANDRA-17679

From the .yaml comment on the param I was working on adding:
In certain environments, operators may want to disable resumable bootstrap 
in order to avoid potential correctness violations or data loss scenarios. 
Largelythis  centers around nodes going down during bootstrap, tombstones being 
written, and potential races with repair. Bydefault  we leavethis  on as it's 
been enabledfor  quite some time, however the option to disable it is more 
palatable now that we have zero copy streaming as that greatly accelerates

Given zero copy streaming in the system and the general
unexplored correctness concerns of
https://issues.apache.org/jira/browse/CASSANDRA-8838,
specifically pointed out by Jeff here:

https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234

<https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234>,
 I've
been chatting w/Paulo about this and we've both concluded we
think the functionality should be made configurable, default off
(?), deprecated in 4.2 and then completely removed next.

- First: anyone have any concerns with the general arc of "remove
resumable bootstrap and decommission"?
- Second: Should we leave them enabled by default in 4.2 or disabled?
- Third: Should we consider revisiting older branches with this
functionality and making it toggle-able?

~Josh

Re: [DISCUSS] Deprecate and remove resumable bootstrap and decommission

2022-08-03 Thread Bowen Song via dev

I have benefited from the resumable bootstrap before, and I'm in favour 
of keeping the feature around.


I've had streaming failures due to long STW GC pauses on some 
bootstrapping nodes, and I had to resume the bootstrap once or twice in 
order to get these nodes finish joinning the cluster. They had not 
experienced more long STW GC pauses since they joined the cluster. I 
would imagine I will spend a lots of time tuning the GC parameters in 
order get these nodes to join if the resumable bootstrapping feature is 
removed. Also, I'm not concerned about racing conditions involving 
repairs, because we don't run repairs while we are adding new nodes (to 
minimize the additional load on the cluster).



On 03/08/2022 19:46, Josh McKenzie wrote:

Context: https://issues.apache.org/jira/browse/CASSANDRA-17679

From the .yaml comment on the param I was working on adding:
In certain environments, operators may want to disable resumable bootstrap in 
order to avoid potential correctness violations or data loss scenarios. 
Largelythis  centers around nodes going down during bootstrap, tombstones being 
written, and potential races with repair. Bydefault  we leavethis  on as it's 
been enabledfor  quite some time, however the option to disable it is more 
palatable now that we have zero copy streaming as that greatly accelerates

Given zero copy streaming in the system and the general unexplored 
correctness concerns of 
https://issues.apache.org/jira/browse/CASSANDRA-8838, specifically 
pointed out by Jeff here: 
https://issues.apache.org/jira/browse/CASSANDRA-8838?focusedCommentId=16900234=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16900234 
, I've 
been chatting w/Paulo about this and we've both concluded we think the 
functionality should be made configurable, default off (?), deprecated 
in 4.2 and then completely removed next.


- First: anyone have any concerns with the general arc of "remove 
resumable bootstrap and decommission"?

- Second: Should we leave them enabled by default in 4.2 or disabled?
- Third: Should we consider revisiting older branches with this 
functionality and making it toggle-able?


~Josh

Re: [DISCUSS] Improve Commitlog write path

2022-07-26 Thread Bowen Song via dev


Hi Amit,

That's some brilliant tests you have done there. It shows that the 
compaction throughput not only can be a bottleneck on the speed of 
insert operations, but it can also stress the JVM garbage collector. As 
a result of GC pressure, it can cause other things, such as insert, to fail.


Your last statement is correct. The commit log change can be beneficial 
for atypical workloads where large volume of data is getting inserted 
and then expired soon, for example when using the 
TimeWindowCompactionStrategy with short TTL. But I must point out that 
this kind of atypical usage is often an anti-pattern in Cassandra, as 
Cassandra is a database, not a queue or cache system.


This, however, is not saying the commit log change should not be 
introduced. As others have pointed out, it's down to a balancing act 
between the cost and benefit, and it will depend on the code complexity 
and the effect it has on typical workload, such as CPU and JVM heap 
usage. After all, we should prioritise the performance and reliability 
of typical usage before optimising for atypical use cases.


Best,
Bowen

On 26/07/2022 12:41, Pawar, Amit wrote:


[Public]


Hi Bowen,

Thanks for the reply and it helped to identify the failure point. 
Tested compaction throughput with different values and threads active 
in compaction reports “java.lang.OutOfMemoryError: Map failed” error 
with 1024 MB/s earlier compared to other values. This shows with lower 
throughput such issues are going to come up not immediately but in 
days or weeks. Test results are given below.


|++---+---+-+|

|| Records    | Compaction Throughput | 5 large files In GB | Disk 
usage (GB) ||


|++---+---+-+|

|| 20 | 8 | Not collected | 500 ||

|++---+---+-+|

|| 20 | 16    | Not collected | 500 ||

|++---+---+-+|

|| 9  | 64    | 3.5,3.5,3.5,3.5,3.5 | 
273     ||


|++---+---+-+|

|| 9  | 128   | 3.5, 3.9,4.9,8.0, 15 | 
287 ||


|++---+---+-+|

|| 9  | 256   | 11,11,12,16,20 | 
359 ||


|++---+---+-+|

|| 9  | 512   | 14,19,23,27,28 | 
469 ||


|++---+---+-+|

|| 9  | 1024  | 14,18,23,27,28 | 
458 ||


|++---+---+-+|

|| 9  | 0 | 6.9,6.9,7.0,28,28 | 
223 ||


|++---+---+-+|

|| |       | | ||

|++---+---+-+|

Issues observed with increasing compaction throughput.

 1. Out of memory errors
 2. Scores reduces as throughput increased
 3. Files size grows as throughput increased
 4. Insert failures are noticed

After this testing, I feel that this change is beneficial for 
workloads where data is not kept/left on nodes for too long. With 
lower throughput large system can ingest more data. Does it make sense ?


Thanks,

Amit

*From:* Bowen Song via dev 
*Sent:* Friday, July 22, 2022 4:37 PM
*To:* dev@cassandra.apache.org
*Subject:* Re: [DISCUSS] Improve Commitlog write path

[CAUTION: External Email]

Hi Amit,

The compaction bottleneck is not an instantly visible limitation. It 
in effect limits the total size of writes over a fairly long period of 
time, because compaction is asynchronous and can be queued. That means 
if compaction can't keep up with the writes, they will be queued, and 
Cassandra remains fully functional until hitting the "too many open 
files" error or the filesystem runs out of free inodes. This can 
happen over many days or even weeks.


For the purpose of benchmarking, you may prefer to measure the max 
concurrent compaction throughput, instead of actually waiting for that 
breaking moment. The max write throughput is a fraction of the max 
concurrent compaction throughput, usually by a factor of 5 or more for 
a non-trivial sized table, depending on the table size in bytes. 
Search for "STCS write amplification" to understand why that's the 
case. That means if you've measured the max concurrent compaction 
throughput is 1GB/s, your average max insertion speed over a period of 
time is probably less than 200MB/s.


If you really decide to test the c

Re: [DISCUSS] Improve Commitlog write path

2022-07-22 Thread Bowen Song via dev


Hi Amit,


The compaction bottleneck is not an instantly visible limitation. It in 
effect limits the total size of writes over a fairly long period of 
time, because compaction is asynchronous and can be queued. That means 
if compaction can't keep up with the writes, they will be queued, and 
Cassandra remains fully functional until hitting the "too many open 
files" error or the filesystem runs out of free inodes. This can happen 
over many days or even weeks.


For the purpose of benchmarking, you may prefer to measure the max 
concurrent compaction throughput, instead of actually waiting for that 
breaking moment. The max write throughput is a fraction of the max 
concurrent compaction throughput, usually by a factor of 5 or more for a 
non-trivial sized table, depending on the table size in bytes. Search 
for "STCS write amplification" to understand why that's the case. That 
means if you've measured the max concurrent compaction throughput is 
1GB/s, your average max insertion speed over a period of time is 
probably less than 200MB/s.


If you really decide to test the compaction bottleneck in action, it's 
better to measure the table size in bytes on disk, rather than the 
number of records. That's because not only the record count, but also 
the size of partitions and compression ratio, all have meaningful effect 
on the compaction workload. It's also worth mentioning that if using the 
STCS strategy, which is more suitable for write heavy workload, you may 
want to keep an eye on the SSTable data file size distribution. 
Initially the compaction may not involve any large SSTable data file, so 
it won't be a bottleneck at all. As more bigger SSTable data files are 
created over time, they will get involved in compactions more and more 
frequently. The bottleneck will only shows up (i.e. become problematic) 
when there's sufficient number of large SSTable data files involved in 
multiple concurrent compactions, occupying all available compactors and 
blocks (queuing) a larger number of compactions involving smaller 
SSTable data files.



Regards,

Bowen


On 22/07/2022 11:19, Pawar, Amit wrote:


[Public]

Thank you Bowen for your reply. Took some time to respond due to 
testing issue.


I tested again multi-threaded feature with number of records from 260 
million to 2 billion and still improvement is seen around 80% of 
Ramdisk score. It is still possible that compaction can become new 
bottleneck and could be new opportunity to fix it. I am newbie here 
and possible that I failed to understand your suggestion completely. 
 At-least with this testing multi-threading benefit is reflecting in 
score.


Do you think multi-threading is good to have now ? else please suggest 
if I need to test further.


Thanks,

Amit

*From:* Bowen Song via dev 
*Sent:* Wednesday, July 20, 2022 4:13 PM
*To:* dev@cassandra.apache.org
*Subject:* Re: [DISCUSS] Improve Commitlog write path

[CAUTION: External Email]

From my past experience, the bottleneck for insert heavy workload is 
likely to be compaction, not commit log. You initially may see commit 
log as the bottleneck when the table size is relatively small, but as 
the table size increases, compaction will likely take its place and 
become the new bottleneck.


On 20/07/2022 11:11, Pawar, Amit wrote:

[Public]

Hi all,

(My previous mail is not appearing in mailing list and resending
again after 2 days)

Myself Amit and working at AMD Bangalore, India. I am new to
Cassandra and need to do Cassandra testing on large core systems.
Usually should test on multi-nodes Cassandra but started with
Single node testing to understand how Cassandra scales with
increasing core counts.

Test details:

Operation: Insert > 90% (insert heavy)

Operation: Scan < 10%

Cassandra: 3.11.10 and trunk

Benchmark: TPCx-IOT (similar to YCSB)

Results shows scaling is poor beyond 16 cores and it is almost
linear. Following settings are the common settings helped to get
the better scores.

 1. Memtable heap allocation: offheap_objects
 2. memtable_flush_writers > 4
 3. Java heap: 8-32GB with survivor ratio tuning
 4. Separate storage space for Commitlog and Data.

Many online blogs suggest to add new Cassandra node when unable to
take high writes. But with large systems, high writes should be
easily taken due to many cores. Need was to improve the scaling
with more cores so this suggestion didn’t help. After many rounds
of testing it was observed that current implementation uses single
thread for Commitlog syncing activity. Commitlog files are mapped
using mmap system call and changes are written with msync.
Periodic syncing with JVisualvm tool shows

 1. thread is not 100% busy with Ramdisk usage for Commitlog
storage and scaling improved on large systems. Ramdisk scores
> 2 X NVME score.
 2. thread becomes 100% bu

Re: [DISCUSS] Improve Commitlog write path

2022-07-20 Thread Bowen Song via dev

From my past experience, the bottleneck for insert heavy workload is 
likely to be compaction, not commit log. You initially may see commit 
log as the bottleneck when the table size is relatively small, but as 
the table size increases, compaction will likely take its place and 
become the new bottleneck.


On 20/07/2022 11:11, Pawar, Amit wrote:


[Public]

Hi all,

(My previous mail is not appearing in mailing list and resending again 
after 2 days)


Myself Amit and working at AMD Bangalore, India. I am new to Cassandra 
and need to do Cassandra testing on large core systems. Usually should 
test on multi-nodes Cassandra but started with Single node testing to 
understand how Cassandra scales with increasing core counts.


Test details:

Operation: Insert > 90% (insert heavy)

Operation: Scan < 10%

Cassandra: 3.11.10 and trunk

Benchmark: TPCx-IOT (similar to YCSB)

Results shows scaling is poor beyond 16 cores and it is almost linear. 
Following settings are the common settings helped to get the better 
scores.


 1. Memtable heap allocation: offheap_objects
 2. memtable_flush_writers > 4
 3. Java heap: 8-32GB with survivor ratio tuning
 4. Separate storage space for Commitlog and Data.

Many online blogs suggest to add new Cassandra node when unable to 
take high writes. But with large systems, high writes should be easily 
taken due to many cores. Need was to improve the scaling with more 
cores so this suggestion didn’t help. After many rounds of testing it 
was observed that current implementation uses single thread for 
Commitlog syncing activity. Commitlog files are mapped using mmap 
system call and changes are written with msync. Periodic syncing with 
JVisualvm tool shows


 1. thread is not 100% busy with Ramdisk usage for Commitlog storage
and scaling improved on large systems. Ramdisk scores > 2 X NVME
score.
 2. thread becomes 100% busy with NVME usage for Commiglog and score
does not improve much beyond 16 cores.

Linux kernel uses 4K pages for mapped memory with mmap system call. 
So, to understand this further, disk I/O testing was done using FIO 
tool and results shows


 1. NVME 4K random R/W throughput is very less with single thread and
it improves with multi-threaded.
 2. Ramdisk 4K random R/W throughput is good with single thread only
and also better with multi-threaded

Based on the FIO test results following two ideas were tested for 
Commitlog files with Cassandra-3.1.10 sources.


 1. Enable Direct IO feature for Commitlog files (similar to
[CASSANDRA-14466] Enable Direct I/O - ASF JIRA (apache.org)
 )
 2. Enable Multi-threaded syncing for Commitlog files.

First one need to retest. Interestingly second one helped to improve 
the score with “NVME” disk. NVME disk configuration score is almost 
within 80-90% of ramdisk and 2 times of single threaded 
implementation. Multithreading enabled by adding new thread pool in 
“AbstractCommitLogSegmentManager” class and changed syncing thread as 
manager thread for this new thread pool to take care synchronization. 
Only tested with Cassandra-3.11.10 and needs complete testing but this 
change is working in my test environment. Tried these few experiments 
so that I could discuss here and seek your valuable suggestions to 
identify the right fix for insert heavy workloads.


 1. Is it good idea to convert single threaded syncing to
multi-threading implementation to improve the disk IO?
 2. Direct I/O throughput is high with single thread and best fit for
Commitlog case due to file size. This will improve writes on small
to large systems. Good to bring this support for Commitlog files?

Please suggest.

Thanks,

Amit Pawar

Re: Dropping Python 3.6 support in 4.1

2022-04-05 Thread Bowen Song


I'm against this change.

CentOS 7 only has Python up to 3.6 available from the EPEL repository, 
and the maintenance updates for CentOS 7 ends in 2024. See: 
https://wiki.centos.org/About/Product


To install Python>3.6 on CentOS 7, the user must either use a 3rd party 
repository that's not maintained by the same project or compile it from 
source. None of these is as simple as "yum install epel-release && yum 
install python36".


I would strongly recommend keeping Python 3.6 compatibility until 
2024-06-30 when the CentOS 7 maintenance updates is stopped.


On 05/04/2022 11:35, Stefan Miklosovic wrote:

Hello,

I stumbled upon this ticket (1)

We will have Cassandra running with unsupported Python 3.6 once we
release 4.1 which is not good in my books.

I would like to try to bump it to 3.8 as minimum, it will get security
updates to 2024 at least.

Does it make sense to people? Especially so close to the freeze. I
guess we would need to update Python in Jenkins images mostly and so
on. I am running 3.8.10 locally with all the tests so it really seems
to be just a version bump.

Regards

(1) https://issues.apache.org/jira/browse/CASSANDRA-17450

Re: Updating our Code Contribution/Style Guide

2022-03-14 Thread Bowen Song

Oh, I certainly don't mean to block the purposed update. I'm sorry if it 
sounded that way. All I'm saying is we should add Python code style 
guides to it, and a follow up addendum can surely do the job.


On 14/03/2022 11:41, Josh McKenzie wrote:
+1 to the doc as written. A good portion of it also applies to python 
code style (structure, clarity in naming, etc).


Perhaps a python specific addendum as a follow up might make sense Bowen?

On Mon, Mar 14, 2022, at 7:21 AM, bened...@apache.org wrote:


I think the community would be happy to introduce a python style 
guide, but I am not well placed to do so, having chosen throughout my 
career to limit my exposure to python. Probably a parallel effort 
would be best - perhaps you could work with Stefan and others to 
produce such a proposal?




*From: *Bowen Song 
*Date: *Monday, 14 March 2022 at 10:53
*To: *dev@cassandra.apache.org 
*Subject: *Re: Updating our Code Contribution/Style Guide

I found there's no mentioning of Python code style at all. If we are 
going to update the style guide, can this be addressed too?


FYI, a quick "flake8" style check shows many existing issues in the 
Python code, including libraries imported but unused, redefinition of 
unused imports and invalid escape sequence in strings.



On 14/03/2022 09:41, bened...@apache.org wrote:

Our style guide hasn’t been updated in about a decade, and I
think it is overdue some improvements that address some
shortcomings as well as modern facilities such as streams and
lambdas.


Most of this was put together for an effort Dinesh started a few
years ago, but has languished since, in part because the project
has always seemed to have other priorities. I figure there’s
never a good time to raise a contended topic, so here is my
suggested update to contributor guidelines:



https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo


Many of these suggestions codify norms already widely employed,
sometimes in spite of the style guide, but some likely remain
contentious. Some potentially contentious things to draw your
attention to:


 1. Deemphasis of getX() nomenclature, in favour of richer set of
prefixes and more succinct simple x() to retrieve where clear
 2. Avoid implementing methods, incl. equals(), hashCode() and
toString(), unless actually used
 3. Modified new-line rules for multi-line function calls
 4. External dependency rules (require DISCUSS thread before
introducing)

Re: Updating our Code Contribution/Style Guide

2022-03-14 Thread Bowen Song

I think there's two separate issues, the style guide for Python code, 
and fixing the existing code style. In my opinion, the style guide 
should come first, and we can follow that to fix the existing code's style.


BTW, I can see the changes you made in CASSANDRA-17413 has already been 
merged into trunk. However, latest code in trunk still has all the 
issues I mentioned in the previous email. Some of the issues are purely 
code styling, such as visual indentation and white spaces around 
operators, but some are a little more than that, such as unused imports 
which can slightly impact performance and memory usage.


I think there's two valid approaches to address the issue. We could 
create a style guide first, and then fix them all in one go; or split 
the issues to two categories, pure style issue to be fixed after the 
code style guides is published, and other issues which can be fixed now. 
I personally prefer the former, because it involves less amount of work 
- no need to spend time to triage the issues reported by tools such as 
"flake8".


On 14/03/2022 11:11, Stefan Miklosovic wrote:

Hi Bowen,

we were working on that recently, like CASSANDRA-17413 + a lot of
improvements around Python stuff are coming. If you identify more
places for improvements we are definitely interested.

Regards

On Mon, 14 Mar 2022 at 11:53, Bowen Song  wrote:

I found there's no mentioning of Python code style at all. If we are going to 
update the style guide, can this be addressed too?

FYI, a quick "flake8" style check shows many existing issues in the Python 
code, including libraries imported but unused, redefinition of unused imports and invalid 
escape sequence in strings.


On 14/03/2022 09:41, bened...@apache.org wrote:

Our style guide hasn’t been updated in about a decade, and I think it is 
overdue some improvements that address some shortcomings as well as modern 
facilities such as streams and lambdas.



Most of this was put together for an effort Dinesh started a few years ago, but 
has languished since, in part because the project has always seemed to have 
other priorities. I figure there’s never a good time to raise a contended 
topic, so here is my suggested update to contributor guidelines:



https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo



Many of these suggestions codify norms already widely employed, sometimes in 
spite of the style guide, but some likely remain contentious. Some potentially 
contentious things to draw your attention to:



Deemphasis of getX() nomenclature, in favour of richer set of prefixes and more 
succinct simple x() to retrieve where clear
Avoid implementing methods, incl. equals(), hashCode() and toString(), unless 
actually used
Modified new-line rules for multi-line function calls
External dependency rules (require DISCUSS thread before introducing)

Re: Updating our Code Contribution/Style Guide

2022-03-14 Thread Bowen Song

I found there's no mentioning of Python code style at all. If we are 
going to update the style guide, can this be addressed too?


FYI, a quick "flake8" style check shows many existing issues in the 
Python code, including libraries imported but unused, redefinition of 
unused imports and invalid escape sequence in strings.



On 14/03/2022 09:41, bened...@apache.org wrote:


Our style guide hasn’t been updated in about a decade, and I think it 
is overdue some improvements that address some shortcomings as well as 
modern facilities such as streams and lambdas.


Most of this was put together for an effort Dinesh started a few years 
ago, but has languished since, in part because the project has always 
seemed to have other priorities. I figure there’s never a good time to 
raise a contended topic, so here is my suggested update to contributor 
guidelines:


https://docs.google.com/document/d/1sjw0crb0clQin2tMgZLt_ob4hYfLJYaU4lRX722htTo

Many of these suggestions codify norms already widely employed, 
sometimes in spite of the style guide, but some likely remain 
contentious. Some potentially contentious things to draw your 
attention to:


  * Deemphasis of getX() nomenclature, in favour of richer set of
prefixes and more succinct simple x() to retrieve where clear
  * Avoid implementing methods, incl. equals(), hashCode() and
toString(), unless actually used
  * Modified new-line rules for multi-line function calls
  * External dependency rules (require DISCUSS thread before introducing)

Re: [DISCUSS] CASSANDRA-17292 Move cassandra.yaml toward a nested structure around major database concepts

2022-02-23 Thread Bowen Song

I don't see the two formats being mutually exclusive. For example, if 
only one option is different from the default in a deeply nested 
structure, it would be far easier to set "a.b.c.d.e: true" than having 5 
lines in the config file. Mixing both formats in the same settings file 
seems like a reasonable thing to do.

On 23/02/2022 19:47, Jeremy Hanna wrote:
If we support both formats for a time, I just would want to make 
absolutely sure that it will read only one or the other so there's no 
uncertainty about the server configuration.  Perhaps to avoid 
unforeseen migration problems, we only read the old format if a 
specific flag is set?  So with version 5, we only read the new format 
by default.  So if you only have the old format and you try to start 
5.0, it will fail with a log message about a JVM option to be used 
("READ_CASSANDRA_YAML" or something).  So if you enable that, you 
*only* read the old config.  It would be one or the other so you don't 
have weird dilemmas of which one to choose.

On Feb 23, 2022, at 11:30 AM, Caleb Rackliffe 
 wrote:

Continuing to parse the old format for some time seems unavoidable, 
and allowing dot-separated options in the old format seems reasonable.

There will certainly be some interesting problems when we move into 
implementation space with this. One approach might be to implement a 
clean object model that corresponds to the new format, work out how 
it's parsed/populated from the file, and then have some kind of 
converter from the old Config object to the new object model that 
allows us to provide values to DatabaseDescriptor from only the new 
one (thereby avoiding any changes to the places all over the codebase 
that use DD).

On Wed, Feb 23, 2022 at 4:46 AM Bowen Song  wrote:

I agree with Benedict, there's legit use cases for both the flat
and structured config file format. The operator should be able to
choose which one is best suited for their own use case. It will
also make the upgrade process easier if both formats are
supported by future versions of Cassandra.

On 23/02/2022 07:52, bened...@apache.org wrote:

I agree that a new configuration layout should be introduced
once only, not incrementally.

However, I disagree that we should immediately deprecate the old
config file and refuse to parse it. We can maintain
compatibility indefinitely at low cost, so we should do so.

Users of the old format, when using new configuration options,
can simply use dot separators to specify them. Since most
settings are not required, this is by far the least painful
upgrade process.

*From: *Berenguer Blasi 
<mailto:berenguerbl...@gmail.com>
*Date: *Wednesday, 23 February 2022 at 06:53
*To: *dev@cassandra.apache.org 
<mailto:dev@cassandra.apache.org>
*Subject: *Re: [DISCUSS] CASSANDRA-17292 Move cassandra.yaml
toward a nested structure around major database concepts

+1 to a non-incremental approach as well.

On 23/2/22 1:27, Caleb Rackliffe wrote:
> @Patrick I’m absolutely intending for this to be a 5.0
concern. The only reason why it would have any bearing on 4.x is
the case where we’re adding new config that could fit into the
v2 structure now and not require any later changes.
>
>> On Feb 22, 2022, at 3:22 PM, Bernardo Sanchez

<mailto:bernard...@pointclickcare.com> wrote:
>>
>> unsubscribe
>>
>> -Original Message-
>> From: Stefan Miklosovic 
<mailto:stefan.mikloso...@instaclustr.com>
>> Sent: Tuesday, February 22, 2022 3:53 PM
>> To: dev@cassandra.apache.org
>> Subject: Re: [DISCUSS] CASSANDRA-17292 Move cassandra.yaml
toward a nested structure around major database concepts
>>
>> "EXTERNAL EMAIL" - This email originated from outside of the
organization. Do not click or open attachments unless you
recognize the sender and know the content is safe. If you are
unsure, please contact hel...@pointclickcare.com.
>>
>> I want to add that to, however, on the other hand, we also do
have dtests in Python and they need to run with old configs too.
That is what Ekaterina was doing - supporting old configuration
while introducing new one. If we make "a big cut" and old way of
doing things would not be possible, how are we going to treat
this in dtests when we will have stuff for 3.11, 4 on old
configs and 5 on new configs?
>>
>>> On Tue, 22 Feb 2022 at 21:48, Stefan Miklosovic

<mailto:stefan.mikloso...@instaclustr.com> wrote:
>>>
>>> +1 to what Patrick says.
>>>
>>>> On Tue, 22 Feb 2022 at 21:40, Patrick McFadin
 <mailto:pmcfa...@gmail.com> wrote:
>>>>
>>&g

Re: [DISCUSS] CASSANDRA-17292 Move cassandra.yaml toward a nested structure around major database concepts

2022-02-23 Thread Bowen Song

I agree with Benedict, there's legit use cases for both the flat and 
structured config file format. The operator should be able to choose 
which one is best suited for their own use case. It will also make the 
upgrade process easier if both formats are supported by future versions 
of Cassandra.

On 23/02/2022 07:52, bened...@apache.org wrote:

I agree that a new configuration layout should be introduced once 
only, not incrementally.

However, I disagree that we should immediately deprecate the old 
config file and refuse to parse it. We can maintain compatibility 
indefinitely at low cost, so we should do so.

Users of the old format, when using new configuration options, can 
simply use dot separators to specify them. Since most settings are not 
required, this is by far the least painful upgrade process.

*From: *Berenguer Blasi 
*Date: *Wednesday, 23 February 2022 at 06:53
*To: *dev@cassandra.apache.org 
*Subject: *Re: [DISCUSS] CASSANDRA-17292 Move cassandra.yaml toward a 
nested structure around major database concepts

+1 to a non-incremental approach as well.

On 23/2/22 1:27, Caleb Rackliffe wrote:
> @Patrick I’m absolutely intending for this to be a 5.0 concern. The 
only reason why it would have any bearing on 4.x is the case where 
we’re adding new config that could fit into the v2 structure now and 
not require any later changes.

>
>> On Feb 22, 2022, at 3:22 PM, Bernardo Sanchez 
 wrote:

>>
>> unsubscribe
>>
>> -Original Message-
>> From: Stefan Miklosovic 
>> Sent: Tuesday, February 22, 2022 3:53 PM
>> To: dev@cassandra.apache.org
>> Subject: Re: [DISCUSS] CASSANDRA-17292 Move cassandra.yaml toward a 
nested structure around major database concepts

>>
>> "EXTERNAL EMAIL" - This email originated from outside of the 
organization. Do not click or open attachments unless you recognize 
the sender and know the content is safe. If you are unsure, please 
contact hel...@pointclickcare.com.

>>
>> I want to add that to, however, on the other hand, we also do have 
dtests in Python and they need to run with old configs too. That is 
what Ekaterina was doing - supporting old configuration while 
introducing new one. If we make "a big cut" and old way of doing 
things would not be possible, how are we going to treat this in dtests 
when we will have stuff for 3.11, 4 on old configs and 5 on new configs?

>>
>>> On Tue, 22 Feb 2022 at 21:48, Stefan Miklosovic 
 wrote:

>>>
>>> +1 to what Patrick says.
>>>
 On Tue, 22 Feb 2022 at 21:40, Patrick McFadin 
 wrote:

 I'm going to put up a red flag of making config file changes of 
this scale on a dot release. This should really be a 5.0 consideration.

 With that, I would propose a #5. 5.0 nodes will only read the new 
config files and reject old config files. If any of you went through 
the config file changes from Apache HTTPd 1.3 -> 2.0 you know how much 
of a lifesaver that can be for ops. Make it a part of the total 
upgrade to a new major version, not a radical change inside of a dot 
version, and make it a clean break. No "legacy config" laying around. 
That's just a recipe for surprises later if there are new required 
config values and somebody doesn't even realize they have some old 4.x 
yaml files laying around.

 Patrick

 On Tue, Feb 22, 2022 at 11:51 AM Tibor Répási 
 wrote:

> Glad to be agree on #4. That feature could be add anytime.
>
> If a version element is added to the YAML, then it is not 
necessary to change the filename, thus we could end up with #3. The 
value of the version element could default to 1 in the first phase, 
which does not need any change for legacy format configuration. New 
config format must include version: 2. When in some later version the 
support for legacy configuration is removed, the default for the 
version element could be changed to 2 or removed.

>
> On 22. Feb 2022, at 19:30, Caleb Rackliffe 
 wrote:

>
> My initial preference would be something like combining #1 and 
#4. We could add something like a simple "version: <1|2>" element to 
the YAML that would eliminate any possible confusion about back-compat 
within a given file.

>
> Thanks for enumerating these!
>
> On Tue, Feb 22, 2022 at 10:42 AM Tibor Répási 
 wrote:

>> Hi,
>>
>> I like the idea of having cassandra.yaml better structured, as 
an operator, my primer concern is the transition. How would we change 
the config structure from legacy to the new one during a rolling 
upgrade? My thoughts on this:

>>
>> 1. Legacy and new configuration is stored in different files. 
Cassandra will read the legacy file on startup if it exists, the new 
one otherwise. May raise warning on startup when legacy was used.

>>    pros:
>> - separate files for separate formats
>> - clean and operator controlled switch to new format
>> - already known procedure, e.g. change from 
PropertyFileSnitch to

Re: Client password hashing

2022-02-16 Thread Bowen Song

To me this doesn't sound very useful. Here's a few threat model I can 
think of that may be related to this proposal, and why is this not 
addressing the issues & what should be done instead.


1. passwords are send over network in plaintext allows passive packet 
sniffier to learn about the password


When the user logging in and authenticating themselves, they will have 
to send both the username and password to the server in plaintext anyway.


Securing the connection with TLS should address this concern.

2. malicious intermediaries (external loadbancer, middleware, etc.) are 
able learn about the password


The admin user must login against the intermediary before 
creating/altering other users, this exposes the admin user's credentials 
to the malicious intermediary.


Only use trusted intermediaries, and use TLS between the client & 
Cassandra server wherever possible (e.g. don't terminate TLS at the 
loadbalancer).


3. accidentally logging the password to an insecure log file

Logging a hashed password to an insecure log file is still very bad

The logger module should correctly redact the data


If this proposal helps mitigating a different threat model that you have 
in mind, please kindly share it with us.



On 16/02/2022 07:44, Berenguer Blasi wrote:

Hi all,

I would like to propose to add support for client password hashing 
(https://issues.apache.org/jira/browse/CASSANDRA-17334). If anybody 
has any concerns or question with this functionality I will be happy 
to discuss them.


Thx in advance.

Re: [DISCUSS] CEP-19: Trie memtable implementation

2022-02-09 Thread Bowen Song

TBH, I don't have an opinion on the configuration. I just want to say 
that if at the end we decide the configuration in the YAML should 
override the table schema, I would like to recommend that we specifying 
a list of whitelisted (or blacklisted) "templates" in the YAML file, and 
the template chosen by the table schema is used if it's enabled, 
otherwise fallback to a default template, which could be the first 
element in the whitelist if that's used, or a separate configuration 
entry if a blacklist is used. The list should be optional in the YAML, 
and an empty list or the absent of it means everything is enabled.


Advantage of this:

1. it doesn't require the operator to configure this, as an empty or 
absent list by default enables all templates and should work fine in 
most cases.


2. it allows the operator to whitelist / blacklist any template if ever 
needed (e.g. due to a bug), and also allow them to choose a fallback option.


3. the table schema has priority as long as the chosen template is not 
explicitly disabled by the YAML.


4. it allows the operator to selectively disable some templates without 
forcing all tables to use the same template specified by the YAML.



On 09/02/2022 09:43, bened...@apache.org wrote:


Why not have some default templates that can be specified by the 
schema without touching the yaml, but overridden in the yaml as necessary?


*From: *Branimir Lambov 
*Date: *Wednesday, 9 February 2022 at 09:35
*To: *dev@cassandra.apache.org 
*Subject: *Re: [DISCUSS] CEP-19: Trie memtable implementation

If I understand this correctly, you prefer _not_ to have an option to 
give the configuration explicitly in the schema. I.e. force the 
configurations ("templates" in current terms) to be specified in the 
yaml, and only allow tables to specify which one to use among them?


This does sound at least as good to me, and I'll happily change the API.

Regards,

Branimir

On Tue, Feb 8, 2022 at 10:40 PM Dinesh Joshi  wrote:

My quick reading of the code suggests that schema will override
the operator's default preference in the YAML. In the event of a
bug in the new implementation, there could be situation where the
operator might need to override this via the YAML.



On Feb 8, 2022, at 12:29 PM, Jeremiah D Jordan
 wrote:

I don’t really see most users touching the default
implementation.  I would expect the main reason someone would
change would be

1. They run into some bug that is only in one of the
implementations.

2. They have persistent memory and so want to use
https://issues.apache.org/jira/browse/CASSANDRA-13981

Given that I doubt most people will touch it, I think it is
good to give advanced operators the ability to have more
control over switching to things that have new performance
characteristics.  So I like the idea that the proposed
configuration approach which allows someone to change to a new
implementation one node at a time and only for specific tables.



On Feb 8, 2022, at 2:21 PM, Dinesh Joshi
 wrote:

Thank you for sharing the perf test results.

Going back to the schema vs yaml configuration. I am
concerned users may pick the wrong implementation for
their use-case. Is there any chance for us to
automatically pick a MemTable implementation based on
heuristics? Do we foresee users ever picking the existing
SkipList implementation over the Trie Given the
performance tests, it seems the Trie implementation is the
clear winner.

To be clear, I am not suggesting we remove the existing
implementation. I am for maintaining a pluggable API for
various components.

Dinesh



On Feb 7, 2022, at 8:39 AM, Branimir Lambov
 wrote:

Added some performance results to the ticket:
https://issues.apache.org/jira/browse/CASSANDRA-17240

Regards,

Branimir

On Sat, Feb 5, 2022 at 10:59 PM Dinesh Joshi
 wrote:

This is excellent. Thanks for opening up this CEP.
It would be great to get some stats around GC
allocation rate / memory pressure, read & write
latencies, etc. compared to existing implementation.

Dinesh



On Jan 18, 2022, at 2:13 AM, Branimir Lambov
 wrote:

The memtable pluggability API (CEP-11) is
per-table to enable memtable selection
that suits specific workflows. It also makes
full sense to permit per-node configuration,
both to be able to modify the

Re: Issue while trying to run pytest command

2022-01-10 Thread Bowen Song

Did you run the pytest command in the cassandra directory (the cassandra 
git repo) or the cassandra-dtest directory (the cassandra-dtest git 
repo)? You should run the pytest command in the cassandra-dtest.


On 09/01/2022 11:33, Manish G wrote:
Initial installation is done following 
https://github.com/apache/cassandra-dtest#python-dependencies.



On Sun, Jan 9, 2022 at 11:43 AM Manish G 
 wrote:


I am trying to run pytest command :

(dtest) manish.ghildiyal@MacBook-Pro-3 cassandra % pytest
--cassandra-dir=/Users/../cassandra


But I get error as:

pytest: error: unrecognized arguments:
--cassandra-dir=/Users/./cassandra


I have done initial installation following this.

Do I need to do any more configuration?

Manish

Re: [DISCUSS] Disabling MIME-part filtering on this mailing list

2021-12-21 Thread Bowen Song

I have just received a confirmation from Infra informing me that this 
change has been made. I'm sending this email as an update but also a 
test. Hopefully it arrives in your inbox without trouble, and my email 
address no longer has the ".INVALID" append to it.


On 04/12/2021 17:15, Bowen Song wrote:

Hello,


Currently this mailing list has MIME-part filtering turned on, which 
will results in "From:" address munging (appending ".INVALID" to the 
sender's email address) for domains enforcing strict DMARC rules, such 
as apple.com, zoho.com and all Yahoo.** domains. This behaviour may 
cause some emails being treated as spam by the recipients' email 
service providers, because the result "From:" address, such as 
"some...@yahoo.com.INVALID" is not valid and cannot be verified.


I have created a Jira ticket INFRA-22548 
<https://issues.apache.org/jira/browse/INFRA-22548> asking to change 
this, but the Infra team said dropping certain MIME part types is to 
prevent spam and harmful attachments, and would require a consensus 
from the project before they can make the change. Therefore I'm 
sending this email asking for your opinions on this.


To be clear, turning off the MIME-part filtering will not turn off the 
anti-spam and anti-virus feature on the mailing list, all emails sent 
to the list will still need to pass the checks before being forwarded 
to subscribers. Morden (since 90s?) anti-spam and anti-virus software 
will scan the MIME parts too, in addition to the plain-text and/or 
HTML email body. Your email service provider is also almost certainly 
going to have their own anti-spam and anti-virus software, in addition 
to the one on the mailing list. The difference is whether the mailing 
list proactively removing MIME parts not in the predefined whitelist.


To help you understand the change, here's the difference between the 
two behaviours:



With the MIME-part filtering enabled (current behaviour)

* the mailing list will remove certain MIME-part types, such as 
executable file attachments, before forwarding it


* the mailing list will append ".INVALID" to some senders' email address

* the emails from the "*@*.INVALID" sender address are more likely to 
end up in recipients' spam folder


* it's harder for people to directly reply to someone who's email 
address has been modified in this way


* recipients running their own email server without anti-spam and/or 
anti-virus software on it have some extra protections



With MIME-part filtering disabled

* the mailing list forward all non-spam and non-infected emails as it 
is without changing them


* the mailing list will not change senders' email address

* the emails from this mailing list are less likely to end up in 
recipients' spam folder


* it's easier for people to directly reply to anyone in this mailing list

* recipients running their own email server without anti-spam and/or 
anti-virus software on it may be exposed to some threats



What's your opinion on this? Do you support or oppose disabling the 
MIME-part filtering on the Cassandra-dev mailing list?



p.s.: as you can see, my email address has the ".INVALID" appended to 
it by this mailing list.



Regards,

Bowen

Re: [DISCUSS] Disabling MIME-part filtering on this mailing list

2021-12-04 Thread Bowen Song

Hmm.. It's too late to change that. I opened it as "DISCUSS" because I 
was not sure if the information in it is enough for people to vote on 
it. There's clearly a lot more can be asked or discussed. For example 
the change will also stop the "To unsubscribe, ..." footer being 
appended to some emails (but not this one), and that may cause some 
difficulties for some people to find the unsubscribe link/address.


Hint #1: you can find the unsubscribe button where you found 
 the subscribe button.


Hint #2: you can also find the unsubscribe address in the email header 
"List-Unsubscribe"


Hint #3: failing all the above, you can also ask in the list. I'm sure 
other people will help.


On 04/12/2021 22:31, Mick Semb Wever wrote:

What's your opinion on this? Do you support or oppose disabling the
MIME-part filtering on the Cassandra-dev mailing list?


+1

(nit: am thinking the thread should have had a [VOTE] prefix, since you are
calling out for consensus)

[DISCUSS] Disabling MIME-part filtering on this mailing list

2021-12-04 Thread Bowen Song


Hello,


Currently this mailing list has MIME-part filtering turned on, which 
will results in "From:" address munging (appending ".INVALID" to the 
sender's email address) for domains enforcing strict DMARC rules, such 
as apple.com, zoho.com and all Yahoo.** domains. This behaviour may 
cause some emails being treated as spam by the recipients' email service 
providers, because the result "From:" address, such as 
"some...@yahoo.com.INVALID" is not valid and cannot be verified.


I have created a Jira ticket INFRA-22548 
 asking to change 
this, but the Infra team said dropping certain MIME part types is to 
prevent spam and harmful attachments, and would require a consensus from 
the project before they can make the change. Therefore I'm sending this 
email asking for your opinions on this.


To be clear, turning off the MIME-part filtering will not turn off the 
anti-spam and anti-virus feature on the mailing list, all emails sent to 
the list will still need to pass the checks before being forwarded to 
subscribers. Morden (since 90s?) anti-spam and anti-virus software will 
scan the MIME parts too, in addition to the plain-text and/or HTML email 
body. Your email service provider is also almost certainly going to have 
their own anti-spam and anti-virus software, in addition to the one on 
the mailing list. The difference is whether the mailing list proactively 
removing MIME parts not in the predefined whitelist.


To help you understand the change, here's the difference between the two 
behaviours:



With the MIME-part filtering enabled (current behaviour)

* the mailing list will remove certain MIME-part types, such as 
executable file attachments, before forwarding it


* the mailing list will append ".INVALID" to some senders' email address

* the emails from the "*@*.INVALID" sender address are more likely to 
end up in recipients' spam folder


* it's harder for people to directly reply to someone who's email 
address has been modified in this way


* recipients running their own email server without anti-spam and/or 
anti-virus software on it have some extra protections



With MIME-part filtering disabled

* the mailing list forward all non-spam and non-infected emails as it is 
without changing them


* the mailing list will not change senders' email address

* the emails from this mailing list are less likely to end up in 
recipients' spam folder


* it's easier for people to directly reply to anyone in this mailing list

* recipients running their own email server without anti-spam and/or 
anti-virus software on it may be exposed to some threats



What's your opinion on this? Do you support or oppose disabling the 
MIME-part filtering on the Cassandra-dev mailing list?



p.s.: as you can see, my email address has the ".INVALID" appended to it 
by this mailing list.



Regards,

Bowen

Re: [DISCUSS] Nested YAML configs for new features

2021-11-29 Thread Bowen Song

In ElasticSearch, the default is a flattened format with almost all 
lines commented out. See 
https://github.com/elastic/elasticsearch/blob/master/distribution/src/config/elasticsearch.yml


I guess they chose to do that because user can uncomment individual 
lines to make changes. In a structured config file, the user will have 
to uncomment all lines containing the parent keys to get it work. For 
example, if someone wants to set the config keyABB to a non-default 
value, they will have to correctly uncomment 3 lines: keyA, keyAB and 
keyABB, which can be annoying and could easily maker a mistake. If any 
of the first two keys is not uncommented, the YAML file will still be 
valid but the config like keyX.keyAB.keyABB might just get silently 
ignored by the database.


   keyX:
  keyY:
    keyZ: value
   # keyA:
   #   keyAA:
   # key AAA: value
   #   keyAB:
   # keyABA: value
   # keyABB: value

On 29/11/2021 15:54, Benjamin Lerer wrote:

I do not think that supporting both options is an issue. The settings
virtual table would have to use the flattened version.
If we support both formats, the question would be: what should be the one
used by default in the configuration file?

Le ven. 26 nov. 2021 à 15:40,bened...@apache.orga
écrit :


This is the approach I favour for config files also. We had a much less
engaged discussion on this topic only a few months ago, so glad to see more
people getting involved now.

I would however personally prefer to see the configuration file slowly
deprecated (if perhaps never retired), in favour of virtual tables, so that
operators may easily set configurations for the entire cluster. Ideally it
would be possible to specify configuration per cluster, per DC and per
node, with the most specific configuration applying I would like to see a
similar hierarchy for Keyspace, Table and Per-Query options. Ideally only
the barest minimum number of options would be necessary to supply in a
config file, and only on first launch – seed nodes, for instance.

So whatever design we employ here, we should IMO be aiming for it to be
compatible with a CQL representation also.


From: Bowen Song
Date: Wednesday, 24 November 2021 at 18:15
To:dev@cassandra.apache.org  
Subject: Re: [DISCUSS] Nested YAML configs for new features
Since you mentioned ElasticSearch, I'm actually pretty happy with their
config file syntax. It allows the user to completely flatten out the
entire config file. To give people who isn't familiar with ElasticSearch
an idea, here is a config file we use:

 cluster.name: foobar

 node.remote_cluster_client: false
 node.name: "foo.example.com"
 node.master: true
 node.data: true
 node.ingest: true
 node.ml: false

 xpack.ml.enabled: false
 xpack.security.enabled: false
 xpack.security.audit.enabled: false
 xpack.watcher.enabled: false

 action.auto_create_index: "+.,-*"

 network.host: _global_

 discovery.zen.hosts_provider: file
 discovery.zen.minimum_master_nodes: 2

 http.publish_host: "foo.example.com"
 http.publish_port: 443
 http.bind_host: 127.0.0.1

 transport.publish_host: "bar.example.com"
 transport.bind_host: 0.0.0.0

 indices.fielddata.cache.size: 1GB
 indices.breaker.total.use_real_memory: false

 path.logs: /var/log/elasticsearch
 path.data: /var/lib/elasticsearch/data

As you can see we can use the flat (grep-able) syntax for everything.
This is also human readable because we can group options together by
inserting empty lines between them.

The equivalent of the above in a structured syntax will be:

 cluster:
  name: foobar

 node:
  remote_cluster_client: false
  name: "foo.example.com"
  master: true
  data: true
  ingest: true
  ml: false

 xpack:
  ml:
  enabled: false
  security:
  enabled: false
  audit:
  enabled: false
  watcher:
  enabled: false

 action:
  auto_create_index: "+.,-*"

 network:
  host: _global_

 discovery:
  zen:
  hosts_provider: file
  minimum_master_nodes: 2

 http:
  publish_host: "foo.example.com"
  publish_port: 443
  bind_host: 127.0.0.1

 transport:
  publish_host: "bar.example.com"
  bind_host: 0.0.0.0

 indices:
  fielddata:
  cache:
  size: 1GB
 indices:
  breaker:
  total:
  use_real_memory: false

 path:
  logs: /var/log/elasticsearch
  data: /var/lib/elasticsearch/data

This may be easier to read for some people, but it is a total nightmare
for "grep" - so many keys have identical names, such as "enabled".

Also, for the virtual tables

Re: [DISCUSS] Nested YAML configs for new features

2021-11-24 Thread Bowen Song

Since you mentioned ElasticSearch, I'm actually pretty happy with their 
config file syntax. It allows the user to completely flatten out the 
entire config file. To give people who isn't familiar with ElasticSearch 
an idea, here is a config file we use:


   cluster.name: foobar

   node.remote_cluster_client: false
   node.name: "foo.example.com"
   node.master: true
   node.data: true
   node.ingest: true
   node.ml: false

   xpack.ml.enabled: false
   xpack.security.enabled: false
   xpack.security.audit.enabled: false
   xpack.watcher.enabled: false

   action.auto_create_index: "+.,-*"

   network.host: _global_

   discovery.zen.hosts_provider: file
   discovery.zen.minimum_master_nodes: 2

   http.publish_host: "foo.example.com"
   http.publish_port: 443
   http.bind_host: 127.0.0.1

   transport.publish_host: "bar.example.com"
   transport.bind_host: 0.0.0.0

   indices.fielddata.cache.size: 1GB
   indices.breaker.total.use_real_memory: false

   path.logs: /var/log/elasticsearch
   path.data: /var/lib/elasticsearch/data

As you can see we can use the flat (grep-able) syntax for everything. 
This is also human readable because we can group options together by 
inserting empty lines between them.


The equivalent of the above in a structured syntax will be:

   cluster:
    name: foobar

   node:
    remote_cluster_client: false
    name: "foo.example.com"
    master: true
    data: true
    ingest: true
    ml: false

   xpack:
    ml:
    enabled: false
    security:
    enabled: false
    audit:
    enabled: false
    watcher:
    enabled: false

   action:
    auto_create_index: "+.,-*"

   network:
    host: _global_

   discovery:
    zen:
    hosts_provider: file
    minimum_master_nodes: 2

   http:
    publish_host: "foo.example.com"
    publish_port: 443
    bind_host: 127.0.0.1

   transport:
    publish_host: "bar.example.com"
    bind_host: 0.0.0.0

   indices:
    fielddata:
    cache:
    size: 1GB
   indices:
    breaker:
    total:
    use_real_memory: false

   path:
    logs: /var/log/elasticsearch
    data: /var/lib/elasticsearch/data

This may be easier to read for some people, but it is a total nightmare 
for "grep" - so many keys have identical names, such as "enabled".


Also, for the virtual tables, it would be a lot easier to represent 
individual values in a virtual table when the config is flat and keys 
are unique. The virtual tables would need to either support the encoding 
and decoding of the structured config into a flat structure, or use JSON 
encoded string value. The use of JSON would make querying individual 
value much harder.


On 22/11/2021 16:16, Joseph Lynch wrote:

Isn't one of the primary reasons to have a YAML configuration instead
of a properties file is to allow typed and structured (implies nested)
configuration? I think it makes a lot of sense to group related
configuration options (e.g. a feature) into a typed class when we're
talking about more than one or two related options.

It's pretty standard elsewhere in the JVM ecosystem to encode YAMLs to
period encoded key->value pairs when required (usually when providing
a property or override layer), Spring and Elasticsearch yamls both
come to mind. It seems pretty reasonable to support dot encoding and
decoding, for example {"a": {"b": 12}} -> '"a.b": 12'.

Regarding quickly telling what configuration a node is running I think
we should lean on virtual tables for "what is the current
configuration" now that we have them, as others have said the written
cassandra.yaml is not necessarily the current configuration ... and
also grep -C or -A exist for this reason.

-Joey

On Mon, Nov 22, 2021 at 4:14 AM Benjamin Lerer  wrote:

I do not have a strong opinion for one or the other but wanted to raise the
issue I see with the "Settings" virtual table.

Currently the "Settings" virtual table converts nested options into flat
options using a "_" separator. For those options it allows a user to query
the all set of options through some hack.
If we decide to move to more nesting (more than one level), it seems to me
that we need to change the way this table is behaving and how we can query
its data.

We would need to start using "." as a nesting separator to ensure that
things are consistent between the configuration and the table and add
support for LIKE restrictions for filtering queries to allow operators to
be able to select the precise set of settings that the operator is looking
for.

Doing so is not really complicated in itself but might impact some users.

Le ven. 19 nov. 2021 à 22:39, David Capwell  a
écrit :


it is really handy to grep
cassandra.yaml on some config key and you know the value instantly.

You can still do that

$ grep -A2 coordinator_read_size conf/cassandra.yaml
# coordinator_read_size:
#

Re: [DISCUSS] Nested YAML configs for new features

2021-11-24 Thread Bowen Song

It only works if the output is for human to read. If you have a large 
number of servers, very often you want to do "grep -q ... && 
other_command" (or || other_command), or chaining the grep results frin 
parallel-ssh into another command (grep or sort). The -A/-B/-C switches 
will not work in this case. If the nested configurations have multiple 
keys with the same name (e.g.: a dictionary where the values are very 
similar dictionaries), even chaining 3 grep commands in the form of 
"grep -A ... | grep -B ... | grep -q ... " is unlikely to work.


Structured / nested config is easier for human eyes to read but very 
hard for simple scripts to handle. Flat config is harder for human eyes 
but easy for simple scripts. I can see user may prefer one over another 
depending on their own use case. If the structured / nested config must 
be introduced, I would like to see both syntaxes supported to allow the 
user to make their own choice.



On 24/11/2021 16:21, Henrik Ingo wrote:

Grepping is an important use case, and having worked with another database
that does nest its configs, I can offer some tips how I survived:

With good old grep, it can help to use the before and after options:

grep -A 5 track_warnings | grep -B 5 warn_threshold

Would find you this:

track_warnings:
 enabled: true
 coordinator_read_size:
 warn_threshold: 10kb

It would require magic expert knowledge to guess right numbers for -A and
-B but in many cases you could just use a large number like  and it
will work in most cases.

For more frequent use, you will want to just install `yq` (aka yaml query):
https://github.com/kislyuk/yq

henrik


On Fri, Nov 19, 2021 at 9:07 PM Stefan Miklosovic <
stefan.mikloso...@instaclustr.com> wrote:


Hi David,

while I do not oppose nested structure, it is really handy to grep
cassandra.yaml on some config key and you know the value instantly.
This is not possible when it is nested (easily & fastly) as it is on
two lines. Or maybe my grepping is just not advanced enough to cover
this case? If it is flat, I can just grep "track_warnings" and I have
them all.

Can you elaborate on your last bullet point? Parsing layer ... What do
you mean specifically?

Thanks

On Fri, 19 Nov 2021 at 19:36, David Capwell  wrote:

This has been brought up in a few tickets, so pushing to the dev list.

CASSANDRA-15234 - Standardise config and JVM parameters
CASSANDRA-16896 - hard/soft limits for queries
CASSANDRA-17147 - Guardrails prototype

In short, do we as a project wish to move "new features" into nested
YAML when the feature has "enough" to justify the nesting?  I would
really like to focus this discussion on new features rather than
retroactively grouping (leaving that to CASSANDRA-15234), as there is
already a place to talk about that.

To get things started, let's start with the track-warning feature
(hard/soft limits for queries), currently the configs look as follows
(assuming 15234)

track_warnings:
 enabled: true
 coordinator_read_size:
 warn_threshold: 10kb
 abort_threshold: 1mb
 local_read_size:
 warn_threshold: 10kb
 abort_threshold: 1mb
 row_index_size:
 warn_threshold: 100mb
 abort_threshold: 1gb

or should this be "flat"

track_warnings_enabled: true
track_warnings_coordinator_read_size_warn_threshold: 10kb
track_warnings_coordinator_read_size_abort_threshold: 1mb
track_warnings_local_read_size_warn_threshold: 10kb
track_warnings_local_read_size_abort_threshold: 1mb
track_warnings_row_index_size_warn_threshold: 100mb
track_warnings_row_index_size_abort_threshold: 1gb

For me I prefer nested for a few reasons
* easier to enforce consistency as the configs can use shared types;
in the track warnings patch I had mismatches cross configs (warn vs
warns, fail vs abort, etc.) before going nested, now everything reuses
the same types
* even though it is longer, things can be more clear how they are related
* parsing layer can add support for mixed or purely flat depending on
user preference (example:
track_warnings.row_index_size.abort_threshold, using the '.' notation
to represent nested structures)

Thoughts?

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] Nested YAML configs for new features

2021-11-19 Thread Bowen Song

I'm with Stefan. I prefer the flat YAML file which I can easily use grep 
to check and confirm the settings on large number of servers with 
parallel-ssh. This will be very hard to do on nested config in a YAML file.


In addition to that, I also use grep in the Cassandra source code to 
locate the relevant files based on the config name. The flat config name 
is long and unique, and this helps me efficiently navigate within the 
source code. I can imagine this is not going to work very well (if it 
works at all) with the nested config name.


p.s.: I'm not a Java developer, it will take me much longer to find the 
relevant code if grep doesn't work in the source code. It is also going 
to be harder for me to understand it if the nested config is turned into 
a Java object/class.


On 19/11/2021 19:07, Stefan Miklosovic wrote:

Hi David,

while I do not oppose nested structure, it is really handy to grep
cassandra.yaml on some config key and you know the value instantly.
This is not possible when it is nested (easily & fastly) as it is on
two lines. Or maybe my grepping is just not advanced enough to cover
this case? If it is flat, I can just grep "track_warnings" and I have
them all.

Can you elaborate on your last bullet point? Parsing layer ... What do
you mean specifically?

Thanks

On Fri, 19 Nov 2021 at 19:36, David Capwell  wrote:

This has been brought up in a few tickets, so pushing to the dev list.

CASSANDRA-15234 - Standardise config and JVM parameters
CASSANDRA-16896 - hard/soft limits for queries
CASSANDRA-17147 - Guardrails prototype

In short, do we as a project wish to move "new features" into nested
YAML when the feature has "enough" to justify the nesting?  I would
really like to focus this discussion on new features rather than
retroactively grouping (leaving that to CASSANDRA-15234), as there is
already a place to talk about that.

To get things started, let's start with the track-warning feature
(hard/soft limits for queries), currently the configs look as follows
(assuming 15234)

track_warnings:
 enabled: true
 coordinator_read_size:
 warn_threshold: 10kb
 abort_threshold: 1mb
 local_read_size:
 warn_threshold: 10kb
 abort_threshold: 1mb
 row_index_size:
 warn_threshold: 100mb
 abort_threshold: 1gb

or should this be "flat"

track_warnings_enabled: true
track_warnings_coordinator_read_size_warn_threshold: 10kb
track_warnings_coordinator_read_size_abort_threshold: 1mb
track_warnings_local_read_size_warn_threshold: 10kb
track_warnings_local_read_size_abort_threshold: 1mb
track_warnings_row_index_size_warn_threshold: 100mb
track_warnings_row_index_size_abort_threshold: 1gb

For me I prefer nested for a few reasons
* easier to enforce consistency as the configs can use shared types;
in the track warnings patch I had mismatches cross configs (warn vs
warns, fail vs abort, etc.) before going nested, now everything reuses
the same types
* even though it is longer, things can be more clear how they are related
* parsing layer can add support for mixed or purely flat depending on
user preference (example:
track_warnings.row_index_size.abort_threshold, using the '.' notation
to represent nested structures)

Thoughts?

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Resurrection of CASSANDRA-9633 - SSTable encryption

2021-11-19 Thread Bowen Song

On the performance note, I copy & pasted a small piece of Java code to 
do AES256-CBC on the stdin and write the result to stdout. I then ran 
the following two commands on the same machine (with AES-NI) for comparison:


   $ dd if=/dev/zero bs=4096 count=$((4*1024*1024)) status=none | time
   /usr/lib/jvm/java-11-openjdk/bin/java -jar aes-bench.jar >/dev/null
   36.24s user 5.96s system 100% cpu 41.912 total
   $ dd if=/dev/zero bs=4096 count=$((4*1024*1024)) status=none | time
   openssl enc -aes-256-cbc -e -K
   "0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef"
   -iv "0123456789abcdef0123456789abcdef" >/dev/null
   31.09s user 3.92s system 99% cpu 35.043 total

This is not an accurate test of the AES performance, as the Java test 
includes the JVM start up time and the key and IV generation in the Java 
code. But this gives us a pretty good idea that the total performance 
regression is definitely far from the 2x to 10x slower claimed in some 
previous emails.



The Java code I used:

   package com.example.AesBenchmark;

   import java.security.Security;
   import java.io.File;
   import java.io.FileInputStream;
   import java.io.FileOutputStream;
   import java.security.SecureRandom;

   import javax.crypto.Cipher;
   import javax.crypto.KeyGenerator;
   import javax.crypto.SecretKey;
   import javax.crypto.spec.IvParameterSpec;
   import javax.crypto.spec.SecretKeySpec;

   public class AesBenchmark {
    static {
    try {
    Security.setProperty("crypto.policy", "unlimited");
    } catch (Exception e) {
    e.printStackTrace();
    }
    }

    static final int BUF_LEN = 4096;

    public static void main(String[] args) throws Exception
    {
    KeyGenerator keyGenerator = KeyGenerator.getInstance("AES");
    keyGenerator.init(256);

    // Generate Key
    SecretKey key = keyGenerator.generateKey();

    // Generating IV.
    byte[] IV = new byte[16];
    SecureRandom random = new SecureRandom();
    random.nextBytes(IV);

    //Get Cipher Instance
    Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding");

    //Create SecretKeySpec
    SecretKeySpec keySpec = new SecretKeySpec(key.getEncoded(),
   "AES");

    //Create IvParameterSpec
    IvParameterSpec ivSpec = new IvParameterSpec(IV);

    //Initialize Cipher for ENCRYPT_MODE
    cipher.init(Cipher.ENCRYPT_MODE, keySpec, ivSpec);

    byte[] bufInput = new byte[BUF_LEN];
    FileInputStream fis = new FileInputStream(new
   File("/dev/stdin"));
    FileOutputStream fos = new FileOutputStream(new
   File("/dev/stdout"));
    int nBytes;
    while ((nBytes = fis.read(bufInput, 0, BUF_LEN)) != -1)
    {
    fos.write(cipher.update(bufInput, 0, nBytes));
    }
    fos.write(cipher.doFinal());
    }
   }

On 19/11/2021 13:28, Jeff Jirsa wrote:


For better or worse, different threat models mean that it’s not strictly better 
to do FDE and some use cases definitely want this at the db layer instead of 
file system.


On Nov 19, 2021, at 12:54 PM, Joshua McKenzie  wrote:




setting performance requirements on this regard is a
nonsense. As long as it's reasonably usable in real world, and Cassandra
makes the estimated effects on performance available, it will be up to
the operators to decide whether to turn on the feature

I think Joey's argument, and correct me if I'm wrong, is that implementing
a complex feature in Cassandra that we then have to manage that's
essentially worse in every way compared to a built-in full-disk encryption
option via LUKS+LVM etc is a poor use of our time and energy.

i.e. we'd be better off investing our time into documenting how to do full
disk encryption in a variety of scenarios + explaining why that is our
recommended approach instead of taking the time and energy to design,
implement, debug, and then maintain an inferior solution.


On Fri, Nov 19, 2021 at 7:49 AM Joshua McKenzie
wrote:

Are you for real here?

Please keep things cordial. Statements like this don't help move the
conversation along.


On Fri, Nov 19, 2021 at 3:57 AM Stefan Miklosovic <
stefan.mikloso...@instaclustr.com> wrote:


On Fri, 19 Nov 2021 at 02:51, Joseph Lynch  wrote:

On Thu, Nov 18, 2021 at 7:23 PM Kokoori, Shylaja <

shylaja.koko...@intel.com>

wrote:


To address Joey's concern, the OpenJDK JVM and its derivatives

optimize

Java crypto based on the underlying HW capabilities. For example, if

the

underlying HW supports AES-NI, JVM intrinsics will use those for

crypto

operations. Likewise, the new vector AES available on the latest Intel
platform is utilized by the JVM while running on that platform to make
crypto operations faster.


Which JDK version were you running? We have had a number of issues with

the

JVM being 2-10x slower

Re: Resurrection of CASSANDRA-9633 - SSTable encryption

2021-11-19 Thread Bowen Song

Sorry, but IMHO setting performance requirements on this regard is a 
nonsense. As long as it's reasonably usable in real world, and Cassandra 
makes the estimated effects on performance available, it will be up to 
the operators to decide whether to turn on the feature. It's a trade off 
between security and performance, and everyone has different needs.


On 19/11/2021 01:50, Joseph Lynch wrote:

On Thu, Nov 18, 2021 at 7:23 PM Kokoori, Shylaja 
wrote:


To address Joey's concern, the OpenJDK JVM and its derivatives optimize
Java crypto based on the underlying HW capabilities. For example, if the
underlying HW supports AES-NI, JVM intrinsics will use those for crypto
operations. Likewise, the new vector AES available on the latest Intel
platform is utilized by the JVM while running on that platform to make
crypto operations faster.


Which JDK version were you running? We have had a number of issues with the
JVM being 2-10x slower than native crypto on Java 8 (especially MD5, SHA1,
and AES-GCM) and to a lesser extent Java 11 (usually ~2x slower). Again I
think we could get the JVM crypto penalty down to ~2x native if we linked
in e.g. ACCP by default [1, 2] but even the very best Java crypto I've seen
(fully utilizing hardware instructions) is still ~2x slower than native
code. The operating system has a number of advantages here in that they
don't pay JVM allocation costs or the JNI barrier (in the case of ACCP) and
the kernel also takes advantage of hardware instructions.



 From our internal experiments, we see single digit % regression when
transparent data encryption is enabled.


Which workloads are you testing and how are you measuring the regression? I
suspect that compaction, repair (validation compaction), streaming, and
quorum reads are probably much slower (probably ~10x slower for the
throughput bound operations and ~2x slower on the read path). As
compaction/repair/streaming usually take up between 10-20% of available CPU
cycles making them 2x slower might show up as <10% overall utilization
increase when you've really regressed 100% or more on key metrics
(compaction throughput, streaming throughput, memory allocation rate, etc
...). For example, if compaction was able to achieve 2 MiBps of throughput
before encryption and it was only able to achieve 1MiBps of throughput
afterwards, that would be a huge real world impact to operators as
compactions now take twice as long.

I think a CEP or details on the ticket that indicate the performance tests
and workloads that will be run might be wise? Perhaps something like
"encryption creates no more than a 1% regression of: compaction throughput
(MiBps), streaming throughput (MiBps), repair validation throughput
(duration of full repair on the entire cluster), read throughput at 10ms
p99 tail at quorum consistency (QPS handled while not exceeding P99 SLO of
10ms), etc ... while a sustained load is applied to a multi-node cluster"?
Even a microbenchmark that just sees how long it takes to encrypt and
decrypt a 500MiB dataset using the proposed JVM implementation versus
encrypting it with a native implementation might be enough to confirm/deny.
For example, keypipe (C, [3]) achieves around 2.8 GiBps symmetric of
AES-GCM and age (golang, ChaCha20-Poly1305, [4]) achieves about 1.6 GiBps
encryption and 1.0 GiBps decryption; from my past experiences with Java
crypto is it would achieve maybe 200 MiBps of _non-authenticated_ AES.

Cheers,
-Joey

[1] https://issues.apache.org/jira/browse/CASSANDRA-15294
[2] https://github.com/corretto/amazon-corretto-crypto-provider
[3] https://github.com/FiloSottile/age
[4] https://github.com/hashbrowncipher/keypipe#encryption



-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Resurrection of CASSANDRA-9633 - SSTable encryption

2021-11-16 Thread Bowen Song

I don't like the idea that FDE Full Disk Encryption as an alternative to 
application managed encryption at rest. Each has their own advantages 
and disadvantages.


For example, if the encryption key is the same across nodes in the same 
cluster, and Cassandra can share the key securely between authenticated 
nodes, rolling restart of the servers will be a lot simpler than if the 
servers were using FDE - someone will have to type in the passphrase on 
each reboot, or have a script to mount the encrypted device over SSH and 
then start Cassandra service after a reboot.


Another valid use case of encryption implemented in Cassandra is 
selectively encrypt some tables, but leave others unencrypted. Doing 
this outside Cassandra on the filesystem level is very tedious and 
error-prone - a lots of symlinks and pretty hard to handle newly created 
tables or keyspaces.


However, I don't know if there's enough demand to justify the above use 
cases.



On 16/11/2021 14:45, Joseph Lynch wrote:

I think a CEP is wise (or a more thorough design document on the
ticket) given how easy it is to do security incorrectly and key
management, rotation and key derivation are not particularly
straightforward.

I am curious what advantage Cassandra implementing encryption has over
asking the user to use an encrypted filesystem or disks instead where
the kernel or device will undoubtedly be able to do the crypto more
efficiently than we can in the JVM and we wouldn't have to further
complicate the storage engine? I think the state of encrypted
filesystems (e.g. LUKS on Linux) is significantly more user friendly
these days than it was in 2015 when that ticket was created.

If the application has existing exfiltration paths (e.g. backups) it's
probably better to encrypt/decrypt in the backup/restore process via
something extremely fast (and modern) like piping through age [1]
isn't it?

[1] https://github.com/FiloSottile/age

-Joey


On Sat, Nov 13, 2021 at 6:01 AM Stefan Miklosovic
 wrote:

Hi list,

an engineer from Intel - Shylaja Kokoori (who is watching this list
closely) has retrofitted the original code from CASSANDRA-9633 work in
times of 3.4 to the current trunk with my help here and there, mostly
cosmetic.

I would like to know if there is a general consensus about me going to
create a CEP for this feature or what is your perception on this. I
know we have it a little bit backwards here as we should first discuss
and then code but I am super glad that we have some POC we can
elaborate further on and CEP would just cement  and summarise the
approach / other implementation aspects of this feature.

I think that having 9633 merged will fill quite a big operational gap
when it comes to security. There are a lot of enterprises who desire
this feature so much. I can not remember when I last saw a ticket with
50 watchers which was inactive for such a long time.

Regards

-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Resurrection of CASSANDRA-9633 - SSTable encryption

2021-11-16 Thread Bowen Song

If the same user chosen key Km is used across all nodes in the same 
cluster, the sender will only need to share their SSTable generation GEN 
with the receiving side. This is because the receiving side will need to 
use the GEN to reproduce the KEK used in the source node. The receiving 
side will then need to unwrap Kr with the KEK and re-wrap it with a new 
KEK' derived from their own GEN. GEN is not considered as a secret.



On 16/11/2021 12:13, Stefan Miklosovic wrote:

Thanks for the insights of everybody.

I would like to return to Km. If we require that all Km's are the same
before streaming, is it not true that we do not need to move any
secrets around at all? So TLS would not be required either as only
encrypted tables would ever be streamed. That way Kr would never ever
leave the node and new Km would be rolled over first. To use correct
Km, we would have hash of that upon received table from the
recipient's perspective. This would also avoid the fairly complex
algorithm in the last Bowen's reply when I got that right.

On Tue, 16 Nov 2021 at 13:02, bened...@apache.org  wrote:

We already have the facility to authenticate peers, I am suggesting we should 
e.g. refuse to enable encryption if there is no such facility configured for a 
replica, or fail to start if there is encrypted data present and no 
authentication facility configured.

It is in my opinion much more problematic to remove encryption from data and 
ship it to another node in the network than it is to ship data that is already 
unencrypted to another node on the network. Either is bad, but it is probably 
fine to leave the unencrypted case to the cognizance of the operator who may be 
happy relying on their general expectation that there are no nefarious actors 
on the network. Encrypting data suggests this is not an acceptable assumption, 
so I think we should make it harder for users that require encryption to 
accidentally misconfigure in this way, since they probably have higher security 
expectations (and compliance requirements) than users that do not encrypt their 
data at rest.


From: Bowen Song 
Date: Tuesday, 16 November 2021 at 11:56
To: dev@cassandra.apache.org 
Subject: Re: Resurrection of CASSANDRA-9633 - SSTable encryption
I think authenticating a receiving node is important, but it is perhaps
not in the scope of this ticket (or CEP if it becomes one). This applies
to not only encrypted SSTables, but also unencrypted SSTables. A
malicious node can join the cluster and send bogus requests to other
nodes is a general problem not specific to the on-disk encryption.

On 16/11/2021 10:50, bened...@apache.org wrote:

I assume the key would be decrypted before being streamed, or perhaps encrypted 
using a public key provided to you by the receiving node. This would permit 
efficient “zero copy” streaming for the data portion, but not require any 
knowledge of the recipient node’s master key(s).

Either way, we would still want to ensure we had some authentication of the 
recipient node before streaming the file as it would effectively be decrypted 
to any node that could request this streaming action.


From: Stefan Miklosovic 
Date: Tuesday, 16 November 2021 at 10:45
To: dev@cassandra.apache.org 
Subject: Re: Resurrection of CASSANDRA-9633 - SSTable encryption
Ok but this also means that Km would need to be the same for all nodes right?

If we are rolling in node by node fashion, Km is changed at node 1, we
change the wrapped key which is stored on disk and we stream this
table to the other node which is still on the old Km. Would this work?
I think we would need to rotate first before anything is streamed. Or
no?

On Tue, 16 Nov 2021 at 11:17, Bowen Song  wrote:

Yes, that's correct. The actual key used to encrypt the SSTable will
stay the same once the SSTable is created. This is a widely used
practice in many encrypt-at-rest applications. One good example is the
LUKS full disk encryption, which also supports multiple keys to unlock
(decrypt) the same data. Multiple unlocking keys is only possible
because the actual key used to encrypt the data is randomly generated
and then stored encrypted by (a key derived from) a user chosen key.

If this approach is adopted, the streaming process can share the Kr
without disclosing the Km, therefore enableling zero-copy streaming.

On 16/11/2021 08:56, Stefan Miklosovic wrote:

Hi Bowen, Very interesting idea indeed. So if I got it right, the very
key for the actual sstable encryption would be always the same, it is
just what is wrapped would differ. So if we rotate, we basically only
change Km hence KEK hence the result of wrapping but there would still
be the original Kr key used.

Jeremiah - I will prepare that branch very soon.

On Tue, 16 Nov 2021 at 01:09, Bowen Song  wrote:

   The second question is about key rotation. If an operator needs to
   roll the key because it was compromised or there is some policy around
   that, we should be able to provide some

Re: Resurrection of CASSANDRA-9633 - SSTable encryption

2021-11-16 Thread Bowen Song

I think a warning message is fine, but Cassandra should not enforce 
network encryption when on-disk encryption is enabled. It's definitely a 
valid use case to have Cassandra over IPSec without enabling TLS.


On 16/11/2021 12:02, bened...@apache.org wrote:

We already have the facility to authenticate peers, I am suggesting we should 
e.g. refuse to enable encryption if there is no such facility configured for a 
replica, or fail to start if there is encrypted data present and no 
authentication facility configured.

It is in my opinion much more problematic to remove encryption from data and 
ship it to another node in the network than it is to ship data that is already 
unencrypted to another node on the network. Either is bad, but it is probably 
fine to leave the unencrypted case to the cognizance of the operator who may be 
happy relying on their general expectation that there are no nefarious actors 
on the network. Encrypting data suggests this is not an acceptable assumption, 
so I think we should make it harder for users that require encryption to 
accidentally misconfigure in this way, since they probably have higher security 
expectations (and compliance requirements) than users that do not encrypt their 
data at rest.


From: Bowen Song 
Date: Tuesday, 16 November 2021 at 11:56
To: dev@cassandra.apache.org 
Subject: Re: Resurrection of CASSANDRA-9633 - SSTable encryption
I think authenticating a receiving node is important, but it is perhaps
not in the scope of this ticket (or CEP if it becomes one). This applies
to not only encrypted SSTables, but also unencrypted SSTables. A
malicious node can join the cluster and send bogus requests to other
nodes is a general problem not specific to the on-disk encryption.

On 16/11/2021 10:50, bened...@apache.org wrote:

I assume the key would be decrypted before being streamed, or perhaps encrypted 
using a public key provided to you by the receiving node. This would permit 
efficient “zero copy” streaming for the data portion, but not require any 
knowledge of the recipient node’s master key(s).

Either way, we would still want to ensure we had some authentication of the 
recipient node before streaming the file as it would effectively be decrypted 
to any node that could request this streaming action.


From: Stefan Miklosovic 
Date: Tuesday, 16 November 2021 at 10:45
To: dev@cassandra.apache.org 
Subject: Re: Resurrection of CASSANDRA-9633 - SSTable encryption
Ok but this also means that Km would need to be the same for all nodes right?

If we are rolling in node by node fashion, Km is changed at node 1, we
change the wrapped key which is stored on disk and we stream this
table to the other node which is still on the old Km. Would this work?
I think we would need to rotate first before anything is streamed. Or
no?

On Tue, 16 Nov 2021 at 11:17, Bowen Song  wrote:

Yes, that's correct. The actual key used to encrypt the SSTable will
stay the same once the SSTable is created. This is a widely used
practice in many encrypt-at-rest applications. One good example is the
LUKS full disk encryption, which also supports multiple keys to unlock
(decrypt) the same data. Multiple unlocking keys is only possible
because the actual key used to encrypt the data is randomly generated
and then stored encrypted by (a key derived from) a user chosen key.

If this approach is adopted, the streaming process can share the Kr
without disclosing the Km, therefore enableling zero-copy streaming.

On 16/11/2021 08:56, Stefan Miklosovic wrote:

Hi Bowen, Very interesting idea indeed. So if I got it right, the very
key for the actual sstable encryption would be always the same, it is
just what is wrapped would differ. So if we rotate, we basically only
change Km hence KEK hence the result of wrapping but there would still
be the original Kr key used.

Jeremiah - I will prepare that branch very soon.

On Tue, 16 Nov 2021 at 01:09, Bowen Song  wrote:

   The second question is about key rotation. If an operator needs to
   roll the key because it was compromised or there is some policy around
   that, we should be able to provide some way to rotate it. Our idea is
   to write a tool (either a subcommand of nodetool (rewritesstables)
   command or a completely standalone one in tools) which would take the
   first, original key, the second, new key and dir with sstables as
   input and it would literally took the data and it would rewrite it to
   the second set of sstables which would be encrypted with the second
   key. What do you think about this?

   I would rather suggest that “what key encrypted this” be part of the 
sstable metadata, and allow there to be multiple keys in the system.  This way 
you can just add a new “current key” so new sstables use the new key, but 
existing sstables would use the old key.  An operator could then trigger a 
“nodetool upgradesstables —all” to rewrite the existing sstables with the new 
“current key

Re: Resurrection of CASSANDRA-9633 - SSTable encryption

2021-11-16 Thread Bowen Song

I think authenticating a receiving node is important, but it is perhaps 
not in the scope of this ticket (or CEP if it becomes one). This applies 
to not only encrypted SSTables, but also unencrypted SSTables. A 
malicious node can join the cluster and send bogus requests to other 
nodes is a general problem not specific to the on-disk encryption.


On 16/11/2021 10:50, bened...@apache.org wrote:

I assume the key would be decrypted before being streamed, or perhaps encrypted 
using a public key provided to you by the receiving node. This would permit 
efficient “zero copy” streaming for the data portion, but not require any 
knowledge of the recipient node’s master key(s).

Either way, we would still want to ensure we had some authentication of the 
recipient node before streaming the file as it would effectively be decrypted 
to any node that could request this streaming action.


From: Stefan Miklosovic 
Date: Tuesday, 16 November 2021 at 10:45
To: dev@cassandra.apache.org 
Subject: Re: Resurrection of CASSANDRA-9633 - SSTable encryption
Ok but this also means that Km would need to be the same for all nodes right?

If we are rolling in node by node fashion, Km is changed at node 1, we
change the wrapped key which is stored on disk and we stream this
table to the other node which is still on the old Km. Would this work?
I think we would need to rotate first before anything is streamed. Or
no?

On Tue, 16 Nov 2021 at 11:17, Bowen Song  wrote:

Yes, that's correct. The actual key used to encrypt the SSTable will
stay the same once the SSTable is created. This is a widely used
practice in many encrypt-at-rest applications. One good example is the
LUKS full disk encryption, which also supports multiple keys to unlock
(decrypt) the same data. Multiple unlocking keys is only possible
because the actual key used to encrypt the data is randomly generated
and then stored encrypted by (a key derived from) a user chosen key.

If this approach is adopted, the streaming process can share the Kr
without disclosing the Km, therefore enableling zero-copy streaming.

On 16/11/2021 08:56, Stefan Miklosovic wrote:

Hi Bowen, Very interesting idea indeed. So if I got it right, the very
key for the actual sstable encryption would be always the same, it is
just what is wrapped would differ. So if we rotate, we basically only
change Km hence KEK hence the result of wrapping but there would still
be the original Kr key used.

Jeremiah - I will prepare that branch very soon.

On Tue, 16 Nov 2021 at 01:09, Bowen Song  wrote:

  The second question is about key rotation. If an operator needs to
  roll the key because it was compromised or there is some policy around
  that, we should be able to provide some way to rotate it. Our idea is
  to write a tool (either a subcommand of nodetool (rewritesstables)
  command or a completely standalone one in tools) which would take the
  first, original key, the second, new key and dir with sstables as
  input and it would literally took the data and it would rewrite it to
  the second set of sstables which would be encrypted with the second
  key. What do you think about this?

  I would rather suggest that “what key encrypted this” be part of the 
sstable metadata, and allow there to be multiple keys in the system.  This way 
you can just add a new “current key” so new sstables use the new key, but 
existing sstables would use the old key.  An operator could then trigger a 
“nodetool upgradesstables —all” to rewrite the existing sstables with the new 
“current key”.

There's a much better approach to solve this issue. You can stored a
wrapped key in an encryption info file alone side the SSTable file.
Here's how it works:
1. randomly generate a key Kr
2. encrypt the SSTable file with the key Kr, store the encrypted SSTable
file on disk
3. derive a key encryption key KEK from the SSTable file's information
(e.g.: table UUID + generation) and the user chosen master key Km, so
you have KEK = KDF(UUID+GEN, Km)
4. wrap (encrypt) the key Kr with the KEK, so you have WKr = KW(Kr, KEK)
5. hash the Km, the hash will used as a key ID to identify which master
key was used to encrypt the key Kr if the server has multiple master
keys in use
6. store the the WKr and the hash of Km in a separate file alone side
the SSTable file

In the read path, the Kr should be kept in memory to help improve
performance and this will also allow zero-downtime master key rotation.

During a key rotation:
1. derive the KEK in the same way: KEK = KDF(UUID+GEN, Km)
2. read the WKr from the encryption information file, and unwrap
(decrypt) it using the KEK to get the Kr
3. derive a new KEK' from the new master key Km' in the same way as above
4. wrap (encrypt) the key Kr with KEK' to get WKr' = KW(Kr, KEK')
5. hash the new master key Km', and store it together with the WKr' in
the encryption info file

Since the key rotation only involves rewriting the encryption info file

Re: Resurrection of CASSANDRA-9633 - SSTable encryption

2021-11-16 Thread Bowen Song

No, the Km does not need to be the same across nodes. Each node can 
store their own encryption info file created by their own Km. The 
streaming process only requires the Kr is shared.


A quick description of the streaming process via an insecure connection:

1. the sender unwrap the wrapped key WKr with their Km, and get the key Kr

2. the sender and the receiver use DH key exchange to establish a shared 
secret Ks, so that sender and receiver both know the Ks


3. the sender derives a KEKs from the table info (SSTable gen is not 
persisted across nodes) & streaming info (TODO) and the shared secret 
Ks, so KEKs = KDF(Table UUID + TBD STREAMING INFO, Ks)


4. the sender wraps the Kr with KEKs to get WKrs = KW(Kr, KEKs)

5. the sender sends WKrs and the (encrypted) SSTable file to the receiver

6. the receiver derives the KEKs in the same way as the sender

7. the receiver unwraps WKrs using the the KEKs and get Kr

8. the receiver wraps the Kr with a KEK' derived from their own Km


This enables zero-copy streaming, and the Kr is never send in plaintext 
over an insecure communication channel. An passive observer cannot learn 
anything about the Kr. If the streaming is done over TLS, the Kr can be 
send over a TLS connection without all the additional work. The SSTable 
can be send via insecure connection to enable zero-copy streaming. An 
HMAC of the SSTable should also be send over TLS to ensure the SSTable 
has not been damaged or modified.



On 16/11/2021 10:45, Stefan Miklosovic wrote:

Ok but this also means that Km would need to be the same for all nodes right?

If we are rolling in node by node fashion, Km is changed at node 1, we
change the wrapped key which is stored on disk and we stream this
table to the other node which is still on the old Km. Would this work?
I think we would need to rotate first before anything is streamed. Or
no?

On Tue, 16 Nov 2021 at 11:17, Bowen Song  wrote:

Yes, that's correct. The actual key used to encrypt the SSTable will
stay the same once the SSTable is created. This is a widely used
practice in many encrypt-at-rest applications. One good example is the
LUKS full disk encryption, which also supports multiple keys to unlock
(decrypt) the same data. Multiple unlocking keys is only possible
because the actual key used to encrypt the data is randomly generated
and then stored encrypted by (a key derived from) a user chosen key.

If this approach is adopted, the streaming process can share the Kr
without disclosing the Km, therefore enableling zero-copy streaming.

On 16/11/2021 08:56, Stefan Miklosovic wrote:

Hi Bowen, Very interesting idea indeed. So if I got it right, the very
key for the actual sstable encryption would be always the same, it is
just what is wrapped would differ. So if we rotate, we basically only
change Km hence KEK hence the result of wrapping but there would still
be the original Kr key used.

Jeremiah - I will prepare that branch very soon.

On Tue, 16 Nov 2021 at 01:09, Bowen Song  wrote:

  The second question is about key rotation. If an operator needs to
  roll the key because it was compromised or there is some policy around
  that, we should be able to provide some way to rotate it. Our idea is
  to write a tool (either a subcommand of nodetool (rewritesstables)
  command or a completely standalone one in tools) which would take the
  first, original key, the second, new key and dir with sstables as
  input and it would literally took the data and it would rewrite it to
  the second set of sstables which would be encrypted with the second
  key. What do you think about this?

  I would rather suggest that “what key encrypted this” be part of the 
sstable metadata, and allow there to be multiple keys in the system.  This way 
you can just add a new “current key” so new sstables use the new key, but 
existing sstables would use the old key.  An operator could then trigger a 
“nodetool upgradesstables —all” to rewrite the existing sstables with the new 
“current key”.

There's a much better approach to solve this issue. You can stored a
wrapped key in an encryption info file alone side the SSTable file.
Here's how it works:
1. randomly generate a key Kr
2. encrypt the SSTable file with the key Kr, store the encrypted SSTable
file on disk
3. derive a key encryption key KEK from the SSTable file's information
(e.g.: table UUID + generation) and the user chosen master key Km, so
you have KEK = KDF(UUID+GEN, Km)
4. wrap (encrypt) the key Kr with the KEK, so you have WKr = KW(Kr, KEK)
5. hash the Km, the hash will used as a key ID to identify which master
key was used to encrypt the key Kr if the server has multiple master
keys in use
6. store the the WKr and the hash of Km in a separate file alone side
the SSTable file

In the read path, the Kr should be kept in memory to help improve
performance and this will also allow zero-downtime master key rotation.

During a key rotation:
1. derive the

Re: Resurrection of CASSANDRA-9633 - SSTable encryption

2021-11-16 Thread Bowen Song

Yes, that's correct. The actual key used to encrypt the SSTable will 
stay the same once the SSTable is created. This is a widely used 
practice in many encrypt-at-rest applications. One good example is the 
LUKS full disk encryption, which also supports multiple keys to unlock 
(decrypt) the same data. Multiple unlocking keys is only possible 
because the actual key used to encrypt the data is randomly generated 
and then stored encrypted by (a key derived from) a user chosen key.


If this approach is adopted, the streaming process can share the Kr 
without disclosing the Km, therefore enableling zero-copy streaming.


On 16/11/2021 08:56, Stefan Miklosovic wrote:

Hi Bowen, Very interesting idea indeed. So if I got it right, the very
key for the actual sstable encryption would be always the same, it is
just what is wrapped would differ. So if we rotate, we basically only
change Km hence KEK hence the result of wrapping but there would still
be the original Kr key used.

Jeremiah - I will prepare that branch very soon.

On Tue, 16 Nov 2021 at 01:09, Bowen Song  wrote:

 The second question is about key rotation. If an operator needs to
 roll the key because it was compromised or there is some policy around
 that, we should be able to provide some way to rotate it. Our idea is
 to write a tool (either a subcommand of nodetool (rewritesstables)
 command or a completely standalone one in tools) which would take the
 first, original key, the second, new key and dir with sstables as
 input and it would literally took the data and it would rewrite it to
 the second set of sstables which would be encrypted with the second
 key. What do you think about this?

 I would rather suggest that “what key encrypted this” be part of the 
sstable metadata, and allow there to be multiple keys in the system.  This way 
you can just add a new “current key” so new sstables use the new key, but 
existing sstables would use the old key.  An operator could then trigger a 
“nodetool upgradesstables —all” to rewrite the existing sstables with the new 
“current key”.

There's a much better approach to solve this issue. You can stored a
wrapped key in an encryption info file alone side the SSTable file.
Here's how it works:
1. randomly generate a key Kr
2. encrypt the SSTable file with the key Kr, store the encrypted SSTable
file on disk
3. derive a key encryption key KEK from the SSTable file's information
(e.g.: table UUID + generation) and the user chosen master key Km, so
you have KEK = KDF(UUID+GEN, Km)
4. wrap (encrypt) the key Kr with the KEK, so you have WKr = KW(Kr, KEK)
5. hash the Km, the hash will used as a key ID to identify which master
key was used to encrypt the key Kr if the server has multiple master
keys in use
6. store the the WKr and the hash of Km in a separate file alone side
the SSTable file

In the read path, the Kr should be kept in memory to help improve
performance and this will also allow zero-downtime master key rotation.

During a key rotation:
1. derive the KEK in the same way: KEK = KDF(UUID+GEN, Km)
2. read the WKr from the encryption information file, and unwrap
(decrypt) it using the KEK to get the Kr
3. derive a new KEK' from the new master key Km' in the same way as above
4. wrap (encrypt) the key Kr with KEK' to get WKr' = KW(Kr, KEK')
5. hash the new master key Km', and store it together with the WKr' in
the encryption info file

Since the key rotation only involves rewriting the encryption info file,
the operation should take only a few milliseconds per SSTable file, it
will be much faster than decrypting and then re-encrypting the SSTable data.



On 15/11/2021 18:42, Jeremiah D Jordan wrote:

On Nov 14, 2021, at 3:53 PM, Stefan 
Miklosovic  wrote:

Hey,

there are two points we are not completely sure about.

The first one is streaming. If there is a cluster of 5 nodes, each
node has its own unique encryption key. Hence, if a SSTable is stored
on a disk with the key for node 1 and this is streamed to node 2 -
which has a different key - it would not be able to decrypt that. Our
idea is to actually send data over the wire _decrypted_ however it
would be still secure if internode communication is done via TLS. Is
this approach good with you?


So would you fail startup if someone enabled sstable encryption but did not 
have TLS for internode communication?  Another concern here is making sure zero 
copy streaming does not get triggered for this case.
Have you considered having some way to distribute the keys to all nodes such 
that you don’t need to decrypt on the sending side?  Having to do this will 
mean a lot more overhead for the sending side of a streaming operation.


The second question is about key rotation. If an operator needs to
roll the key because it was compromised or there is some policy around
that, we should be able to provide some way to rotate it. Our idea is
to write a tool (either a subcommand of nodetool (rewritesstables)
command

Re: Resurrection of CASSANDRA-9633 - SSTable encryption

2021-11-15 Thread Bowen Song


The second question is about key rotation. If an operator needs to
roll the key because it was compromised or there is some policy around
that, we should be able to provide some way to rotate it. Our idea is
to write a tool (either a subcommand of nodetool (rewritesstables)
command or a completely standalone one in tools) which would take the
first, original key, the second, new key and dir with sstables as
input and it would literally took the data and it would rewrite it to
the second set of sstables which would be encrypted with the second
key. What do you think about this?


   I would rather suggest that “what key encrypted this” be part of the sstable 
metadata, and allow there to be multiple keys in the system.  This way you can 
just add a new “current key” so new sstables use the new key, but existing 
sstables would use the old key.  An operator could then trigger a “nodetool 
upgradesstables —all” to rewrite the existing sstables with the new “current 
key”.

There's a much better approach to solve this issue. You can stored a 
wrapped key in an encryption info file alone side the SSTable file. 
Here's how it works:

1. randomly generate a key Kr
2. encrypt the SSTable file with the key Kr, store the encrypted SSTable 
file on disk
3. derive a key encryption key KEK from the SSTable file's information 
(e.g.: table UUID + generation) and the user chosen master key Km, so 
you have KEK = KDF(UUID+GEN, Km)

4. wrap (encrypt) the key Kr with the KEK, so you have WKr = KW(Kr, KEK)
5. hash the Km, the hash will used as a key ID to identify which master 
key was used to encrypt the key Kr if the server has multiple master 
keys in use
6. store the the WKr and the hash of Km in a separate file alone side 
the SSTable file


In the read path, the Kr should be kept in memory to help improve 
performance and this will also allow zero-downtime master key rotation.


During a key rotation:
1. derive the KEK in the same way: KEK = KDF(UUID+GEN, Km)
2. read the WKr from the encryption information file, and unwrap 
(decrypt) it using the KEK to get the Kr

3. derive a new KEK' from the new master key Km' in the same way as above
4. wrap (encrypt) the key Kr with KEK' to get WKr' = KW(Kr, KEK')
5. hash the new master key Km', and store it together with the WKr' in 
the encryption info file


Since the key rotation only involves rewriting the encryption info file, 
the operation should take only a few milliseconds per SSTable file, it 
will be much faster than decrypting and then re-encrypting the SSTable data.




On 15/11/2021 18:42, Jeremiah D Jordan wrote:



On Nov 14, 2021, at 3:53 PM, Stefan 
Miklosovic  wrote:

Hey,

there are two points we are not completely sure about.

The first one is streaming. If there is a cluster of 5 nodes, each
node has its own unique encryption key. Hence, if a SSTable is stored
on a disk with the key for node 1 and this is streamed to node 2 -
which has a different key - it would not be able to decrypt that. Our
idea is to actually send data over the wire _decrypted_ however it
would be still secure if internode communication is done via TLS. Is
this approach good with you?


So would you fail startup if someone enabled sstable encryption but did not 
have TLS for internode communication?  Another concern here is making sure zero 
copy streaming does not get triggered for this case.
Have you considered having some way to distribute the keys to all nodes such 
that you don’t need to decrypt on the sending side?  Having to do this will 
mean a lot more overhead for the sending side of a streaming operation.


The second question is about key rotation. If an operator needs to
roll the key because it was compromised or there is some policy around
that, we should be able to provide some way to rotate it. Our idea is
to write a tool (either a subcommand of nodetool (rewritesstables)
command or a completely standalone one in tools) which would take the
first, original key, the second, new key and dir with sstables as
input and it would literally took the data and it would rewrite it to
the second set of sstables which would be encrypted with the second
key. What do you think about this?

I would rather suggest that “what key encrypted this” be part of the sstable 
metadata, and allow there to be multiple keys in the system.  This way you can 
just add a new “current key” so new sstables use the new key, but existing 
sstables would use the old key.  An operator could then trigger a “nodetool 
upgradesstables —all” to rewrite the existing sstables with the new “current 
key”.


Regards

On Sat, 13 Nov 2021 at 19:35,  wrote:

Same reaction here - great to have traction on this ticket. Shylaja, thanks for 
your work on this and to Stefan as well! It would be wonderful to have the 
feature complete.

One thing I’d mention is that a lot’s changed about the project’s testing 
strategy since the original patch was written. I see that the 2016 version

Re: [DISCUSS] Creating a new slack channel for newcomers

2021-11-09 Thread Bowen Song

As a newcomer (made two commits since October) who has been watching 
this mailing list since then, I don't like the idea of a separate 
channel for beginner questions. The volume in this mailing list is 
fairly low, I can't see any legitimate reason for diverting a portion of 
that into another channel, further reducing the volume in the existing 
channel and perhaps not creating much volume in the new channel either.


Personally, I think a clearly written and easy to find community 
guideline highlighting that this mailing list is suitable for beginner 
questions, and give some suggestions/recommendations on when, where and 
how to ask beginner questions would be more useful.


At the moment because the volume of beginner questions is very very low 
in this mailing list, newcomers like me don't feel comfortable asking 
questions here. That's not because there's 600 pair of eyes watching 
this (TBH, if you didn't mention it, I wouldn't have noticed it), but 
because the herd mentality. If not many questions are asked here, most 
people won't start doing that. It's all about creating the environment 
that makes people feel comfortable asking questions here.


On 08/11/2021 16:28, Benjamin Lerer wrote:

Hi everybody,

Aleksei Zotov mentioned to me that it was a bit intimidating for newcomers
to ask beginner questions in the cassandra-dev channel as it has over 600
followers and that we should probably have a specific channel for
newcomers.
This proposal makes total sense to me.

What is your opinion on this? Do you have any concerns about it?

Benjamin



-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] How to implement backward compatibility (CASSANDRA-17048)

2021-10-26 Thread Bowen Song

The user will be able to test the new feature in a testing environment 
and not push the changes to their production environment if they are not 
satisfied.


On 26/10/2021 12:00, Jacek Lewandowski wrote:

Though, the user is unable to test the new feature without enabling it. And
when it is enabled, the user is unable to revert it.

- - -- --- -  -
Jacek Lewandowski


On Tue, Oct 26, 2021 at 12:54 PM Bowen Song  wrote:


Personally, I would prefer a transition period in which the new feature
is not enabled by default. This not only makes version upgrading easier,
it also allows the user to stay on the old behaviour if they experience
any issue with the new feature (e.g.: bugs in the new feature, or edge
use cases / 3rd party tools depending on the old behaviour) until the
issue is resolved.

On 26/10/2021 10:21, Jacek Lewandowski wrote:

Hi,

In short, we are discussing UUID based sstable generation identifiers in

https://issues.apache.org/jira/browse/CASSANDRA-17048.

The question which somehow hold us is support for downgrading. Long

story short, when we generate new sstables with uuid based ids, they are
not readable by older C* versions.

1. should we implement a downgrade tool? (it may be quite complex)
2. should we let users enable the new uuid ids later when they are sure

they will not downgrade in the future?

Thanks,
Jacek



-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: [DISCUSS] How to implement backward compatibility (CASSANDRA-17048)

2021-10-26 Thread Bowen Song

Personally, I would prefer a transition period in which the new feature 
is not enabled by default. This not only makes version upgrading easier, 
it also allows the user to stay on the old behaviour if they experience 
any issue with the new feature (e.g.: bugs in the new feature, or edge 
use cases / 3rd party tools depending on the old behaviour) until the 
issue is resolved.


On 26/10/2021 10:21, Jacek Lewandowski wrote:

Hi,

In short, we are discussing UUID based sstable generation identifiers in 
https://issues.apache.org/jira/browse/CASSANDRA-17048.

The question which somehow hold us is support for downgrading. Long story 
short, when we generate new sstables with uuid based ids, they are not readable 
by older C* versions.

1. should we implement a downgrade tool? (it may be quite complex)
2. should we let users enable the new uuid ids later when they are sure they 
will not downgrade in the future?

Thanks,
Jacek



-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Tradeoffs for Cassandra transaction management

2021-10-15 Thread Bowen Song

I'm worried that by the time a consensus is reached, the people who 
originally purposed the CEP may have long lost their passion about it 
and may no longer willing to contribute.


On 15/10/2021 16:55, Benjamin Lerer wrote:

Reaching consensus is hard but we will get there :-)

Le ven. 15 oct. 2021 à 17:33, Mick Semb Wever  a écrit :


I have reviewed CEP-15 and I must say, I'm excited to see its inclusion
into mainline Cassandra, and I'm disheartened to see what appears to be

an

unsubstantiated veto of the proposal from the committee's leadership.



Leif,
the Accord paper and CEP-15 has indeed generated a lot of excitement in the
community.

But please don't misinterpret what vetoes are. Cassandra 4.0 (from RCs) was
vetoed four times before it got released, every veto was important and in
support of 4.0.0 out and appreciated by all. No one doubted that 4.0.0
wasn't about to come out.

The ASF community has a precedence for seeking consensus, and valuing
community over code. The latter point is a touchy topic, wide open to
different opinions about what constitutes a healthy and inclusive community
both in the short and long term. In my opinion, rushing people never helps,
bear with us and we will get there and get there together. And I believe we
will have some valuable retrospectives from current threads to help us
become even better at what we do.

kind regards,
Mick



-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

63 matches

Mail list logo