Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-04-11 Thread Jon Haddad
Right, exactly.  Which (I think) makes the object store about as valuable
as an ephemeral disk if you don't keep everything on there.  It's a
tradeoff I'd never use given the cost / benefit.

Does that mean you agree that we should focus on writethrough cache mode
first?

Jon



On Fri, Apr 11, 2025 at 2:09 PM Jeff Jirsa  wrote:

>
>
> > On Apr 11, 2025, at 1:15 PM, Jon Haddad  wrote:
> >
> >
> > I also keep running up against my concern about treating object store as
> a write back cache instead of write through.  "Tiering" data off has real
> consequences for the user, the big one being data loss, especially with
> regards to tombstones.  I think this is a pretty serious foot gun.  It's
> the same problem we originally had with JBOD, where we could have
> tombstones on one disk and the shadowed data on the other.  Losing one disk
> results in data getting resurrected.  Anthony covered it in a blog post [1]
> and I believe CASSANDRA-6696 was the JIRA that addressed the problem.
> Introducing tiering would essentially bring this problem back.
>
>
> If you lose one disk, you shoot the instance. We have to stop pretending
> you can have partial failures. That’s it. That’s the fix. You don’t get to
> lose part of a machine and pretend it’s still viable. Just like losing a a
> commit log segment or losing an object in a bucket, if you lose one object,
> you throw it away or you’ve resurrected data / violated consistency.
>
>
>
>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-04-11 Thread Jeff Jirsa



> On Apr 11, 2025, at 1:15 PM, Jon Haddad  wrote:
> 
> 
> I also keep running up against my concern about treating object store as a 
> write back cache instead of write through.  "Tiering" data off has real 
> consequences for the user, the big one being data loss, especially with 
> regards to tombstones.  I think this is a pretty serious foot gun.  It's the 
> same problem we originally had with JBOD, where we could have tombstones on 
> one disk and the shadowed data on the other.  Losing one disk results in data 
> getting resurrected.  Anthony covered it in a blog post [1] and I believe 
> CASSANDRA-6696 was the JIRA that addressed the problem.  Introducing tiering 
> would essentially bring this problem back.


If you lose one disk, you shoot the instance. We have to stop pretending you 
can have partial failures. That’s it. That’s the fix. You don’t get to lose 
part of a machine and pretend it’s still viable. Just like losing a a commit 
log segment or losing an object in a bucket, if you lose one object, you throw 
it away or you’ve resurrected data / violated consistency. 





Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-04-11 Thread Jon Haddad
I've been thinking about this a bit more recently, and I think Joey's
suggestion about improving the yaml based disk configuration is a better
first step than what I wrote (table definition), for a couple reasons.

1. Attaching it to the schema means we need to have the disk configuration
as part of the schema as well, or we need to start enforcing homogeneous
disk setups.  We can't rely on the drive configs to be the same across the
cluster currently.
2. Changing the disk configuration is something we can only do as an
offline operation today.  Making it mutable opens up a world of complexity.
3. I'm not even sure how we'd define this at a configuration level in a way
that wouldn't be insanely confusing.  We currently don't allow people to
have multiple disk configs or allow for custom placement, so this would add
that complexity as well as the complexity of caching.

I also keep running up against my concern about treating object store as a
write back cache instead of write through.  "Tiering" data off has real
consequences for the user, the big one being data loss, especially with
regards to tombstones.  I think this is a pretty serious foot gun.  It's
the same problem we originally had with JBOD, where we could have
tombstones on one disk and the shadowed data on the other.  Losing one disk
results in data getting resurrected.  Anthony covered it in a blog post [1]
and I believe CASSANDRA-6696 was the JIRA that addressed the problem.
Introducing tiering would essentially bring this problem back.

I think we should update the proposal as Joey suggested, improving
data_file_locations to allow for the configuration of treating a local disk
as a cache and another disk or object store as the durable store.  Also
known as a writethrough cache.  It's defined clearly in the LVM docs:

> Writethrough ensures that any data written will be stored both in the
cache and on the origin LV.  The loss of a device associated with the cache
in this case would not mean the loss of any data.

We can discuss the other stuff as a follow up, but IMO we should first
focus on this use case, which has the fewest sharp edges.  Getting this in
would be a huge win on it's own and I think it's what the majority of users
would benefit from.  That makes it a pretty big project on it's own, and we
can work on the follow up CEPs to enhance this body of work in parallel
with the implementation of this CEP, if it's really desired.

Jon

[1]
https://thelastpickle.com/blog/2018/08/22/the-fine-print-when-using-multiple-data-directories.html
[2] https://issues.apache.org/jira/browse/CASSANDRA-6696
[3] https://www.man7.org/linux/man-pages/man7/lvmcache.7.html



On Sat, Mar 8, 2025 at 3:00 PM Jon Haddad  wrote:

> I really like the data directories and replication configuration.  I think
> it makes a ton of sense to put it in the yaml, but we should probably yaml
> it all, and not nest JSON :), and we can probably simplify it a little with
> a file uri scheme, something like this:
>
> data_file_locations:
>   disk:
> path: "file:///var/lib/cassandra/data"
>   object:
> path: "s3://..."
>
> You've opened up a couple fun options here with this as well.  I hadn't
> thought about 3 tiers of storage at all, but I could definitely see
>
> NVMe -> EBS -> Object Store
>
> as a valuable setup.
>
> Does spread == JBOD here?  Maybe we just call it jbod?  I can't help but
> bikeshed :)
>
> Jon
>
> On Sat, Mar 8, 2025 at 8:06 AM Joseph Lynch  wrote:
>
>> Jon, I like where you are headed with that, just brainstorming out what
>> the end interface might look like (might be getting a bit ahead of things
>> talking about directories if we don't even have files implemented yet).
>> What do folks think about pairing data_file_locations (fka
>> data_file_directories) with three table level tunables: replication
>> strategy ("spread", "tier"), the eviction strategy ("none", "lfu", "lru",
>> etc...) and the writeback duration (0s, 10s, 8d)? So your three examples
>>
>> data_file_locations:
>>   disk: {type: "filesystem", "path": "/var/lib/cassandra/data"}
>>   object: {type: "s3", "path": "s3://..."}
>>
>> data_file_eviction_strategies:
>>   none: {type: "NONE"}
>>   lfu: {type: "LFU", "min_retention": "7d"}
>>   hot-cold: {type: "LRU", "min_retention": "60d"}
>>
>> Then on tables to achieve your three proposals
>> WITH storage = {locations: ["disk", "object"], "replication": "tier"}
>> WITH storage = {locations: ["disk", "object"], "replication": "tier",
>> "eviction": ["lfu"], "writeback": ["10s"]}
>> WITH storage = {locations: ["disk", "object"], "replication": "tier",
>> "eviction": ["hot-cold"], "writeback": ["8d"]}
>>
>> We definitely wouldn't want to implement all of the eviction strategies
>> in the first cut - probably just the object file location (CEP 36 fwict)
>> and eviction strategy "none". Default eviction would be "none" if not
>> specified (throw errors on full storage), default writeback would be "0s".
>> I am thinking that the strategies 

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-08 Thread Jon Haddad
I really like the data directories and replication configuration.  I think
it makes a ton of sense to put it in the yaml, but we should probably yaml
it all, and not nest JSON :), and we can probably simplify it a little with
a file uri scheme, something like this:

data_file_locations:
  disk:
path: "file:///var/lib/cassandra/data"
  object:
path: "s3://..."

You've opened up a couple fun options here with this as well.  I hadn't
thought about 3 tiers of storage at all, but I could definitely see

NVMe -> EBS -> Object Store

as a valuable setup.

Does spread == JBOD here?  Maybe we just call it jbod?  I can't help but
bikeshed :)

Jon

On Sat, Mar 8, 2025 at 8:06 AM Joseph Lynch  wrote:

> Jon, I like where you are headed with that, just brainstorming out what
> the end interface might look like (might be getting a bit ahead of things
> talking about directories if we don't even have files implemented yet).
> What do folks think about pairing data_file_locations (fka
> data_file_directories) with three table level tunables: replication
> strategy ("spread", "tier"), the eviction strategy ("none", "lfu", "lru",
> etc...) and the writeback duration (0s, 10s, 8d)? So your three examples
>
> data_file_locations:
>   disk: {type: "filesystem", "path": "/var/lib/cassandra/data"}
>   object: {type: "s3", "path": "s3://..."}
>
> data_file_eviction_strategies:
>   none: {type: "NONE"}
>   lfu: {type: "LFU", "min_retention": "7d"}
>   hot-cold: {type: "LRU", "min_retention": "60d"}
>
> Then on tables to achieve your three proposals
> WITH storage = {locations: ["disk", "object"], "replication": "tier"}
> WITH storage = {locations: ["disk", "object"], "replication": "tier",
> "eviction": ["lfu"], "writeback": ["10s"]}
> WITH storage = {locations: ["disk", "object"], "replication": "tier",
> "eviction": ["hot-cold"], "writeback": ["8d"]}
>
> We definitely wouldn't want to implement all of the eviction strategies in
> the first cut - probably just the object file location (CEP 36 fwict) and
> eviction strategy "none". Default eviction would be "none" if not
> specified (throw errors on full storage), default writeback would be "0s".
> I am thinking that the strategies maybe should live in cassandra.yaml and
> not table properties because they're usually a property of the node being
> setup, tables can't magically make you have more locations to store data in.
>
> Nice thing about this is that our JBOD thing with multiple directories
> just becomes a replication strategy, default stays the same (everything
> local), and we can add more eviction strategies as folks want them. One
> could even imagine crazy tuned setups like tmpfs -> local-ssd -> remote-ssd
> -> remote-object configurations >_<
>
> -Joey
>
> On Sat, Mar 8, 2025 at 9:33 AM Jon Haddad  wrote:
>
>> Thanks Jordan and Joey for the additional info.
>>
>> One thing I'd like to clarify - what I'm mostly after is 100% of my data
>> on object store, local disk acting as a LRU cache, although there's also a
>> case for the mirror.
>>
>> What I see so far are three high level ways of running this:
>>
>> 1. Mirror Mode
>>
>> This is 100% on both.  When SSTables are written, they'd be written to
>> the object store as well, and we'd block till it's fully written to the
>> object store to ensure durability.  We don't exceed the space of the disk.
>> This is essentially a RAID 1.
>>
>> 2. LRU
>>
>> This behaves similarly to mirror mode, but we can drop local data to make
>> room.  If we need it from the object store we can fetch it, with a bit of
>> extra latency.  This is the mode I'm most interested in.  I have customers
>> today that would use this because it would cut their operational costs
>> massively.  This is like the LVM cache pool or the S3 cache Joey
>> described.  You should probably be able to configure how much of your local
>> disk you want to keep free, or specify a minimum period to keep local.
>> Might be some other fine grained control here.
>>
>> 3. Tiered Mode
>>
>> Data is written locally, but not immediately replicated to S3.  This is
>> the TWCS scenario, where you're using window of size X, you copy the data
>> to object store after Y period, and keep it local for Z.  This allows you
>> to avoid pushing data up every time it's compacted, you keep all hot data
>> local, and you tier it off.  Maybe you use 7 day TWCS window, copy after 8
>> days, and retain local for 60.  For TWCS to work well here, it'll should be
>> updated to use a max sstable size for the final compaction.
>>
>> To me, all 3 of these should be possible, and each has valid use cases. I
>> think it should also be applicable per table, because some stuff you want
>> to mirror (my user table), some stuff you want to LRU (infrequently
>> accessed nightly jobs), and some stuff you want to tier off (user
>> engagement).
>>
>> What it really boils down to are these properties:
>>
>> * Do we upload to object store when writing the SSTable?
>> * Do we upload after a specific a

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-08 Thread Štefan Miklošovič
That is cool but this still does not show / explain how it would look like
when it comes to dependencies needed for actually talking to storages like
s3.

Maybe I am missing something here and please explain when I am mistaken but
If I understand that correctly, for talking to s3 we would need to use a
library like this, right? (1). So that would be added among Cassandra
dependencies? Hence Cassandra starts to be biased against s3? Why s3? Every
time somebody comes up with a new remote storage support, that would be
added to classpath as well? How are these dependencies going to play with
each other and with Cassandra in general? Will all these storage
provider libraries for arbitrary clouds be even compatible with Cassandra
licence-wise?

I am sorry I keep repeating these questions but this part of that I just
don't get at all.

We can indeed add an API for this, sure sure, why not. But for people who
do not want to deal with this at all and just be OK with a FS mounted, why
would we block them doing that?

(1) https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xml

On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever  wrote:

>.
>
>
> It’s not an area where I can currently dedicate engineering effort. But if
>> others are interested in contributing a feature like this, I’d see it as
>> valuable for the project and would be happy to collaborate on
>> design/architecture/goals.
>>
>
>
> Jake mentioned 17 months ago a custom FileSystemProvider we could offer.
>
> None of us at DataStax has gotten around to providing that, but to quickly
> throw something over the wall this is it:
>
> https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java
>
>   (with a few friend classes under o.a.c.io.util)
>
> We then have a RemoteStorageProvider, private in another repo, that
> implements that and also provides the RemoteFileSystemProvider that Jake
> refers to.
>
> Hopefully that's a start to get people thinking about CEP level details,
> while we get a cleaned abstract of RemoteStorageProvider and friends to
> offer.
>
>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-08 Thread Joseph Lynch
Jon, I like where you are headed with that, just brainstorming out what the
end interface might look like (might be getting a bit ahead of things
talking about directories if we don't even have files implemented yet).
What do folks think about pairing data_file_locations (fka
data_file_directories) with three table level tunables: replication
strategy ("spread", "tier"), the eviction strategy ("none", "lfu", "lru",
etc...) and the writeback duration (0s, 10s, 8d)? So your three examples

data_file_locations:
  disk: {type: "filesystem", "path": "/var/lib/cassandra/data"}
  object: {type: "s3", "path": "s3://..."}

data_file_eviction_strategies:
  none: {type: "NONE"}
  lfu: {type: "LFU", "min_retention": "7d"}
  hot-cold: {type: "LRU", "min_retention": "60d"}

Then on tables to achieve your three proposals
WITH storage = {locations: ["disk", "object"], "replication": "tier"}
WITH storage = {locations: ["disk", "object"], "replication": "tier",
"eviction": ["lfu"], "writeback": ["10s"]}
WITH storage = {locations: ["disk", "object"], "replication": "tier",
"eviction": ["hot-cold"], "writeback": ["8d"]}

We definitely wouldn't want to implement all of the eviction strategies in
the first cut - probably just the object file location (CEP 36 fwict) and
eviction strategy "none". Default eviction would be "none" if not
specified (throw errors on full storage), default writeback would be "0s".
I am thinking that the strategies maybe should live in cassandra.yaml and
not table properties because they're usually a property of the node being
setup, tables can't magically make you have more locations to store data in.

Nice thing about this is that our JBOD thing with multiple directories just
becomes a replication strategy, default stays the same (everything local),
and we can add more eviction strategies as folks want them. One could even
imagine crazy tuned setups like tmpfs -> local-ssd -> remote-ssd ->
remote-object configurations >_<

-Joey

On Sat, Mar 8, 2025 at 9:33 AM Jon Haddad  wrote:

> Thanks Jordan and Joey for the additional info.
>
> One thing I'd like to clarify - what I'm mostly after is 100% of my data
> on object store, local disk acting as a LRU cache, although there's also a
> case for the mirror.
>
> What I see so far are three high level ways of running this:
>
> 1. Mirror Mode
>
> This is 100% on both.  When SSTables are written, they'd be written to the
> object store as well, and we'd block till it's fully written to the object
> store to ensure durability.  We don't exceed the space of the disk.  This
> is essentially a RAID 1.
>
> 2. LRU
>
> This behaves similarly to mirror mode, but we can drop local data to make
> room.  If we need it from the object store we can fetch it, with a bit of
> extra latency.  This is the mode I'm most interested in.  I have customers
> today that would use this because it would cut their operational costs
> massively.  This is like the LVM cache pool or the S3 cache Joey
> described.  You should probably be able to configure how much of your local
> disk you want to keep free, or specify a minimum period to keep local.
> Might be some other fine grained control here.
>
> 3. Tiered Mode
>
> Data is written locally, but not immediately replicated to S3.  This is
> the TWCS scenario, where you're using window of size X, you copy the data
> to object store after Y period, and keep it local for Z.  This allows you
> to avoid pushing data up every time it's compacted, you keep all hot data
> local, and you tier it off.  Maybe you use 7 day TWCS window, copy after 8
> days, and retain local for 60.  For TWCS to work well here, it'll should be
> updated to use a max sstable size for the final compaction.
>
> To me, all 3 of these should be possible, and each has valid use cases. I
> think it should also be applicable per table, because some stuff you want
> to mirror (my user table), some stuff you want to LRU (infrequently
> accessed nightly jobs), and some stuff you want to tier off (user
> engagement).
>
> What it really boils down to are these properties:
>
> * Do we upload to object store when writing the SSTable?
> * Do we upload after a specific amount of time?
> * Do we free up space locally when we reach a threshold?
> * Do we delete local files after a specific time period?
>
> If this is per table, then it makes sense to me to follow the same pattern
> as compaction, giving us something like this:
>
> WITH storage = {"class": "Mirror"}
> WITH storage = {"class": "LRU", "min_time_to_retain":"7 days"}
> WITH storage = {"class": "Tiered", "copy_after":"8 days",
> "min_time_to_retain":"60 days"}
>
> Something like that, at least.
>
> Jon
>
>
>
> On Sat, Mar 8, 2025 at 5:39 AM Joseph Lynch  wrote:
>
>> Great discussion - I agree strongly with Jon's points, giving operators
>> this option will make many operator's lives easier. Even if you still have
>> to have 100% disk space to meet performance requirements, that's still much
>> more efficient than you can ru

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-08 Thread Jon Haddad
Taking this a step further, this opens up is a different way of
bootstrapping new nodes, by using the object store's native copy commands.
This is something that's impossible when just using the filesystem mount.
I think Joey mentioned something along these lines to me several years ago,
maybe at the conference we had in Vegas.

Even longer term, it might make sense to rethink how we handle token ranges
and ownership.  If ranges are predetermined, we could organize data by
token range instead of by node.  Then we could, if the data was fully in
the object store, scale up and down dynamically without doing any streaming
or copying.  Imagine a scenario where initially ranges [0-10], [11-20] are
handled by node A.  We fire up node B, and TCM reassigns ownership of
[11-20] range to B.  Want to scale down?  Reassign [11-20] back to A.  It
also allows for zero copy streaming to be used 100% of the time in
non-object store use cases because we can partition the on disk data by
range ahead of time, and token ranges wouldn't be split during bootstrap,
just reassigned.

All those teams that need to scale up their clusters for the holiday season
could do so the day before, going from 9 nodes to 90, then back down to 9
after it's over without any streaming overhead. Massive cost savings.

It's pretty clear to me there's a ton of opportunity here.  This is a path
to making C* cost a small fraction of what it does today.  I've been
talking about 20TB / node for years, now I'm thinking about 100TB / node.

Jon


On Sat, Mar 8, 2025 at 6:31 AM Jon Haddad  wrote:

> Thanks Jordan and Joey for the additional info.
>
> One thing I'd like to clarify - what I'm mostly after is 100% of my data
> on object store, local disk acting as a LRU cache, although there's also a
> case for the mirror.
>
> What I see so far are three high level ways of running this:
>
> 1. Mirror Mode
>
> This is 100% on both.  When SSTables are written, they'd be written to the
> object store as well, and we'd block till it's fully written to the object
> store to ensure durability.  We don't exceed the space of the disk.  This
> is essentially a RAID 1.
>
> 2. LRU
>
> This behaves similarly to mirror mode, but we can drop local data to make
> room.  If we need it from the object store we can fetch it, with a bit of
> extra latency.  This is the mode I'm most interested in.  I have customers
> today that would use this because it would cut their operational costs
> massively.  This is like the LVM cache pool or the S3 cache Joey
> described.  You should probably be able to configure how much of your local
> disk you want to keep free, or specify a minimum period to keep local.
> Might be some other fine grained control here.
>
> 3. Tiered Mode
>
> Data is written locally, but not immediately replicated to S3.  This is
> the TWCS scenario, where you're using window of size X, you copy the data
> to object store after Y period, and keep it local for Z.  This allows you
> to avoid pushing data up every time it's compacted, you keep all hot data
> local, and you tier it off.  Maybe you use 7 day TWCS window, copy after 8
> days, and retain local for 60.  For TWCS to work well here, it'll should be
> updated to use a max sstable size for the final compaction.
>
> To me, all 3 of these should be possible, and each has valid use cases. I
> think it should also be applicable per table, because some stuff you want
> to mirror (my user table), some stuff you want to LRU (infrequently
> accessed nightly jobs), and some stuff you want to tier off (user
> engagement).
>
> What it really boils down to are these properties:
>
> * Do we upload to object store when writing the SSTable?
> * Do we upload after a specific amount of time?
> * Do we free up space locally when we reach a threshold?
> * Do we delete local files after a specific time period?
>
> If this is per table, then it makes sense to me to follow the same pattern
> as compaction, giving us something like this:
>
> WITH storage = {"class": "Mirror"}
> WITH storage = {"class": "LRU", "min_time_to_retain":"7 days"}
> WITH storage = {"class": "Tiered", "copy_after":"8 days",
> "min_time_to_retain":"60 days"}
>
> Something like that, at least.
>
> Jon
>
>
>
> On Sat, Mar 8, 2025 at 5:39 AM Joseph Lynch  wrote:
>
>> Great discussion - I agree strongly with Jon's points, giving operators
>> this option will make many operator's lives easier. Even if you still have
>> to have 100% disk space to meet performance requirements, that's still much
>> more efficient than you can run C* with just disks (as you need to leave
>> headroom). Jon's points are spot on regarding replacement, backup, and
>> analytics.
>>
>> Regarding Cheng and Jordan's concern around mount perf, I want to share
>> the "why". In the test Chris and I ran, we used s3-mountpoint's built-in
>> cache [1] which uses the local filesystem to cache remote calls. While all
>> data was in ephemeral cache and compaction wasn't running, performance was
>> exc

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-08 Thread Jon Haddad
Thanks Jordan and Joey for the additional info.

One thing I'd like to clarify - what I'm mostly after is 100% of my data on
object store, local disk acting as a LRU cache, although there's also a
case for the mirror.

What I see so far are three high level ways of running this:

1. Mirror Mode

This is 100% on both.  When SSTables are written, they'd be written to the
object store as well, and we'd block till it's fully written to the object
store to ensure durability.  We don't exceed the space of the disk.  This
is essentially a RAID 1.

2. LRU

This behaves similarly to mirror mode, but we can drop local data to make
room.  If we need it from the object store we can fetch it, with a bit of
extra latency.  This is the mode I'm most interested in.  I have customers
today that would use this because it would cut their operational costs
massively.  This is like the LVM cache pool or the S3 cache Joey
described.  You should probably be able to configure how much of your local
disk you want to keep free, or specify a minimum period to keep local.
Might be some other fine grained control here.

3. Tiered Mode

Data is written locally, but not immediately replicated to S3.  This is the
TWCS scenario, where you're using window of size X, you copy the data to
object store after Y period, and keep it local for Z.  This allows you to
avoid pushing data up every time it's compacted, you keep all hot data
local, and you tier it off.  Maybe you use 7 day TWCS window, copy after 8
days, and retain local for 60.  For TWCS to work well here, it'll should be
updated to use a max sstable size for the final compaction.

To me, all 3 of these should be possible, and each has valid use cases. I
think it should also be applicable per table, because some stuff you want
to mirror (my user table), some stuff you want to LRU (infrequently
accessed nightly jobs), and some stuff you want to tier off (user
engagement).

What it really boils down to are these properties:

* Do we upload to object store when writing the SSTable?
* Do we upload after a specific amount of time?
* Do we free up space locally when we reach a threshold?
* Do we delete local files after a specific time period?

If this is per table, then it makes sense to me to follow the same pattern
as compaction, giving us something like this:

WITH storage = {"class": "Mirror"}
WITH storage = {"class": "LRU", "min_time_to_retain":"7 days"}
WITH storage = {"class": "Tiered", "copy_after":"8 days",
"min_time_to_retain":"60 days"}

Something like that, at least.

Jon



On Sat, Mar 8, 2025 at 5:39 AM Joseph Lynch  wrote:

> Great discussion - I agree strongly with Jon's points, giving operators
> this option will make many operator's lives easier. Even if you still have
> to have 100% disk space to meet performance requirements, that's still much
> more efficient than you can run C* with just disks (as you need to leave
> headroom). Jon's points are spot on regarding replacement, backup, and
> analytics.
>
> Regarding Cheng and Jordan's concern around mount perf, I want to share
> the "why". In the test Chris and I ran, we used s3-mountpoint's built-in
> cache [1] which uses the local filesystem to cache remote calls. While all
> data was in ephemeral cache and compaction wasn't running, performance was
> excellent for both writes and reads. The problem was that cache did not
> work well with compaction - it caused cache evictions of actively read hot
> data at well below full disk usage. The architecture was fine - it was the
> implementation of the cache that was poor (and talking to AWS S3 engineers
> about it, good cache eviction is a harder problem for them to solve
> generally and is relatively low on their priority list at least right now)
> - to me this strongly supports the need for this CEP so the project can
> decide how to make these tradeoffs.
>
> Allowing operators to choose to treat Cassandra as a cache just seems good
> because caches are easier to run - you just have to manage cache filling
> and dumping, and monitor your cache hit ratio. Like Jon says, some people
> may want 100% hit ratio, but others may be ok with lower ratios (or even
> zero ratio) to save cost. If the proposed ChannelProxy implemented
> write-through caching with local disk (which it should probably do for
> availability anyways), would that alleviate some of the concerns? That
> would let operators choose to provision enough local disk somewhere on the
> spectrum of:
>
> a) hold everything (full HA, lowest latency, still cheaper than status quo
> and easier)
> b) hold commitlogs and hot sstables (hot-cold storage, middle option that
> has degraded p99 read latency for cold data)
> c) hold commitlogs (just HA writes, high and variable latency on reads)
>
> I'll also note that when the cluster degrades from a to b, in contrast to
> the status quo where you lose data and blow up, this would just get slower
> on the p99 read - and since replacement is so easy in comparison,
> recovering 

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-08 Thread Joseph Lynch
Great discussion - I agree strongly with Jon's points, giving operators
this option will make many operator's lives easier. Even if you still have
to have 100% disk space to meet performance requirements, that's still much
more efficient than you can run C* with just disks (as you need to leave
headroom). Jon's points are spot on regarding replacement, backup, and
analytics.

Regarding Cheng and Jordan's concern around mount perf, I want to share the
"why". In the test Chris and I ran, we used s3-mountpoint's built-in cache
[1] which uses the local filesystem to cache remote calls. While all data
was in ephemeral cache and compaction wasn't running, performance was
excellent for both writes and reads. The problem was that cache did not
work well with compaction - it caused cache evictions of actively read hot
data at well below full disk usage. The architecture was fine - it was the
implementation of the cache that was poor (and talking to AWS S3 engineers
about it, good cache eviction is a harder problem for them to solve
generally and is relatively low on their priority list at least right now)
- to me this strongly supports the need for this CEP so the project can
decide how to make these tradeoffs.

Allowing operators to choose to treat Cassandra as a cache just seems good
because caches are easier to run - you just have to manage cache filling
and dumping, and monitor your cache hit ratio. Like Jon says, some people
may want 100% hit ratio, but others may be ok with lower ratios (or even
zero ratio) to save cost. If the proposed ChannelProxy implemented
write-through caching with local disk (which it should probably do for
availability anyways), would that alleviate some of the concerns? That
would let operators choose to provision enough local disk somewhere on the
spectrum of:

a) hold everything (full HA, lowest latency, still cheaper than status quo
and easier)
b) hold commitlogs and hot sstables (hot-cold storage, middle option that
has degraded p99 read latency for cold data)
c) hold commitlogs (just HA writes, high and variable latency on reads)

I'll also note that when the cluster degrades from a to b, in contrast to
the status quo where you lose data and blow up, this would just get slower
on the p99 read - and since replacement is so easy in comparison,
recovering would be straightforward (move to machines with more disk and
run the local disk equivalent of happycache load [1]).

-Joey

[1]
https://github.com/awslabs/mountpoint-s3/blob/main/doc/CONFIGURATION.md#caching-configuration
[1] https://github.com/hashbrowncipher/happycache


On Fri, Mar 7, 2025 at 4:01 PM Jordan West  wrote:

> I too initially felt we should just use mounts and was excited by e.g.
> Single Zone Express mounting. As Cheng mentioned we tried it…and the
> results were disappointing (except for use cases who could sometimes
> tolerate seconds of p99 latency. That brought me around to needing an
> implementation we own that we can optimize properly as others have
> discussed.
>
> Regarding what percent of data should be in the cold store I would love to
> see an implementation that allows what Jon is proposing with the full data
> set and what the original proposal included with partial dataset. I think
> there are different reasons to use both.
>
> Jordan
>
>
> On Fri, Mar 7, 2025 at 11:02 Jon Haddad  wrote:
>
>> Supporting a filesystem mount is perfectly reasonable. If you wanted to
>> use that with the S3 mount, there's nothing that should prevent you from
>> doing so, and the filesystem version is probably the default implementation
>> that we'd want to ship with since, to your point, it doesn't require
>> additional dependencies.
>>
>> Allowing support for object stores is just that - allowing it.  It's just
>> more functionality, more flexibility.  I don't claim to know every object
>> store or remote filesystem, there are going to be folks that want to do
>> something custom I can't think of right now and there's no reason to box
>> ourselves in.  The abstraction to allow it is a small one.  If folks want
>> to put SSTables on HDFS, they should be able to.  Who am I to police them,
>> especially if we can do it in a way that doesn't carry any risk, making it
>> available as a separate plugin?
>>
>> My comment with regard to not wanting to treat object store as tiered
>> storage has everything to do with what I want to do with the data.  I want
>> 100% of it on object store (with a copy of hot data on the local node) for
>> multiple reasons, some of which (maybe all) were made by Jeff and Scott:
>>
>> * Analytics is easier, no transfer off C*
>> * Replace is dead simple, just pull the data off the mount when it boots
>> up.  Like using EBS, but way cheaper.  (thanks Scott for some numbers on
>> this)
>> * It makes backups easier, just copy the bucket.
>> * If you're using object store for backups anyways than I don't see why
>> you wouldn't keep all your data there
>> * I hadn't even really thought about scale to 

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-07 Thread Štefan Miklošovič
Because an earlier reply hinted that mounting a bucket yields "terrible
results". That has moved the discussion, in my mind, practically to the
place of "we are not going to do this", to which I explained that in this
particular case I do not find the speed important, because the use cases
you want to use it for do not have anything in common with what I want to
use it for.

Since then, in my eyes this was binary "either / or", I was repeatedly
trying to get an opinion about being able to mount it regardless, to which,
afaik, only you explicitly expressed an opinion that it is OK but you are
not a fan of it:

"I personally can't see myself using something that treats an object store
as cold storage where SSTables are moved (implying they weren't there
before), and I've expressed my concerns with this, but other folks seem to
want it and that's OK."

So my assumption was, except you being ok with it, that mounting is not
viable, so it looks like we are forcing it.

To be super honest, if we made custom storage providers / proxies possible
and it was already in place, then my urge to do "something fast and
functioning" (e.g. mounting a bucket) would not exist. I would not use
mounted buckets if we had this already in place and configurable in such a
way that we could say that everything except (e.g) snapshots would be
treated as it is now.

But, I can see how this will take a very, very long time to implement. This
is a complex CEP to tackle. I remember this topic being discussed in the
past as well. I _think_ there were at least two occasions when this was
already discussed, that it might be ported / retrofitted from what Mick was
showing etc. Nothing happened. Maybe mounting a bucket is not perfect and
doing it the other way is a more fitting solution etc. but as the saying
goes "perfect is the enemy of good".

On Fri, Mar 7, 2025 at 6:32 PM Jon Haddad  wrote:

> If that's not your intent, then you should be more careful with your
> replies.  When you write something like this:
>
> > While this might work, what I find tricky is that we are forcing this to
> users. Not everybody is interested in putting everything to a bucket and
> server traffic from that. They just don't want to do that. Because reasons.
> They are just happy with what they have etc, it works fine for years and so
> on. They just want to upload SSTables upon snapshotting and call it a day.
>
> > I don't think we should force our worldview on them if they are not
> interested in it.
>
> It comes off *extremely* negative.  You use the word "force" here multiple
> times.
>
>
>
>
> On Fri, Mar 7, 2025 at 9:18 AM Štefan Miklošovič 
> wrote:
>
>> I was explaining multiple times (1) that I don't have anything against
>> what is discussed here.
>>
>> Having questions about what that is going to look like does not mean I am
>> dismissive.
>>
>> (1) https://lists.apache.org/thread/ofh2q52p92cr89wh2l3djsm5n9dmzzsg
>>
>> On Fri, Mar 7, 2025 at 5:44 PM Jon Haddad 
>> wrote:
>>
>>> Nobody is saying you can't work with a mount, and this isn't a
>>> conversation about snapshots.
>>>
>>> Nobody is forcing users to use object storage either.
>>>
>>> You're making a ton of negative assumptions here about both the
>>> discussion, and the people you're having it with.  Try to be more open
>>> minded.
>>>
>>>
>>> On Fri, Mar 7, 2025 at 2:28 AM Štefan Miklošovič 
>>> wrote:
>>>
 The only way I see that working is that, if everything was in a bucket,
 if you take a snapshot, these SSTables would be "copied" from live data dir
 (living in a bucket) to snapshots dir (living in a bucket). Basically, we
 would need to say "and if you go to take a snapshot on this table, instead
 of hardlinking these SSTables, do a copy". But this "copying" would be
 internal to a bucket itself. We would not need to "upload" from node's
 machine to s3.

 While this might work, what I find tricky is that we are forcing this
 to users. Not everybody is interested in putting everything to a bucket and
 server traffic from that. They just don't want to do that. Because reasons.
 They are just happy with what they have etc, it works fine for years and so
 on. They just want to upload SSTables upon snapshotting and call it a day.

 I don't think we should force our worldview on them if they are not
 interested in it.

 On Fri, Mar 7, 2025 at 11:02 AM Štefan Miklošovič <
 [email protected]> wrote:

> BTW, snapshots are quite special because these are not "files", they
> are just hard links. They "materialize" as regular files once underlying
> SSTables are compacted away. How are you going to hardlink from local
> storage to an object storage anyway? We will always need to "upload".
>
> On Fri, Mar 7, 2025 at 10:51 AM Štefan Miklošovič <
> [email protected]> wrote:
>
>> Jon,
>>
>> all "big three" support mounting a bucket locally. That being said, I
>> do

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-07 Thread Jordan West
I too initially felt we should just use mounts and was excited by e.g.
Single Zone Express mounting. As Cheng mentioned we tried it…and the
results were disappointing (except for use cases who could sometimes
tolerate seconds of p99 latency. That brought me around to needing an
implementation we own that we can optimize properly as others have
discussed.

Regarding what percent of data should be in the cold store I would love to
see an implementation that allows what Jon is proposing with the full data
set and what the original proposal included with partial dataset. I think
there are different reasons to use both.

Jordan


On Fri, Mar 7, 2025 at 11:02 Jon Haddad  wrote:

> Supporting a filesystem mount is perfectly reasonable. If you wanted to
> use that with the S3 mount, there's nothing that should prevent you from
> doing so, and the filesystem version is probably the default implementation
> that we'd want to ship with since, to your point, it doesn't require
> additional dependencies.
>
> Allowing support for object stores is just that - allowing it.  It's just
> more functionality, more flexibility.  I don't claim to know every object
> store or remote filesystem, there are going to be folks that want to do
> something custom I can't think of right now and there's no reason to box
> ourselves in.  The abstraction to allow it is a small one.  If folks want
> to put SSTables on HDFS, they should be able to.  Who am I to police them,
> especially if we can do it in a way that doesn't carry any risk, making it
> available as a separate plugin?
>
> My comment with regard to not wanting to treat object store as tiered
> storage has everything to do with what I want to do with the data.  I want
> 100% of it on object store (with a copy of hot data on the local node) for
> multiple reasons, some of which (maybe all) were made by Jeff and Scott:
>
> * Analytics is easier, no transfer off C*
> * Replace is dead simple, just pull the data off the mount when it boots
> up.  Like using EBS, but way cheaper.  (thanks Scott for some numbers on
> this)
> * It makes backups easier, just copy the bucket.
> * If you're using object store for backups anyways than I don't see why
> you wouldn't keep all your data there
> * I hadn't even really thought about scale to zero before but I love this
> too
>
> Some folks want to treat the object store as a second tier, implying to me
> that once the SSTables reach a certain age or aren't touched, they're
> uploaded to the object store. Supporting this use case shouldn't be that
> much different.  Maybe you don't care about the most recent data, and
> you're OK with losing everything from the last few days, because you can
> reload from Kafka.  You'd be treating C* as a temporary staging area for
> whatever purpose, and you only want to do analytics on the cold data.   As
> I've expressed already, it's just a difference in policy of when to
> upload.  Yes I'm being a bit hand wavy about this part but its an email
> discussion.  I don't have this use case today, but it's certainly valid.
> Or maybe it works in conjunction with witness replicas / transient
> replication, allowing you to offload data from a node until there's a
> failure, in which case C* can grab it.  I'm just throwing out ideas here.
>
> Feel free to treat object store here as "secondary location".  It can be a
> NAS, an S3 mount, a custom FUSE filesystem, or the S3 api, or whatever else
> people come up with.
>
> In the past, I've made the case that this functionality can be achieved
> with LVM cache pools [1][2], and even provided some benchmarks showing it
> can be used.  I've also argued that we can do node replacements with
> rsync.  While these things are both technically true, others have convinced
> me that having this functionality as first class in the database makes it
> easier for our users and thus better for the project.  Should someone have
> to *just* understand all of LVM, or *just* understand the nuance of
> potential data loss due to rsync's default second resolution?  I keep
> relearning that whenever I say "*just* do X", that X isn't as convenient or
> easy for other people as it is for me, and I need to relax a bit.
>
> The view of the world where someone *just* needs to know dozens of
> workarounds to use the database makes it harder for non-experts to use.
> The bar for usability is constantly being raised, and whatever makes it
> better for a first time user is better for the community.
>
> Anyways, that's where I'm at now.  Easier = better, even if we're
> reinventing some of the wheel to do so.  Sometimes you get a better wheel,
> too.
>
> Jon
>
> [1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
> [2] https://issues.apache.org/jira/browse/CASSANDRA-8460
>
>
>
> On Fri, Mar 7, 2025 at 10:08 AM Štefan Miklošovič 
> wrote:
>
>> Because an earlier reply hinted that mounting a bucket yields "terrible
>> results". That has moved the discussion, in my mind, practically to t

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-07 Thread Jon Haddad
Supporting a filesystem mount is perfectly reasonable. If you wanted to use
that with the S3 mount, there's nothing that should prevent you from doing
so, and the filesystem version is probably the default implementation that
we'd want to ship with since, to your point, it doesn't require additional
dependencies.

Allowing support for object stores is just that - allowing it.  It's just
more functionality, more flexibility.  I don't claim to know every object
store or remote filesystem, there are going to be folks that want to do
something custom I can't think of right now and there's no reason to box
ourselves in.  The abstraction to allow it is a small one.  If folks want
to put SSTables on HDFS, they should be able to.  Who am I to police them,
especially if we can do it in a way that doesn't carry any risk, making it
available as a separate plugin?

My comment with regard to not wanting to treat object store as tiered
storage has everything to do with what I want to do with the data.  I want
100% of it on object store (with a copy of hot data on the local node) for
multiple reasons, some of which (maybe all) were made by Jeff and Scott:

* Analytics is easier, no transfer off C*
* Replace is dead simple, just pull the data off the mount when it boots
up.  Like using EBS, but way cheaper.  (thanks Scott for some numbers on
this)
* It makes backups easier, just copy the bucket.
* If you're using object store for backups anyways than I don't see why you
wouldn't keep all your data there
* I hadn't even really thought about scale to zero before but I love this
too

Some folks want to treat the object store as a second tier, implying to me
that once the SSTables reach a certain age or aren't touched, they're
uploaded to the object store. Supporting this use case shouldn't be that
much different.  Maybe you don't care about the most recent data, and
you're OK with losing everything from the last few days, because you can
reload from Kafka.  You'd be treating C* as a temporary staging area for
whatever purpose, and you only want to do analytics on the cold data.   As
I've expressed already, it's just a difference in policy of when to
upload.  Yes I'm being a bit hand wavy about this part but its an email
discussion.  I don't have this use case today, but it's certainly valid.
Or maybe it works in conjunction with witness replicas / transient
replication, allowing you to offload data from a node until there's a
failure, in which case C* can grab it.  I'm just throwing out ideas here.

Feel free to treat object store here as "secondary location".  It can be a
NAS, an S3 mount, a custom FUSE filesystem, or the S3 api, or whatever else
people come up with.

In the past, I've made the case that this functionality can be achieved
with LVM cache pools [1][2], and even provided some benchmarks showing it
can be used.  I've also argued that we can do node replacements with
rsync.  While these things are both technically true, others have convinced
me that having this functionality as first class in the database makes it
easier for our users and thus better for the project.  Should someone have
to *just* understand all of LVM, or *just* understand the nuance of
potential data loss due to rsync's default second resolution?  I keep
relearning that whenever I say "*just* do X", that X isn't as convenient or
easy for other people as it is for me, and I need to relax a bit.

The view of the world where someone *just* needs to know dozens of
workarounds to use the database makes it harder for non-experts to use.
The bar for usability is constantly being raised, and whatever makes it
better for a first time user is better for the community.

Anyways, that's where I'm at now.  Easier = better, even if we're
reinventing some of the wheel to do so.  Sometimes you get a better wheel,
too.

Jon

[1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
[2] https://issues.apache.org/jira/browse/CASSANDRA-8460



On Fri, Mar 7, 2025 at 10:08 AM Štefan Miklošovič 
wrote:

> Because an earlier reply hinted that mounting a bucket yields "terrible
> results". That has moved the discussion, in my mind, practically to the
> place of "we are not going to do this", to which I explained that in this
> particular case I do not find the speed important, because the use cases
> you want to use it for do not have anything in common with what I want to
> use it for.
>
> Since then, in my eyes this was binary "either / or", I was repeatedly
> trying to get an opinion about being able to mount it regardless, to which,
> afaik, only you explicitly expressed an opinion that it is OK but you are
> not a fan of it:
>
> "I personally can't see myself using something that treats an object store
> as cold storage where SSTables are moved (implying they weren't there
> before), and I've expressed my concerns with this, but other folks seem to
> want it and that's OK."
>
> So my assumption was, except you being ok with it, that mounting is not
> vi

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-07 Thread Jon Haddad
If that's not your intent, then you should be more careful with your
replies.  When you write something like this:

> While this might work, what I find tricky is that we are forcing this to
users. Not everybody is interested in putting everything to a bucket and
server traffic from that. They just don't want to do that. Because reasons.
They are just happy with what they have etc, it works fine for years and so
on. They just want to upload SSTables upon snapshotting and call it a day.

> I don't think we should force our worldview on them if they are not
interested in it.

It comes off *extremely* negative.  You use the word "force" here multiple
times.




On Fri, Mar 7, 2025 at 9:18 AM Štefan Miklošovič 
wrote:

> I was explaining multiple times (1) that I don't have anything against
> what is discussed here.
>
> Having questions about what that is going to look like does not mean I am
> dismissive.
>
> (1) https://lists.apache.org/thread/ofh2q52p92cr89wh2l3djsm5n9dmzzsg
>
> On Fri, Mar 7, 2025 at 5:44 PM Jon Haddad  wrote:
>
>> Nobody is saying you can't work with a mount, and this isn't a
>> conversation about snapshots.
>>
>> Nobody is forcing users to use object storage either.
>>
>> You're making a ton of negative assumptions here about both the
>> discussion, and the people you're having it with.  Try to be more open
>> minded.
>>
>>
>> On Fri, Mar 7, 2025 at 2:28 AM Štefan Miklošovič 
>> wrote:
>>
>>> The only way I see that working is that, if everything was in a bucket,
>>> if you take a snapshot, these SSTables would be "copied" from live data dir
>>> (living in a bucket) to snapshots dir (living in a bucket). Basically, we
>>> would need to say "and if you go to take a snapshot on this table, instead
>>> of hardlinking these SSTables, do a copy". But this "copying" would be
>>> internal to a bucket itself. We would not need to "upload" from node's
>>> machine to s3.
>>>
>>> While this might work, what I find tricky is that we are forcing this to
>>> users. Not everybody is interested in putting everything to a bucket and
>>> server traffic from that. They just don't want to do that. Because reasons.
>>> They are just happy with what they have etc, it works fine for years and so
>>> on. They just want to upload SSTables upon snapshotting and call it a day.
>>>
>>> I don't think we should force our worldview on them if they are not
>>> interested in it.
>>>
>>> On Fri, Mar 7, 2025 at 11:02 AM Štefan Miklošovič <
>>> [email protected]> wrote:
>>>
 BTW, snapshots are quite special because these are not "files", they
 are just hard links. They "materialize" as regular files once underlying
 SSTables are compacted away. How are you going to hardlink from local
 storage to an object storage anyway? We will always need to "upload".

 On Fri, Mar 7, 2025 at 10:51 AM Štefan Miklošovič <
 [email protected]> wrote:

> Jon,
>
> all "big three" support mounting a bucket locally. That being said, I
> do not think that completely ditching this possibility for Cassandra
> working with a mount, e.g. for just uploading snapshots there etc, is
> reasonable.
>
> GCP
>
>
> https://cloud.google.com/storage/docs/cloud-storage-fuse/quickstart-mount-bucket
>
> Azure (this one is quite sophisticated), lot of options ...
>
>
> https://learn.microsoft.com/en-us/azure/storage/blobs/blobfuse2-how-to-deploy?tabs=RHEL
>
> S3, lot of options how to mount that
>
> https://bluexp.netapp.com/blog/amazon-s3-as-a-file-system
>
> On Thu, Mar 6, 2025 at 4:17 PM Jon Haddad 
> wrote:
>
>> Assuming everything else is identical, might not matter for S3.
>> However, not every object store has a filesystem mount.
>>
>> Regarding sprawling dependencies, we can always make the provider
>> specific libraries available as a separate download and put them on their
>> own thread with a separate class path. I think in JVM dtest does this
>> already.  Someone just started asking about IAM for login, it sounds 
>> like a
>> similar problem.
>>
>>
>> On Thu, Mar 6, 2025 at 12:53 AM Benedict  wrote:
>>
>>> I think another way of saying what Stefan may be getting at is what
>>> does a library give us that an appropriately configured mount dir 
>>> doesn’t?
>>>
>>> We don’t want to treat S3 the same as local disk, but this can be
>>> achieved easily with config. Is there some other benefit of direct
>>> integration? Well defined exceptions if we need to distinguish cases is 
>>> one
>>> that maybe springs to mind but perhaps there are others?
>>>
>>>
>>> On 6 Mar 2025, at 08:39, Štefan Miklošovič 
>>> wrote:
>>>
>>> 
>>>
>>> That is cool but this still does not show / explain how it would
>>> look like when it comes to dependencies needed for actually talking to
>>> storages like s3.
>

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-07 Thread Štefan Miklošovič
I was explaining multiple times (1) that I don't have anything against what
is discussed here.

Having questions about what that is going to look like does not mean I am
dismissive.

(1) https://lists.apache.org/thread/ofh2q52p92cr89wh2l3djsm5n9dmzzsg

On Fri, Mar 7, 2025 at 5:44 PM Jon Haddad  wrote:

> Nobody is saying you can't work with a mount, and this isn't a
> conversation about snapshots.
>
> Nobody is forcing users to use object storage either.
>
> You're making a ton of negative assumptions here about both the
> discussion, and the people you're having it with.  Try to be more open
> minded.
>
>
> On Fri, Mar 7, 2025 at 2:28 AM Štefan Miklošovič 
> wrote:
>
>> The only way I see that working is that, if everything was in a bucket,
>> if you take a snapshot, these SSTables would be "copied" from live data dir
>> (living in a bucket) to snapshots dir (living in a bucket). Basically, we
>> would need to say "and if you go to take a snapshot on this table, instead
>> of hardlinking these SSTables, do a copy". But this "copying" would be
>> internal to a bucket itself. We would not need to "upload" from node's
>> machine to s3.
>>
>> While this might work, what I find tricky is that we are forcing this to
>> users. Not everybody is interested in putting everything to a bucket and
>> server traffic from that. They just don't want to do that. Because reasons.
>> They are just happy with what they have etc, it works fine for years and so
>> on. They just want to upload SSTables upon snapshotting and call it a day.
>>
>> I don't think we should force our worldview on them if they are not
>> interested in it.
>>
>> On Fri, Mar 7, 2025 at 11:02 AM Štefan Miklošovič 
>> wrote:
>>
>>> BTW, snapshots are quite special because these are not "files", they are
>>> just hard links. They "materialize" as regular files once underlying
>>> SSTables are compacted away. How are you going to hardlink from local
>>> storage to an object storage anyway? We will always need to "upload".
>>>
>>> On Fri, Mar 7, 2025 at 10:51 AM Štefan Miklošovič <
>>> [email protected]> wrote:
>>>
 Jon,

 all "big three" support mounting a bucket locally. That being said, I
 do not think that completely ditching this possibility for Cassandra
 working with a mount, e.g. for just uploading snapshots there etc, is
 reasonable.

 GCP


 https://cloud.google.com/storage/docs/cloud-storage-fuse/quickstart-mount-bucket

 Azure (this one is quite sophisticated), lot of options ...


 https://learn.microsoft.com/en-us/azure/storage/blobs/blobfuse2-how-to-deploy?tabs=RHEL

 S3, lot of options how to mount that

 https://bluexp.netapp.com/blog/amazon-s3-as-a-file-system

 On Thu, Mar 6, 2025 at 4:17 PM Jon Haddad 
 wrote:

> Assuming everything else is identical, might not matter for S3.
> However, not every object store has a filesystem mount.
>
> Regarding sprawling dependencies, we can always make the provider
> specific libraries available as a separate download and put them on their
> own thread with a separate class path. I think in JVM dtest does this
> already.  Someone just started asking about IAM for login, it sounds like 
> a
> similar problem.
>
>
> On Thu, Mar 6, 2025 at 12:53 AM Benedict  wrote:
>
>> I think another way of saying what Stefan may be getting at is what
>> does a library give us that an appropriately configured mount dir 
>> doesn’t?
>>
>> We don’t want to treat S3 the same as local disk, but this can be
>> achieved easily with config. Is there some other benefit of direct
>> integration? Well defined exceptions if we need to distinguish cases is 
>> one
>> that maybe springs to mind but perhaps there are others?
>>
>>
>> On 6 Mar 2025, at 08:39, Štefan Miklošovič 
>> wrote:
>>
>> 
>>
>> That is cool but this still does not show / explain how it would look
>> like when it comes to dependencies needed for actually talking to 
>> storages
>> like s3.
>>
>> Maybe I am missing something here and please explain when I am
>> mistaken but If I understand that correctly, for talking to s3 we would
>> need to use a library like this, right? (1). So that would be added among
>> Cassandra dependencies? Hence Cassandra starts to be biased against s3? 
>> Why
>> s3? Every time somebody comes up with a new remote storage support, that
>> would be added to classpath as well? How are these dependencies going to
>> play with each other and with Cassandra in general? Will all these 
>> storage
>> provider libraries for arbitrary clouds be even compatible with Cassandra
>> licence-wise?
>>
>> I am sorry I keep repeating these questions but this part of that I
>> just don't get at all.
>>
>> We can indeed add an API for this, sure sure, why not.

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-07 Thread Štefan Miklošovič
The only way I see that working is that, if everything was in a bucket, if
you take a snapshot, these SSTables would be "copied" from live data dir
(living in a bucket) to snapshots dir (living in a bucket). Basically, we
would need to say "and if you go to take a snapshot on this table, instead
of hardlinking these SSTables, do a copy". But this "copying" would be
internal to a bucket itself. We would not need to "upload" from node's
machine to s3.

While this might work, what I find tricky is that we are forcing this to
users. Not everybody is interested in putting everything to a bucket and
server traffic from that. They just don't want to do that. Because reasons.
They are just happy with what they have etc, it works fine for years and so
on. They just want to upload SSTables upon snapshotting and call it a day.

I don't think we should force our worldview on them if they are not
interested in it.

On Fri, Mar 7, 2025 at 11:02 AM Štefan Miklošovič 
wrote:

> BTW, snapshots are quite special because these are not "files", they are
> just hard links. They "materialize" as regular files once underlying
> SSTables are compacted away. How are you going to hardlink from local
> storage to an object storage anyway? We will always need to "upload".
>
> On Fri, Mar 7, 2025 at 10:51 AM Štefan Miklošovič 
> wrote:
>
>> Jon,
>>
>> all "big three" support mounting a bucket locally. That being said, I do
>> not think that completely ditching this possibility for Cassandra working
>> with a mount, e.g. for just uploading snapshots there etc, is reasonable.
>>
>> GCP
>>
>>
>> https://cloud.google.com/storage/docs/cloud-storage-fuse/quickstart-mount-bucket
>>
>> Azure (this one is quite sophisticated), lot of options ...
>>
>>
>> https://learn.microsoft.com/en-us/azure/storage/blobs/blobfuse2-how-to-deploy?tabs=RHEL
>>
>> S3, lot of options how to mount that
>>
>> https://bluexp.netapp.com/blog/amazon-s3-as-a-file-system
>>
>> On Thu, Mar 6, 2025 at 4:17 PM Jon Haddad 
>> wrote:
>>
>>> Assuming everything else is identical, might not matter for S3. However,
>>> not every object store has a filesystem mount.
>>>
>>> Regarding sprawling dependencies, we can always make the provider
>>> specific libraries available as a separate download and put them on their
>>> own thread with a separate class path. I think in JVM dtest does this
>>> already.  Someone just started asking about IAM for login, it sounds like a
>>> similar problem.
>>>
>>>
>>> On Thu, Mar 6, 2025 at 12:53 AM Benedict  wrote:
>>>
 I think another way of saying what Stefan may be getting at is what
 does a library give us that an appropriately configured mount dir doesn’t?

 We don’t want to treat S3 the same as local disk, but this can be
 achieved easily with config. Is there some other benefit of direct
 integration? Well defined exceptions if we need to distinguish cases is one
 that maybe springs to mind but perhaps there are others?


 On 6 Mar 2025, at 08:39, Štefan Miklošovič 
 wrote:

 

 That is cool but this still does not show / explain how it would look
 like when it comes to dependencies needed for actually talking to storages
 like s3.

 Maybe I am missing something here and please explain when I am mistaken
 but If I understand that correctly, for talking to s3 we would need to use
 a library like this, right? (1). So that would be added among Cassandra
 dependencies? Hence Cassandra starts to be biased against s3? Why s3? Every
 time somebody comes up with a new remote storage support, that would be
 added to classpath as well? How are these dependencies going to play with
 each other and with Cassandra in general? Will all these storage
 provider libraries for arbitrary clouds be even compatible with Cassandra
 licence-wise?

 I am sorry I keep repeating these questions but this part of that I
 just don't get at all.

 We can indeed add an API for this, sure sure, why not. But for people
 who do not want to deal with this at all and just be OK with a FS mounted,
 why would we block them doing that?

 (1)
 https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xml

 On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever  wrote:

>.
>
>
> It’s not an area where I can currently dedicate engineering effort.
>> But if others are interested in contributing a feature like this, I’d see
>> it as valuable for the project and would be happy to collaborate on
>> design/architecture/goals.
>>
>
>
> Jake mentioned 17 months ago a custom FileSystemProvider we could
> offer.
>
> None of us at DataStax has gotten around to providing that, but to
> quickly throw something over the wall this is it:
>
> https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java
>
>   

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-07 Thread Jon Haddad
Nobody is saying you can't work with a mount, and this isn't a conversation
about snapshots.

Nobody is forcing users to use object storage either.

You're making a ton of negative assumptions here about both the discussion,
and the people you're having it with.  Try to be more open minded.


On Fri, Mar 7, 2025 at 2:28 AM Štefan Miklošovič 
wrote:

> The only way I see that working is that, if everything was in a bucket, if
> you take a snapshot, these SSTables would be "copied" from live data dir
> (living in a bucket) to snapshots dir (living in a bucket). Basically, we
> would need to say "and if you go to take a snapshot on this table, instead
> of hardlinking these SSTables, do a copy". But this "copying" would be
> internal to a bucket itself. We would not need to "upload" from node's
> machine to s3.
>
> While this might work, what I find tricky is that we are forcing this to
> users. Not everybody is interested in putting everything to a bucket and
> server traffic from that. They just don't want to do that. Because reasons.
> They are just happy with what they have etc, it works fine for years and so
> on. They just want to upload SSTables upon snapshotting and call it a day.
>
> I don't think we should force our worldview on them if they are not
> interested in it.
>
> On Fri, Mar 7, 2025 at 11:02 AM Štefan Miklošovič 
> wrote:
>
>> BTW, snapshots are quite special because these are not "files", they are
>> just hard links. They "materialize" as regular files once underlying
>> SSTables are compacted away. How are you going to hardlink from local
>> storage to an object storage anyway? We will always need to "upload".
>>
>> On Fri, Mar 7, 2025 at 10:51 AM Štefan Miklošovič 
>> wrote:
>>
>>> Jon,
>>>
>>> all "big three" support mounting a bucket locally. That being said, I do
>>> not think that completely ditching this possibility for Cassandra working
>>> with a mount, e.g. for just uploading snapshots there etc, is reasonable.
>>>
>>> GCP
>>>
>>>
>>> https://cloud.google.com/storage/docs/cloud-storage-fuse/quickstart-mount-bucket
>>>
>>> Azure (this one is quite sophisticated), lot of options ...
>>>
>>>
>>> https://learn.microsoft.com/en-us/azure/storage/blobs/blobfuse2-how-to-deploy?tabs=RHEL
>>>
>>> S3, lot of options how to mount that
>>>
>>> https://bluexp.netapp.com/blog/amazon-s3-as-a-file-system
>>>
>>> On Thu, Mar 6, 2025 at 4:17 PM Jon Haddad 
>>> wrote:
>>>
 Assuming everything else is identical, might not matter for S3.
 However, not every object store has a filesystem mount.

 Regarding sprawling dependencies, we can always make the provider
 specific libraries available as a separate download and put them on their
 own thread with a separate class path. I think in JVM dtest does this
 already.  Someone just started asking about IAM for login, it sounds like a
 similar problem.


 On Thu, Mar 6, 2025 at 12:53 AM Benedict  wrote:

> I think another way of saying what Stefan may be getting at is what
> does a library give us that an appropriately configured mount dir doesn’t?
>
> We don’t want to treat S3 the same as local disk, but this can be
> achieved easily with config. Is there some other benefit of direct
> integration? Well defined exceptions if we need to distinguish cases is 
> one
> that maybe springs to mind but perhaps there are others?
>
>
> On 6 Mar 2025, at 08:39, Štefan Miklošovič 
> wrote:
>
> 
>
> That is cool but this still does not show / explain how it would look
> like when it comes to dependencies needed for actually talking to storages
> like s3.
>
> Maybe I am missing something here and please explain when I am
> mistaken but If I understand that correctly, for talking to s3 we would
> need to use a library like this, right? (1). So that would be added among
> Cassandra dependencies? Hence Cassandra starts to be biased against s3? 
> Why
> s3? Every time somebody comes up with a new remote storage support, that
> would be added to classpath as well? How are these dependencies going to
> play with each other and with Cassandra in general? Will all these storage
> provider libraries for arbitrary clouds be even compatible with Cassandra
> licence-wise?
>
> I am sorry I keep repeating these questions but this part of that I
> just don't get at all.
>
> We can indeed add an API for this, sure sure, why not. But for people
> who do not want to deal with this at all and just be OK with a FS mounted,
> why would we block them doing that?
>
> (1)
> https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xml
>
> On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever  wrote:
>
>>.
>>
>>
>> It’s not an area where I can currently dedicate engineering effort.
>>> But if others are interested in contributing a feature like this, I’

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-07 Thread guo Maxwell
Thank you very much, I'm certainly interested. I'll start working on the
update for cep-36 next week.


Mick Semb Wever 于2025年3月7日 周五下午7:07写道:

>
>
> On Thu, 6 Mar 2025 at 09:40, Štefan Miklošovič 
> wrote:
>
>> That is cool but this still does not show / explain how it would look
>> like when it comes to dependencies needed for actually talking to storages
>> like s3.
>>
>
>
> As Benedict writes, dealing with optional dependencies is not hard (and as
> Jon writes we should work on improving how we deal with it).
>
> More configurable paths would also be welcome (snapshots, backups, etc),
> and you can already see that in the StorageProvider.DirectoryType enum.
> Given the overlap this has with other work ongoing the CEP should be
> written to account and coordinate for that.  Simply configuring different
> paths will certainly satisfy a lot of user's demands.
>
> Object storage is still a very valuable goal (as Jeff and Scott writes).
> Maxwell, are you interested in updating the CEP to reflect all the new
> input ? I believe that's the next step here, and I'd be happy to help if
> you took the lead.
>
>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-07 Thread Mick Semb Wever
On Thu, 6 Mar 2025 at 09:40, Štefan Miklošovič 
wrote:

> That is cool but this still does not show / explain how it would look like
> when it comes to dependencies needed for actually talking to storages like
> s3.
>


As Benedict writes, dealing with optional dependencies is not hard (and as
Jon writes we should work on improving how we deal with it).

More configurable paths would also be welcome (snapshots, backups, etc),
and you can already see that in the StorageProvider.DirectoryType enum.
Given the overlap this has with other work ongoing the CEP should be
written to account and coordinate for that.  Simply configuring different
paths will certainly satisfy a lot of user's demands.

Object storage is still a very valuable goal (as Jeff and Scott writes).
Maxwell, are you interested in updating the CEP to reflect all the new
input ? I believe that's the next step here, and I'd be happy to help if
you took the lead.


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-07 Thread Štefan Miklošovič
BTW, snapshots are quite special because these are not "files", they are
just hard links. They "materialize" as regular files once underlying
SSTables are compacted away. How are you going to hardlink from local
storage to an object storage anyway? We will always need to "upload".

On Fri, Mar 7, 2025 at 10:51 AM Štefan Miklošovič 
wrote:

> Jon,
>
> all "big three" support mounting a bucket locally. That being said, I do
> not think that completely ditching this possibility for Cassandra working
> with a mount, e.g. for just uploading snapshots there etc, is reasonable.
>
> GCP
>
>
> https://cloud.google.com/storage/docs/cloud-storage-fuse/quickstart-mount-bucket
>
> Azure (this one is quite sophisticated), lot of options ...
>
>
> https://learn.microsoft.com/en-us/azure/storage/blobs/blobfuse2-how-to-deploy?tabs=RHEL
>
> S3, lot of options how to mount that
>
> https://bluexp.netapp.com/blog/amazon-s3-as-a-file-system
>
> On Thu, Mar 6, 2025 at 4:17 PM Jon Haddad  wrote:
>
>> Assuming everything else is identical, might not matter for S3. However,
>> not every object store has a filesystem mount.
>>
>> Regarding sprawling dependencies, we can always make the provider
>> specific libraries available as a separate download and put them on their
>> own thread with a separate class path. I think in JVM dtest does this
>> already.  Someone just started asking about IAM for login, it sounds like a
>> similar problem.
>>
>>
>> On Thu, Mar 6, 2025 at 12:53 AM Benedict  wrote:
>>
>>> I think another way of saying what Stefan may be getting at is what does
>>> a library give us that an appropriately configured mount dir doesn’t?
>>>
>>> We don’t want to treat S3 the same as local disk, but this can be
>>> achieved easily with config. Is there some other benefit of direct
>>> integration? Well defined exceptions if we need to distinguish cases is one
>>> that maybe springs to mind but perhaps there are others?
>>>
>>>
>>> On 6 Mar 2025, at 08:39, Štefan Miklošovič 
>>> wrote:
>>>
>>> 
>>>
>>> That is cool but this still does not show / explain how it would look
>>> like when it comes to dependencies needed for actually talking to storages
>>> like s3.
>>>
>>> Maybe I am missing something here and please explain when I am mistaken
>>> but If I understand that correctly, for talking to s3 we would need to use
>>> a library like this, right? (1). So that would be added among Cassandra
>>> dependencies? Hence Cassandra starts to be biased against s3? Why s3? Every
>>> time somebody comes up with a new remote storage support, that would be
>>> added to classpath as well? How are these dependencies going to play with
>>> each other and with Cassandra in general? Will all these storage
>>> provider libraries for arbitrary clouds be even compatible with Cassandra
>>> licence-wise?
>>>
>>> I am sorry I keep repeating these questions but this part of that I just
>>> don't get at all.
>>>
>>> We can indeed add an API for this, sure sure, why not. But for people
>>> who do not want to deal with this at all and just be OK with a FS mounted,
>>> why would we block them doing that?
>>>
>>> (1)
>>> https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xml
>>>
>>> On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever  wrote:
>>>
.


 It’s not an area where I can currently dedicate engineering effort. But
> if others are interested in contributing a feature like this, I’d see it 
> as
> valuable for the project and would be happy to collaborate on
> design/architecture/goals.
>


 Jake mentioned 17 months ago a custom FileSystemProvider we could offer.

 None of us at DataStax has gotten around to providing that, but to
 quickly throw something over the wall this is it:

 https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java

   (with a few friend classes under o.a.c.io.util)

 We then have a RemoteStorageProvider, private in another repo, that
 implements that and also provides the RemoteFileSystemProvider that Jake
 refers to.

 Hopefully that's a start to get people thinking about CEP level
 details, while we get a cleaned abstract of RemoteStorageProvider and
 friends to offer.




Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-07 Thread Štefan Miklošovič
Jon,

all "big three" support mounting a bucket locally. That being said, I do
not think that completely ditching this possibility for Cassandra working
with a mount, e.g. for just uploading snapshots there etc, is reasonable.

GCP

https://cloud.google.com/storage/docs/cloud-storage-fuse/quickstart-mount-bucket

Azure (this one is quite sophisticated), lot of options ...

https://learn.microsoft.com/en-us/azure/storage/blobs/blobfuse2-how-to-deploy?tabs=RHEL

S3, lot of options how to mount that

https://bluexp.netapp.com/blog/amazon-s3-as-a-file-system

On Thu, Mar 6, 2025 at 4:17 PM Jon Haddad  wrote:

> Assuming everything else is identical, might not matter for S3. However,
> not every object store has a filesystem mount.
>
> Regarding sprawling dependencies, we can always make the provider specific
> libraries available as a separate download and put them on their own thread
> with a separate class path. I think in JVM dtest does this already.
> Someone just started asking about IAM for login, it sounds like a similar
> problem.
>
>
> On Thu, Mar 6, 2025 at 12:53 AM Benedict  wrote:
>
>> I think another way of saying what Stefan may be getting at is what does
>> a library give us that an appropriately configured mount dir doesn’t?
>>
>> We don’t want to treat S3 the same as local disk, but this can be
>> achieved easily with config. Is there some other benefit of direct
>> integration? Well defined exceptions if we need to distinguish cases is one
>> that maybe springs to mind but perhaps there are others?
>>
>>
>> On 6 Mar 2025, at 08:39, Štefan Miklošovič 
>> wrote:
>>
>> 
>>
>> That is cool but this still does not show / explain how it would look
>> like when it comes to dependencies needed for actually talking to storages
>> like s3.
>>
>> Maybe I am missing something here and please explain when I am mistaken
>> but If I understand that correctly, for talking to s3 we would need to use
>> a library like this, right? (1). So that would be added among Cassandra
>> dependencies? Hence Cassandra starts to be biased against s3? Why s3? Every
>> time somebody comes up with a new remote storage support, that would be
>> added to classpath as well? How are these dependencies going to play with
>> each other and with Cassandra in general? Will all these storage
>> provider libraries for arbitrary clouds be even compatible with Cassandra
>> licence-wise?
>>
>> I am sorry I keep repeating these questions but this part of that I just
>> don't get at all.
>>
>> We can indeed add an API for this, sure sure, why not. But for people who
>> do not want to deal with this at all and just be OK with a FS mounted, why
>> would we block them doing that?
>>
>> (1)
>> https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xml
>>
>> On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever  wrote:
>>
>>>.
>>>
>>>
>>> It’s not an area where I can currently dedicate engineering effort. But
 if others are interested in contributing a feature like this, I’d see it as
 valuable for the project and would be happy to collaborate on
 design/architecture/goals.

>>>
>>>
>>> Jake mentioned 17 months ago a custom FileSystemProvider we could offer.
>>>
>>> None of us at DataStax has gotten around to providing that, but to
>>> quickly throw something over the wall this is it:
>>>
>>> https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java
>>>
>>>   (with a few friend classes under o.a.c.io.util)
>>>
>>> We then have a RemoteStorageProvider, private in another repo, that
>>> implements that and also provides the RemoteFileSystemProvider that Jake
>>> refers to.
>>>
>>> Hopefully that's a start to get people thinking about CEP level details,
>>> while we get a cleaned abstract of RemoteStorageProvider and friends to
>>> offer.
>>>
>>>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-06 Thread Joel Shepherd

On 3/6/2025 7:16 AM, Jon Haddad wrote:
Assuming everything else is identical, might not matter for S3. 
However, not every object store has a filesystem mount.


Regarding sprawling dependencies, we can always make the provider 
specific libraries available as a separate download and put them on 
their own thread with a separate class path. I think in JVM dtest does 
this already.  Someone just started asking about IAM for login, it 
sounds like a similar problem.


That was me. :-) Cassandra's auth already has fairly well defined 
interfaces and a plug-in mechanism, so it's easy to vend alternative 
auth solutions without polluting the main project's dependency graph, at 
build-time anyway. A similar approach could be beneficial for CEP-36, 
particularly (IMO) for cold-storage purposes. I suspect decoupling 
pluggable alternate channel proxies for cold storage from configurable 
alternate channel proxies for redirecting data locally to free up space, 
migrate to a different storage device, etc., would make both easier. The 
CEP seems to be trying to do both, but they smell like pretty different 
goals to me.


Thanks -- Joel.



On Thu, Mar 6, 2025 at 12:53 AM Benedict  wrote:

I think another way of saying what Stefan may be getting at is
what does a library give us that an appropriately configured mount
dir doesn’t?

We don’t want to treat S3 the same as local disk, but this can be
achieved easily with config. Is there some other benefit of direct
integration? Well defined exceptions if we need to distinguish
cases is one that maybe springs to mind but perhaps there are others?



On 6 Mar 2025, at 08:39, Štefan Miklošovič
 wrote:


That is cool but this still does not show / explain how it would
look like when it comes to dependencies needed for actually
talking to storages like s3.

Maybe I am missing something here and please explain when I am
mistaken but If I understand that correctly, for talking to s3 we
would need to use a library like this, right? (1). So that would
be added among Cassandra dependencies? Hence Cassandra starts to
be biased against s3? Why s3? Every time somebody comes up with a
new remote storage support, that would be added to classpath as
well? How are these dependencies going to play with each other
and with Cassandra in general? Will all these storage
provider libraries for arbitrary clouds be even compatible with
Cassandra licence-wise?

I am sorry I keep repeating these questions but this part of that
I just don't get at all.

We can indeed add an API for this, sure sure, why not. But for
people who do not want to deal with this at all and just be OK
with a FS mounted, why would we block them doing that?

(1)
https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xml

On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever 
wrote:

   .

It’s not an area where I can currently dedicate
engineering effort. But if others are interested in
contributing a feature like this, I’d see it as valuable
for the project and would be happy to collaborate on
design/architecture/goals.



Jake mentioned 17 months ago a custom FileSystemProvider we
could offer.

None of us at DataStax has gotten around to providing that,
but to quickly throw something over the wall this is it:

https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java

  (with a few friend classes under o.a.c.io.util)

We then have a RemoteStorageProvider, private in another
repo, that implements that and also provides the
RemoteFileSystemProvider that Jake refers to.

Hopefully that's a start to get people thinking about CEP
level details, while we get a cleaned abstract of
RemoteStorageProvider and friends to offer.


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-06 Thread Jon Haddad
Hey Joel, thanks for chiming in!

Regarding dependencies - while it's possible to provide pluggable
interfaces, the issue I'm concerned about is conflicting versions of
transitive dependencies at runtime.  For example, I used a java agent that
had a different version of snakeyaml, and it ended up breaking C*'s startup
sequence [1].  I suggest putting external modules on separate threads with
their own classpath to avoid this issue.

I think there's quite a bit of overlap between the two desires expressed in
this thread, even though they achieve very different results.  I personally
can't see myself using something that treats an object store as cold
storage where SSTables are moved (implying they weren't there before), and
I've expressed my concerns with this, but other folks seem to want it and
that's OK.  I feel very strongly that treating local storage as a cache
with the full dataset on object store is a better approach, but ultimately
different people have different priorities.  Either way, stuff is moved to
object store at some point, and pulled to the local disk on demand.

I am *firmly* of the position that this CEP should not exclude the local
storage as cache option, and should be accounted for in the design.

Jon

[1] https://issues.apache.org/jira/browse/CASSANDRA-19663


On Thu, Mar 6, 2025 at 10:31 AM Joel Shepherd  wrote:

> On 3/6/2025 7:16 AM, Jon Haddad wrote:
>
> Assuming everything else is identical, might not matter for S3. However,
> not every object store has a filesystem mount.
>
> Regarding sprawling dependencies, we can always make the provider specific
> libraries available as a separate download and put them on their own thread
> with a separate class path. I think in JVM dtest does this already.
> Someone just started asking about IAM for login, it sounds like a similar
> problem.
>
> That was me. :-) Cassandra's auth already has fairly well defined
> interfaces and a plug-in mechanism, so it's easy to vend alternative auth
> solutions without polluting the main project's dependency graph, at
> build-time anyway. A similar approach could be beneficial for CEP-36,
> particularly (IMO) for cold-storage purposes. I suspect decoupling
> pluggable alternate channel proxies for cold storage from configurable
> alternate channel proxies for redirecting data locally to free up space,
> migrate to a different storage device, etc., would make both easier. The
> CEP seems to be trying to do both, but they smell like pretty different
> goals to me.
>
> Thanks -- Joel.
>
>
> On Thu, Mar 6, 2025 at 12:53 AM Benedict  wrote:
>
>> I think another way of saying what Stefan may be getting at is what does
>> a library give us that an appropriately configured mount dir doesn’t?
>>
>> We don’t want to treat S3 the same as local disk, but this can be
>> achieved easily with config. Is there some other benefit of direct
>> integration? Well defined exceptions if we need to distinguish cases is one
>> that maybe springs to mind but perhaps there are others?
>>
>>
>> On 6 Mar 2025, at 08:39, Štefan Miklošovič 
>> wrote:
>>
>> 
>>
>> That is cool but this still does not show / explain how it would look
>> like when it comes to dependencies needed for actually talking to storages
>> like s3.
>>
>> Maybe I am missing something here and please explain when I am mistaken
>> but If I understand that correctly, for talking to s3 we would need to use
>> a library like this, right? (1). So that would be added among Cassandra
>> dependencies? Hence Cassandra starts to be biased against s3? Why s3? Every
>> time somebody comes up with a new remote storage support, that would be
>> added to classpath as well? How are these dependencies going to play with
>> each other and with Cassandra in general? Will all these storage
>> provider libraries for arbitrary clouds be even compatible with Cassandra
>> licence-wise?
>>
>> I am sorry I keep repeating these questions but this part of that I just
>> don't get at all.
>>
>> We can indeed add an API for this, sure sure, why not. But for people who
>> do not want to deal with this at all and just be OK with a FS mounted, why
>> would we block them doing that?
>>
>> (1)
>> https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xml
>>
>> On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever  wrote:
>>
>>>.
>>>
>>>
>>> It’s not an area where I can currently dedicate engineering effort. But
 if others are interested in contributing a feature like this, I’d see it as
 valuable for the project and would be happy to collaborate on
 design/architecture/goals.

>>>
>>>
>>> Jake mentioned 17 months ago a custom FileSystemProvider we could offer.
>>>
>>> None of us at DataStax has gotten around to providing that, but to
>>> quickly throw something over the wall this is it:
>>>
>>> https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java
>>>
>>>   (with a few friend classes under o.a.c.io.util)
>>>
>>> We 

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-06 Thread Benedict
It anyway seems reasonable to me that we would support multiple FileSystemProvider. So perhaps this is really two problems we’re maybe conflating: 1) a mechanism for dropping jars that can register a FileSystemProvider for Cassandra to utilise2) a way to mark directories (from any provider) as “remote storage” so that they can be treated in the appropriate mannerI think the harder problems by far live in (2). Less for performance, but more how we handle errors correctly. For instance, a failure to read from object storage may mean that the data is lost or it may mean the service has been interrupted. This might mean a total or partial loss of any intersecting tokens, depending on how the cluster stripes dependency on the object storage. But we almost certainly don’t want to handle this like we do local disk errors, either way.On 6 Mar 2025, at 15:16, Jon Haddad  wrote:Assuming everything else is identical, might not matter for S3. However, not every object store has a filesystem mount. Regarding sprawling dependencies, we can always make the provider specific libraries available as a separate download and put them on their own thread with a separate class path. I think in JVM dtest does this already.  Someone just started asking about IAM for login, it sounds like a similar problem. On Thu, Mar 6, 2025 at 12:53 AM Benedict  wrote:I think another way of saying what Stefan may be getting at is what does a library give us that an appropriately configured mount dir doesn’t?We don’t want to treat S3 the same as local disk, but this can be achieved easily with config. Is there some other benefit of direct integration? Well defined exceptions if we need to distinguish cases is one that maybe springs to mind but perhaps there are others?On 6 Mar 2025, at 08:39, Štefan Miklošovič  wrote:That is cool but this still does not show / explain how it would look like when it comes to dependencies needed for actually talking to storages like s3. Maybe I am missing something here and please explain when I am mistaken but If I understand that correctly, for talking to s3 we would need to use a library like this, right? (1). So that would be added among Cassandra dependencies? Hence Cassandra starts to be biased against s3? Why s3? Every time somebody comes up with a new remote storage support, that would be added to classpath as well? How are these dependencies going to play with each other and with Cassandra in general? Will all these storage provider libraries for arbitrary clouds be even compatible with Cassandra licence-wise?I am sorry I keep repeating these questions but this part of that I just don't get at all. We can indeed add an API for this, sure sure, why not. But for people who do not want to deal with this at all and just be OK with a FS mounted, why would we block them doing that?(1) https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xmlOn Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever  wrote:   .  It’s not an area where I can currently dedicate engineering effort. But if others are interested in contributing a feature like this, I’d see it as valuable for the project and would be happy to collaborate on design/architecture/goals.Jake mentioned 17 months ago a custom FileSystemProvider we could offer.None of us at DataStax has gotten around to providing that, but to quickly throw something over the wall this is it: https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java   (with a few friend classes under o.a.c.io.util)We then have a RemoteStorageProvider, private in another repo, that implements that and also provides the RemoteFileSystemProvider that Jake refers to.Hopefully that's a start to get people thinking about CEP level details, while we get a cleaned abstract of RemoteStorageProvider and friends to offer.




Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-06 Thread Jon Haddad
Assuming everything else is identical, might not matter for S3. However,
not every object store has a filesystem mount.

Regarding sprawling dependencies, we can always make the provider specific
libraries available as a separate download and put them on their own thread
with a separate class path. I think in JVM dtest does this already.
Someone just started asking about IAM for login, it sounds like a similar
problem.


On Thu, Mar 6, 2025 at 12:53 AM Benedict  wrote:

> I think another way of saying what Stefan may be getting at is what does a
> library give us that an appropriately configured mount dir doesn’t?
>
> We don’t want to treat S3 the same as local disk, but this can be achieved
> easily with config. Is there some other benefit of direct integration? Well
> defined exceptions if we need to distinguish cases is one that maybe
> springs to mind but perhaps there are others?
>
>
> On 6 Mar 2025, at 08:39, Štefan Miklošovič  wrote:
>
> 
>
> That is cool but this still does not show / explain how it would look like
> when it comes to dependencies needed for actually talking to storages like
> s3.
>
> Maybe I am missing something here and please explain when I am mistaken
> but If I understand that correctly, for talking to s3 we would need to use
> a library like this, right? (1). So that would be added among Cassandra
> dependencies? Hence Cassandra starts to be biased against s3? Why s3? Every
> time somebody comes up with a new remote storage support, that would be
> added to classpath as well? How are these dependencies going to play with
> each other and with Cassandra in general? Will all these storage
> provider libraries for arbitrary clouds be even compatible with Cassandra
> licence-wise?
>
> I am sorry I keep repeating these questions but this part of that I just
> don't get at all.
>
> We can indeed add an API for this, sure sure, why not. But for people who
> do not want to deal with this at all and just be OK with a FS mounted, why
> would we block them doing that?
>
> (1)
> https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xml
>
> On Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever  wrote:
>
>>.
>>
>>
>> It’s not an area where I can currently dedicate engineering effort. But
>>> if others are interested in contributing a feature like this, I’d see it as
>>> valuable for the project and would be happy to collaborate on
>>> design/architecture/goals.
>>>
>>
>>
>> Jake mentioned 17 months ago a custom FileSystemProvider we could offer.
>>
>> None of us at DataStax has gotten around to providing that, but to
>> quickly throw something over the wall this is it:
>>
>> https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java
>>
>>   (with a few friend classes under o.a.c.io.util)
>>
>> We then have a RemoteStorageProvider, private in another repo, that
>> implements that and also provides the RemoteFileSystemProvider that Jake
>> refers to.
>>
>> Hopefully that's a start to get people thinking about CEP level details,
>> while we get a cleaned abstract of RemoteStorageProvider and friends to
>> offer.
>>
>>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-06 Thread Benedict
I think another way of saying what Stefan may be getting at is: what does a library give us that an appropriately configured mount dir doesn’t?We don’t want to treat S3 the same as local disk, but this can be achieved easily with config. Is there some other benefit of direct integration? Well defined exceptions if we need to distinguish cases is one that maybe springs to mind but perhaps there are others?On 6 Mar 2025, at 08:39, Štefan Miklošovič  wrote:That is cool but this still does not show / explain how it would look like when it comes to dependencies needed for actually talking to storages like s3. Maybe I am missing something here and please explain when I am mistaken but If I understand that correctly, for talking to s3 we would need to use a library like this, right? (1). So that would be added among Cassandra dependencies? Hence Cassandra starts to be biased against s3? Why s3? Every time somebody comes up with a new remote storage support, that would be added to classpath as well? How are these dependencies going to play with each other and with Cassandra in general? Will all these storage provider libraries for arbitrary clouds be even compatible with Cassandra licence-wise?I am sorry I keep repeating these questions but this part of that I just don't get at all. We can indeed add an API for this, sure sure, why not. But for people who do not want to deal with this at all and just be OK with a FS mounted, why would we block them doing that?(1) https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/pom.xmlOn Wed, Mar 5, 2025 at 3:28 PM Mick Semb Wever  wrote:   .  It’s not an area where I can currently dedicate engineering effort. But if others are interested in contributing a feature like this, I’d see it as valuable for the project and would be happy to collaborate on design/architecture/goals.Jake mentioned 17 months ago a custom FileSystemProvider we could offer.None of us at DataStax has gotten around to providing that, but to quickly throw something over the wall this is it: https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java   (with a few friend classes under o.a.c.io.util)We then have a RemoteStorageProvider, private in another repo, that implements that and also provides the RemoteFileSystemProvider that Jake refers to.Hopefully that's a start to get people thinking about CEP level details, while we get a cleaned abstract of RemoteStorageProvider and friends to offer.



Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-05 Thread Mick Semb Wever
   .


It’s not an area where I can currently dedicate engineering effort. But if
> others are interested in contributing a feature like this, I’d see it as
> valuable for the project and would be happy to collaborate on
> design/architecture/goals.
>


Jake mentioned 17 months ago a custom FileSystemProvider we could offer.

None of us at DataStax has gotten around to providing that, but to quickly
throw something over the wall this is it:

https://github.com/datastax/cassandra/blob/main/src/java/org/apache/cassandra/io/storage/StorageProvider.java

  (with a few friend classes under o.a.c.io.util)

We then have a RemoteStorageProvider, private in another repo, that
implements that and also provides the RemoteFileSystemProvider that Jake
refers to.

Hopefully that's a start to get people thinking about CEP level details,
while we get a cleaned abstract of RemoteStorageProvider and friends to
offer.


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-05 Thread Štefan Miklošovič
Scott,

what you wrote is all correct, but I have a feeling that both you and Jeff
are talking about something different, some other aspect of using that.

It seems that I still need to explain myself that I don't consider object
storage to be useless, it is as if everybody has to make the point about
the opposite. I agree with you already.

It seems to me that the use cases you want to use s3 for (or any object
storage for that matter) is actively reading / writing to satisfy queries,
offload some stuff etc.

What I am talking about when it comes to s3 mounted locally is just to use
it for copying SSTables there which were taken by a snapshot. In, (1) I
never wanted to use s3 for anything else than to literally copy SSTable
there as part of snapshotting and be done with it. If I exaggerate to make
a point, where do I need to "hurry" so I care about the speed specifically
in this particular case? I have never said I consider it to be important.

What I consider important is that it is super easy to use - this approach
is cloud agnostic, we do not need to implement anything in Cassandra, no
messing with dependencies etc. It is "future-proof" in such a sense that
whatever cloud somebody wants to use for storing snapshots, all it takes is
to _somehow_ mount it locally and all will work out of the box.

You want to leverage object storage for more "involved" use cases.

I do not see how mounting a dir and copy files there would "clash" with
your way of looking at it. Why can't we have both?

I keep repeating this all over again but I still don't know how it is going
to be actually done. If you want to e.g. support object storage like, I
don't know, GCP (I do not have any clue if that is even code-able), then
there would need to be _somebody_ who will integrate with it. With mounting
a dir for snapshotting purposes, we do not need to deal with that. This
aspect of additional complexity when it comes to coding, integrating,
deploying etc. seems to be repeatedly overlooked and I would really
appreciate it if we spent a little bit more time expanding that area.

I do not have a problem with using object storage for things you want, but
I do not get why it should automatically disqualify scenarios when a user
figures out that mounting that storage locally is sufficient.

(1) https://lists.apache.org/thread/8cz5fh835ojnxwtn1479q31smm5x7nxt

On Wed, Mar 5, 2025 at 6:22 AM C. Scott Andreas 
wrote:

> To Jeff’s point on tactical vs. strategic, here’s the big picture for me
> on object storage:
>
> *– Object storage is 70% cheaper:*
> Replicated flash block storage is extremely expensive, and more so with
> compute resources constantly attached. If one were to build a storage
> platform on top of a cloud provider’s compute and storage infrastructure,
> selective use of object storage is essential to even being in the ballpark
> of managed offerings on price. EBS is 8¢/GB. S3 is 2.3¢/GB. It’s over 70%
> cheaper.
>
> *– Local/block storage is priced on storage *provisioned*. Object storage
> is priced on storage *consumed*:*
> It’s actually better than 70%. Local/block storage is priced based on the
> size of disks/volumes provisioned. While they may be resizable, resizing is
> generally inelastic. This typically produces a large gap between storage
> consumed vs. storage provisioned - and poor utilization. Object storage is
> typically priced on storage that is actually consumed.
>
> *– Object storage integration is the simplest path to complete decoupling
> of CPU and storage:*
> Block volumes are more fungible than local disk, but aren’t even close in
> flexibility to an SSTable that can be accessed by any function. Object is
> also the only sensible path to implementing a serverless database whose
> query facilities can be deployed on a function-as-a-service platform. That
> enables one to reach an idle compute cost of zero and an idle storage cost
> of 2.3¢/GB/month (S3).
>
> *– Object storage enables scale-to-zero:*
> Object storage integration is the only path for most databases to provide
> a scale-to-zero offering that doesn’t rely on keeping hot NVMe or block
> storage attached 24/7 while a database receives zero queries per second.
>
> *– Scale to zero is one of the easiest paths to zero marginal cost (the
> other is multitenancy - and not mutually exclusive):*
> Database platforms operated in a cluster-as-a-service model incur a
> constant fixed cost of provisioned resources regardless of whether they are
> in use. That’s fine for platforms that pass the full cost of resources
> consumed back to someone — but it produces poor economics and resource
> waste. Ability to scale to zero dramatically reduces the cost of
> provisioning and maintaining an un/underutilized database.
>
> *– It’s not all or nothing:*
> There are super sensible ways to pair local/block storage and object
> storage. One might be to store upper-level SSTable data components in
> object storage; and all other SSTable components (TOC, Compres

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread C. Scott Andreas
To Jeff’s point on tactical vs. strategic, here’s the big picture for me on object storage:– Object storage is 70% cheaper:Replicated flash block storage is extremely expensive, and more so with compute resources constantly attached. If one were to build a storage platform on top of a cloud provider’s compute and storage infrastructure, selective use of object storage is essential to even being in the ballpark of managed offerings on price. EBS is 8¢/GB. S3 is 2.3¢/GB. It’s over 70% cheaper.– Local/block storage is priced on storage *provisioned*. Object storage is priced on storage *consumed*:It’s actually better than 70%. Local/block storage is priced based on the size of disks/volumes provisioned. While they may be resizable, resizing is generally inelastic. This typically produces a large gap between storage consumed vs. storage provisioned - and poor utilization. Object storage is typically priced on storage that is actually consumed.– Object storage integration is the simplest path to complete decoupling of CPU and storage:Block volumes are more fungible than local disk, but aren’t even close in flexibility to an SSTable that can be accessed by any function. Object is also the only sensible path to implementing a serverless database whose query facilities can be deployed on a function-as-a-service platform. That enables one to reach an idle compute cost of zero and an idle storage cost of 2.3¢/GB/month (S3).– Object storage enables scale-to-zero:Object storage integration is the only path for most databases to provide a scale-to-zero offering that doesn’t rely on keeping hot NVMe or block storage attached 24/7 while a database receives zero queries per second.– Scale to zero is one of the easiest paths to zero marginal cost (the other is multitenancy - and not mutually exclusive):Database platforms operated in a cluster-as-a-service model incur a constant fixed cost of provisioned resources regardless of whether they are in use. That’s fine for platforms that pass the full cost of resources consumed back to someone — but it produces poor economics and resource waste. Ability to scale to zero dramatically reduces the cost of provisioning and maintaining an un/underutilized database.– It’s not all or nothing:There are super sensible ways to pair local/block storage and object storage. One might be to store upper-level SSTable data components in object storage; and all other SSTable components (TOC, CompressionInfo, primary index, etc) on local flash. This gives you a way to rapidly enumerate and navigate SSTable metadata while only paying the cost of reads when fetching data (and possibly from upper-level SSTables only). Alternately, one could offload only older partitions of data in TWCS - time series data older than 30 days, a year, etc.– Mounting a bucket as a filesystem is unusable:Others have made this point. Naively mounting an object storage bucket as a filesystem produces uniformly *terrible* results. S3 time to first byte is in the 20-30ms+ range. S3 one-zone is closer to the 5-10ms range which is on par with the seek latency of a spinning disk. Despite C* originally being designed to operate well on spinning disks, a filesystem-like abstraction backed by object storage today will result in awful surprises due to IO patterns like those found by Jon H. and Jordan recently.– Object Storage can unify “database” and “lakehouse”:One could imagine Iceberg integration that enables manifest/snapshot-based querying of SSTables in an object store via Spark or similar platforms, with zero ETL or light cone contact with a production database process.The reason people care about object is that it’s 70%+ cheaper than flash - and 90%+ cheaper if the software querying it isn’t always running, too.– Scott—MobileOn Mar 4, 2025, at 12:29 PM, Štefan Miklošovič  wrote:Jeff,when it comes to snapshots, there was already discussion in the other thread I am not sure you are aware of (1), here (2) I am talking about Sidecar + snapshots specifically. One "caveat" of Sidecar is that you actually _need_ sidecar if we ever contemplated Sidecar doing upload / backup (by whatever means)."s3 bucket as a mounted directory" bypasses the necessity of having Sidecar deployed. I do not want to repeat what I wrote in (2), all the reasoning is there.My primary motivation to do it by mounting is to 1) not use Sidecar if not needed 2) just save snapshot SSTables outside of table data dirs (which is imho completely fine requirement on its own, e.g. putting snapshots on a different / slow disk etc, why does it have to be on the same disk as data are?)The vibe I got from that thread was that what I was proposing is in general acceptable and I wanted to figure out the details as Blake was mentioning incremental backs as well. Other than that, I was thinking that we are pretty much settled on how that should be, that thread was up for a long time so I was thinking people in general do not have a problem with that.How I read your email, sp

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread Cheng Wang via dev
I agree with all the points mentioned by Soctt. We are actually very
interested to explore the tiered storage for the same reasons above. Our
first experiment with S3 single zone express was, unfortunately, awfully
slow compared to ephemeral and EBS.

On Tue, Mar 4, 2025 at 9:22 PM C. Scott Andreas 
wrote:

> To Jeff’s point on tactical vs. strategic, here’s the big picture for me
> on object storage:
>
> *– Object storage is 70% cheaper:*
> Replicated flash block storage is extremely expensive, and more so with
> compute resources constantly attached. If one were to build a storage
> platform on top of a cloud provider’s compute and storage infrastructure,
> selective use of object storage is essential to even being in the ballpark
> of managed offerings on price. EBS is 8¢/GB. S3 is 2.3¢/GB. It’s over 70%
> cheaper.
>
> *– Local/block storage is priced on storage *provisioned*. Object storage
> is priced on storage *consumed*:*
> It’s actually better than 70%. Local/block storage is priced based on the
> size of disks/volumes provisioned. While they may be resizable, resizing is
> generally inelastic. This typically produces a large gap between storage
> consumed vs. storage provisioned - and poor utilization. Object storage is
> typically priced on storage that is actually consumed.
>
> *– Object storage integration is the simplest path to complete decoupling
> of CPU and storage:*
> Block volumes are more fungible than local disk, but aren’t even close in
> flexibility to an SSTable that can be accessed by any function. Object is
> also the only sensible path to implementing a serverless database whose
> query facilities can be deployed on a function-as-a-service platform. That
> enables one to reach an idle compute cost of zero and an idle storage cost
> of 2.3¢/GB/month (S3).
>
> *– Object storage enables scale-to-zero:*
> Object storage integration is the only path for most databases to provide
> a scale-to-zero offering that doesn’t rely on keeping hot NVMe or block
> storage attached 24/7 while a database receives zero queries per second.
>
> *– Scale to zero is one of the easiest paths to zero marginal cost (the
> other is multitenancy - and not mutually exclusive):*
> Database platforms operated in a cluster-as-a-service model incur a
> constant fixed cost of provisioned resources regardless of whether they are
> in use. That’s fine for platforms that pass the full cost of resources
> consumed back to someone — but it produces poor economics and resource
> waste. Ability to scale to zero dramatically reduces the cost of
> provisioning and maintaining an un/underutilized database.
>
> *– It’s not all or nothing:*
> There are super sensible ways to pair local/block storage and object
> storage. One might be to store upper-level SSTable data components in
> object storage; and all other SSTable components (TOC, CompressionInfo,
> primary index, etc) on local flash. This gives you a way to rapidly
> enumerate and navigate SSTable metadata while only paying the cost of reads
> when fetching data (and possibly from upper-level SSTables only).
> Alternately, one could offload only older partitions of data in TWCS - time
> series data older than 30 days, a year, etc.
>
> *– Mounting a bucket as a filesystem is unusable:*
> Others have made this point. Naively mounting an object storage bucket as
> a filesystem produces uniformly *terrible* results. S3 time to first byte
> is in the 20-30ms+ range. S3 one-zone is closer to the 5-10ms range which
> is on par with the seek latency of a spinning disk. Despite C* originally
> being designed to operate well on spinning disks, a filesystem-like
> abstraction backed by object storage today will result in awful surprises
> due to IO patterns like those found by Jon H. and Jordan recently.
>
> *– Object Storage can unify “database” and “lakehouse”:*
> One could imagine Iceberg integration that enables manifest/snapshot-based
> querying of SSTables in an object store via Spark or similar platforms,
> with zero ETL or light cone contact with a production database process.
>
> The reason people care about object is that it’s 70%+ cheaper than flash -
> and 90%+ cheaper if the software querying it isn’t always running, too.
>
> – Scott
>
> —
> Mobile
>
> On Mar 4, 2025, at 12:29 PM, Štefan Miklošovič 
> wrote:
>
> 
> Jeff,
>
> when it comes to snapshots, there was already discussion in the other
> thread I am not sure you are aware of (1), here (2) I am talking about
> Sidecar + snapshots specifically. One "caveat" of Sidecar is that you
> actually _need_ sidecar if we ever contemplated Sidecar doing upload /
> backup (by whatever means).
>
> "s3 bucket as a mounted directory" bypasses the necessity of having
> Sidecar deployed. I do not want to repeat what I wrote in (2), all the
> reasoning is there.
>
> My primary motivation to do it by mounting is to 1) not use Sidecar if not
> needed 2) just save snapshot SSTables outside of table data dirs (which is
> imho com

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread Jon Haddad
I've come around on the rsync vs built in topic, and I think this is the
same.  Having this managed in process gives us more options for control.

I think it's critical that 100% of the data be pushed to the object store.
I mentioned this in my email on Dec 15, but nobody directly responded to
that.  I keep seeing messages that sound like people only want a subset of
SSTables to live in the object store, I think that would be a massive
mistake.

Jon



On Tue, Mar 4, 2025 at 8:11 AM Jeff Jirsa  wrote:

> Mounted dirs give up the opportunity to change the IO model to account for
> different behaviors. The configurable channel proxy may suffer from the
> same IO constraints depending on implementation, too. But it may also
> become viable.
>
> The snapshot outside of the mounted file system seems like you’re
> implicitly implementing a one off in process backup. This is the same
> feedback I gave the sidecar with the rsync to another machine proposal. If
> we stop doing one off tactical projects and map out the actual problem
> we’re solving, we can get the right interfaces.  Also, you can probably
> just have the sidecar rsync thing do your snapshot to another directory on
> host.
>
> But if every sstable makes its way to s3, things like native backup,
> restoring from backup, recovering from local volumes can look VERY different
>
>
>
> On Mar 4, 2025, at 3:57 PM, Štefan Miklošovič 
> wrote:
>
> 
> For what it's worth, as it might come to somebody I am rejecting this
> altogether (which is not the case, all I am trying to say is that we should
> just think about it more) - it would be cool to know more about the
> experience of others when it comes to this, maybe somebody already tried to
> mount and it did not work as expected?
>
> On the other hand, there is this "snapshots outside data dir" effort I am
> doing and if we did it with this, then I can imagine that we could say "and
> if you deal with snapshots, use this proxy instead" which would
> transparently upload it to s3.
>
> Then we would not need to do anything at all, code-wise. We would not need
> to store snapshots "outside of data dir" just to be able to place it on a
> directory which is mounted as an s3 bucket.
>
> I don't know if it is possible to do it like that. Worth to explore I
> guess.
>
> I like mounted dirs for its simplicity and I guess that for copying files
> it might be just enough. Plus we would not need to add all s3 jars on CP
> either etc ...
>
> On Tue, Mar 4, 2025 at 2:46 PM Štefan Miklošovič 
> wrote:
>
>> I don't say that using remote object storage is useless.
>>
>> I am just saying that I don't see the difference. I have not measured
>> that but I can imagine that s3 mounted would use, under the hood, the same
>> calls to s3 api. How else would it be done? You need to talk to remote s3
>> storage eventually anyway. So why does it matter if we call s3 api from
>> Java or by other means from some "s3 driver"?  It is eventually using same
>> thing, no?
>>
>> On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa  wrote:
>>
>>> Mounting an s3 bucket as a directory is an easy but poor implementation
>>> of object backed storage for databases
>>>
>>> Object storage is durable (most data loss is due to bugs not concurrent
>>> hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge
>>> number of modern systems are object-storage-only because the approximately
>>> infinite scale / cost / throughput tradeoffs often make up for the latency.
>>>
>>> Outright dismissing object storage for Cassandra is short sighted - it
>>> needs to be done in a way that makes sense, not just blindly copying over
>>> the block access patterns to object.
>>>
>>>
>>> On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič 
>>> wrote:
>>>
>>> 
>>> I do not think we need this CEP, honestly. I don't want to diss this
>>> unnecessarily but if you mount a remote storage locally (e.g. mounting s3
>>> bucket as if it was any other directory on node's machine), then what is
>>> this CEP good for?
>>>
>>> Not talking about the necessity to put all dependencies to be able to
>>> talk to respective remote storage to Cassandra's class path, introducing
>>> potential problems with dependencies and their possible incompatibilities /
>>> different versions etc ...
>>>
>>> On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas 
>>> wrote:
>>>
 I’d love to see this implemented — where “this” is a proxy for some
 notion of support for remote object storage, perhaps usable by compaction
 strategies like TWCS to migrate data older than a threshold from a local
 filesystem to remote object.

 It’s not an area where I can currently dedicate engineering effort. But
 if others are interested in contributing a feature like this, I’d see it as
 valuable for the project and would be happy to collaborate on
 design/architecture/goals.

 – Scott

 On Feb 26, 2025, at 6:56 AM, guo Maxwell  wrote:

 
 Is anyone else interested

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread Štefan Miklošovič
Jeff,

when it comes to snapshots, there was already discussion in the other
thread I am not sure you are aware of (1), here (2) I am talking about
Sidecar + snapshots specifically. One "caveat" of Sidecar is that you
actually _need_ sidecar if we ever contemplated Sidecar doing upload /
backup (by whatever means).

"s3 bucket as a mounted directory" bypasses the necessity of having Sidecar
deployed. I do not want to repeat what I wrote in (2), all the reasoning is
there.

My primary motivation to do it by mounting is to 1) not use Sidecar if not
needed 2) just save snapshot SSTables outside of table data dirs (which is
imho completely fine requirement on its own, e.g. putting snapshots on a
different / slow disk etc, why does it have to be on the same disk as data
are?)

The vibe I got from that thread was that what I was proposing is in general
acceptable and I wanted to figure out the details as Blake was mentioning
incremental backs as well. Other than that, I was thinking that we are
pretty much settled on how that should be, that thread was up for a long
time so I was thinking people in general do not have a problem with that.

How I read your email, specifically this part:

"This is the same feedback I gave the sidecar with the rsync to another
machine proposal. If we stop doing one off tactical projects and map out
the actual problem we’re solving, we can get the right interfaces."

it seems to me that you are categorizing "s3 bucket mounted locally" as one
of these "one off tactical projects" which in translation means that you
would like to see that approach not implemented and we should rather focus
on doing everything via proxies?

One big downside of that which nobody answered yet (seems like that to me
as far as I checked) is how would this actually look like on deployment and
delivery? As I said in (1), we would need to code up proxy for every remote
storage, Azure, S3, GCP to name a few. We would need to implement all of
these for each and every cloud.

Secondly, who would implement that and where would that code live? Is it up
to individuals to code it up internally? When we want to talk to s3, we
need to put all s3 dependencies to classpath, who is going to integrate
that, is that even possible? Similar for other clouds.

By mounting a dir, we just do not do anything with Cassandra's class path.
It is as it was, simple, easy, and we interact with it as we are used to.

I see that proxy might be viable for some applications but I think it has
also non-trivial disadvantages operationally-wise.

(1) https://lists.apache.org/thread/8cz5fh835ojnxwtn1479q31smm5x7nxt
(2) https://lists.apache.org/thread/mttg75ps49qkob6km4l74fmp879v76qs

On Tue, Mar 4, 2025 at 5:13 PM Jeff Jirsa  wrote:

> Mounted dirs give up the opportunity to change the IO model to account for
> different behaviors. The configurable channel proxy may suffer from the
> same IO constraints depending on implementation, too. But it may also
> become viable.
>
> The snapshot outside of the mounted file system seems like you’re
> implicitly implementing a one off in process backup. This is the same
> feedback I gave the sidecar with the rsync to another machine proposal. If
> we stop doing one off tactical projects and map out the actual problem
> we’re solving, we can get the right interfaces.  Also, you can probably
> just have the sidecar rsync thing do your snapshot to another directory on
> host.
>
> But if every sstable makes its way to s3, things like native backup,
> restoring from backup, recovering from local volumes can look VERY different
>
>
>
> On Mar 4, 2025, at 3:57 PM, Štefan Miklošovič 
> wrote:
>
> 
> For what it's worth, as it might come to somebody I am rejecting this
> altogether (which is not the case, all I am trying to say is that we should
> just think about it more) - it would be cool to know more about the
> experience of others when it comes to this, maybe somebody already tried to
> mount and it did not work as expected?
>
> On the other hand, there is this "snapshots outside data dir" effort I am
> doing and if we did it with this, then I can imagine that we could say "and
> if you deal with snapshots, use this proxy instead" which would
> transparently upload it to s3.
>
> Then we would not need to do anything at all, code-wise. We would not need
> to store snapshots "outside of data dir" just to be able to place it on a
> directory which is mounted as an s3 bucket.
>
> I don't know if it is possible to do it like that. Worth to explore I
> guess.
>
> I like mounted dirs for its simplicity and I guess that for copying files
> it might be just enough. Plus we would not need to add all s3 jars on CP
> either etc ...
>
> On Tue, Mar 4, 2025 at 2:46 PM Štefan Miklošovič 
> wrote:
>
>> I don't say that using remote object storage is useless.
>>
>> I am just saying that I don't see the difference. I have not measured
>> that but I can imagine that s3 mounted would use, under the hood, the same
>> calls to s

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread Jeff Jirsa
Mounted dirs give up the opportunity to change the IO model to account for different behaviors. The configurable channel proxy may suffer from the same IO constraints depending on implementation, too. But it may also become viable. The snapshot outside of the mounted file system seems like you’re implicitly implementing a one off in process backup. This is the same feedback I gave the sidecar with the rsync to another machine proposal. If we stop doing one off tactical projects and map out the actual problem we’re solving, we can get the right interfaces.  Also, you can probably just have the sidecar rsync thing do your snapshot to another directory on host.  But if every sstable makes its way to s3, things like native backup, restoring from backup, recovering from local volumes can look VERY differentOn Mar 4, 2025, at 3:57 PM, Štefan Miklošovič  wrote:For what it's worth, as it might come to somebody I am rejecting this altogether (which is not the case, all I am trying to say is that we should just think about it more) - it would be cool to know more about the experience of others when it comes to this, maybe somebody already tried to mount and it did not work as expected?On the other hand, there is this "snapshots outside data dir" effort I am doing and if we did it with this, then I can imagine that we could say "and if you deal with snapshots, use this proxy instead" which would transparently upload it to s3.Then we would not need to do anything at all, code-wise. We would not need to store snapshots "outside of data dir" just to be able to place it on a directory which is mounted as an s3 bucket. I don't know if it is possible to do it like that. Worth to explore I guess.I like mounted dirs for its simplicity and I guess that for copying files it might be just enough. Plus we would not need to add all s3 jars on CP either etc ...On Tue, Mar 4, 2025 at 2:46 PM Štefan Miklošovič  wrote:I don't say that using remote object storage is useless. I am just saying that I don't see the difference. I have not measured that but I can imagine that s3 mounted would use, under the hood, the same calls to s3 api. How else would it be done? You need to talk to remote s3 storage eventually anyway. So why does it matter if we call s3 api from Java or by other means from some "s3 driver"?  It is eventually using same thing, no?On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa  wrote:Mounting an s3 bucket as a directory is an easy but poor implementation of object backed storage for databases Object storage is durable (most data loss is due to bugs not concurrent hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge number of modern systems are object-storage-only because the approximately infinite scale / cost / throughput tradeoffs often make up for the latency.Outright dismissing object storage for Cassandra is short sighted - it needs to be done in a way that makes sense, not just blindly copying over the block access patterns to object.On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič  wrote:I do not think we need this CEP, honestly. I don't want to diss this unnecessarily but if you mount a remote storage locally (e.g. mounting s3 bucket as if it was any other directory on node's machine), then what is this CEP good for? Not talking about the necessity to put all dependencies to be able to talk to respective remote storage to Cassandra's class path, introducing potential problems with dependencies and their possible incompatibilities / different versions etc ... On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas  wrote:I’d love to see this implemented — where “this” is a proxy for some notion of support for remote object storage, perhaps usable by compaction strategies like TWCS to migrate data older than a threshold from a local filesystem to remote object.It’s not an area where I can currently dedicate engineering effort. But if others are interested in contributing a feature like this, I’d see it as valuable for the project and would be happy to collaborate on design/architecture/goals.– ScottOn Feb 26, 2025, at 6:56 AM, guo Maxwell  wrote:Is anyone else interested in continuing to discuss this topic?guo Maxwell  于2024年9月20日周五 09:44写道:I discussed this offline with Claude, he is no longer working on this. It's a pity. I think this is a very valuable thing. Commitlog's archiving and restore may be able to use the relevant code if it is completed.Patrick McFadin 于2024年9月20日 周五上午2:01写道:Thanks for reviving this one!On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell  wrote:Is there any update on this topic?  It seems that things can make a big progress if  Jake Luciani  can find someone who can make the FileSystemProvider code accessible. Jon Haddad  于2023年12月16日周六 05:29写道:At a high level I really like the idea of being abl

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread Štefan Miklošovič
For what it's worth, as it might come to somebody I am rejecting this
altogether (which is not the case, all I am trying to say is that we should
just think about it more) - it would be cool to know more about the
experience of others when it comes to this, maybe somebody already tried to
mount and it did not work as expected?

On the other hand, there is this "snapshots outside data dir" effort I am
doing and if we did it with this, then I can imagine that we could say "and
if you deal with snapshots, use this proxy instead" which would
transparently upload it to s3.

Then we would not need to do anything at all, code-wise. We would not need
to store snapshots "outside of data dir" just to be able to place it on a
directory which is mounted as an s3 bucket.

I don't know if it is possible to do it like that. Worth to explore I guess.

I like mounted dirs for its simplicity and I guess that for copying files
it might be just enough. Plus we would not need to add all s3 jars on CP
either etc ...

On Tue, Mar 4, 2025 at 2:46 PM Štefan Miklošovič 
wrote:

> I don't say that using remote object storage is useless.
>
> I am just saying that I don't see the difference. I have not measured that
> but I can imagine that s3 mounted would use, under the hood, the same calls
> to s3 api. How else would it be done? You need to talk to remote s3 storage
> eventually anyway. So why does it matter if we call s3 api from Java or by
> other means from some "s3 driver"?  It is eventually using same thing, no?
>
> On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa  wrote:
>
>> Mounting an s3 bucket as a directory is an easy but poor implementation
>> of object backed storage for databases
>>
>> Object storage is durable (most data loss is due to bugs not concurrent
>> hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge
>> number of modern systems are object-storage-only because the approximately
>> infinite scale / cost / throughput tradeoffs often make up for the latency.
>>
>> Outright dismissing object storage for Cassandra is short sighted - it
>> needs to be done in a way that makes sense, not just blindly copying over
>> the block access patterns to object.
>>
>>
>> On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič 
>> wrote:
>>
>> 
>> I do not think we need this CEP, honestly. I don't want to diss this
>> unnecessarily but if you mount a remote storage locally (e.g. mounting s3
>> bucket as if it was any other directory on node's machine), then what is
>> this CEP good for?
>>
>> Not talking about the necessity to put all dependencies to be able to
>> talk to respective remote storage to Cassandra's class path, introducing
>> potential problems with dependencies and their possible incompatibilities /
>> different versions etc ...
>>
>> On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas 
>> wrote:
>>
>>> I’d love to see this implemented — where “this” is a proxy for some
>>> notion of support for remote object storage, perhaps usable by compaction
>>> strategies like TWCS to migrate data older than a threshold from a local
>>> filesystem to remote object.
>>>
>>> It’s not an area where I can currently dedicate engineering effort. But
>>> if others are interested in contributing a feature like this, I’d see it as
>>> valuable for the project and would be happy to collaborate on
>>> design/architecture/goals.
>>>
>>> – Scott
>>>
>>> On Feb 26, 2025, at 6:56 AM, guo Maxwell  wrote:
>>>
>>> 
>>> Is anyone else interested in continuing to discuss this topic?
>>>
>>> guo Maxwell  于2024年9月20日周五 09:44写道:
>>>
 I discussed this offline with Claude, he is no longer working on this.

 It's a pity. I think this is a very valuable thing. Commitlog's
 archiving and restore may be able to use the relevant code if it is
 completed.

 Patrick McFadin 于2024年9月20日 周五上午2:01写道:

> Thanks for reviving this one!
>
> On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell 
> wrote:
>
>> Is there any update on this topic?  It seems that things can make a
>> big progress if  Jake Luciani  can find someone who can make the
>> FileSystemProvider code accessible.
>>
>> Jon Haddad  于2023年12月16日周六 05:29写道:
>>
>>> At a high level I really like the idea of being able to better
>>> leverage cheaper storage especially object stores like S3.
>>>
>>> One important thing though - I feel pretty strongly that there's a
>>> big, deal breaking downside.   Backups, disk failure policies, snapshots
>>> and possibly repairs would get more complicated which haven't been
>>> particularly great in the past, and of course there's the issue of 
>>> failure
>>> recovery being only partially possible if you're looking at a durable 
>>> block
>>> store paired with an ephemeral one with some of your data not 
>>> replicated to
>>> the cold side.  That introduces a failure case that's unacceptable for 
>>> most
>>> teams, which results in need

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread guo Maxwell
If we want to do this, we should wrap the object storage downwards and
provide the file system api capabilities upwards (Cassandra layer),if my
understanding is correct.


Brandon Williams 于2025年3月4日 周二下午9:55写道:

> A failing remote api that you are calling and a failing filesystem you
> are using have different implications.
>
> Kind Regards,
> Brandon
>
> On Tue, Mar 4, 2025 at 7:47 AM Štefan Miklošovič 
> wrote:
> >
> > I don't say that using remote object storage is useless.
> >
> > I am just saying that I don't see the difference. I have not measured
> that but I can imagine that s3 mounted would use, under the hood, the same
> calls to s3 api. How else would it be done? You need to talk to remote s3
> storage eventually anyway. So why does it matter if we call s3 api from
> Java or by other means from some "s3 driver"?  It is eventually using same
> thing, no?
> >
> > On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa  wrote:
> >>
> >> Mounting an s3 bucket as a directory is an easy but poor implementation
> of object backed storage for databases
> >>
> >> Object storage is durable (most data loss is due to bugs not concurrent
> hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge
> number of modern systems are object-storage-only because the approximately
> infinite scale / cost / throughput tradeoffs often make up for the latency.
> >>
> >> Outright dismissing object storage for Cassandra is short sighted - it
> needs to be done in a way that makes sense, not just blindly copying over
> the block access patterns to object.
> >>
> >>
> >> On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič 
> wrote:
> >>
> >> 
> >> I do not think we need this CEP, honestly. I don't want to diss this
> unnecessarily but if you mount a remote storage locally (e.g. mounting s3
> bucket as if it was any other directory on node's machine), then what is
> this CEP good for?
> >>
> >> Not talking about the necessity to put all dependencies to be able to
> talk to respective remote storage to Cassandra's class path, introducing
> potential problems with dependencies and their possible incompatibilities /
> different versions etc ...
> >>
> >> On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas 
> wrote:
> >>>
> >>> I’d love to see this implemented — where “this” is a proxy for some
> notion of support for remote object storage, perhaps usable by compaction
> strategies like TWCS to migrate data older than a threshold from a local
> filesystem to remote object.
> >>>
> >>> It’s not an area where I can currently dedicate engineering effort.
> But if others are interested in contributing a feature like this, I’d see
> it as valuable for the project and would be happy to collaborate on
> design/architecture/goals.
> >>>
> >>> – Scott
> >>>
> >>> On Feb 26, 2025, at 6:56 AM, guo Maxwell  wrote:
> >>>
> >>> 
> >>> Is anyone else interested in continuing to discuss this topic?
> >>>
> >>> guo Maxwell  于2024年9月20日周五 09:44写道:
> 
>  I discussed this offline with Claude, he is no longer working on this.
> 
>  It's a pity. I think this is a very valuable thing. Commitlog's
> archiving and restore may be able to use the relevant code if it is
> completed.
> 
>  Patrick McFadin 于2024年9月20日 周五上午2:01写道:
> >
> > Thanks for reviving this one!
> >
> > On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell 
> wrote:
> >>
> >> Is there any update on this topic?  It seems that things can make a
> big progress if  Jake Luciani  can find someone who can make the
> FileSystemProvider code accessible.
> >>
> >> Jon Haddad  于2023年12月16日周六 05:29写道:
> >>>
> >>> At a high level I really like the idea of being able to better
> leverage cheaper storage especially object stores like S3.
> >>>
> >>> One important thing though - I feel pretty strongly that there's a
> big, deal breaking downside.   Backups, disk failure policies, snapshots
> and possibly repairs would get more complicated which haven't been
> particularly great in the past, and of course there's the issue of failure
> recovery being only partially possible if you're looking at a durable block
> store paired with an ephemeral one with some of your data not replicated to
> the cold side.  That introduces a failure case that's unacceptable for most
> teams, which results in needing to implement potentially 2 different backup
> solutions.  This is operationally complex with a lot of surface area for
> headaches.  I think a lot of teams would probably have an issue with the
> big question mark around durability and I probably would avoid it myself.
> >>>
> >>> On the other hand, I'm +1 if we approach it something slightly
> differently - where _all_ the data is located on the cold storage, with the
> local hot storage used as a cache.  This means we can use the cold
> directories for the complete dataset, simplifying backups and node
> replacements.
> >>>
> >>> For a little background, we had a ticket several years a

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread Štefan Miklošovič
I would be very cautious about not "reinventing the wheel" here. Are we
confident that we implement it in such a way which is bug-free / robust
enough? Do we think we can do a better job than the authors of "s3 driver"?
If they did a good job (which I assume they did) then failing an operation
while writing to / reading from remote storage mounted locally _should_ be
ideally indistinguishable from dealing with it locally.

So, what are these differences exactly? Not asking you specifically but in
general. I think this all should be answered before doing this in order to
know what we are getting ourselves into.

Plus the disadvantages when it comes to putting all the dependencies to
classpath. I think we would never release it. We would at most expose an
API for others to integrate with and it would be their job to deal with all
the complexity.

On Tue, Mar 4, 2025 at 2:57 PM Brandon Williams  wrote:

> A failing remote api that you are calling and a failing filesystem you
> are using have different implications.
>
> Kind Regards,
> Brandon
>
> On Tue, Mar 4, 2025 at 7:47 AM Štefan Miklošovič 
> wrote:
> >
> > I don't say that using remote object storage is useless.
> >
> > I am just saying that I don't see the difference. I have not measured
> that but I can imagine that s3 mounted would use, under the hood, the same
> calls to s3 api. How else would it be done? You need to talk to remote s3
> storage eventually anyway. So why does it matter if we call s3 api from
> Java or by other means from some "s3 driver"?  It is eventually using same
> thing, no?
> >
> > On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa  wrote:
> >>
> >> Mounting an s3 bucket as a directory is an easy but poor implementation
> of object backed storage for databases
> >>
> >> Object storage is durable (most data loss is due to bugs not concurrent
> hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge
> number of modern systems are object-storage-only because the approximately
> infinite scale / cost / throughput tradeoffs often make up for the latency.
> >>
> >> Outright dismissing object storage for Cassandra is short sighted - it
> needs to be done in a way that makes sense, not just blindly copying over
> the block access patterns to object.
> >>
> >>
> >> On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič 
> wrote:
> >>
> >> 
> >> I do not think we need this CEP, honestly. I don't want to diss this
> unnecessarily but if you mount a remote storage locally (e.g. mounting s3
> bucket as if it was any other directory on node's machine), then what is
> this CEP good for?
> >>
> >> Not talking about the necessity to put all dependencies to be able to
> talk to respective remote storage to Cassandra's class path, introducing
> potential problems with dependencies and their possible incompatibilities /
> different versions etc ...
> >>
> >> On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas 
> wrote:
> >>>
> >>> I’d love to see this implemented — where “this” is a proxy for some
> notion of support for remote object storage, perhaps usable by compaction
> strategies like TWCS to migrate data older than a threshold from a local
> filesystem to remote object.
> >>>
> >>> It’s not an area where I can currently dedicate engineering effort.
> But if others are interested in contributing a feature like this, I’d see
> it as valuable for the project and would be happy to collaborate on
> design/architecture/goals.
> >>>
> >>> – Scott
> >>>
> >>> On Feb 26, 2025, at 6:56 AM, guo Maxwell  wrote:
> >>>
> >>> 
> >>> Is anyone else interested in continuing to discuss this topic?
> >>>
> >>> guo Maxwell  于2024年9月20日周五 09:44写道:
> 
>  I discussed this offline with Claude, he is no longer working on this.
> 
>  It's a pity. I think this is a very valuable thing. Commitlog's
> archiving and restore may be able to use the relevant code if it is
> completed.
> 
>  Patrick McFadin 于2024年9月20日 周五上午2:01写道:
> >
> > Thanks for reviving this one!
> >
> > On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell 
> wrote:
> >>
> >> Is there any update on this topic?  It seems that things can make a
> big progress if  Jake Luciani  can find someone who can make the
> FileSystemProvider code accessible.
> >>
> >> Jon Haddad  于2023年12月16日周六 05:29写道:
> >>>
> >>> At a high level I really like the idea of being able to better
> leverage cheaper storage especially object stores like S3.
> >>>
> >>> One important thing though - I feel pretty strongly that there's a
> big, deal breaking downside.   Backups, disk failure policies, snapshots
> and possibly repairs would get more complicated which haven't been
> particularly great in the past, and of course there's the issue of failure
> recovery being only partially possible if you're looking at a durable block
> store paired with an ephemeral one with some of your data not replicated to
> the cold side.  That introduces a failure case that's unacceptable for

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread Jeff Jirsa
Most obviously, you don’t need to move all components of the sstable to s3, you could keep index + compression offsets locally. On Mar 4, 2025, at 1:46 PM, Štefan Miklošovič  wrote:I don't say that using remote object storage is useless. I am just saying that I don't see the difference. I have not measured that but I can imagine that s3 mounted would use, under the hood, the same calls to s3 api. How else would it be done? You need to talk to remote s3 storage eventually anyway. So why does it matter if we call s3 api from Java or by other means from some "s3 driver"?  It is eventually using same thing, no?On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa  wrote:Mounting an s3 bucket as a directory is an easy but poor implementation of object backed storage for databases Object storage is durable (most data loss is due to bugs not concurrent hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge number of modern systems are object-storage-only because the approximately infinite scale / cost / throughput tradeoffs often make up for the latency.Outright dismissing object storage for Cassandra is short sighted - it needs to be done in a way that makes sense, not just blindly copying over the block access patterns to object.On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič  wrote:I do not think we need this CEP, honestly. I don't want to diss this unnecessarily but if you mount a remote storage locally (e.g. mounting s3 bucket as if it was any other directory on node's machine), then what is this CEP good for? Not talking about the necessity to put all dependencies to be able to talk to respective remote storage to Cassandra's class path, introducing potential problems with dependencies and their possible incompatibilities / different versions etc ... On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas  wrote:I’d love to see this implemented — where “this” is a proxy for some notion of support for remote object storage, perhaps usable by compaction strategies like TWCS to migrate data older than a threshold from a local filesystem to remote object.It’s not an area where I can currently dedicate engineering effort. But if others are interested in contributing a feature like this, I’d see it as valuable for the project and would be happy to collaborate on design/architecture/goals.– ScottOn Feb 26, 2025, at 6:56 AM, guo Maxwell  wrote:Is anyone else interested in continuing to discuss this topic?guo Maxwell  于2024年9月20日周五 09:44写道:I discussed this offline with Claude, he is no longer working on this. It's a pity. I think this is a very valuable thing. Commitlog's archiving and restore may be able to use the relevant code if it is completed.Patrick McFadin 于2024年9月20日 周五上午2:01写道:Thanks for reviving this one!On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell  wrote:Is there any update on this topic?  It seems that things can make a big progress if  Jake Luciani  can find someone who can make the FileSystemProvider code accessible. Jon Haddad  于2023年12月16日周六 05:29写道:At a high level I really like the idea of being able to better leverage cheaper storage especially object stores like S3.  One important thing though - I feel pretty strongly that there's a big, deal breaking downside.   Backups, disk failure policies, snapshots and possibly repairs would get more complicated which haven't been particularly great in the past, and of course there's the issue of failure recovery being only partially possible if you're looking at a durable block store paired with an ephemeral one with some of your data not replicated to the cold side.  That introduces a failure case that's unacceptable for most teams, which results in needing to implement potentially 2 different backup solutions.  This is operationally complex with a lot of surface area for headaches.  I think a lot of teams would probably have an issue with the big question mark around durability and I probably would avoid it myself.On the other hand, I'm +1 if we approach it something slightly differently - where _all_ the data is located on the cold storage, with the local hot storage used as a cache.  This means we can use the cold directories for the complete dataset, simplifying backups and node replacements.  For a little background, we had a ticket several years ago where I pointed out it was possible to do this *today* at the operating system level as long as you're using block devices (vs an object store) and LVM [1].  For example, this works well with GP3 EBS w/ low IOPS provisioning + local NVMe to get a nice balance of great read performance without going nuts on the cost for IOPS.  I also wrote about this in a little more detail in my blog [2].  There's also the new mount point tech in AWS which pretty much does exactly what I've suggested above [3] that's probably worth evaluating just to get

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread Brandon Williams
A failing remote api that you are calling and a failing filesystem you
are using have different implications.

Kind Regards,
Brandon

On Tue, Mar 4, 2025 at 7:47 AM Štefan Miklošovič  wrote:
>
> I don't say that using remote object storage is useless.
>
> I am just saying that I don't see the difference. I have not measured that 
> but I can imagine that s3 mounted would use, under the hood, the same calls 
> to s3 api. How else would it be done? You need to talk to remote s3 storage 
> eventually anyway. So why does it matter if we call s3 api from Java or by 
> other means from some "s3 driver"?  It is eventually using same thing, no?
>
> On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa  wrote:
>>
>> Mounting an s3 bucket as a directory is an easy but poor implementation of 
>> object backed storage for databases
>>
>> Object storage is durable (most data loss is due to bugs not concurrent 
>> hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge number 
>> of modern systems are object-storage-only because the approximately infinite 
>> scale / cost / throughput tradeoffs often make up for the latency.
>>
>> Outright dismissing object storage for Cassandra is short sighted - it needs 
>> to be done in a way that makes sense, not just blindly copying over the 
>> block access patterns to object.
>>
>>
>> On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič  
>> wrote:
>>
>> 
>> I do not think we need this CEP, honestly. I don't want to diss this 
>> unnecessarily but if you mount a remote storage locally (e.g. mounting s3 
>> bucket as if it was any other directory on node's machine), then what is 
>> this CEP good for?
>>
>> Not talking about the necessity to put all dependencies to be able to talk 
>> to respective remote storage to Cassandra's class path, introducing 
>> potential problems with dependencies and their possible incompatibilities / 
>> different versions etc ...
>>
>> On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas  
>> wrote:
>>>
>>> I’d love to see this implemented — where “this” is a proxy for some notion 
>>> of support for remote object storage, perhaps usable by compaction 
>>> strategies like TWCS to migrate data older than a threshold from a local 
>>> filesystem to remote object.
>>>
>>> It’s not an area where I can currently dedicate engineering effort. But if 
>>> others are interested in contributing a feature like this, I’d see it as 
>>> valuable for the project and would be happy to collaborate on 
>>> design/architecture/goals.
>>>
>>> – Scott
>>>
>>> On Feb 26, 2025, at 6:56 AM, guo Maxwell  wrote:
>>>
>>> 
>>> Is anyone else interested in continuing to discuss this topic?
>>>
>>> guo Maxwell  于2024年9月20日周五 09:44写道:

 I discussed this offline with Claude, he is no longer working on this.

 It's a pity. I think this is a very valuable thing. Commitlog's archiving 
 and restore may be able to use the relevant code if it is completed.

 Patrick McFadin 于2024年9月20日 周五上午2:01写道:
>
> Thanks for reviving this one!
>
> On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell  wrote:
>>
>> Is there any update on this topic?  It seems that things can make a big 
>> progress if  Jake Luciani  can find someone who can make the 
>> FileSystemProvider code accessible.
>>
>> Jon Haddad  于2023年12月16日周六 05:29写道:
>>>
>>> At a high level I really like the idea of being able to better leverage 
>>> cheaper storage especially object stores like S3.
>>>
>>> One important thing though - I feel pretty strongly that there's a big, 
>>> deal breaking downside.   Backups, disk failure policies, snapshots and 
>>> possibly repairs would get more complicated which haven't been 
>>> particularly great in the past, and of course there's the issue of 
>>> failure recovery being only partially possible if you're looking at a 
>>> durable block store paired with an ephemeral one with some of your data 
>>> not replicated to the cold side.  That introduces a failure case that's 
>>> unacceptable for most teams, which results in needing to implement 
>>> potentially 2 different backup solutions.  This is operationally 
>>> complex with a lot of surface area for headaches.  I think a lot of 
>>> teams would probably have an issue with the big question mark around 
>>> durability and I probably would avoid it myself.
>>>
>>> On the other hand, I'm +1 if we approach it something slightly 
>>> differently - where _all_ the data is located on the cold storage, with 
>>> the local hot storage used as a cache.  This means we can use the cold 
>>> directories for the complete dataset, simplifying backups and node 
>>> replacements.
>>>
>>> For a little background, we had a ticket several years ago where I 
>>> pointed out it was possible to do this *today* at the operating system 
>>> level as long as you're using block devices (vs an object store) an

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread Štefan Miklošovič
I don't say that using remote object storage is useless.

I am just saying that I don't see the difference. I have not measured that
but I can imagine that s3 mounted would use, under the hood, the same calls
to s3 api. How else would it be done? You need to talk to remote s3 storage
eventually anyway. So why does it matter if we call s3 api from Java or by
other means from some "s3 driver"?  It is eventually using same thing, no?

On Tue, Mar 4, 2025 at 12:47 PM Jeff Jirsa  wrote:

> Mounting an s3 bucket as a directory is an easy but poor implementation of
> object backed storage for databases
>
> Object storage is durable (most data loss is due to bugs not concurrent
> hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge
> number of modern systems are object-storage-only because the approximately
> infinite scale / cost / throughput tradeoffs often make up for the latency.
>
> Outright dismissing object storage for Cassandra is short sighted - it
> needs to be done in a way that makes sense, not just blindly copying over
> the block access patterns to object.
>
>
> On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič 
> wrote:
>
> 
> I do not think we need this CEP, honestly. I don't want to diss this
> unnecessarily but if you mount a remote storage locally (e.g. mounting s3
> bucket as if it was any other directory on node's machine), then what is
> this CEP good for?
>
> Not talking about the necessity to put all dependencies to be able to talk
> to respective remote storage to Cassandra's class path, introducing
> potential problems with dependencies and their possible incompatibilities /
> different versions etc ...
>
> On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas 
> wrote:
>
>> I’d love to see this implemented — where “this” is a proxy for some
>> notion of support for remote object storage, perhaps usable by compaction
>> strategies like TWCS to migrate data older than a threshold from a local
>> filesystem to remote object.
>>
>> It’s not an area where I can currently dedicate engineering effort. But
>> if others are interested in contributing a feature like this, I’d see it as
>> valuable for the project and would be happy to collaborate on
>> design/architecture/goals.
>>
>> – Scott
>>
>> On Feb 26, 2025, at 6:56 AM, guo Maxwell  wrote:
>>
>> 
>> Is anyone else interested in continuing to discuss this topic?
>>
>> guo Maxwell  于2024年9月20日周五 09:44写道:
>>
>>> I discussed this offline with Claude, he is no longer working on this.
>>>
>>> It's a pity. I think this is a very valuable thing. Commitlog's
>>> archiving and restore may be able to use the relevant code if it is
>>> completed.
>>>
>>> Patrick McFadin 于2024年9月20日 周五上午2:01写道:
>>>
 Thanks for reviving this one!

 On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell 
 wrote:

> Is there any update on this topic?  It seems that things can make a
> big progress if  Jake Luciani  can find someone who can make the
> FileSystemProvider code accessible.
>
> Jon Haddad  于2023年12月16日周六 05:29写道:
>
>> At a high level I really like the idea of being able to better
>> leverage cheaper storage especially object stores like S3.
>>
>> One important thing though - I feel pretty strongly that there's a
>> big, deal breaking downside.   Backups, disk failure policies, snapshots
>> and possibly repairs would get more complicated which haven't been
>> particularly great in the past, and of course there's the issue of 
>> failure
>> recovery being only partially possible if you're looking at a durable 
>> block
>> store paired with an ephemeral one with some of your data not replicated 
>> to
>> the cold side.  That introduces a failure case that's unacceptable for 
>> most
>> teams, which results in needing to implement potentially 2 different 
>> backup
>> solutions.  This is operationally complex with a lot of surface area for
>> headaches.  I think a lot of teams would probably have an issue with the
>> big question mark around durability and I probably would avoid it myself.
>>
>> On the other hand, I'm +1 if we approach it something slightly
>> differently - where _all_ the data is located on the cold storage, with 
>> the
>> local hot storage used as a cache.  This means we can use the cold
>> directories for the complete dataset, simplifying backups and node
>> replacements.
>>
>> For a little background, we had a ticket several years ago where I
>> pointed out it was possible to do this *today* at the operating system
>> level as long as you're using block devices (vs an object store) and LVM
>> [1].  For example, this works well with GP3 EBS w/ low IOPS provisioning 
>> +
>> local NVMe to get a nice balance of great read performance without going
>> nuts on the cost for IOPS.  I also wrote about this in a little more 
>> detail
>> in my blog [2].  There's als

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread Jeff Jirsa
Mounting an s3 bucket as a directory is an easy but poor implementation of object backed storage for databases Object storage is durable (most data loss is due to bugs not concurrent hardware failures), cheap (can 5-10x cheaper) and ubiquitous. A  huge number of modern systems are object-storage-only because the approximately infinite scale / cost / throughput tradeoffs often make up for the latency.Outright dismissing object storage for Cassandra is short sighted - it needs to be done in a way that makes sense, not just blindly copying over the block access patterns to object.On Mar 4, 2025, at 11:19 AM, Štefan Miklošovič  wrote:I do not think we need this CEP, honestly. I don't want to diss this unnecessarily but if you mount a remote storage locally (e.g. mounting s3 bucket as if it was any other directory on node's machine), then what is this CEP good for? Not talking about the necessity to put all dependencies to be able to talk to respective remote storage to Cassandra's class path, introducing potential problems with dependencies and their possible incompatibilities / different versions etc ... On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas  wrote:I’d love to see this implemented — where “this” is a proxy for some notion of support for remote object storage, perhaps usable by compaction strategies like TWCS to migrate data older than a threshold from a local filesystem to remote object.It’s not an area where I can currently dedicate engineering effort. But if others are interested in contributing a feature like this, I’d see it as valuable for the project and would be happy to collaborate on design/architecture/goals.– ScottOn Feb 26, 2025, at 6:56 AM, guo Maxwell  wrote:Is anyone else interested in continuing to discuss this topic?guo Maxwell  于2024年9月20日周五 09:44写道:I discussed this offline with Claude, he is no longer working on this. It's a pity. I think this is a very valuable thing. Commitlog's archiving and restore may be able to use the relevant code if it is completed.Patrick McFadin 于2024年9月20日 周五上午2:01写道:Thanks for reviving this one!On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell  wrote:Is there any update on this topic?  It seems that things can make a big progress if  Jake Luciani  can find someone who can make the FileSystemProvider code accessible. Jon Haddad  于2023年12月16日周六 05:29写道:At a high level I really like the idea of being able to better leverage cheaper storage especially object stores like S3.  One important thing though - I feel pretty strongly that there's a big, deal breaking downside.   Backups, disk failure policies, snapshots and possibly repairs would get more complicated which haven't been particularly great in the past, and of course there's the issue of failure recovery being only partially possible if you're looking at a durable block store paired with an ephemeral one with some of your data not replicated to the cold side.  That introduces a failure case that's unacceptable for most teams, which results in needing to implement potentially 2 different backup solutions.  This is operationally complex with a lot of surface area for headaches.  I think a lot of teams would probably have an issue with the big question mark around durability and I probably would avoid it myself.On the other hand, I'm +1 if we approach it something slightly differently - where _all_ the data is located on the cold storage, with the local hot storage used as a cache.  This means we can use the cold directories for the complete dataset, simplifying backups and node replacements.  For a little background, we had a ticket several years ago where I pointed out it was possible to do this *today* at the operating system level as long as you're using block devices (vs an object store) and LVM [1].  For example, this works well with GP3 EBS w/ low IOPS provisioning + local NVMe to get a nice balance of great read performance without going nuts on the cost for IOPS.  I also wrote about this in a little more detail in my blog [2].  There's also the new mount point tech in AWS which pretty much does exactly what I've suggested above [3] that's probably worth evaluating just to get a feel for it.  I'm not insisting we require LVM or the AWS S3 fs, since that would rule out other cloud providers, but I am pretty confident that the entire dataset should reside in the "cold" side of things for the practical and technical reasons I listed above.  I don't think it massively changes the proposal, and should simplify things for everyone.Jon[1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/[2] https://issues.apache.org/jira/browse/CASSANDRA-8460[3] https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/On Thu, Dec 14, 2023 at 1:56 AM Claude Warren  wrote:Is there still interest in this?  Can we get some points down on electrons so 

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread Štefan Miklošovič
I do not think we need this CEP, honestly. I don't want to diss this
unnecessarily but if you mount a remote storage locally (e.g. mounting s3
bucket as if it was any other directory on node's machine), then what is
this CEP good for?

Not talking about the necessity to put all dependencies to be able to talk
to respective remote storage to Cassandra's class path, introducing
potential problems with dependencies and their possible incompatibilities /
different versions etc ...

On Thu, Feb 27, 2025 at 6:21 AM C. Scott Andreas 
wrote:

> I’d love to see this implemented — where “this” is a proxy for some
> notion of support for remote object storage, perhaps usable by compaction
> strategies like TWCS to migrate data older than a threshold from a local
> filesystem to remote object.
>
> It’s not an area where I can currently dedicate engineering effort. But if
> others are interested in contributing a feature like this, I’d see it as
> valuable for the project and would be happy to collaborate on
> design/architecture/goals.
>
> – Scott
>
> On Feb 26, 2025, at 6:56 AM, guo Maxwell  wrote:
>
> 
> Is anyone else interested in continuing to discuss this topic?
>
> guo Maxwell  于2024年9月20日周五 09:44写道:
>
>> I discussed this offline with Claude, he is no longer working on this.
>>
>> It's a pity. I think this is a very valuable thing. Commitlog's archiving
>> and restore may be able to use the relevant code if it is completed.
>>
>> Patrick McFadin 于2024年9月20日 周五上午2:01写道:
>>
>>> Thanks for reviving this one!
>>>
>>> On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell 
>>> wrote:
>>>
 Is there any update on this topic?  It seems that things can make a big
 progress if  Jake Luciani  can find someone who can make the
 FileSystemProvider code accessible.

 Jon Haddad  于2023年12月16日周六 05:29写道:

> At a high level I really like the idea of being able to better
> leverage cheaper storage especially object stores like S3.
>
> One important thing though - I feel pretty strongly that there's a
> big, deal breaking downside.   Backups, disk failure policies, snapshots
> and possibly repairs would get more complicated which haven't been
> particularly great in the past, and of course there's the issue of failure
> recovery being only partially possible if you're looking at a durable 
> block
> store paired with an ephemeral one with some of your data not replicated 
> to
> the cold side.  That introduces a failure case that's unacceptable for 
> most
> teams, which results in needing to implement potentially 2 different 
> backup
> solutions.  This is operationally complex with a lot of surface area for
> headaches.  I think a lot of teams would probably have an issue with the
> big question mark around durability and I probably would avoid it myself.
>
> On the other hand, I'm +1 if we approach it something slightly
> differently - where _all_ the data is located on the cold storage, with 
> the
> local hot storage used as a cache.  This means we can use the cold
> directories for the complete dataset, simplifying backups and node
> replacements.
>
> For a little background, we had a ticket several years ago where I
> pointed out it was possible to do this *today* at the operating system
> level as long as you're using block devices (vs an object store) and LVM
> [1].  For example, this works well with GP3 EBS w/ low IOPS provisioning +
> local NVMe to get a nice balance of great read performance without going
> nuts on the cost for IOPS.  I also wrote about this in a little more 
> detail
> in my blog [2].  There's also the new mount point tech in AWS which pretty
> much does exactly what I've suggested above [3] that's probably worth
> evaluating just to get a feel for it.
>
> I'm not insisting we require LVM or the AWS S3 fs, since that would
> rule out other cloud providers, but I am pretty confident that the entire
> dataset should reside in the "cold" side of things for the practical and
> technical reasons I listed above.  I don't think it massively changes the
> proposal, and should simplify things for everyone.
>
> Jon
>
> [1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
> [2] https://issues.apache.org/jira/browse/CASSANDRA-8460
> [3]
> https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/
>
>
> On Thu, Dec 14, 2023 at 1:56 AM Claude Warren 
> wrote:
>
>> Is there still interest in this?  Can we get some points down on
>> electrons so that we all understand the issues?
>>
>> While it is fairly simple to redirect the read/write to something
>> other  than the local system for a single node this will not solve the
>> problem for tiered storage.
>>
>> Tiered storage will require that on read/write the primary key be
>

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-03-04 Thread Rolo, Carlos via dev
Hello,

I would love to discuss this and provide feedback and design work for this. 
Since I'm not an experienced Java programmer I can't "hands-on" on the code. 
But I pick this up and try to carry it forward.

Carlos


From: guo Maxwell 
Sent: 26 February 2025 14:54
To: [email protected] 
Subject: Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external 
storage locations

EXTERNAL EMAIL - USE CAUTION when clicking links or attachments



Is anyone else interested in continuing to discuss this topic?

guo Maxwell mailto:[email protected]>> 于2024年9月20日周五 
09:44写道:
I discussed this offline with Claude, he is no longer working on this.

It's a pity. I think this is a very valuable thing. Commitlog's archiving and 
restore may be able to use the relevant code if it is completed.

Patrick McFadin mailto:[email protected]>>于2024年9月20日 
周五上午2:01写道:
Thanks for reviving this one!

On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell 
mailto:[email protected]>> wrote:
Is there any update on this topic?  It seems that things can make a big 
progress if  Jake Luciani  can find someone who can make the FileSystemProvider 
code accessible.

Jon Haddad mailto:[email protected]>> 于2023年12月16日周六 
05:29写道:
At a high level I really like the idea of being able to better leverage cheaper 
storage especially object stores like S3.

One important thing though - I feel pretty strongly that there's a big, deal 
breaking downside.   Backups, disk failure policies, snapshots and possibly 
repairs would get more complicated which haven't been particularly great in the 
past, and of course there's the issue of failure recovery being only partially 
possible if you're looking at a durable block store paired with an ephemeral 
one with some of your data not replicated to the cold side.  That introduces a 
failure case that's unacceptable for most teams, which results in needing to 
implement potentially 2 different backup solutions.  This is operationally 
complex with a lot of surface area for headaches.  I think a lot of teams would 
probably have an issue with the big question mark around durability and I 
probably would avoid it myself.

On the other hand, I'm +1 if we approach it something slightly differently - 
where _all_ the data is located on the cold storage, with the local hot storage 
used as a cache.  This means we can use the cold directories for the complete 
dataset, simplifying backups and node replacements.

For a little background, we had a ticket several years ago where I pointed out 
it was possible to do this *today* at the operating system level as long as 
you're using block devices (vs an object store) and LVM [1].  For example, this 
works well with GP3 EBS w/ low IOPS provisioning + local NVMe to get a nice 
balance of great read performance without going nuts on the cost for IOPS.  I 
also wrote about this in a little more detail in my blog [2].  There's also the 
new mount point tech in AWS which pretty much does exactly what I've suggested 
above [3] that's probably worth evaluating just to get a feel for it.

I'm not insisting we require LVM or the AWS S3 fs, since that would rule out 
other cloud providers, but I am pretty confident that the entire dataset should 
reside in the "cold" side of things for the practical and technical reasons I 
listed above.  I don't think it massively changes the proposal, and should 
simplify things for everyone.

Jon

[1] 
https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/<https://urldefense.com/v3/__https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/__;!!Nhn8V6BzJA!XscMrj2bGAg48tjrbrz4xaoPNcc3Jyng4qbE3V8hSTOafMdmpAEn51Tjd3sgj9qYE810Gawvv_za5kuuhjBy$>
[2] 
https://issues.apache.org/jira/browse/CASSANDRA-8460<https://urldefense.com/v3/__https://issues.apache.org/jira/browse/CASSANDRA-8460__;!!Nhn8V6BzJA!XscMrj2bGAg48tjrbrz4xaoPNcc3Jyng4qbE3V8hSTOafMdmpAEn51Tjd3sgj9qYE810Gawvv_za5gahHlhq$>
[3] 
https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/<https://urldefense.com/v3/__https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/__;!!Nhn8V6BzJA!XscMrj2bGAg48tjrbrz4xaoPNcc3Jyng4qbE3V8hSTOafMdmpAEn51Tjd3sgj9qYE810Gawvv_za5uWVlHcs$>


On Thu, Dec 14, 2023 at 1:56 AM Claude Warren 
mailto:[email protected]>> wrote:
Is there still interest in this?  Can we get some points down on electrons so 
that we all understand the issues?

While it is fairly simple to redirect the read/write to something other  than 
the local system for a single node this will not solve the problem for tiered 
storage.

Tiered storage will require that on read/write the primary key be assessed and 
determine if the read/write should be redirected.  My reasoning for this 
statement is that in a cluster with a replication factor greater than 1 the 

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-02-26 Thread C. Scott Andreas
I’d love to see this implemented — where “this” is a proxy for some notion of support for remote object storage, perhaps usable by compaction strategies like TWCS to migrate data older than a threshold from a local filesystem to remote object.It’s not an area where I can currently dedicate engineering effort. But if others are interested in contributing a feature like this, I’d see it as valuable for the project and would be happy to collaborate on design/architecture/goals.– ScottOn Feb 26, 2025, at 6:56 AM, guo Maxwell  wrote:Is anyone else interested in continuing to discuss this topic?guo Maxwell  于2024年9月20日周五 09:44写道:I discussed this offline with Claude, he is no longer working on this. It's a pity. I think this is a very valuable thing. Commitlog's archiving and restore may be able to use the relevant code if it is completed.Patrick McFadin 于2024年9月20日 周五上午2:01写道:Thanks for reviving this one!On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell  wrote:Is there any update on this topic?  It seems that things can make a big progress if  Jake Luciani  can find someone who can make the FileSystemProvider code accessible. Jon Haddad  于2023年12月16日周六 05:29写道:At a high level I really like the idea of being able to better leverage cheaper storage especially object stores like S3.  One important thing though - I feel pretty strongly that there's a big, deal breaking downside.   Backups, disk failure policies, snapshots and possibly repairs would get more complicated which haven't been particularly great in the past, and of course there's the issue of failure recovery being only partially possible if you're looking at a durable block store paired with an ephemeral one with some of your data not replicated to the cold side.  That introduces a failure case that's unacceptable for most teams, which results in needing to implement potentially 2 different backup solutions.  This is operationally complex with a lot of surface area for headaches.  I think a lot of teams would probably have an issue with the big question mark around durability and I probably would avoid it myself.On the other hand, I'm +1 if we approach it something slightly differently - where _all_ the data is located on the cold storage, with the local hot storage used as a cache.  This means we can use the cold directories for the complete dataset, simplifying backups and node replacements.  For a little background, we had a ticket several years ago where I pointed out it was possible to do this *today* at the operating system level as long as you're using block devices (vs an object store) and LVM [1].  For example, this works well with GP3 EBS w/ low IOPS provisioning + local NVMe to get a nice balance of great read performance without going nuts on the cost for IOPS.  I also wrote about this in a little more detail in my blog [2].  There's also the new mount point tech in AWS which pretty much does exactly what I've suggested above [3] that's probably worth evaluating just to get a feel for it.  I'm not insisting we require LVM or the AWS S3 fs, since that would rule out other cloud providers, but I am pretty confident that the entire dataset should reside in the "cold" side of things for the practical and technical reasons I listed above.  I don't think it massively changes the proposal, and should simplify things for everyone.Jon[1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/[2] https://issues.apache.org/jira/browse/CASSANDRA-8460[3] https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/On Thu, Dec 14, 2023 at 1:56 AM Claude Warren  wrote:Is there still interest in this?  Can we get some points down on electrons so that we all understand the issues?

While it is fairly simple to redirect the read/write to something other  than the local system for a single node this will not solve the problem for tiered storage.

Tiered storage will require that on read/write the primary key be assessed and determine if the read/write should be redirected.  My reasoning for this statement is that in a cluster with a replication factor greater than 1 the node will store data for the keys that would be allocated to it in a cluster with a replication factor = 1, as well as some keys from nodes earlier in the ring.

Even if we can get the primary keys for all the data we want to write to "cold storage" to map to a single node a replication factor > 1 means that data will also be placed in "normal storage" on subsequent nodes.

To overcome this, we have to explore ways to route data to different storage based on the keys and that different storage may have to be available on _all_  the nodes.

Have any of the partial solutions mentioned in this email chain (or others) solved this problem?

Claude








Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2025-02-26 Thread guo Maxwell
Is anyone else interested in continuing to discuss this topic?

guo Maxwell  于2024年9月20日周五 09:44写道:

> I discussed this offline with Claude, he is no longer working on this.
>
> It's a pity. I think this is a very valuable thing. Commitlog's archiving
> and restore may be able to use the relevant code if it is completed.
>
> Patrick McFadin 于2024年9月20日 周五上午2:01写道:
>
>> Thanks for reviving this one!
>>
>> On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell 
>> wrote:
>>
>>> Is there any update on this topic?  It seems that things can make a big
>>> progress if  Jake Luciani  can find someone who can make the
>>> FileSystemProvider code accessible.
>>>
>>> Jon Haddad  于2023年12月16日周六 05:29写道:
>>>
 At a high level I really like the idea of being able to better leverage
 cheaper storage especially object stores like S3.

 One important thing though - I feel pretty strongly that there's a big,
 deal breaking downside.   Backups, disk failure policies, snapshots and
 possibly repairs would get more complicated which haven't been particularly
 great in the past, and of course there's the issue of failure recovery
 being only partially possible if you're looking at a durable block store
 paired with an ephemeral one with some of your data not replicated to the
 cold side.  That introduces a failure case that's unacceptable for most
 teams, which results in needing to implement potentially 2 different backup
 solutions.  This is operationally complex with a lot of surface area for
 headaches.  I think a lot of teams would probably have an issue with the
 big question mark around durability and I probably would avoid it myself.

 On the other hand, I'm +1 if we approach it something slightly
 differently - where _all_ the data is located on the cold storage, with the
 local hot storage used as a cache.  This means we can use the cold
 directories for the complete dataset, simplifying backups and node
 replacements.

 For a little background, we had a ticket several years ago where I
 pointed out it was possible to do this *today* at the operating system
 level as long as you're using block devices (vs an object store) and LVM
 [1].  For example, this works well with GP3 EBS w/ low IOPS provisioning +
 local NVMe to get a nice balance of great read performance without going
 nuts on the cost for IOPS.  I also wrote about this in a little more detail
 in my blog [2].  There's also the new mount point tech in AWS which pretty
 much does exactly what I've suggested above [3] that's probably worth
 evaluating just to get a feel for it.

 I'm not insisting we require LVM or the AWS S3 fs, since that would
 rule out other cloud providers, but I am pretty confident that the entire
 dataset should reside in the "cold" side of things for the practical and
 technical reasons I listed above.  I don't think it massively changes the
 proposal, and should simplify things for everyone.

 Jon

 [1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
 [2] https://issues.apache.org/jira/browse/CASSANDRA-8460
 [3]
 https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/


 On Thu, Dec 14, 2023 at 1:56 AM Claude Warren 
 wrote:

> Is there still interest in this?  Can we get some points down on
> electrons so that we all understand the issues?
>
> While it is fairly simple to redirect the read/write to something
> other  than the local system for a single node this will not solve the
> problem for tiered storage.
>
> Tiered storage will require that on read/write the primary key be
> assessed and determine if the read/write should be redirected.  My
> reasoning for this statement is that in a cluster with a replication 
> factor
> greater than 1 the node will store data for the keys that would be
> allocated to it in a cluster with a replication factor = 1, as well as 
> some
> keys from nodes earlier in the ring.
>
> Even if we can get the primary keys for all the data we want to write
> to "cold storage" to map to a single node a replication factor > 1 means
> that data will also be placed in "normal storage" on subsequent nodes.
>
> To overcome this, we have to explore ways to route data to different
> storage based on the keys and that different storage may have to be
> available on _all_  the nodes.
>
> Have any of the partial solutions mentioned in this email chain (or
> others) solved this problem?
>
> Claude
>



Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2024-09-19 Thread guo Maxwell
I discussed this offline with Claude, he is no longer working on this.

It's a pity. I think this is a very valuable thing. Commitlog's archiving
and restore may be able to use the relevant code if it is completed.

Patrick McFadin 于2024年9月20日 周五上午2:01写道:

> Thanks for reviving this one!
>
> On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell  wrote:
>
>> Is there any update on this topic?  It seems that things can make a big
>> progress if  Jake Luciani  can find someone who can make the
>> FileSystemProvider code accessible.
>>
>> Jon Haddad  于2023年12月16日周六 05:29写道:
>>
>>> At a high level I really like the idea of being able to better leverage
>>> cheaper storage especially object stores like S3.
>>>
>>> One important thing though - I feel pretty strongly that there's a big,
>>> deal breaking downside.   Backups, disk failure policies, snapshots and
>>> possibly repairs would get more complicated which haven't been particularly
>>> great in the past, and of course there's the issue of failure recovery
>>> being only partially possible if you're looking at a durable block store
>>> paired with an ephemeral one with some of your data not replicated to the
>>> cold side.  That introduces a failure case that's unacceptable for most
>>> teams, which results in needing to implement potentially 2 different backup
>>> solutions.  This is operationally complex with a lot of surface area for
>>> headaches.  I think a lot of teams would probably have an issue with the
>>> big question mark around durability and I probably would avoid it myself.
>>>
>>> On the other hand, I'm +1 if we approach it something slightly
>>> differently - where _all_ the data is located on the cold storage, with the
>>> local hot storage used as a cache.  This means we can use the cold
>>> directories for the complete dataset, simplifying backups and node
>>> replacements.
>>>
>>> For a little background, we had a ticket several years ago where I
>>> pointed out it was possible to do this *today* at the operating system
>>> level as long as you're using block devices (vs an object store) and LVM
>>> [1].  For example, this works well with GP3 EBS w/ low IOPS provisioning +
>>> local NVMe to get a nice balance of great read performance without going
>>> nuts on the cost for IOPS.  I also wrote about this in a little more detail
>>> in my blog [2].  There's also the new mount point tech in AWS which pretty
>>> much does exactly what I've suggested above [3] that's probably worth
>>> evaluating just to get a feel for it.
>>>
>>> I'm not insisting we require LVM or the AWS S3 fs, since that would rule
>>> out other cloud providers, but I am pretty confident that the entire
>>> dataset should reside in the "cold" side of things for the practical and
>>> technical reasons I listed above.  I don't think it massively changes the
>>> proposal, and should simplify things for everyone.
>>>
>>> Jon
>>>
>>> [1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-8460
>>> [3]
>>> https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/
>>>
>>>
>>> On Thu, Dec 14, 2023 at 1:56 AM Claude Warren  wrote:
>>>
 Is there still interest in this?  Can we get some points down on
 electrons so that we all understand the issues?

 While it is fairly simple to redirect the read/write to something
 other  than the local system for a single node this will not solve the
 problem for tiered storage.

 Tiered storage will require that on read/write the primary key be
 assessed and determine if the read/write should be redirected.  My
 reasoning for this statement is that in a cluster with a replication factor
 greater than 1 the node will store data for the keys that would be
 allocated to it in a cluster with a replication factor = 1, as well as some
 keys from nodes earlier in the ring.

 Even if we can get the primary keys for all the data we want to write
 to "cold storage" to map to a single node a replication factor > 1 means
 that data will also be placed in "normal storage" on subsequent nodes.

 To overcome this, we have to explore ways to route data to different
 storage based on the keys and that different storage may have to be
 available on _all_  the nodes.

 Have any of the partial solutions mentioned in this email chain (or
 others) solved this problem?

 Claude

>>>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2024-09-19 Thread Patrick McFadin
Thanks for reviving this one!

On Wed, Sep 18, 2024 at 12:06 AM guo Maxwell  wrote:

> Is there any update on this topic?  It seems that things can make a big
> progress if  Jake Luciani  can find someone who can make the
> FileSystemProvider code accessible.
>
> Jon Haddad  于2023年12月16日周六 05:29写道:
>
>> At a high level I really like the idea of being able to better leverage
>> cheaper storage especially object stores like S3.
>>
>> One important thing though - I feel pretty strongly that there's a big,
>> deal breaking downside.   Backups, disk failure policies, snapshots and
>> possibly repairs would get more complicated which haven't been particularly
>> great in the past, and of course there's the issue of failure recovery
>> being only partially possible if you're looking at a durable block store
>> paired with an ephemeral one with some of your data not replicated to the
>> cold side.  That introduces a failure case that's unacceptable for most
>> teams, which results in needing to implement potentially 2 different backup
>> solutions.  This is operationally complex with a lot of surface area for
>> headaches.  I think a lot of teams would probably have an issue with the
>> big question mark around durability and I probably would avoid it myself.
>>
>> On the other hand, I'm +1 if we approach it something slightly
>> differently - where _all_ the data is located on the cold storage, with the
>> local hot storage used as a cache.  This means we can use the cold
>> directories for the complete dataset, simplifying backups and node
>> replacements.
>>
>> For a little background, we had a ticket several years ago where I
>> pointed out it was possible to do this *today* at the operating system
>> level as long as you're using block devices (vs an object store) and LVM
>> [1].  For example, this works well with GP3 EBS w/ low IOPS provisioning +
>> local NVMe to get a nice balance of great read performance without going
>> nuts on the cost for IOPS.  I also wrote about this in a little more detail
>> in my blog [2].  There's also the new mount point tech in AWS which pretty
>> much does exactly what I've suggested above [3] that's probably worth
>> evaluating just to get a feel for it.
>>
>> I'm not insisting we require LVM or the AWS S3 fs, since that would rule
>> out other cloud providers, but I am pretty confident that the entire
>> dataset should reside in the "cold" side of things for the practical and
>> technical reasons I listed above.  I don't think it massively changes the
>> proposal, and should simplify things for everyone.
>>
>> Jon
>>
>> [1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
>> [2] https://issues.apache.org/jira/browse/CASSANDRA-8460
>> [3]
>> https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/
>>
>>
>> On Thu, Dec 14, 2023 at 1:56 AM Claude Warren  wrote:
>>
>>> Is there still interest in this?  Can we get some points down on
>>> electrons so that we all understand the issues?
>>>
>>> While it is fairly simple to redirect the read/write to something other
>>> than the local system for a single node this will not solve the problem for
>>> tiered storage.
>>>
>>> Tiered storage will require that on read/write the primary key be
>>> assessed and determine if the read/write should be redirected.  My
>>> reasoning for this statement is that in a cluster with a replication factor
>>> greater than 1 the node will store data for the keys that would be
>>> allocated to it in a cluster with a replication factor = 1, as well as some
>>> keys from nodes earlier in the ring.
>>>
>>> Even if we can get the primary keys for all the data we want to write to
>>> "cold storage" to map to a single node a replication factor > 1 means that
>>> data will also be placed in "normal storage" on subsequent nodes.
>>>
>>> To overcome this, we have to explore ways to route data to different
>>> storage based on the keys and that different storage may have to be
>>> available on _all_  the nodes.
>>>
>>> Have any of the partial solutions mentioned in this email chain (or
>>> others) solved this problem?
>>>
>>> Claude
>>>
>>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2024-09-18 Thread guo Maxwell
Is there any update on this topic?  It seems that things can make a big
progress if  Jake Luciani  can find someone who can make the
FileSystemProvider code accessible.

Jon Haddad  于2023年12月16日周六 05:29写道:

> At a high level I really like the idea of being able to better leverage
> cheaper storage especially object stores like S3.
>
> One important thing though - I feel pretty strongly that there's a big,
> deal breaking downside.   Backups, disk failure policies, snapshots and
> possibly repairs would get more complicated which haven't been particularly
> great in the past, and of course there's the issue of failure recovery
> being only partially possible if you're looking at a durable block store
> paired with an ephemeral one with some of your data not replicated to the
> cold side.  That introduces a failure case that's unacceptable for most
> teams, which results in needing to implement potentially 2 different backup
> solutions.  This is operationally complex with a lot of surface area for
> headaches.  I think a lot of teams would probably have an issue with the
> big question mark around durability and I probably would avoid it myself.
>
> On the other hand, I'm +1 if we approach it something slightly differently
> - where _all_ the data is located on the cold storage, with the local hot
> storage used as a cache.  This means we can use the cold directories for
> the complete dataset, simplifying backups and node replacements.
>
> For a little background, we had a ticket several years ago where I pointed
> out it was possible to do this *today* at the operating system level as
> long as you're using block devices (vs an object store) and LVM [1].  For
> example, this works well with GP3 EBS w/ low IOPS provisioning + local NVMe
> to get a nice balance of great read performance without going nuts on the
> cost for IOPS.  I also wrote about this in a little more detail in my blog
> [2].  There's also the new mount point tech in AWS which pretty much does
> exactly what I've suggested above [3] that's probably worth evaluating just
> to get a feel for it.
>
> I'm not insisting we require LVM or the AWS S3 fs, since that would rule
> out other cloud providers, but I am pretty confident that the entire
> dataset should reside in the "cold" side of things for the practical and
> technical reasons I listed above.  I don't think it massively changes the
> proposal, and should simplify things for everyone.
>
> Jon
>
> [1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
> [2] https://issues.apache.org/jira/browse/CASSANDRA-8460
> [3]
> https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/
>
>
> On Thu, Dec 14, 2023 at 1:56 AM Claude Warren  wrote:
>
>> Is there still interest in this?  Can we get some points down on
>> electrons so that we all understand the issues?
>>
>> While it is fairly simple to redirect the read/write to something other
>> than the local system for a single node this will not solve the problem for
>> tiered storage.
>>
>> Tiered storage will require that on read/write the primary key be
>> assessed and determine if the read/write should be redirected.  My
>> reasoning for this statement is that in a cluster with a replication factor
>> greater than 1 the node will store data for the keys that would be
>> allocated to it in a cluster with a replication factor = 1, as well as some
>> keys from nodes earlier in the ring.
>>
>> Even if we can get the primary keys for all the data we want to write to
>> "cold storage" to map to a single node a replication factor > 1 means that
>> data will also be placed in "normal storage" on subsequent nodes.
>>
>> To overcome this, we have to explore ways to route data to different
>> storage based on the keys and that different storage may have to be
>> available on _all_  the nodes.
>>
>> Have any of the partial solutions mentioned in this email chain (or
>> others) solved this problem?
>>
>> Claude
>>
>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-12-15 Thread Jon Haddad
At a high level I really like the idea of being able to better leverage
cheaper storage especially object stores like S3.

One important thing though - I feel pretty strongly that there's a big,
deal breaking downside.   Backups, disk failure policies, snapshots and
possibly repairs would get more complicated which haven't been particularly
great in the past, and of course there's the issue of failure recovery
being only partially possible if you're looking at a durable block store
paired with an ephemeral one with some of your data not replicated to the
cold side.  That introduces a failure case that's unacceptable for most
teams, which results in needing to implement potentially 2 different backup
solutions.  This is operationally complex with a lot of surface area for
headaches.  I think a lot of teams would probably have an issue with the
big question mark around durability and I probably would avoid it myself.

On the other hand, I'm +1 if we approach it something slightly differently
- where _all_ the data is located on the cold storage, with the local hot
storage used as a cache.  This means we can use the cold directories for
the complete dataset, simplifying backups and node replacements.

For a little background, we had a ticket several years ago where I pointed
out it was possible to do this *today* at the operating system level as
long as you're using block devices (vs an object store) and LVM [1].  For
example, this works well with GP3 EBS w/ low IOPS provisioning + local NVMe
to get a nice balance of great read performance without going nuts on the
cost for IOPS.  I also wrote about this in a little more detail in my blog
[2].  There's also the new mount point tech in AWS which pretty much does
exactly what I've suggested above [3] that's probably worth evaluating just
to get a feel for it.

I'm not insisting we require LVM or the AWS S3 fs, since that would rule
out other cloud providers, but I am pretty confident that the entire
dataset should reside in the "cold" side of things for the practical and
technical reasons I listed above.  I don't think it massively changes the
proposal, and should simplify things for everyone.

Jon

[1] https://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/
[2] https://issues.apache.org/jira/browse/CASSANDRA-8460
[3] https://aws.amazon.com/about-aws/whats-new/2023/03/mountpoint-amazon-s3/


On Thu, Dec 14, 2023 at 1:56 AM Claude Warren  wrote:

> Is there still interest in this?  Can we get some points down on electrons
> so that we all understand the issues?
>
> While it is fairly simple to redirect the read/write to something other
> than the local system for a single node this will not solve the problem for
> tiered storage.
>
> Tiered storage will require that on read/write the primary key be assessed
> and determine if the read/write should be redirected.  My reasoning for
> this statement is that in a cluster with a replication factor greater than
> 1 the node will store data for the keys that would be allocated to it in a
> cluster with a replication factor = 1, as well as some keys from nodes
> earlier in the ring.
>
> Even if we can get the primary keys for all the data we want to write to
> "cold storage" to map to a single node a replication factor > 1 means that
> data will also be placed in "normal storage" on subsequent nodes.
>
> To overcome this, we have to explore ways to route data to different
> storage based on the keys and that different storage may have to be
> available on _all_  the nodes.
>
> Have any of the partial solutions mentioned in this email chain (or
> others) solved this problem?
>
> Claude
>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-12-14 Thread Claude Warren
Is there still interest in this?  Can we get some points down on electrons so 
that we all understand the issues?

While it is fairly simple to redirect the read/write to something other  than 
the local system for a single node this will not solve the problem for tiered 
storage.

Tiered storage will require that on read/write the primary key be assessed and 
determine if the read/write should be redirected.  My reasoning for this 
statement is that in a cluster with a replication factor greater than 1 the 
node will store data for the keys that would be allocated to it in a cluster 
with a replication factor = 1, as well as some keys from nodes earlier in the 
ring.

Even if we can get the primary keys for all the data we want to write to "cold 
storage" to map to a single node a replication factor > 1 means that data will 
also be placed in "normal storage" on subsequent nodes.

To overcome this, we have to explore ways to route data to different storage 
based on the keys and that different storage may have to be available on _all_  
the nodes.

Have any of the partial solutions mentioned in this email chain (or others) 
solved this problem?

Claude


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-10-31 Thread Claude Warren, Jr via dev
@henrik,  Have you made any progress on this?  I would like to help drive
it forward but I am waiting to see what your code looks like and figure out
what I need to do.  Any update on timeline would be appreciated.

On Mon, Oct 23, 2023 at 9:07 PM Jon Haddad 
wrote:

> I think this is a great more generally useful than the two scenarios
> you've outlined.  I think it could / should be possible to use an object
> store as the primary storage for sstables and rely on local disk as a cache
> for reads.
>
> I don't know the roadmap for TCM, but imo if it allowed for more stable,
> pre-allocated ranges that compaction will always be aware of (plus a bunch
> of plumbing I'm deliberately avoiding the details on), then you could
> bootstrap a new node by copying s3 directories around rather than streaming
> data between nodes.  That's how we get to 20TB / node, easy scale up /
> down, etc, and always-ZCS for non-object store deployments.
>
> Jon
>
> On 2023/09/25 06:48:06 "Claude Warren, Jr via dev" wrote:
> > I have just filed CEP-36 [1] to allow for keyspace/table storage outside
> of
> > the standard storage space.
> >
> > There are two desires  driving this change:
> >
> >1. The ability to temporarily move some keyspaces/tables to storage
> >outside the normal directory tree to other disk so that compaction can
> >occur in situations where there is not enough disk space for
> compaction and
> >the processing to the moved data can not be suspended.
> >2. The ability to store infrequently used data on slower cheaper
> storage
> >layers.
> >
> > I have a working POC implementation [2] though there are some issues
> still
> > to be solved and much logging to be reduced.
> >
> > I look forward to productive discussions,
> > Claude
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
> > [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
> >
>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-10-23 Thread Jon Haddad
I think this is a great more generally useful than the two scenarios you've 
outlined.  I think it could / should be possible to use an object store as the 
primary storage for sstables and rely on local disk as a cache for reads.  

I don't know the roadmap for TCM, but imo if it allowed for more stable, 
pre-allocated ranges that compaction will always be aware of (plus a bunch of 
plumbing I'm deliberately avoiding the details on), then you could bootstrap a 
new node by copying s3 directories around rather than streaming data between 
nodes.  That's how we get to 20TB / node, easy scale up / down, etc, and 
always-ZCS for non-object store deployments.

Jon

On 2023/09/25 06:48:06 "Claude Warren, Jr via dev" wrote:
> I have just filed CEP-36 [1] to allow for keyspace/table storage outside of
> the standard storage space.
> 
> There are two desires  driving this change:
> 
>1. The ability to temporarily move some keyspaces/tables to storage
>outside the normal directory tree to other disk so that compaction can
>occur in situations where there is not enough disk space for compaction and
>the processing to the moved data can not be suspended.
>2. The ability to store infrequently used data on slower cheaper storage
>layers.
> 
> I have a working POC implementation [2] though there are some issues still
> to be solved and much logging to be reduced.
> 
> I look forward to productive discussions,
> Claude
> 
> [1]
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
> 


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-10-19 Thread Claude Warren, Jr via dev
n an
>>>> array?
>>>>
>>>> Should the system set the path to the root of the ColumnFamilyStore in
>>>> the ColumnFamilyStore directories instance?
>>>> Should the Directories.getLocationForDisk() do the proxy to the other
>>>> file system?
>>>>
>>>> Where is the proper location to change from the standard internal
>>>> representation to the remote location?
>>>>
>>>>
>>>> On Fri, Sep 29, 2023 at 8:07 AM Claude Warren, Jr <
>>>> [email protected]> wrote:
>>>>
>>>>> Sorry I was out sick and did not respond yesterday.
>>>>>
>>>>> Henrik,  How does your system work?  What is the design strategy?
>>>>> Also is your code available somewhere?
>>>>>
>>>>> After looking at the code some more I think that the best solution is
>>>>> not a FileChannelProxy but to modify the Cassandra File class to get a
>>>>> FileSystem object for a Factory to build the Path that is used within that
>>>>> object.  I think that this makes if very small change that will pick up
>>>>> 90+% of the cases.  We then just need to find the edge cases.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 29, 2023 at 1:14 AM German Eichberger via dev <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Super excited about this as well. Happy to help test with Azure and
>>>>>> any other way needed.
>>>>>>
>>>>>> Thanks,
>>>>>> German
>>>>>> --
>>>>>> *From:* guo Maxwell 
>>>>>> *Sent:* Wednesday, September 27, 2023 7:38 PM
>>>>>> *To:* [email protected] 
>>>>>> *Subject:* [EXTERNAL] Re: [DISCUSS] CEP-36: A Configurable
>>>>>> ChannelProxy to alias external storage locations
>>>>>>
>>>>>> Thanks , So I think a jira can be created now. And I'd be happy to
>>>>>> provide some help with this as well if needed.
>>>>>>
>>>>>> Henrik Ingo  于2023年9月28日周四 00:21写道:
>>>>>>
>>>>>> It seems I was volunteered to rebase the Astra implementation of this
>>>>>> functionality (FileSystemProvider) onto Cassandra trunk. (And publish it,
>>>>>> of course) I'll try to get going today or tomorrow, so that this
>>>>>> discussion can then benefit from having that code available for 
>>>>>> inspection.
>>>>>> And potentially using it as a soluttion to this use case.
>>>>>>
>>>>>> On Tue, Sep 26, 2023 at 8:04 PM Jake Luciani 
>>>>>> wrote:
>>>>>>
>>>>>> We (DataStax) have a FileSystemProvider for Astra we can provide.
>>>>>> Works with S3/GCS/Azure.
>>>>>>
>>>>>> I'll ask someone on our end to make it accessible.
>>>>>>
>>>>>> This would work by having a bucket prefix per node. But there are lots
>>>>>> of details needed to support things like out of bound compaction
>>>>>> (mentioned in CEP).
>>>>>>
>>>>>> Jake
>>>>>>
>>>>>> On Tue, Sep 26, 2023 at 12:56 PM Benedict 
>>>>>> wrote:
>>>>>> >
>>>>>> > I agree with Ariel, the more suitable insertion point is probably
>>>>>> the JDK level FileSystemProvider and FileSystem abstraction.
>>>>>> >
>>>>>> > It might also be that we can reuse existing work here in some cases?
>>>>>> >
>>>>>> > On 26 Sep 2023, at 17:49, Ariel Weisberg  wrote:
>>>>>> >
>>>>>> > 
>>>>>> > Hi,
>>>>>> >
>>>>>> > Support for multiple storage backends including remote storage
>>>>>> backends is a pretty high value piece of functionality. I am happy to see
>>>>>> there is interest in that.
>>>>>> >
>>>>>> > I think that `ChannelProxyFactory` as an integration point is going
>>>>>> to quickly turn into a dead end as we get into really using multiple
>>>>>> storage backends. We need to be able to list

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-10-18 Thread guo Maxwell
If it is ok for Henrik to rebase the Astra implementation of this
functionality (FileSystemProvider) onto Cassandra trunk.

Then we can create a jira to move this forward for a small step.

Claude Warren, Jr  于2023年10月18日周三 15:05写道:

> Henrik and Guo,
>
> Have you moved forward on this topic?  I have not seen anything recently.
> I have posted a solution that intercepts calls for directories and injects
> directories from different FileSystems.  This means that a node can have
> keyspaces both on the local file system and one or more other FileSystem
> implementations.
>
> I look forward to hearing from you,
> Claude
>
>
> On Wed, Oct 18, 2023 at 9:00 AM Claude Warren, Jr 
> wrote:
>
>> After a bit more analysis and some testing I have a new branch that I
>> think solves the problem. [1]  I have also created a pull request internal
>> to my clone so that it is easy to see the changes. [2]
>>
>> The strategy change is to move the insertion of the proxy from the
>> Cassandra File class to the Directories class.  This means that all action
>> with the table is captured (this solves a problem encountered in the
>> earlier strategy).
>> The strategy is to create a path on a different FileSystem and return
>> that.  The example code only moves the data for the table to another
>> directory on the same FileSystem but using a different FileSystem
>> implementation should be a trivial change.
>>
>> The current code works on an entire keyspace.  I, while code exists to
>> limit the redirect to a table I have not tested that branch yet and am not
>> certain that it will work.  There is also some code (i.e. the PathParser)
>> that may no longer be needed but has not been removed yet.
>>
>> Please take a look and let me know if you see any issues with this
>> solution.
>>
>> Claude
>>
>> [1] https://github.com/Claudenw/cassandra/tree/FileSystemProxy
>> [2] https://github.com/Claudenw/cassandra/pull/5/files
>>
>>
>>
>> On Tue, Oct 10, 2023 at 10:28 AM Claude Warren, Jr <
>> [email protected]> wrote:
>>
>>> I have been exploring adding a second Path to the Cassandra File
>>> object.  The original path being the path within the standard Cassandra
>>> directory tree and the second being a translated path when there is what
>>> was called a ChannelProxy in place.
>>>
>>> A problem arises when the Directories.getLocationForDisk() is called.
>>> It seems to be looking for locations that start with the data directory
>>> absolute path.   I can change it to make it look for the original path not
>>> the translated path.  But in other cases the translated path is the one
>>> that is needed.
>>>
>>> I notice that there is a concept of multiple file locations in the code
>>> base, particularly in the Directories.DataDirectories class where there are
>>> "locationsForNonSystemKeyspaces" and "locationsForSystemKeyspace" in the
>>> constructor, and in the
>>> DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations() method
>>> which returns an array of String and is populated from the cassandra.yaml
>>> file.
>>>
>>> The DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations()
>>> only ever seems to return an array of one item.
>>>
>>> Why does
>>> DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations()  return an
>>> array?
>>>
>>> Should the system set the path to the root of the ColumnFamilyStore in
>>> the ColumnFamilyStore directories instance?
>>> Should the Directories.getLocationForDisk() do the proxy to the other
>>> file system?
>>>
>>> Where is the proper location to change from the standard internal
>>> representation to the remote location?
>>>
>>>
>>> On Fri, Sep 29, 2023 at 8:07 AM Claude Warren, Jr <
>>> [email protected]> wrote:
>>>
>>>> Sorry I was out sick and did not respond yesterday.
>>>>
>>>> Henrik,  How does your system work?  What is the design strategy?  Also
>>>> is your code available somewhere?
>>>>
>>>> After looking at the code some more I think that the best solution is
>>>> not a FileChannelProxy but to modify the Cassandra File class to get a
>>>> FileSystem object for a Factory to build the Path that is used within that
>>>> object.  I think that this makes if very small change that will pick up
>>>> 90+% of the cases.  We then just need to find the ed

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-10-18 Thread Claude Warren, Jr via dev
Henrik and Guo,

Have you moved forward on this topic?  I have not seen anything recently.
I have posted a solution that intercepts calls for directories and injects
directories from different FileSystems.  This means that a node can have
keyspaces both on the local file system and one or more other FileSystem
implementations.

I look forward to hearing from you,
Claude


On Wed, Oct 18, 2023 at 9:00 AM Claude Warren, Jr 
wrote:

> After a bit more analysis and some testing I have a new branch that I
> think solves the problem. [1]  I have also created a pull request internal
> to my clone so that it is easy to see the changes. [2]
>
> The strategy change is to move the insertion of the proxy from the
> Cassandra File class to the Directories class.  This means that all action
> with the table is captured (this solves a problem encountered in the
> earlier strategy).
> The strategy is to create a path on a different FileSystem and return
> that.  The example code only moves the data for the table to another
> directory on the same FileSystem but using a different FileSystem
> implementation should be a trivial change.
>
> The current code works on an entire keyspace.  I, while code exists to
> limit the redirect to a table I have not tested that branch yet and am not
> certain that it will work.  There is also some code (i.e. the PathParser)
> that may no longer be needed but has not been removed yet.
>
> Please take a look and let me know if you see any issues with this
> solution.
>
> Claude
>
> [1] https://github.com/Claudenw/cassandra/tree/FileSystemProxy
> [2] https://github.com/Claudenw/cassandra/pull/5/files
>
>
>
> On Tue, Oct 10, 2023 at 10:28 AM Claude Warren, Jr 
> wrote:
>
>> I have been exploring adding a second Path to the Cassandra File object.
>> The original path being the path within the standard Cassandra directory
>> tree and the second being a translated path when there is what was called a
>> ChannelProxy in place.
>>
>> A problem arises when the Directories.getLocationForDisk() is called.  It
>> seems to be looking for locations that start with the data directory
>> absolute path.   I can change it to make it look for the original path not
>> the translated path.  But in other cases the translated path is the one
>> that is needed.
>>
>> I notice that there is a concept of multiple file locations in the code
>> base, particularly in the Directories.DataDirectories class where there are
>> "locationsForNonSystemKeyspaces" and "locationsForSystemKeyspace" in the
>> constructor, and in the
>> DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations() method
>> which returns an array of String and is populated from the cassandra.yaml
>> file.
>>
>> The DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations()
>> only ever seems to return an array of one item.
>>
>> Why does
>> DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations()  return an
>> array?
>>
>> Should the system set the path to the root of the ColumnFamilyStore in
>> the ColumnFamilyStore directories instance?
>> Should the Directories.getLocationForDisk() do the proxy to the other
>> file system?
>>
>> Where is the proper location to change from the standard internal
>> representation to the remote location?
>>
>>
>> On Fri, Sep 29, 2023 at 8:07 AM Claude Warren, Jr 
>> wrote:
>>
>>> Sorry I was out sick and did not respond yesterday.
>>>
>>> Henrik,  How does your system work?  What is the design strategy?  Also
>>> is your code available somewhere?
>>>
>>> After looking at the code some more I think that the best solution is
>>> not a FileChannelProxy but to modify the Cassandra File class to get a
>>> FileSystem object for a Factory to build the Path that is used within that
>>> object.  I think that this makes if very small change that will pick up
>>> 90+% of the cases.  We then just need to find the edge cases.
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Sep 29, 2023 at 1:14 AM German Eichberger via dev <
>>> [email protected]> wrote:
>>>
>>>> Super excited about this as well. Happy to help test with Azure and any
>>>> other way needed.
>>>>
>>>> Thanks,
>>>> German
>>>> --
>>>> *From:* guo Maxwell 
>>>> *Sent:* Wednesday, September 27, 2023 7:38 PM
>>>> *To:* [email protected] 
>>>> *Subject:* [EXTERNAL] Re: [DISCUSS] CEP-36: A Configurable
>&g

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-10-18 Thread Claude Warren, Jr via dev
After a bit more analysis and some testing I have a new branch that I think
solves the problem. [1]  I have also created a pull request internal to my
clone so that it is easy to see the changes. [2]

The strategy change is to move the insertion of the proxy from the
Cassandra File class to the Directories class.  This means that all action
with the table is captured (this solves a problem encountered in the
earlier strategy).
The strategy is to create a path on a different FileSystem and return
that.  The example code only moves the data for the table to another
directory on the same FileSystem but using a different FileSystem
implementation should be a trivial change.

The current code works on an entire keyspace.  I, while code exists to
limit the redirect to a table I have not tested that branch yet and am not
certain that it will work.  There is also some code (i.e. the PathParser)
that may no longer be needed but has not been removed yet.

Please take a look and let me know if you see any issues with this solution.

Claude

[1] https://github.com/Claudenw/cassandra/tree/FileSystemProxy
[2] https://github.com/Claudenw/cassandra/pull/5/files



On Tue, Oct 10, 2023 at 10:28 AM Claude Warren, Jr 
wrote:

> I have been exploring adding a second Path to the Cassandra File object.
> The original path being the path within the standard Cassandra directory
> tree and the second being a translated path when there is what was called a
> ChannelProxy in place.
>
> A problem arises when the Directories.getLocationForDisk() is called.  It
> seems to be looking for locations that start with the data directory
> absolute path.   I can change it to make it look for the original path not
> the translated path.  But in other cases the translated path is the one
> that is needed.
>
> I notice that there is a concept of multiple file locations in the code
> base, particularly in the Directories.DataDirectories class where there are
> "locationsForNonSystemKeyspaces" and "locationsForSystemKeyspace" in the
> constructor, and in the
> DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations() method
> which returns an array of String and is populated from the cassandra.yaml
> file.
>
> The DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations()  only
> ever seems to return an array of one item.
>
> Why does DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations()
> return an array?
>
> Should the system set the path to the root of the ColumnFamilyStore in the
> ColumnFamilyStore directories instance?
> Should the Directories.getLocationForDisk() do the proxy to the other file
> system?
>
> Where is the proper location to change from the standard internal
> representation to the remote location?
>
>
> On Fri, Sep 29, 2023 at 8:07 AM Claude Warren, Jr 
> wrote:
>
>> Sorry I was out sick and did not respond yesterday.
>>
>> Henrik,  How does your system work?  What is the design strategy?  Also
>> is your code available somewhere?
>>
>> After looking at the code some more I think that the best solution is not
>> a FileChannelProxy but to modify the Cassandra File class to get a
>> FileSystem object for a Factory to build the Path that is used within that
>> object.  I think that this makes if very small change that will pick up
>> 90+% of the cases.  We then just need to find the edge cases.
>>
>>
>>
>>
>>
>> On Fri, Sep 29, 2023 at 1:14 AM German Eichberger via dev <
>> [email protected]> wrote:
>>
>>> Super excited about this as well. Happy to help test with Azure and any
>>> other way needed.
>>>
>>> Thanks,
>>> German
>>> --
>>> *From:* guo Maxwell 
>>> *Sent:* Wednesday, September 27, 2023 7:38 PM
>>> *To:* [email protected] 
>>> *Subject:* [EXTERNAL] Re: [DISCUSS] CEP-36: A Configurable ChannelProxy
>>> to alias external storage locations
>>>
>>> Thanks , So I think a jira can be created now. And I'd be happy to
>>> provide some help with this as well if needed.
>>>
>>> Henrik Ingo  于2023年9月28日周四 00:21写道:
>>>
>>> It seems I was volunteered to rebase the Astra implementation of this
>>> functionality (FileSystemProvider) onto Cassandra trunk. (And publish it,
>>> of course) I'll try to get going today or tomorrow, so that this
>>> discussion can then benefit from having that code available for inspection.
>>> And potentially using it as a soluttion to this use case.
>>>
>>> On Tue, Sep 26, 2023 at 8:04 PM Jake Luciani  wrote:
>>>
>>> We (DataStax) have a FileSystemProvid

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-10-10 Thread Claude Warren, Jr via dev
I have been exploring adding a second Path to the Cassandra File object.
The original path being the path within the standard Cassandra directory
tree and the second being a translated path when there is what was called a
ChannelProxy in place.

A problem arises when the Directories.getLocationForDisk() is called.  It
seems to be looking for locations that start with the data directory
absolute path.   I can change it to make it look for the original path not
the translated path.  But in other cases the translated path is the one
that is needed.

I notice that there is a concept of multiple file locations in the code
base, particularly in the Directories.DataDirectories class where there are
"locationsForNonSystemKeyspaces" and "locationsForSystemKeyspace" in the
constructor, and in the
DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations() method
which returns an array of String and is populated from the cassandra.yaml
file.

The DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations()  only
ever seems to return an array of one item.

Why does DatabaseDescriptor.getNonLocalSystemKeyspacesDataFileLocations()
return an array?

Should the system set the path to the root of the ColumnFamilyStore in the
ColumnFamilyStore directories instance?
Should the Directories.getLocationForDisk() do the proxy to the other file
system?

Where is the proper location to change from the standard internal
representation to the remote location?


On Fri, Sep 29, 2023 at 8:07 AM Claude Warren, Jr 
wrote:

> Sorry I was out sick and did not respond yesterday.
>
> Henrik,  How does your system work?  What is the design strategy?  Also is
> your code available somewhere?
>
> After looking at the code some more I think that the best solution is not
> a FileChannelProxy but to modify the Cassandra File class to get a
> FileSystem object for a Factory to build the Path that is used within that
> object.  I think that this makes if very small change that will pick up
> 90+% of the cases.  We then just need to find the edge cases.
>
>
>
>
>
> On Fri, Sep 29, 2023 at 1:14 AM German Eichberger via dev <
> [email protected]> wrote:
>
>> Super excited about this as well. Happy to help test with Azure and any
>> other way needed.
>>
>> Thanks,
>> German
>> --
>> *From:* guo Maxwell 
>> *Sent:* Wednesday, September 27, 2023 7:38 PM
>> *To:* [email protected] 
>> *Subject:* [EXTERNAL] Re: [DISCUSS] CEP-36: A Configurable ChannelProxy
>> to alias external storage locations
>>
>> Thanks , So I think a jira can be created now. And I'd be happy to
>> provide some help with this as well if needed.
>>
>> Henrik Ingo  于2023年9月28日周四 00:21写道:
>>
>> It seems I was volunteered to rebase the Astra implementation of this
>> functionality (FileSystemProvider) onto Cassandra trunk. (And publish it,
>> of course) I'll try to get going today or tomorrow, so that this
>> discussion can then benefit from having that code available for inspection.
>> And potentially using it as a soluttion to this use case.
>>
>> On Tue, Sep 26, 2023 at 8:04 PM Jake Luciani  wrote:
>>
>> We (DataStax) have a FileSystemProvider for Astra we can provide.
>> Works with S3/GCS/Azure.
>>
>> I'll ask someone on our end to make it accessible.
>>
>> This would work by having a bucket prefix per node. But there are lots
>> of details needed to support things like out of bound compaction
>> (mentioned in CEP).
>>
>> Jake
>>
>> On Tue, Sep 26, 2023 at 12:56 PM Benedict  wrote:
>> >
>> > I agree with Ariel, the more suitable insertion point is probably the
>> JDK level FileSystemProvider and FileSystem abstraction.
>> >
>> > It might also be that we can reuse existing work here in some cases?
>> >
>> > On 26 Sep 2023, at 17:49, Ariel Weisberg  wrote:
>> >
>> > 
>> > Hi,
>> >
>> > Support for multiple storage backends including remote storage backends
>> is a pretty high value piece of functionality. I am happy to see there is
>> interest in that.
>> >
>> > I think that `ChannelProxyFactory` as an integration point is going to
>> quickly turn into a dead end as we get into really using multiple storage
>> backends. We need to be able to list files and really the full range of
>> filesystem interactions that Java supports should work with any backend to
>> make development, testing, and using existing code straightforward.
>> >
>> > It's a little more work to get C* to creates paths for alternate
>> backends where appropriate, but t

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-28 Thread Claude Warren, Jr via dev
Sorry I was out sick and did not respond yesterday.

Henrik,  How does your system work?  What is the design strategy?  Also is
your code available somewhere?

After looking at the code some more I think that the best solution is not a
FileChannelProxy but to modify the Cassandra File class to get a FileSystem
object for a Factory to build the Path that is used within that object.  I
think that this makes if very small change that will pick up 90+% of the
cases.  We then just need to find the edge cases.





On Fri, Sep 29, 2023 at 1:14 AM German Eichberger via dev <
[email protected]> wrote:

> Super excited about this as well. Happy to help test with Azure and any
> other way needed.
>
> Thanks,
> German
> --
> *From:* guo Maxwell 
> *Sent:* Wednesday, September 27, 2023 7:38 PM
> *To:* [email protected] 
> *Subject:* [EXTERNAL] Re: [DISCUSS] CEP-36: A Configurable ChannelProxy
> to alias external storage locations
>
> Thanks , So I think a jira can be created now. And I'd be happy to provide
> some help with this as well if needed.
>
> Henrik Ingo  于2023年9月28日周四 00:21写道:
>
> It seems I was volunteered to rebase the Astra implementation of this
> functionality (FileSystemProvider) onto Cassandra trunk. (And publish it,
> of course) I'll try to get going today or tomorrow, so that this
> discussion can then benefit from having that code available for inspection.
> And potentially using it as a soluttion to this use case.
>
> On Tue, Sep 26, 2023 at 8:04 PM Jake Luciani  wrote:
>
> We (DataStax) have a FileSystemProvider for Astra we can provide.
> Works with S3/GCS/Azure.
>
> I'll ask someone on our end to make it accessible.
>
> This would work by having a bucket prefix per node. But there are lots
> of details needed to support things like out of bound compaction
> (mentioned in CEP).
>
> Jake
>
> On Tue, Sep 26, 2023 at 12:56 PM Benedict  wrote:
> >
> > I agree with Ariel, the more suitable insertion point is probably the
> JDK level FileSystemProvider and FileSystem abstraction.
> >
> > It might also be that we can reuse existing work here in some cases?
> >
> > On 26 Sep 2023, at 17:49, Ariel Weisberg  wrote:
> >
> > 
> > Hi,
> >
> > Support for multiple storage backends including remote storage backends
> is a pretty high value piece of functionality. I am happy to see there is
> interest in that.
> >
> > I think that `ChannelProxyFactory` as an integration point is going to
> quickly turn into a dead end as we get into really using multiple storage
> backends. We need to be able to list files and really the full range of
> filesystem interactions that Java supports should work with any backend to
> make development, testing, and using existing code straightforward.
> >
> > It's a little more work to get C* to creates paths for alternate
> backends where appropriate, but that works is probably necessary even with
> `ChanelProxyFactory` and munging UNIX paths (vs supporting multiple
> Fileystems). There will probably also be backend specific behaviors that
> show up above the `ChannelProxy` layer that will depend on the backend.
> >
> > Ideally there would be some config to specify several backend
> filesystems and their individual configuration that can be used, as well as
> configuration and support for a "backend file router" for file creation
> (and opening) that can be used to route files to the backend most
> appropriate.
> >
> > Regards,
> > Ariel
> >
> > On Mon, Sep 25, 2023, at 2:48 AM, Claude Warren, Jr via dev wrote:
> >
> > I have just filed CEP-36 [1] to allow for keyspace/table storage outside
> of the standard storage space.
> >
> > There are two desires  driving this change:
> >
> > The ability to temporarily move some keyspaces/tables to storage outside
> the normal directory tree to other disk so that compaction can occur in
> situations where there is not enough disk space for compaction and the
> processing to the moved data can not be suspended.
> > The ability to store infrequently used data on slower cheaper storage
> layers.
> >
> > I have a working POC implementation [2] though there are some issues
> still to be solved and much logging to be reduced.
> >
> > I look forward to productive discussions,
> > Claude
> >
> > [1]
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
> > [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
> >
> >
> >
>
>
> --
> http://twitter.com/tjake
>
>
>
> --
>
> Henrik Ingo
>
> c. +358 40 569 7354
>
> w. www.datastax.com
>
> <https://www.facebook.com/datastax>  <https://twitter.com/datastax>
> <https://www.linkedin.com/company/datastax/>
> <https://github.com/datastax/>
>
>
>
> --
> you are the apple of my eye !
>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-28 Thread German Eichberger via dev
Super excited about this as well. Happy to help test with Azure and any other 
way needed.

Thanks,
German

From: guo Maxwell 
Sent: Wednesday, September 27, 2023 7:38 PM
To: [email protected] 
Subject: [EXTERNAL] Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias 
external storage locations

Thanks , So I think a jira can be created now. And I'd be happy to provide some 
help with this as well if needed.

Henrik Ingo mailto:[email protected]>> 
于2023年9月28日周四 00:21写道:
It seems I was volunteered to rebase the Astra implementation of this 
functionality (FileSystemProvider) onto Cassandra trunk. (And publish it, of 
course) I'll try to get going today or tomorrow, so that this  discussion can 
then benefit from having that code available for inspection. And potentially 
using it as a soluttion to this use case.

On Tue, Sep 26, 2023 at 8:04 PM Jake Luciani 
mailto:[email protected]>> wrote:
We (DataStax) have a FileSystemProvider for Astra we can provide.
Works with S3/GCS/Azure.

I'll ask someone on our end to make it accessible.

This would work by having a bucket prefix per node. But there are lots
of details needed to support things like out of bound compaction
(mentioned in CEP).

Jake

On Tue, Sep 26, 2023 at 12:56 PM Benedict 
mailto:[email protected]>> wrote:
>
> I agree with Ariel, the more suitable insertion point is probably the JDK 
> level FileSystemProvider and FileSystem abstraction.
>
> It might also be that we can reuse existing work here in some cases?
>
> On 26 Sep 2023, at 17:49, Ariel Weisberg 
> mailto:[email protected]>> wrote:
>
> 
> Hi,
>
> Support for multiple storage backends including remote storage backends is a 
> pretty high value piece of functionality. I am happy to see there is interest 
> in that.
>
> I think that `ChannelProxyFactory` as an integration point is going to 
> quickly turn into a dead end as we get into really using multiple storage 
> backends. We need to be able to list files and really the full range of 
> filesystem interactions that Java supports should work with any backend to 
> make development, testing, and using existing code straightforward.
>
> It's a little more work to get C* to creates paths for alternate backends 
> where appropriate, but that works is probably necessary even with 
> `ChanelProxyFactory` and munging UNIX paths (vs supporting multiple 
> Fileystems). There will probably also be backend specific behaviors that show 
> up above the `ChannelProxy` layer that will depend on the backend.
>
> Ideally there would be some config to specify several backend filesystems and 
> their individual configuration that can be used, as well as configuration and 
> support for a "backend file router" for file creation (and opening) that can 
> be used to route files to the backend most appropriate.
>
> Regards,
> Ariel
>
> On Mon, Sep 25, 2023, at 2:48 AM, Claude Warren, Jr via dev wrote:
>
> I have just filed CEP-36 [1] to allow for keyspace/table storage outside of 
> the standard storage space.
>
> There are two desires  driving this change:
>
> The ability to temporarily move some keyspaces/tables to storage outside the 
> normal directory tree to other disk so that compaction can occur in 
> situations where there is not enough disk space for compaction and the 
> processing to the moved data can not be suspended.
> The ability to store infrequently used data on slower cheaper storage layers.
>
> I have a working POC implementation [2] though there are some issues still to 
> be solved and much logging to be reduced.
>
> I look forward to productive discussions,
> Claude
>
> [1] 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
>
>
>


--
http://twitter.com/tjake


--

[https://lh5.googleusercontent.com/UwlCp-Ixn21QzYv9oNnaGy0cKfFk1ukEBVKSv4V3-nQShsR-cib_VeSuNm4M_xZxyAzTTr0Et7MsQuTDhUGcmWQyfVP801Flif-SGT2x38lFRGkgoMUB4cot1DB9xd7Y0x2P0wJWA-gQ5k4rzytFSoLCP4wJntmJzhlqTuQQsOanCBHeejtSBcBry5v6kw]

Henrik Ingo

c. +358 40 569 7354

w. www.datastax.com<http://www.datastax.com/>

[https://lh3.googleusercontent.com/T6MEp9neZySKd-eg-tkz96Yf4qG_Xsgu-IznDkdHfsHCjAnnHQP6OsPCdj8rsDvgKs-GJS6TA7Yx5HlK-zfRlE64j0zDpDG9cI29VaG948x5xLgUU4KKctaHNAhbpJ_pDwzRag9K7yCibGblB5Ix5z6Xj99Vc92V9nYSmR4HIj5F9T_TVI7ayW2n2_lp5Q]<https://www.facebook.com/datastax>
  
[https://lh3.googleusercontent.com/Xrju2UthJiMtMS5jFknV8AhVO45tfhXSR6U0F8Qam1Mu2taE2SeVcl5ExaxU5l6pG0fHjv2b6vvUOe12WQldMqsOHknC7wQtBVYiX9ff3fLMtFAbjVRM0MGTKvPsjAcMI_FNvcIcuWIBP_zwRuh3b3g6hjHOW0ik9bDPuuYMvdLWIF8C8YgKDYQ-nV9dlQ]
 <https://twitter.com/datastax>  

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-27 Thread Jeff Jirsa
Claude if you’re still at POC phase does it make sense for you to perhaps help validate / qualify the work that Henrik seems willing to rebase rather than reinventing the wheel? On Sep 26, 2023, at 11:23 PM, Claude Warren, Jr via dev  wrote:I spent a little (very little) time building an S3 implementation using an Apache licensed S3 filesystem package.  I have not yet tested it but if anyone is interested it is at https://github.com/Aiven-Labs/S3-Cassandra-ChannelProxyIn looking at some of the code I think the Cassandra File class needs to be modified to ask the ChannelProxy for the default file system for the file in question.  This should resolve some of the issues my original demo has with some files being created in the data tree.  It may also handle many of the cases for offline tools as well.On Tue, Sep 26, 2023 at 7:33 PM Miklosovic, Stefan <[email protected]> wrote:Would it be possible to make Jimfs integration production-ready then? I see we are using it in the tests already.

It might be one of the reference implementations of this CEP. If there is a type of workload / type of nodes with plenty of RAM but no disk, some kind of compute nodes, it would just hold it all in memory and we might "flush" it to a cloud-based storage if rendered to be not necessary anymore (whatever that means).

We could then completely bypass the memtables as fetching data from an SSTable from memory would be basically roughly same?

On the other hand, that might be achieved by creating a ramdisk so I am not sure what exactly we would gain here. However, if it was eventually storing these SSTables in a cloud storage, we might "compact" "TWCS tables" automatically after so-and-so period by moving them there.


From: Jake Luciani <[email protected]>
Sent: Tuesday, September 26, 2023 19:03
To: [email protected]
Subject: Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.




We (DataStax) have a FileSystemProvider for Astra we can provide.
Works with S3/GCS/Azure.

I'll ask someone on our end to make it accessible.

This would work by having a bucket prefix per node. But there are lots
of details needed to support things like out of bound compaction
(mentioned in CEP).

Jake

On Tue, Sep 26, 2023 at 12:56 PM Benedict <[email protected]> wrote:
>
> I agree with Ariel, the more suitable insertion point is probably the JDK level FileSystemProvider and FileSystem abstraction.
>
> It might also be that we can reuse existing work here in some cases?
>
> On 26 Sep 2023, at 17:49, Ariel Weisberg <[email protected]> wrote:
>
> 
> Hi,
>
> Support for multiple storage backends including remote storage backends is a pretty high value piece of functionality. I am happy to see there is interest in that.
>
> I think that `ChannelProxyFactory` as an integration point is going to quickly turn into a dead end as we get into really using multiple storage backends. We need to be able to list files and really the full range of filesystem interactions that Java supports should work with any backend to make development, testing, and using existing code straightforward.
>
> It's a little more work to get C* to creates paths for alternate backends where appropriate, but that works is probably necessary even with `ChanelProxyFactory` and munging UNIX paths (vs supporting multiple Fileystems). There will probably also be backend specific behaviors that show up above the `ChannelProxy` layer that will depend on the backend.
>
> Ideally there would be some config to specify several backend filesystems and their individual configuration that can be used, as well as configuration and support for a "backend file router" for file creation (and opening) that can be used to route files to the backend most appropriate.
>
> Regards,
> Ariel
>
> On Mon, Sep 25, 2023, at 2:48 AM, Claude Warren, Jr via dev wrote:
>
> I have just filed CEP-36 [1] to allow for keyspace/table storage outside of the standard storage space.
>
> There are two desires  driving this change:
>
> The ability to temporarily move some keyspaces/tables to storage outside the normal directory tree to other disk so that compaction can occur in situations where there is not enough disk space for compaction and the processing to the moved data can not be suspended.
> The ability to store infrequently used data on slower cheaper storage layers.
>
> I have a working POC implementation [2] though there are some issues still to be solved and much logging to be reduced.
>
> I look forward to productive discussions,
> Claude
>
> [1] https://cwiki.apache.org/conflue

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-27 Thread guo Maxwell
Thanks , So I think a jira can be created now. And I'd be happy to provide
some help with this as well if needed.

Henrik Ingo  于2023年9月28日周四 00:21写道:

> It seems I was volunteered to rebase the Astra implementation of this
> functionality (FileSystemProvider) onto Cassandra trunk. (And publish it,
> of course) I'll try to get going today or tomorrow, so that this
> discussion can then benefit from having that code available for inspection.
> And potentially using it as a soluttion to this use case.
>
> On Tue, Sep 26, 2023 at 8:04 PM Jake Luciani  wrote:
>
>> We (DataStax) have a FileSystemProvider for Astra we can provide.
>> Works with S3/GCS/Azure.
>>
>> I'll ask someone on our end to make it accessible.
>>
>> This would work by having a bucket prefix per node. But there are lots
>> of details needed to support things like out of bound compaction
>> (mentioned in CEP).
>>
>> Jake
>>
>> On Tue, Sep 26, 2023 at 12:56 PM Benedict  wrote:
>> >
>> > I agree with Ariel, the more suitable insertion point is probably the
>> JDK level FileSystemProvider and FileSystem abstraction.
>> >
>> > It might also be that we can reuse existing work here in some cases?
>> >
>> > On 26 Sep 2023, at 17:49, Ariel Weisberg  wrote:
>> >
>> > 
>> > Hi,
>> >
>> > Support for multiple storage backends including remote storage backends
>> is a pretty high value piece of functionality. I am happy to see there is
>> interest in that.
>> >
>> > I think that `ChannelProxyFactory` as an integration point is going to
>> quickly turn into a dead end as we get into really using multiple storage
>> backends. We need to be able to list files and really the full range of
>> filesystem interactions that Java supports should work with any backend to
>> make development, testing, and using existing code straightforward.
>> >
>> > It's a little more work to get C* to creates paths for alternate
>> backends where appropriate, but that works is probably necessary even with
>> `ChanelProxyFactory` and munging UNIX paths (vs supporting multiple
>> Fileystems). There will probably also be backend specific behaviors that
>> show up above the `ChannelProxy` layer that will depend on the backend.
>> >
>> > Ideally there would be some config to specify several backend
>> filesystems and their individual configuration that can be used, as well as
>> configuration and support for a "backend file router" for file creation
>> (and opening) that can be used to route files to the backend most
>> appropriate.
>> >
>> > Regards,
>> > Ariel
>> >
>> > On Mon, Sep 25, 2023, at 2:48 AM, Claude Warren, Jr via dev wrote:
>> >
>> > I have just filed CEP-36 [1] to allow for keyspace/table storage
>> outside of the standard storage space.
>> >
>> > There are two desires  driving this change:
>> >
>> > The ability to temporarily move some keyspaces/tables to storage
>> outside the normal directory tree to other disk so that compaction can
>> occur in situations where there is not enough disk space for compaction and
>> the processing to the moved data can not be suspended.
>> > The ability to store infrequently used data on slower cheaper storage
>> layers.
>> >
>> > I have a working POC implementation [2] though there are some issues
>> still to be solved and much logging to be reduced.
>> >
>> > I look forward to productive discussions,
>> > Claude
>> >
>> > [1]
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
>> > [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
>> >
>> >
>> >
>>
>>
>> --
>> http://twitter.com/tjake
>>
>
>
> --
>
> Henrik Ingo
>
> c. +358 40 569 7354
>
> w. www.datastax.com
>
>   
> 
> 
>
>

-- 
you are the apple of my eye !


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-27 Thread Henrik Ingo
It seems I was volunteered to rebase the Astra implementation of this
functionality (FileSystemProvider) onto Cassandra trunk. (And publish it,
of course) I'll try to get going today or tomorrow, so that this
discussion can then benefit from having that code available for inspection.
And potentially using it as a soluttion to this use case.

On Tue, Sep 26, 2023 at 8:04 PM Jake Luciani  wrote:

> We (DataStax) have a FileSystemProvider for Astra we can provide.
> Works with S3/GCS/Azure.
>
> I'll ask someone on our end to make it accessible.
>
> This would work by having a bucket prefix per node. But there are lots
> of details needed to support things like out of bound compaction
> (mentioned in CEP).
>
> Jake
>
> On Tue, Sep 26, 2023 at 12:56 PM Benedict  wrote:
> >
> > I agree with Ariel, the more suitable insertion point is probably the
> JDK level FileSystemProvider and FileSystem abstraction.
> >
> > It might also be that we can reuse existing work here in some cases?
> >
> > On 26 Sep 2023, at 17:49, Ariel Weisberg  wrote:
> >
> > 
> > Hi,
> >
> > Support for multiple storage backends including remote storage backends
> is a pretty high value piece of functionality. I am happy to see there is
> interest in that.
> >
> > I think that `ChannelProxyFactory` as an integration point is going to
> quickly turn into a dead end as we get into really using multiple storage
> backends. We need to be able to list files and really the full range of
> filesystem interactions that Java supports should work with any backend to
> make development, testing, and using existing code straightforward.
> >
> > It's a little more work to get C* to creates paths for alternate
> backends where appropriate, but that works is probably necessary even with
> `ChanelProxyFactory` and munging UNIX paths (vs supporting multiple
> Fileystems). There will probably also be backend specific behaviors that
> show up above the `ChannelProxy` layer that will depend on the backend.
> >
> > Ideally there would be some config to specify several backend
> filesystems and their individual configuration that can be used, as well as
> configuration and support for a "backend file router" for file creation
> (and opening) that can be used to route files to the backend most
> appropriate.
> >
> > Regards,
> > Ariel
> >
> > On Mon, Sep 25, 2023, at 2:48 AM, Claude Warren, Jr via dev wrote:
> >
> > I have just filed CEP-36 [1] to allow for keyspace/table storage outside
> of the standard storage space.
> >
> > There are two desires  driving this change:
> >
> > The ability to temporarily move some keyspaces/tables to storage outside
> the normal directory tree to other disk so that compaction can occur in
> situations where there is not enough disk space for compaction and the
> processing to the moved data can not be suspended.
> > The ability to store infrequently used data on slower cheaper storage
> layers.
> >
> > I have a working POC implementation [2] though there are some issues
> still to be solved and much logging to be reduced.
> >
> > I look forward to productive discussions,
> > Claude
> >
> > [1]
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
> > [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
> >
> >
> >
>
>
> --
> http://twitter.com/tjake
>


-- 

Henrik Ingo

c. +358 40 569 7354

w. www.datastax.com

  
  


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-26 Thread Claude Warren, Jr via dev
I spent a little (very little) time building an S3 implementation using an
Apache licensed S3 filesystem package.  I have not yet tested it but if
anyone is interested it is at
https://github.com/Aiven-Labs/S3-Cassandra-ChannelProxy

In looking at some of the code I think the Cassandra File class needs to be
modified to ask the ChannelProxy for the default file system for the file
in question.  This should resolve some of the issues my original demo has
with some files being created in the data tree.  It may also handle many of
the cases for offline tools as well.


On Tue, Sep 26, 2023 at 7:33 PM Miklosovic, Stefan <
[email protected]> wrote:

> Would it be possible to make Jimfs integration production-ready then? I
> see we are using it in the tests already.
>
> It might be one of the reference implementations of this CEP. If there is
> a type of workload / type of nodes with plenty of RAM but no disk, some
> kind of compute nodes, it would just hold it all in memory and we might
> "flush" it to a cloud-based storage if rendered to be not necessary anymore
> (whatever that means).
>
> We could then completely bypass the memtables as fetching data from an
> SSTable from memory would be basically roughly same?
>
> On the other hand, that might be achieved by creating a ramdisk so I am
> not sure what exactly we would gain here. However, if it was eventually
> storing these SSTables in a cloud storage, we might "compact" "TWCS tables"
> automatically after so-and-so period by moving them there.
>
> 
> From: Jake Luciani 
> Sent: Tuesday, September 26, 2023 19:03
> To: [email protected]
> Subject: Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias
> external storage locations
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
>
> We (DataStax) have a FileSystemProvider for Astra we can provide.
> Works with S3/GCS/Azure.
>
> I'll ask someone on our end to make it accessible.
>
> This would work by having a bucket prefix per node. But there are lots
> of details needed to support things like out of bound compaction
> (mentioned in CEP).
>
> Jake
>
> On Tue, Sep 26, 2023 at 12:56 PM Benedict  wrote:
> >
> > I agree with Ariel, the more suitable insertion point is probably the
> JDK level FileSystemProvider and FileSystem abstraction.
> >
> > It might also be that we can reuse existing work here in some cases?
> >
> > On 26 Sep 2023, at 17:49, Ariel Weisberg  wrote:
> >
> > 
> > Hi,
> >
> > Support for multiple storage backends including remote storage backends
> is a pretty high value piece of functionality. I am happy to see there is
> interest in that.
> >
> > I think that `ChannelProxyFactory` as an integration point is going to
> quickly turn into a dead end as we get into really using multiple storage
> backends. We need to be able to list files and really the full range of
> filesystem interactions that Java supports should work with any backend to
> make development, testing, and using existing code straightforward.
> >
> > It's a little more work to get C* to creates paths for alternate
> backends where appropriate, but that works is probably necessary even with
> `ChanelProxyFactory` and munging UNIX paths (vs supporting multiple
> Fileystems). There will probably also be backend specific behaviors that
> show up above the `ChannelProxy` layer that will depend on the backend.
> >
> > Ideally there would be some config to specify several backend
> filesystems and their individual configuration that can be used, as well as
> configuration and support for a "backend file router" for file creation
> (and opening) that can be used to route files to the backend most
> appropriate.
> >
> > Regards,
> > Ariel
> >
> > On Mon, Sep 25, 2023, at 2:48 AM, Claude Warren, Jr via dev wrote:
> >
> > I have just filed CEP-36 [1] to allow for keyspace/table storage outside
> of the standard storage space.
> >
> > There are two desires  driving this change:
> >
> > The ability to temporarily move some keyspaces/tables to storage outside
> the normal directory tree to other disk so that compaction can occur in
> situations where there is not enough disk space for compaction and the
> processing to the moved data can not be suspended.
> > The ability to store infrequently used data on slower cheaper storage
> layers.
> >
> > I have a working POC implementation [2] though there are some issues
> still to be solved and much logging to be reduced.
> >
> > I look forward to productive discussions,
> > Claude
> >
> > [1]
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
> > [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
> >
> >
> >
>
>
> --
> http://twitter.com/tjake
>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-26 Thread Miklosovic, Stefan
Would it be possible to make Jimfs integration production-ready then? I see we 
are using it in the tests already.

It might be one of the reference implementations of this CEP. If there is a 
type of workload / type of nodes with plenty of RAM but no disk, some kind of 
compute nodes, it would just hold it all in memory and we might "flush" it to a 
cloud-based storage if rendered to be not necessary anymore (whatever that 
means).

We could then completely bypass the memtables as fetching data from an SSTable 
from memory would be basically roughly same?

On the other hand, that might be achieved by creating a ramdisk so I am not 
sure what exactly we would gain here. However, if it was eventually storing 
these SSTables in a cloud storage, we might "compact" "TWCS tables" 
automatically after so-and-so period by moving them there.


From: Jake Luciani 
Sent: Tuesday, September 26, 2023 19:03
To: [email protected]
Subject: Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external 
storage locations

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.




We (DataStax) have a FileSystemProvider for Astra we can provide.
Works with S3/GCS/Azure.

I'll ask someone on our end to make it accessible.

This would work by having a bucket prefix per node. But there are lots
of details needed to support things like out of bound compaction
(mentioned in CEP).

Jake

On Tue, Sep 26, 2023 at 12:56 PM Benedict  wrote:
>
> I agree with Ariel, the more suitable insertion point is probably the JDK 
> level FileSystemProvider and FileSystem abstraction.
>
> It might also be that we can reuse existing work here in some cases?
>
> On 26 Sep 2023, at 17:49, Ariel Weisberg  wrote:
>
> 
> Hi,
>
> Support for multiple storage backends including remote storage backends is a 
> pretty high value piece of functionality. I am happy to see there is interest 
> in that.
>
> I think that `ChannelProxyFactory` as an integration point is going to 
> quickly turn into a dead end as we get into really using multiple storage 
> backends. We need to be able to list files and really the full range of 
> filesystem interactions that Java supports should work with any backend to 
> make development, testing, and using existing code straightforward.
>
> It's a little more work to get C* to creates paths for alternate backends 
> where appropriate, but that works is probably necessary even with 
> `ChanelProxyFactory` and munging UNIX paths (vs supporting multiple 
> Fileystems). There will probably also be backend specific behaviors that show 
> up above the `ChannelProxy` layer that will depend on the backend.
>
> Ideally there would be some config to specify several backend filesystems and 
> their individual configuration that can be used, as well as configuration and 
> support for a "backend file router" for file creation (and opening) that can 
> be used to route files to the backend most appropriate.
>
> Regards,
> Ariel
>
> On Mon, Sep 25, 2023, at 2:48 AM, Claude Warren, Jr via dev wrote:
>
> I have just filed CEP-36 [1] to allow for keyspace/table storage outside of 
> the standard storage space.
>
> There are two desires  driving this change:
>
> The ability to temporarily move some keyspaces/tables to storage outside the 
> normal directory tree to other disk so that compaction can occur in 
> situations where there is not enough disk space for compaction and the 
> processing to the moved data can not be suspended.
> The ability to store infrequently used data on slower cheaper storage layers.
>
> I have a working POC implementation [2] though there are some issues still to 
> be solved and much logging to be reduced.
>
> I look forward to productive discussions,
> Claude
>
> [1] 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
>
>
>


--
http://twitter.com/tjake


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-26 Thread Jake Luciani
We (DataStax) have a FileSystemProvider for Astra we can provide.
Works with S3/GCS/Azure.

I'll ask someone on our end to make it accessible.

This would work by having a bucket prefix per node. But there are lots
of details needed to support things like out of bound compaction
(mentioned in CEP).

Jake

On Tue, Sep 26, 2023 at 12:56 PM Benedict  wrote:
>
> I agree with Ariel, the more suitable insertion point is probably the JDK 
> level FileSystemProvider and FileSystem abstraction.
>
> It might also be that we can reuse existing work here in some cases?
>
> On 26 Sep 2023, at 17:49, Ariel Weisberg  wrote:
>
> 
> Hi,
>
> Support for multiple storage backends including remote storage backends is a 
> pretty high value piece of functionality. I am happy to see there is interest 
> in that.
>
> I think that `ChannelProxyFactory` as an integration point is going to 
> quickly turn into a dead end as we get into really using multiple storage 
> backends. We need to be able to list files and really the full range of 
> filesystem interactions that Java supports should work with any backend to 
> make development, testing, and using existing code straightforward.
>
> It's a little more work to get C* to creates paths for alternate backends 
> where appropriate, but that works is probably necessary even with 
> `ChanelProxyFactory` and munging UNIX paths (vs supporting multiple 
> Fileystems). There will probably also be backend specific behaviors that show 
> up above the `ChannelProxy` layer that will depend on the backend.
>
> Ideally there would be some config to specify several backend filesystems and 
> their individual configuration that can be used, as well as configuration and 
> support for a "backend file router" for file creation (and opening) that can 
> be used to route files to the backend most appropriate.
>
> Regards,
> Ariel
>
> On Mon, Sep 25, 2023, at 2:48 AM, Claude Warren, Jr via dev wrote:
>
> I have just filed CEP-36 [1] to allow for keyspace/table storage outside of 
> the standard storage space.
>
> There are two desires  driving this change:
>
> The ability to temporarily move some keyspaces/tables to storage outside the 
> normal directory tree to other disk so that compaction can occur in 
> situations where there is not enough disk space for compaction and the 
> processing to the moved data can not be suspended.
> The ability to store infrequently used data on slower cheaper storage layers.
>
> I have a working POC implementation [2] though there are some issues still to 
> be solved and much logging to be reduced.
>
> I look forward to productive discussions,
> Claude
>
> [1] 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
>
>
>


-- 
http://twitter.com/tjake


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-26 Thread Benedict
I agree with Ariel, the more suitable insertion point is probably the JDK level 
FileSystemProvider and FileSystem abstraction.

It might also be that we can reuse existing work here in some cases?

> On 26 Sep 2023, at 17:49, Ariel Weisberg  wrote:
> 
> 
> Hi,
> 
> Support for multiple storage backends including remote storage backends is a 
> pretty high value piece of functionality. I am happy to see there is interest 
> in that.
> 
> I think that `ChannelProxyFactory` as an integration point is going to 
> quickly turn into a dead end as we get into really using multiple storage 
> backends. We need to be able to list files and really the full range of 
> filesystem interactions that Java supports should work with any backend to 
> make development, testing, and using existing code straightforward.
> 
> It's a little more work to get C* to creates paths for alternate backends 
> where appropriate, but that works is probably necessary even with 
> `ChanelProxyFactory` and munging UNIX paths (vs supporting multiple 
> Fileystems). There will probably also be backend specific behaviors that show 
> up above the `ChannelProxy` layer that will depend on the backend.
> 
> Ideally there would be some config to specify several backend filesystems and 
> their individual configuration that can be used, as well as configuration and 
> support for a "backend file router" for file creation (and opening) that can 
> be used to route files to the backend most appropriate.
> 
> Regards,
> Ariel
> 
>> On Mon, Sep 25, 2023, at 2:48 AM, Claude Warren, Jr via dev wrote:
>> I have just filed CEP-36 [1] to allow for keyspace/table storage outside of 
>> the standard storage space.  
>> 
>> There are two desires  driving this change:
>> The ability to temporarily move some keyspaces/tables to storage outside the 
>> normal directory tree to other disk so that compaction can occur in 
>> situations where there is not enough disk space for compaction and the 
>> processing to the moved data can not be suspended.
>> The ability to store infrequently used data on slower cheaper storage layers.
>> I have a working POC implementation [2] though there are some issues still 
>> to be solved and much logging to be reduced.
>> 
>> I look forward to productive discussions,
>> Claude
>> 
>> [1] 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
>> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory 
>> 
>> 
> 


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-26 Thread Ariel Weisberg
Hi,

Support for multiple storage backends including remote storage backends is a 
pretty high value piece of functionality. I am happy to see there is interest 
in that.

I think that `ChannelProxyFactory` as an integration point is going to quickly 
turn into a dead end as we get into really using multiple storage backends. We 
need to be able to list files and really the full range of filesystem 
interactions that Java supports should work with any backend to make 
development, testing, and using existing code straightforward.

It's a little more work to get C* to creates paths for alternate backends where 
appropriate, but that works is probably necessary even with 
`ChanelProxyFactory` and munging UNIX paths (vs supporting multiple 
Fileystems). There will probably also be backend specific behaviors that show 
up above the `ChannelProxy` layer that will depend on the backend.

Ideally there would be some config to specify several backend filesystems and 
their individual configuration that can be used, as well as configuration and 
support for a "backend file router" for file creation (and opening) that can be 
used to route files to the backend most appropriate.

Regards,
Ariel

On Mon, Sep 25, 2023, at 2:48 AM, Claude Warren, Jr via dev wrote:
> I have just filed CEP-36 [1] to allow for keyspace/table storage outside of 
> the standard storage space.  
> 
> There are two desires  driving this change:
>  1. The ability to temporarily move some keyspaces/tables to storage outside 
> the normal directory tree to other disk so that compaction can occur in 
> situations where there is not enough disk space for compaction and the 
> processing to the moved data can not be suspended.
>  2. The ability to store infrequently used data on slower cheaper storage 
> layers.
> I have a working POC implementation [2] though there are some issues still to 
> be solved and much logging to be reduced.
> 
> I look forward to productive discussions,
> Claude
> 
> [1] 
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory 
> 
> 


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-26 Thread guo Maxwell
Yeah, there is so much things to do as cassandra (share-nothing) is
different from some other system like hbase , So I think we can break the
final goal into multiple steps. first is what Claude proposed. But I
suggest that this design can make the interface more scalable and we can
consider the implementation of cloud storage. so that someone can extend
the interface in the future.

Josh McKenzie  于2023年9月26日周二 18:40写道:

> it may be better to support most cloud storage
> It simply only supports S3, which feels a bit customized for a certain
> user and is not universal enough.Am I right ?
>
> I agree w/the eventual goal (and constraint on design now) of supporting
> most popular cloud storage vendors, but if we have someone with an itch to
> scratch and at the end of that we end up with first steps in a compatible
> direction to ultimately supporting decoupled / abstracted storage systems,
> that's fantastic.
>
> To Jeff's point - so long as we can think about and chart a general path
> of where we want to go, if Claude has the time and inclination to handle
> abstracting out the API in that direction and one implementation, that's
> fantastic IMO.
>
> I know there's some other folks out there who've done some interception /
> refactoring of the FileChannel stuff to support disaggregated storage;
> curious what their experiences were like.
>
>
> On Tue, Sep 26, 2023, at 4:20 AM, Claude Warren, Jr via dev wrote:
>
> The intention of the CEP is to lay the groundwork to allow development of
> ChannelProxyFactories that are pluggable in Cassandra.  In this way any
> storage system can be a candidate for Cassandra storage provided
> FileChannels can be created for the system.
>
> As I stated before I think that there may be a need for a
> java.nio.FileSystem implementation for  the proxies but I have not had the
> time to dig into it yet.
>
> Claude
>
>
> On Tue, Sep 26, 2023 at 9:01 AM guo Maxwell  wrote:
>
> In my mind , it may be better to support most cloud storage : aws,
> azure,gcp,aliyun and so on . We may make it a plugable. But in that way, it
> seems there may need a filesystem interface layer for object storage. And
> should we support ,distributed system like hdfs ,or something else. We
> should first discuss what should be done and what should not be done. It
> simply only supports S3, which feels a bit customized for a certain user
> and is not universal enough.Am I right ?
>
> Claude Warren, Jr  于2023年9月26日周二 14:36写道:
>
> My intention is to develop an S3 storage system using
> https://github.com/carlspring/s3fs-nio
>
> There are several issues yet to be solved:
>
>1. There are some internal calls that create files in the table
>directory that do not use the channel proxy.  I believe that these are
>making calls on File objects.  I think those File objects are Cassandra
>File objects not Java I/O File objects, but am unsure.
>2. Determine if the carlspring s3fs-nio library will be performant
>enough to work in the long run.  There may be issues with it:
>1. Downloading entire files before using them rather than using views
>   into larger remotely stored files.
>   2. Requiring a complete file to upload rather than using the
>   partial upload capability of the S3 interface.
>
>
>
> On Tue, Sep 26, 2023 at 4:11 AM guo Maxwell  wrote:
>
> "Rather than building this piece by piece, I think it'd be awesome if
> someone drew up an end-to-end plan to implement tiered storage, so we can
> make sure we're discussing the whole final state, and not an implementation
> detail of one part of the final state?"
>
> Do agree with jeff for this ~~~ If these feature can be supported in oss
> cassandra , I think it will be very popular, whether in  a private
> deployment environment or a public cloud service (our experience can prove
> it). In addition, it is also a cost-cutting option for users too
>
> Jeff Jirsa  于2023年9月26日周二 00:11写道:
>
>
> - I think this is a great step forward.
> - Being able to move sstables around between tiers of storage is a feature
> Cassandra desperately needs, especially if one of those tiers is some sort
> of object storage
> - This looks like it's a foundational piece that enables that. Perhaps by
> a team that's already implemented this end to end?
> - Rather than building this piece by piece, I think it'd be awesome if
> someone drew up an end-to-end plan to implement tiered storage, so we can
> make sure we're discussing the whole final state, and not an implementation
> detail of one part of the final state?
>
>
>
>
>
>
> On Sun, Sep 24, 2023 at 11:49 PM Claude Warren, Jr via dev <
> [email protected]> wrote:
>
> I have just filed CEP-36 [1] to allow for keyspace/table storage outside
> of the standard storage space.
>
> There are two desires  driving this change:
>
>1. The ability to temporarily move some keyspaces/tables to storage
>outside the normal directory tree to other disk so that compaction can
>occur in situati

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-26 Thread Josh McKenzie
> it may be better to support most cloud storage
> It simply only supports S3, which feels a bit customized for a certain user 
> and is not universal enough.Am I right ?
I agree w/the eventual goal (and constraint on design now) of supporting most 
popular cloud storage vendors, but if we have someone with an itch to scratch 
and at the end of that we end up with first steps in a compatible direction to 
ultimately supporting decoupled / abstracted storage systems, that's fantastic.

To Jeff's point - so long as we can think about and chart a general path of 
where we want to go, if Claude has the time and inclination to handle 
abstracting out the API in that direction and one implementation, that's 
fantastic IMO.

I know there's some other folks out there who've done some interception / 
refactoring of the FileChannel stuff to support disaggregated storage; curious 
what their experiences were like.


On Tue, Sep 26, 2023, at 4:20 AM, Claude Warren, Jr via dev wrote:
> The intention of the CEP is to lay the groundwork to allow development of 
> ChannelProxyFactories that are pluggable in Cassandra.  In this way any 
> storage system can be a candidate for Cassandra storage provided FileChannels 
> can be created for the system. 
> 
> As I stated before I think that there may be a need for a java.nio.FileSystem 
> implementation for  the proxies but I have not had the time to dig into it 
> yet.
> 
> Claude
> 
> 
> On Tue, Sep 26, 2023 at 9:01 AM guo Maxwell  wrote:
>> In my mind , it may be better to support most cloud storage : aws, 
>> azure,gcp,aliyun and so on . We may make it a plugable. But in that way, it 
>> seems there may need a filesystem interface layer for object storage. And 
>> should we support ,distributed system like hdfs ,or something else. We 
>> should first discuss what should be done and what should not be done. It 
>> simply only supports S3, which feels a bit customized for a certain user and 
>> is not universal enough.Am I right ?
>> 
>> Claude Warren, Jr  于2023年9月26日周二 14:36写道:
>>> My intention is to develop an S3 storage system using  
>>> https://github.com/carlspring/s3fs-nio 
>>> 
>>> There are several issues yet to be solved:
>>>  1. There are some internal calls that create files in the table directory 
>>> that do not use the channel proxy.  I believe that these are making calls 
>>> on File objects.  I think those File objects are Cassandra File objects not 
>>> Java I/O File objects, but am unsure.
>>>  2. Determine if the carlspring s3fs-nio library will be performant enough 
>>> to work in the long run.  There may be issues with it:
>>>1. Downloading entire files before using them rather than using views 
>>> into larger remotely stored files.
>>>2. Requiring a complete file to upload rather than using the partial 
>>> upload capability of the S3 interface.
>>> 
>>> 
>>> On Tue, Sep 26, 2023 at 4:11 AM guo Maxwell  wrote:
 "Rather than building this piece by piece, I think it'd be awesome if 
 someone drew up an end-to-end plan to implement tiered storage, so we can 
 make sure we're discussing the whole final state, and not an 
 implementation detail of one part of the final state?"
 
 Do agree with jeff for this ~~~ If these feature can be supported in oss 
 cassandra , I think it will be very popular, whether in  a private 
 deployment environment or a public cloud service (our experience can prove 
 it). In addition, it is also a cost-cutting option for users too
 
 Jeff Jirsa  于2023年9月26日周二 00:11写道:
> 
> - I think this is a great step forward. 
> - Being able to move sstables around between tiers of storage is a 
> feature Cassandra desperately needs, especially if one of those tiers is 
> some sort of object storage
> - This looks like it's a foundational piece that enables that. Perhaps by 
> a team that's already implemented this end to end? 
> - Rather than building this piece by piece, I think it'd be awesome if 
> someone drew up an end-to-end plan to implement tiered storage, so we can 
> make sure we're discussing the whole final state, and not an 
> implementation detail of one part of the final state?
> 
> 
> 
> 
> 
> 
> On Sun, Sep 24, 2023 at 11:49 PM Claude Warren, Jr via dev 
>  wrote:
>> I have just filed CEP-36 [1] to allow for keyspace/table storage outside 
>> of the standard storage space.  
>> 
>> There are two desires  driving this change:
>>  1. The ability to temporarily move some keyspaces/tables to storage 
>> outside the normal directory tree to other disk so that compaction can 
>> occur in situations where there is not enough disk space for compaction 
>> and the processing to the moved data can not be suspended.
>>  2. The ability to store infrequently used data on slower cheaper 
>> storage layers.
>> I have a working POC implementation [2] though there are som

Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-26 Thread Claude Warren, Jr via dev
The intention of the CEP is to lay the groundwork to allow development of
ChannelProxyFactories that are pluggable in Cassandra.  In this way any
storage system can be a candidate for Cassandra storage provided
FileChannels can be created for the system.

As I stated before I think that there may be a need for a
java.nio.FileSystem implementation for  the proxies but I have not had the
time to dig into it yet.

Claude


On Tue, Sep 26, 2023 at 9:01 AM guo Maxwell  wrote:

> In my mind , it may be better to support most cloud storage : aws,
> azure,gcp,aliyun and so on . We may make it a plugable. But in that way, it
> seems there may need a filesystem interface layer for object storage. And
> should we support ,distributed system like hdfs ,or something else. We
> should first discuss what should be done and what should not be done. It
> simply only supports S3, which feels a bit customized for a certain user
> and is not universal enough.Am I right ?
>
> Claude Warren, Jr  于2023年9月26日周二 14:36写道:
>
>> My intention is to develop an S3 storage system using
>> https://github.com/carlspring/s3fs-nio
>>
>> There are several issues yet to be solved:
>>
>>1. There are some internal calls that create files in the table
>>directory that do not use the channel proxy.  I believe that these are
>>making calls on File objects.  I think those File objects are Cassandra
>>File objects not Java I/O File objects, but am unsure.
>>2. Determine if the carlspring s3fs-nio library will be performant
>>enough to work in the long run.  There may be issues with it:
>>   1. Downloading entire files before using them rather than using
>>   views into larger remotely stored files.
>>   2. Requiring a complete file to upload rather than using the
>>   partial upload capability of the S3 interface.
>>
>>
>>
>> On Tue, Sep 26, 2023 at 4:11 AM guo Maxwell  wrote:
>>
>>> "Rather than building this piece by piece, I think it'd be awesome if
>>> someone drew up an end-to-end plan to implement tiered storage, so we can
>>> make sure we're discussing the whole final state, and not an implementation
>>> detail of one part of the final state?"
>>>
>>> Do agree with jeff for this ~~~ If these feature can be supported in oss
>>> cassandra , I think it will be very popular, whether in  a private
>>> deployment environment or a public cloud service (our experience can prove
>>> it). In addition, it is also a cost-cutting option for users too
>>>
>>> Jeff Jirsa  于2023年9月26日周二 00:11写道:
>>>

 - I think this is a great step forward.
 - Being able to move sstables around between tiers of storage is a
 feature Cassandra desperately needs, especially if one of those tiers is
 some sort of object storage
 - This looks like it's a foundational piece that enables that. Perhaps
 by a team that's already implemented this end to end?
 - Rather than building this piece by piece, I think it'd be awesome if
 someone drew up an end-to-end plan to implement tiered storage, so we can
 make sure we're discussing the whole final state, and not an implementation
 detail of one part of the final state?






 On Sun, Sep 24, 2023 at 11:49 PM Claude Warren, Jr via dev <
 [email protected]> wrote:

> I have just filed CEP-36 [1] to allow for keyspace/table storage
> outside of the standard storage space.
>
> There are two desires  driving this change:
>
>1. The ability to temporarily move some keyspaces/tables to
>storage outside the normal directory tree to other disk so that 
> compaction
>can occur in situations where there is not enough disk space for 
> compaction
>and the processing to the moved data can not be suspended.
>2. The ability to store infrequently used data on slower cheaper
>storage layers.
>
> I have a working POC implementation [2] though there are some issues
> still to be solved and much logging to be reduced.
>
> I look forward to productive discussions,
> Claude
>
> [1]
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
>
>
>
>>>
>>> --
>>> you are the apple of my eye !
>>>
>>
>
> --
> you are the apple of my eye !
>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-26 Thread guo Maxwell
In my mind , it may be better to support most cloud storage : aws,
azure,gcp,aliyun and so on . We may make it a plugable. But in that way, it
seems there may need a filesystem interface layer for object storage. And
should we support ,distributed system like hdfs ,or something else. We
should first discuss what should be done and what should not be done. It
simply only supports S3, which feels a bit customized for a certain user
and is not universal enough.Am I right ?

Claude Warren, Jr  于2023年9月26日周二 14:36写道:

> My intention is to develop an S3 storage system using
> https://github.com/carlspring/s3fs-nio
>
> There are several issues yet to be solved:
>
>1. There are some internal calls that create files in the table
>directory that do not use the channel proxy.  I believe that these are
>making calls on File objects.  I think those File objects are Cassandra
>File objects not Java I/O File objects, but am unsure.
>2. Determine if the carlspring s3fs-nio library will be performant
>enough to work in the long run.  There may be issues with it:
>   1. Downloading entire files before using them rather than using
>   views into larger remotely stored files.
>   2. Requiring a complete file to upload rather than using the
>   partial upload capability of the S3 interface.
>
>
>
> On Tue, Sep 26, 2023 at 4:11 AM guo Maxwell  wrote:
>
>> "Rather than building this piece by piece, I think it'd be awesome if
>> someone drew up an end-to-end plan to implement tiered storage, so we can
>> make sure we're discussing the whole final state, and not an implementation
>> detail of one part of the final state?"
>>
>> Do agree with jeff for this ~~~ If these feature can be supported in oss
>> cassandra , I think it will be very popular, whether in  a private
>> deployment environment or a public cloud service (our experience can prove
>> it). In addition, it is also a cost-cutting option for users too
>>
>> Jeff Jirsa  于2023年9月26日周二 00:11写道:
>>
>>>
>>> - I think this is a great step forward.
>>> - Being able to move sstables around between tiers of storage is a
>>> feature Cassandra desperately needs, especially if one of those tiers is
>>> some sort of object storage
>>> - This looks like it's a foundational piece that enables that. Perhaps
>>> by a team that's already implemented this end to end?
>>> - Rather than building this piece by piece, I think it'd be awesome if
>>> someone drew up an end-to-end plan to implement tiered storage, so we can
>>> make sure we're discussing the whole final state, and not an implementation
>>> detail of one part of the final state?
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Sep 24, 2023 at 11:49 PM Claude Warren, Jr via dev <
>>> [email protected]> wrote:
>>>
 I have just filed CEP-36 [1] to allow for keyspace/table storage
 outside of the standard storage space.

 There are two desires  driving this change:

1. The ability to temporarily move some keyspaces/tables to storage
outside the normal directory tree to other disk so that compaction can
occur in situations where there is not enough disk space for compaction 
 and
the processing to the moved data can not be suspended.
2. The ability to store infrequently used data on slower cheaper
storage layers.

 I have a working POC implementation [2] though there are some issues
 still to be solved and much logging to be reduced.

 I look forward to productive discussions,
 Claude

 [1]
 https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
 [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory



>>
>> --
>> you are the apple of my eye !
>>
>

-- 
you are the apple of my eye !


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-25 Thread Claude Warren, Jr via dev
My intention is to develop an S3 storage system using
https://github.com/carlspring/s3fs-nio

There are several issues yet to be solved:

   1. There are some internal calls that create files in the table
   directory that do not use the channel proxy.  I believe that these are
   making calls on File objects.  I think those File objects are Cassandra
   File objects not Java I/O File objects, but am unsure.
   2. Determine if the carlspring s3fs-nio library will be performant
   enough to work in the long run.  There may be issues with it:
  1. Downloading entire files before using them rather than using views
  into larger remotely stored files.
  2. Requiring a complete file to upload rather than using the partial
  upload capability of the S3 interface.



On Tue, Sep 26, 2023 at 4:11 AM guo Maxwell  wrote:

> "Rather than building this piece by piece, I think it'd be awesome if
> someone drew up an end-to-end plan to implement tiered storage, so we can
> make sure we're discussing the whole final state, and not an implementation
> detail of one part of the final state?"
>
> Do agree with jeff for this ~~~ If these feature can be supported in oss
> cassandra , I think it will be very popular, whether in  a private
> deployment environment or a public cloud service (our experience can prove
> it). In addition, it is also a cost-cutting option for users too
>
> Jeff Jirsa  于2023年9月26日周二 00:11写道:
>
>>
>> - I think this is a great step forward.
>> - Being able to move sstables around between tiers of storage is a
>> feature Cassandra desperately needs, especially if one of those tiers is
>> some sort of object storage
>> - This looks like it's a foundational piece that enables that. Perhaps by
>> a team that's already implemented this end to end?
>> - Rather than building this piece by piece, I think it'd be awesome if
>> someone drew up an end-to-end plan to implement tiered storage, so we can
>> make sure we're discussing the whole final state, and not an implementation
>> detail of one part of the final state?
>>
>>
>>
>>
>>
>>
>> On Sun, Sep 24, 2023 at 11:49 PM Claude Warren, Jr via dev <
>> [email protected]> wrote:
>>
>>> I have just filed CEP-36 [1] to allow for keyspace/table storage outside
>>> of the standard storage space.
>>>
>>> There are two desires  driving this change:
>>>
>>>1. The ability to temporarily move some keyspaces/tables to storage
>>>outside the normal directory tree to other disk so that compaction can
>>>occur in situations where there is not enough disk space for compaction 
>>> and
>>>the processing to the moved data can not be suspended.
>>>2. The ability to store infrequently used data on slower cheaper
>>>storage layers.
>>>
>>> I have a working POC implementation [2] though there are some issues
>>> still to be solved and much logging to be reduced.
>>>
>>> I look forward to productive discussions,
>>> Claude
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
>>> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
>>>
>>>
>>>
>
> --
> you are the apple of my eye !
>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-25 Thread guo Maxwell
"Rather than building this piece by piece, I think it'd be awesome if
someone drew up an end-to-end plan to implement tiered storage, so we can
make sure we're discussing the whole final state, and not an implementation
detail of one part of the final state?"

Do agree with jeff for this ~~~ If these feature can be supported in oss
cassandra , I think it will be very popular, whether in  a private
deployment environment or a public cloud service (our experience can prove
it). In addition, it is also a cost-cutting option for users too

Jeff Jirsa  于2023年9月26日周二 00:11写道:

>
> - I think this is a great step forward.
> - Being able to move sstables around between tiers of storage is a feature
> Cassandra desperately needs, especially if one of those tiers is some sort
> of object storage
> - This looks like it's a foundational piece that enables that. Perhaps by
> a team that's already implemented this end to end?
> - Rather than building this piece by piece, I think it'd be awesome if
> someone drew up an end-to-end plan to implement tiered storage, so we can
> make sure we're discussing the whole final state, and not an implementation
> detail of one part of the final state?
>
>
>
>
>
>
> On Sun, Sep 24, 2023 at 11:49 PM Claude Warren, Jr via dev <
> [email protected]> wrote:
>
>> I have just filed CEP-36 [1] to allow for keyspace/table storage outside
>> of the standard storage space.
>>
>> There are two desires  driving this change:
>>
>>1. The ability to temporarily move some keyspaces/tables to storage
>>outside the normal directory tree to other disk so that compaction can
>>occur in situations where there is not enough disk space for compaction 
>> and
>>the processing to the moved data can not be suspended.
>>2. The ability to store infrequently used data on slower cheaper
>>storage layers.
>>
>> I have a working POC implementation [2] though there are some issues
>> still to be solved and much logging to be reduced.
>>
>> I look forward to productive discussions,
>> Claude
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
>> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
>>
>>
>>

-- 
you are the apple of my eye !


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-25 Thread Jeff Jirsa
- I think this is a great step forward.
- Being able to move sstables around between tiers of storage is a feature
Cassandra desperately needs, especially if one of those tiers is some sort
of object storage
- This looks like it's a foundational piece that enables that. Perhaps by a
team that's already implemented this end to end?
- Rather than building this piece by piece, I think it'd be awesome if
someone drew up an end-to-end plan to implement tiered storage, so we can
make sure we're discussing the whole final state, and not an implementation
detail of one part of the final state?






On Sun, Sep 24, 2023 at 11:49 PM Claude Warren, Jr via dev <
[email protected]> wrote:

> I have just filed CEP-36 [1] to allow for keyspace/table storage outside
> of the standard storage space.
>
> There are two desires  driving this change:
>
>1. The ability to temporarily move some keyspaces/tables to storage
>outside the normal directory tree to other disk so that compaction can
>occur in situations where there is not enough disk space for compaction and
>the processing to the moved data can not be suspended.
>2. The ability to store infrequently used data on slower cheaper
>storage layers.
>
> I have a working POC implementation [2] though there are some issues still
> to be solved and much logging to be reduced.
>
> I look forward to productive discussions,
> Claude
>
> [1]
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
>
>
>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-25 Thread Claude Warren, Jr via dev
external storage can be any storage that you can produce a FileChannel
for.  There is an S3 library that does this so S3 is a definite
possibility for storage in this solution.  My example code only writes to a
different directory on the same system.  And there are a couple of places
where I did not catch the file creation, those have to be found and
redirected to the proxy location.  I think that it may be necessary to have
a java FileSystem object to make the whole thing work.  The S3 library that
I found also has an S3 FileSystem class.

This solution uses the internal file name for for example an sstable name.
The proxyfactory can examine the entire path and make a determination of
where to read/write the file.  So any determination that can be made based
on the information in the file path can be implemented with this approach.
There is no direct inspection of the data being written to determine
routing.  The only routing data are in the file name.

I ran an inhouse demo where I showed that we could reroute a single table
to a different storage while leaving the rest of the tables in the same
keyspace alone.

In discussing this with a colleague we hit upon the term "tiered nodes".
If you can spread your data across the nodes so that some nodes get the
infrequently used data (cold data) and other nodes receive the frequently
used data (hot data) then the cold data nodes can use this process to store
the data on S3 or similar systems.

On Mon, Sep 25, 2023 at 10:45 AM guo Maxwell  wrote:

> Great suggestion,  Can external storage only be local storage media? Or
> can it be stored in any storage medium, such as object storage s3 ?
> We have previously implemented a tiered storage capability, that is, there
> are multiple storage media on one node, SSD, HDD, and data placement based
> on requests. After briefly browsing the proposals, it seems that there are
> some differences. Can you help to do some explain ? Thanks 。
>
>
> Claude Warren, Jr via dev  于2023年9月25日周一
> 14:49写道:
>
>> I have just filed CEP-36 [1] to allow for keyspace/table storage outside
>> of the standard storage space.
>>
>> There are two desires  driving this change:
>>
>>1. The ability to temporarily move some keyspaces/tables to storage
>>outside the normal directory tree to other disk so that compaction can
>>occur in situations where there is not enough disk space for compaction 
>> and
>>the processing to the moved data can not be suspended.
>>2. The ability to store infrequently used data on slower cheaper
>>storage layers.
>>
>> I have a working POC implementation [2] though there are some issues
>> still to be solved and much logging to be reduced.
>>
>> I look forward to productive discussions,
>> Claude
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
>> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
>>
>>
>>
>
> --
> you are the apple of my eye !
>


Re: [DISCUSS] CEP-36: A Configurable ChannelProxy to alias external storage locations

2023-09-25 Thread guo Maxwell
Great suggestion,  Can external storage only be local storage media? Or can
it be stored in any storage medium, such as object storage s3 ?
We have previously implemented a tiered storage capability, that is, there
are multiple storage media on one node, SSD, HDD, and data placement based
on requests. After briefly browsing the proposals, it seems that there are
some differences. Can you help to do some explain ? Thanks 。


Claude Warren, Jr via dev  于2023年9月25日周一 14:49写道:

> I have just filed CEP-36 [1] to allow for keyspace/table storage outside
> of the standard storage space.
>
> There are two desires  driving this change:
>
>1. The ability to temporarily move some keyspaces/tables to storage
>outside the normal directory tree to other disk so that compaction can
>occur in situations where there is not enough disk space for compaction and
>the processing to the moved data can not be suspended.
>2. The ability to store infrequently used data on slower cheaper
>storage layers.
>
> I have a working POC implementation [2] though there are some issues still
> to be solved and much logging to be reduced.
>
> I look forward to productive discussions,
> Claude
>
> [1]
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-36%3A+A+Configurable+ChannelProxy+to+alias+external+storage+locations
> [2] https://github.com/Claudenw/cassandra/tree/channel_proxy_factory
>
>
>

-- 
you are the apple of my eye !