Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread guo Maxwell
Thanks Dinesh , That will be great. Dinesh Joshi 于2023年5月4日 周四下午11:06写道: > Hi Guo, > > I would expect that there would be release artifacts for the sidecar as > well as the library once this functionality is available. > > Dinesh > > On May 4, 2023, at 12:03 AM, guo Maxwell wrote: > > This is

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread Dinesh Joshi
Hi Guo, I would expect that there would be release artifacts for the sidecar as well as the library once this functionality is available. Dinesh > On May 4, 2023, at 12:03 AM, guo Maxwell wrote: > > This is a very meaningful work, thanks , but I would like to ask a question > that is not

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-04 Thread guo Maxwell
This is a very meaningful work, thanks , but I would like to ask a question that is not particularly related to the cep project's code design itself but the project engineering management : what is the future development and release plan of this project? As far as I know, project Cassandra Sidecar

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-03 Thread Dinesh Joshi
If there aren't additional questions / comments I will start the VOTE thread on this CEP tonight. On 2023/05/01 19:50:12 Dinesh Joshi wrote: > Does anybody have any questions that we could answer about this proposal?

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Dinesh Joshi
We're reusing existing Cassandra code so the performance characteristics for parsing should be the same as Cassandra. I will need to check if we have benchmarks. If we do, we'll add it to the CEP wiki page. On 2023/05/02 19:52:28 Sebastian Estevez wrote: > Hey Dinesh, > > Yeah it makes sense

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Sebastian Estevez
Hey Dinesh, Yeah it makes sense that the sstable streaming is network bound since it's mostly just moving files. Do you have any performance stats on the sstable parsing side inside spark? --Seb On Tue, May 2, 2023 at 3:31 PM Dinesh Joshi wrote: > It is line rate / network bound. We have a

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Dinesh Joshi
It is line rate / network bound. We have a patch out in vert.x that should use the zero copy path for it. But it's not a strict prereq for it. On 2023/05/02 15:39:02 Sebastian Estevez wrote: > Hi folks, > > Great stuff thanks for sharing. > > The performance numbers I've seen so far are for

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Sebastian Estevez
Hi folks, Great stuff thanks for sharing. The performance numbers I've seen so far are for the sidecar streaming sstables (seems like this is just network bound?). What kind of perf are you seeing at the Spark executors (at the per task level)? --Seb On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-01 Thread Dinesh Joshi
Does anybody have any questions that we could answer about this proposal? > On Apr 27, 2023, at 1:24 PM, Francisco Guerrero > wrote: > > Hi folks, > > We have updated the confluence page with the source code for CEP-28. > There are two repositories with contributions. One is the patch [1] >

RE: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-27 Thread Francisco Guerrero
Hi folks, We have updated the confluence page with the source code for CEP-28. There are two repositories with contributions. One is the patch [1] for Cassandra Sidecar with the bulk APIs that enable the Cassandra Spark Analytics library. The second is a new repository [2] with contributions

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-12 Thread James Berragan
__ > From: Doug Rohrer mailto:droh...@apple.com>> > Sent: Tuesday, April 11, 2023 0:37 > To: dev@cassandra.apache.org <mailto:dev@cassandra.apache.org> > Subject: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark > Bulk A

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-11 Thread Miklosovic, Stefan
/debezium/debezium-connector-cassandra From: Doug Rohrer Sent: Tuesday, April 11, 2023 0:37 To: dev@cassandra.apache.org Subject: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics NetApp Security WARNING: This is an external email

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-11 Thread J. D. Jordan
Thanks for those. They are very helpful.I think the CEP needs to call out all of the classes/interfaces from the cassandra-all jar that the “Spark driver” is using.Given this CEP is exposing “sstables as an external API” I would think all the interfaces and code associated with using those would

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-10 Thread Doug Rohrer
I’ve updated the CEP with two overview diagrams of the interactions between Sidecar, Cassandra, and the Bulk Analytics library. Hope this helps folks better understand how things work, and thanks for the patience as it took a bit longer than expected for me to find the time for this. Doug >

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-05 Thread Doug Rohrer
Sorry for the delay in responding here - yes, we can add some diagrams to the CEP - I’ll try to get that done by end-of-week. Thanks, Doug > On Mar 28, 2023, at 1:14 PM, J. D. Jordan wrote: > > Maybe some data flow diagrams could be added to the cep showing some example > operations for

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread J. D. Jordan
Maybe some data flow diagrams could be added to the cep showing some example operations for read/write?On Mar 28, 2023, at 11:35 AM, Yifan Cai wrote:A lot of great discussions! On the sidecar front, especially what the role sidecar plays in terms of this CEP, I feel there might be some

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Yifan Cai
A lot of great discussions! On the sidecar front, especially what the role sidecar plays in terms of this CEP, I feel there might be some confusion. Once the code is published, we should have clarity. Sidecar does not read sstables nor do any coordination for analytics queries. It is local to the

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Benedict
I disagree with the first claim, as the process has all the information it chooses to utilise about which resources it’s using and what it’s using those resources for.The inability to isolate GC domains is something we cannot address, but also probably not a problem if we were doing everything

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Derek Chen-Becker
On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch wrote: ... I think we might be underselling how valuable JVM isolation is, > especially for analytics queries that are going to pass the entire > dataset through heap somewhat constantly. > Big +1 here. The JVM simply does not have significant

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Jeremiah D Jordan
> One of the explicit goals of making an official sidecar project was to > try to make it something the project does not break compatibility with > as one of the main issues the third-party sidecars (that handle > distributed control, backup, repair, etc ...) have is they break > constantly

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Benedict
Fwiw I’m sceptical of the performance angle long term. You can do a lot more to control QoS when you understand what each query is doing, and what your SLOs are. You can also more efficiently apportion your resources (not leaving any lying fallow to ensure it’s free later) But, we’re a long

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Joseph Lynch
> If we want to bring groups/containers/etc into the default deployment > mechanisms of C*, great. I am all for dividing it up into micro services > given we solve all the problems I listed in the complexity section. > > I am actually all for dividing C* up into multiple micro services, but the

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Joseph Lynch
One of the explicit goals of making an official sidecar project was to try to make it something the project does not break compatibility with as one of the main issues the third-party sidecars (that handle distributed control, backup, repair, etc ...) have is they break constantly because C*

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Jeremiah D Jordan
>> Given the sidecar is running on the same node as the main C* process, the >> only real resource isolation you have is in heap/GC? CPU/Memory/IO are all >> still shared between the main C* process and the side car, and coordinating >> those across processes is harder than coordinating them

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Jeff Jirsa
On Tue, Mar 28, 2023 at 7:30 AM Jeremiah D Jordan wrote: > - Resources isolation. Having the said service running within the same JVM > may negatively impact Cassandra storage's performance. It could be more > beneficial to have them in Sidecar, which offers strong resource isolation >

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-28 Thread Jeremiah D Jordan
> - Resources isolation. Having the said service running within the same JVM > may negatively impact Cassandra storage's performance. It could be more > beneficial to have them in Sidecar, which offers strong resource isolation > guarantees. How does having this in a side car change the impact

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-27 Thread James Berragan
Complex predicates on non-partition keys naturally require pulling the entire data set into the Spark DataFrame to perform the query. We have some optimizations around column filtering and partition key predicates, utilizing the Filter.db/Summary.db/Index.db files to only read the data it

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-27 Thread Jeremy Hanna
Thank you for the write-up and the efforts on CASSANDRA-16222. It sounds like you've been using this for some time. I understand from the rejected alternatives that the Spark Cassandra Connector was slower because it goes through the read and write path for C* rather than this backdoor

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-27 Thread James Berragan
On the Sidecar discussion, while Sidecar is the preferred mechanism for the reasons described, the API is sufficiently generic enough to plugin a user implementations (essentially provide a list of sstables for a token range, and a mechanism to open an InputStream on any SSTable file

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-26 Thread Josh McKenzie
I want to second what Yifan's spoken to, specifically in terms of resource isolation and availability. While the sidecar hasn't seen a ton of traffic and contributions since the acceptance into the project and clearance of CEP-1, my intuition is that that's due to the entrenched maturity of

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-25 Thread Brandon Williams
Oh, that's significantly different and great news, please do! Thanks for the clarification, Doug! Kind Regards, Brandon On Fri, Mar 24, 2023 at 4:42 PM Doug Rohrer wrote: > > I agree that the analytics library will need to support vnodes. To be clear, > there’s nothing preventing the solution

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Yifan Cai
Hi Jeremiah, There are good reasons to not have these inside Cassandra. Consider the following. - Resources isolation. Having the said service running within the same JVM may negatively impact Cassandra storage's performance. It could be more beneficial to have them in Sidecar, which offers

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Doug Rohrer
I agree that the analytics library will need to support vnodes. To be clear, there’s nothing preventing the solution from working with vnodes right now, and no assumptions about a 1:1 topology between a token and a node. However, we don’t, today, have the ability to test vnode support

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Brandon Williams
On Fri, Mar 24, 2023 at 10:39 AM Jeremiah D Jordan wrote: > > I have concerns with the majority of this being in the sidecar and not in the > database itself. I think it would make sense for the server side of this to > be a new service exposed by the database, not in the sidecar. That way it

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Jeremiah D Jordan
>>From: Doug Rohrer mailto:droh...@apple.com> >> <mailto:droh...@apple.com>> >>Sent: Thursday, March 23, 2023 18:33 >>To: dev@cassandra.apache.org <mailto:dev@cassandra.apache.org> >> <mailto:dev@cassandra

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Dinesh Joshi
.apache.org <mailto:dev@cassandra.apache.org> > Cc: James Berragan > Subject: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with > Spark Bulk Analytics > > NetApp Security WARNING: This is an external email. Do not click > links or open attac

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Miklosovic, Stefan
: Benjamin Lerer Sent: Friday, March 24, 2023 10:35 To: dev@cassandra.apache.org Subject: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sende

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-24 Thread Benjamin Lerer
it might be a logical > replacement of that. > > Regards > > > From: Doug Rohrer > Sent: Thursday, March 23, 2023 18:33 > To: dev@cassandra.apache.org > Cc: James Berragan > Subject: [DISCUSS] CEP-28: Reading and Writing Ca

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-23 Thread Miklosovic, Stefan
From: Doug Rohrer Sent: Thursday, March 23, 2023 18:33 To: dev@cassandra.apache.org Cc: James Berragan Subject: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you

[DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-03-23 Thread Doug Rohrer
Hi everyone, Wiki: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-28%3A+Reading+and+Writing+Cassandra+Data+with+Spark+Bulk+Analytics We’d like to propose this CEP for adoption by the community. It is common for teams using Cassandra to find themselves looking for a way to interact