RE: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Francisco Guerrero Thu, 27 Apr 2023 13:25:22 -0700

Hi folks,


We have updated the confluence page with the source code for CEP-28.

There are two repositories with contributions. One is the patch [1]

for Cassandra Sidecar with the bulk APIs that enable the Cassandra

Spark Analytics library. The second is a new repository [2] with

contributions to the Cassandra Spark Analytics code


We also have a README markdown file that you can follow to give the

code a try:


https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md


Best,

- Francisco


[1] Apache Cassandra Sidecar bulk APIs source code:
https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis

[2] Apache Cassandra Spark Analytics source code:
https://github.com/frankgh/cassandra-analytics


On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
responding here - yes, we can add some diagrams to the CEP - I’ll try to
get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28, 2023,
at 1:14 PM, J. D. Jordan <jeremiah.jor...@gmail.com> wrote: > > > > Maybe
some data flow diagrams could be added to the cep showing some example
operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM, Yifan Cai
<yc25c...@gmail.com> wrote: > >> > >>  > >> A lot of great discussions! >
>> > >> On the sidecar front, especially what the role sidecar plays in
terms of this CEP, I feel there might be some confusion. Once the code is
published, we should have clarity. > >> Sidecar does not read sstables nor
do any coordination for analytics queries. It is local to the companion
Cassandra instance. For bulk read, it takes snapshots and streams sstables
to spark workers to read. For bulk write, it imports the sstables uploaded
from spark workers. All commands are existing jmx/nodetool functionalities
from Cassandra. Sidecar adds the http interface to them. It might be an
over simplified description. The complex computation is performed in spark
clusters only. > >> > >> In the long run, Cassandra might evolve into a
database that does both OLTP and OLAP. (Not what this thread aims for) > >>
At the current stage, Spark is very suited for analytic purposes. > >> > >>
On Tue, Mar 28, 2023 at 9:06 AM Benedict <bened...@apache.org <mailto:
bened...@apache.org>> wrote: > >>> I disagree with the first claim, as the
process has all the information it chooses to utilise about which resources
it’s using and what it’s using those resources for. > >>> > >>> The
inability to isolate GC domains is something we cannot address, but also
probably not a problem if we were doing everything with memory management
as well as we could be. > >>> > >>> But, not worth detailing this thread
for. Today we do very little well on this front within the process, and a
separate process is well justified given the state of play. > >>> > >>>> On
28 Mar 2023, at 16:38, Derek Chen-Becker <de...@chen-becker.org <mailto:
de...@chen-becker.org>> wrote: > >>>> > >>>>  > >>>> > >>>> On Tue, Mar
28, 2023 at 9:03 AM Joseph Lynch <joe.e.ly...@gmail.com <mailto:
joe.e.ly...@gmail.com>> wrote: > >>>> ... > >>>> > >>>>> I think we might
be underselling how valuable JVM isolation is, > >>>>> especially for
analytics queries that are going to pass the entire > >>>>> dataset through
heap somewhat constantly. > >>>> > >>>> Big +1 here. The JVM simply does
not have significant granularity of control for resource utilization, but
this is explicitly a feature of separate processes. Add in being able to
separate GC domains and you can avoid a lot of noisy neighbor in-VM
behavior for the disparate workloads. > >>>> > >>>> Cheers, > >>>> > >>>>
Derek > >>>> > >>>> > >>>> -- > >>>>
+---------------------------------------------------------------+ > >>>> |
Derek Chen-Becker | > >>>> | GPG Key available at
https://keybase.io/dchenbecker and | > >>>> |
https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | > >>>> |
Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > >>>>
+---------------------------------------------------------------+ > >>>> >
>
-- 
Francisco Guerrero

RE: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

Reply via email to