Re: About the relationship between the sstable compaction and the read path

2019-01-09 Thread Jinhua Luo
> We stop at the memtable if we know that’s all we need. This depends on a lot 
> of factors (schema, point read vs slice, etc)

The codes seems to search sstables without checking whether the query
is already satisfied in memtable only.
Could you point out the related code snippets for what you said?




Could you give quick and simple answer to my questions about the complex types:

For collection, when I select a column of collection type, e.g.
map, to ensure the whole set of map fields is collected,
it is necessary to search in all sstables.

For cdt, it needs to ensure all fields of the cdt is collected.

For counter, it needs to merge all mutations distributed in all
sstables to give a final state of counter value.




Another related question, since the sstable only contains partitioning
key index, clustering key index (inline within the index file), but no
index for collection, like map and set. So, for field getting,
cassandra needs to iterate all fields or do quick search based on
sorted array?

Jeff Jirsa  于2019年1月9日周三 下午10:43写道:
>
> You’re comparing single machine key/value stores to a distributed db with a 
> much richer data model (partitions/slices, statics, range reads, range 
> deletions, etc). They’re going to read very differently. Instead of 
> explaining why they’re not like rocks/ldb, how about you tell us what you’re 
> trying to do / learn so we can answer the real question?
>
> Few other notes inline.
>
> --
> Jeff Jirsa
>
>
> > On Jan 8, 2019, at 10:51 PM, Jinhua Luo  wrote:
> >
> > Thanks. Let me clarify my questions more.
> >
> > 1) For memtable, if the selected columns (assuming they are in simple
> > types) could be found in memtable only, why bother to search sstables
> > then? In leveldb and rocksdb, they would stop consulting sstables if
> > the memtable already fulfill the query.
>
> We stop at the memtable if we know that’s all we need. This depends on a lot 
> of factors (schema, point read vs slice, etc)
>
> >
> > 2) For STCS and LCS, obviously, the sstables are grouped in
> > generations (old mutations would promoted into next level or bucket),
> > so why not search the columns level by level (or bucket by bucket)
> > until all selected columns are collected? In leveldb and rocksdb, they
> > do in this way.
>
> They’re single machine and Cassandra isn’t. There’s no guarantee in Cassandra 
> that the small sstables in stcs or low levels in LCS are newest:
>
> - you can write arbitrary timestamps into the memtable
> - read repair can put old data in the memtable
> - streaming (bootstrap/repair) can put old data into new files
> - user processes (nodetool refresh) can put old data into new files
>
>
> >
> > 3) Could you explain the collection, cdt and counter types in more
> > detail? Does they need to iterate all sstables? Because they could not
> > be simply filtered by timestamp or value range.
> >
>
> I can’t (combination of time available and it’s been a long time since I’ve 
> dealt with that code and I don’t want to misspeak).
>
>
> > For collection, when I select a column of collection type, e.g.
> > map, to ensure the whole set of map fields is collected,
> > it is necessary to search in all sstables.
> >
> > For cdt, it needs to ensure all fields of the cdt is collected.
> >
> > For counter, it needs to merge all mutations distributed in all
> > sstables to give a final state of counter value.
> >
> > Am I correct? If so, then there three complex types seems less
> > efficient than simple types, right?
> >
> > Jeff Jirsa  于2019年1月8日周二 下午11:58写道:
> >>
> >> First:
> >>
> >> Compaction controls how sstables are combined but not how they’re read. 
> >> The read path (with one tiny exception) doesn’t know or care which 
> >> compaction strategy you’re using.
> >>
> >> A few more notes inline.
> >>
> >>> On Jan 8, 2019, at 3:04 AM, Jinhua Luo  wrote:
> >>>
> >>> Hi All,
> >>>
> >>> The compaction would organize the sstables, e.g. with LCS, the
> >>> sstables would be categorized into levels, and the read path should
> >>> read sstables level by level until the read is fulfilled, correct?
> >>
> >> LCS levels are to minimize the number of sstables scanned - at most one 
> >> per level - but there’s no attempt to fulfill the read with low levels 
> >> beyond the filtering done by timestamp.
> >>
> >>>
> >>> For STCS, it would search sstables in buckets from smallest to largest?
> >>
> >> Nope. No attempt to do this.
> >>
> >>>
> >>> What about other compaction cases? They would iterate all sstables?
> >>
> >> In all cases, we’ll use a combination of bloom filters and sstable 
> >> metadata and indices to include / exclude sstables. If the bloom filter 
> >> hits, we’ll consider things like timestamps and whether or not the min/max 
> >> clustering of the sstable matches the slice we care about. We don’t 
> >> consult the compaction strategy, though the compaction strategy may have 
> >> (in the case of LCS or TWCS) placed the sstables into a state that makes 
> >> this 

Re: [EXTERNAL] Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-09 Thread Dor Laor
On Wed, Jan 9, 2019 at 7:28 AM Durity, Sean R 
wrote:

> I think you could consider option C: Create a (new) analytics DC in
> Cassandra and run your spark nodes there. Then you can address the scaling
> just on that DC. You can also use less vnodes, only replicate certain
> keyspaces, etc. in order to perform the analytics more efficiently.
>

But this way you duplicate the entire dataset RF times over. It's very very
expensive.
It is a common practice to run Spark on a separate Cassandra (virtual)
datacenter but it's done
in order to isolate the analytic workload from the realtime workload for
isolation and low latency guarantees.
We addressed this problem elsewhere, beyond this scope.


>
>
>
> Sean Durity
>
>
>
> *From:* Dor Laor 
> *Sent:* Friday, January 04, 2019 4:21 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Good way of configuring Apache spark with
> Apache Cassandra
>
>
>
> I strongly recommend option B, separate clusters. Reasons:
>
>  - Networking of node-node is negligible compared to networking within the
> node
>
>  - Different scaling considerations
>
>Your workload may require 10 Spark nodes and 20 database nodes, so why
> bundle them?
>
>This ratio may also change over time as your application evolves and
> amount of data changes.
>
>  - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't
> want it to affect Cassandra and the opposite.
>
>If you isolate it with cgroups, you may have too much idle time when
> the above doesn't happen.
>
>
>
>
>
> On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
> wrote:
>
> Hi,
>
> We have requirement of heavy data lifting and analytics requirement and
> decided to go with Apache Spark. In the process we have come up with two
> patterns
>
> a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
>
> b. Apache Spark on one independent cluster and Apache Cassandra as one
> independent cluster.
>
>
>
> Need good pattern how to use the analytic engine for Cassandra. Thanks in
> advance.
>
>
>
> Regards
>
> Goutham.
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

2019-01-09 Thread Jonathan Haddad
> I’m still not sure if having tombstones vs. empty values / frozen UDTs
will have the same results.

When in doubt, benchmark.

Good luck,
Jon

On Wed, Jan 9, 2019 at 3:02 PM Tomas Bartalos 
wrote:

> Loosing atomic updates is a good point, but in my use case its not a
> problem, since I always overwrite the whole record (no partitial updates).
>
> I’m still not sure if having tombstones vs. empty values / frozen UDTs
> will have the same results.
> When I update one row with 10 null columns it will create 10 tombstones.
> We do OLAP processing of data stored in Cassandra with Spark.
>
> When Spark requests range of data, lets say 1000 rows, I can easily hit
> the 10 000 tombstones threshold.
>
> Even if I would not hit the error threshold Spark requests would increase
> the heap pressure, because tombstones have to be collected and returned to
> coordinator.
>
> Are my assumptions correct ?
>
> On 4 Jan 2019, at 21:15, DuyHai Doan  wrote:
>
> The idea of storing your data as a single blob can be dangerous.
>
> Indeed, you loose the ability to perform atomic update on each column.
>
> In Cassandra, LWW is the rule. Suppose 2 concurrent updates on the same
> row, 1st update changes column Firstname (let's say it's a Person record)
> and 2nd update changes column Lastname
>
> Now depending on the timestamp between the 2 updates, you'll have:
>
> - old Firstname, new Lastname
> - new Firstname, old Lastname
>
> having updates on columns atomically guarantees you to have new Firstname,
> new Lastname
>
> On Fri, Jan 4, 2019 at 8:17 PM Jonathan Haddad  wrote:
>
>> Those are two different cases though.  It *sounds like* (again, I may be
>> missing the point) you're trying to overwrite a value with another value.
>> You're either going to serialize a blob and overwrite a single cell, or
>> you're going to overwrite all the cells and include a tombstone.
>>
>> When you do a read, reading a single tombstone vs a single vs is
>> essentially the same thing, performance wise.
>>
>> In your description you said "~ 20-100 events", and you're overwriting
>> the event each time, so I don't know how you go to 10K tombstones either.
>> Compaction will bring multiple tombstones together for a cell in the same
>> way it compacts multiple values for a single cell.
>>
>> I sounds to make like you're taking some advice about tombstones out of
>> context and trying to apply the advice to a different problem.  Again, I
>> might be misunderstanding what you're doing.
>>
>>
>> On Fri, Jan 4, 2019 at 10:49 AM Tomas Bartalos 
>> wrote:
>>
>>> Hello Jon,
>>>
>>> I thought having tombstones is much higher overhead than just
>>> overwriting values. The compaction overhead can be l similar, but I think
>>> the read performance is much worse.
>>>
>>> Tombstones accumulate and hang for 10 days (by default) before they are
>>> eligible for compaction.
>>>
>>> Also we have tombstone warning and error thresholds. If cassandra scans
>>> more than 10 000 tombstones, she will abort the query.
>>>
>>> According to this article:
>>> https://opencredo.com/blogs/cassandra-tombstones-common-issues/
>>>
>>> "The cassandra.yaml comments explain in perfectly: *“When executing a
>>> scan, within or across a partition, we need to keep the tombstones seen in
>>> memory so we can return them to the coordinator, which will use them to
>>> make sure other replicas also know about the deleted rows. With workloads
>>> that generate a lot of tombstones, this can cause performance problems and
>>> even exhaust the server heap. "*
>>>
>>> Regards,
>>> Tomas
>>>
>>> On Fri, 4 Jan 2019, 7:06 pm Jonathan Haddad >>
 If you're overwriting values, it really doesn't matter much if it's a
 tombstone or any other value, they still need to be compacted and have the
 same overhead at read time.

 Tombstones are problematic when you try to use Cassandra as a queue (or
 something like a queue) and you need to scan over thousands of tombstones
 in order to get to the real data.  You're simply overwriting a row and
 trying to avoid a single tombstone.

 Maybe I'm missing something here.  Why do you think overwriting a
 single cell with a tombstone is any worse than overwriting a single cell
 with a value?

 Jon


 On Fri, Jan 4, 2019 at 9:57 AM Tomas Bartalos 
 wrote:

> Hello,
>
> I beleive your approach is the same as using spark with "
> spark.cassandra.output.ignoreNulls=true"
> This will not cover the situation when a value have to be overwriten
> with null.
>
> I found one possible solution - change the schema to keep only primary
> key fields and move all other fields to frozen UDT.
> create table (year, month, day, id, frozen, primary key((year,
> month, day), id) )
> In this way anything that is null inside event doesn't create
> tombstone, since event is serialized to BLOB.
> The penalty is in need of deserializing the whole Event when 

Re: Cassandra and Apache Arrow

2019-01-09 Thread Jonathan Haddad
Not sure why they put that in there, it's definitely misleading.  There's
nothing arrow related in Cassandra.

There's an open JIRA, but nothing has been committed yet:
https://issues.apache.org/jira/browse/CASSANDRA-9259

On Wed, Jan 9, 2019 at 3:48 PM Tomas Bartalos 
wrote:

> There is a diagram on the homepage displaying Cassandra (with other
> storages) as source of data.
> https://arrow.apache.org/img/shared.png
>
> Which made me think there should be some integration...
>
> On Thu, 10 Jan 2019, 12:38 am Jonathan Haddad 
>> Where are you seeing that it works with Cassandra?  There's no mention of
>> it under https://arrow.apache.org/powered_by/, and on the homepage it
>> says only says that a Cassandra developer worked on it.
>>
>> We (unfortunately) don't do anything with it at the moment.
>>
>> On Wed, Jan 9, 2019 at 3:24 PM Tomas Bartalos 
>> wrote:
>>
>>> I’ve read lot of nice things about Apache Arrow in-memory columnar
>>> format. On their homepage they mention Cassandra as a possible storage
>>> which could interoperate with Arrow. Unfortunately I was not able to find
>>> any working example which would demonstrate their cooperation.
>>>
>>> *My use case:* I’m doing OLAP processing of data stored in Cassandra
>>> with Spark. I need to deduplicate data with Cassandra’s upserts, so other
>>> (more-suitable) storages like HDFS + parquet, ORC didn’t seem like an
>>> option.
>>> *What I’d like to achieve: *speed-up spark’s data ingestion from
>>> Cassandra.
>>>
>>> Is it possible to query data from Cassandra in Arrow format ?
>>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade
>>
>

-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade


Re: Cassandra and Apache Arrow

2019-01-09 Thread Tomas Bartalos
There is a diagram on the homepage displaying Cassandra (with other
storages) as source of data.
https://arrow.apache.org/img/shared.png

Which made me think there should be some integration...

On Thu, 10 Jan 2019, 12:38 am Jonathan Haddad  Where are you seeing that it works with Cassandra?  There's no mention of
> it under https://arrow.apache.org/powered_by/, and on the homepage it
> says only says that a Cassandra developer worked on it.
>
> We (unfortunately) don't do anything with it at the moment.
>
> On Wed, Jan 9, 2019 at 3:24 PM Tomas Bartalos 
> wrote:
>
>> I’ve read lot of nice things about Apache Arrow in-memory columnar
>> format. On their homepage they mention Cassandra as a possible storage
>> which could interoperate with Arrow. Unfortunately I was not able to find
>> any working example which would demonstrate their cooperation.
>>
>> *My use case:* I’m doing OLAP processing of data stored in Cassandra
>> with Spark. I need to deduplicate data with Cassandra’s upserts, so other
>> (more-suitable) storages like HDFS + parquet, ORC didn’t seem like an
>> option.
>> *What I’d like to achieve: *speed-up spark’s data ingestion from
>> Cassandra.
>>
>> Is it possible to query data from Cassandra in Arrow format ?
>>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>


Re: Cassandra and Apache Arrow

2019-01-09 Thread Jonathan Haddad
Where are you seeing that it works with Cassandra?  There's no mention of
it under https://arrow.apache.org/powered_by/, and on the homepage it says
only says that a Cassandra developer worked on it.

We (unfortunately) don't do anything with it at the moment.

On Wed, Jan 9, 2019 at 3:24 PM Tomas Bartalos 
wrote:

> I’ve read lot of nice things about Apache Arrow in-memory columnar format.
> On their homepage they mention Cassandra as a possible storage which could
> interoperate with Arrow. Unfortunately I was not able to find any working
> example which would demonstrate their cooperation.
>
> *My use case:* I’m doing OLAP processing of data stored in Cassandra with
> Spark. I need to deduplicate data with Cassandra’s upserts, so other
> (more-suitable) storages like HDFS + parquet, ORC didn’t seem like an
> option.
> *What I’d like to achieve: *speed-up spark’s data ingestion from
> Cassandra.
>
> Is it possible to query data from Cassandra in Arrow format ?
>


-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade


Cassandra and Apache Arrow

2019-01-09 Thread Tomas Bartalos
I’ve read lot of nice things about Apache Arrow in-memory columnar format. On 
their homepage they mention Cassandra as a possible storage which could 
interoperate with Arrow. Unfortunately I was not able to find any working 
example which would demonstrate their cooperation.

My use case: I’m doing OLAP processing of data stored in Cassandra with Spark. 
I need to deduplicate data with Cassandra’s upserts, so other (more-suitable) 
storages like HDFS + parquet, ORC didn’t seem like an option.
What I’d like to achieve: speed-up spark’s data ingestion from Cassandra. 

Is it possible to query data from Cassandra in Arrow format ?

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

2019-01-09 Thread Tomas Bartalos
Loosing atomic updates is a good point, but in my use case its not a problem, 
since I always overwrite the whole record (no partitial updates).

I’m still not sure if having tombstones vs. empty values / frozen UDTs will 
have the same results.
When I update one row with 10 null columns it will create 10 tombstones.
We do OLAP processing of data stored in Cassandra with Spark.

When Spark requests range of data, lets say 1000 rows, I can easily hit the 10 
000 tombstones threshold.

Even if I would not hit the error threshold Spark requests would increase the 
heap pressure, because tombstones have to be collected and returned to 
coordinator. 

Are my assumptions correct ?

> On 4 Jan 2019, at 21:15, DuyHai Doan  wrote:
> 
> The idea of storing your data as a single blob can be dangerous.
> 
> Indeed, you loose the ability to perform atomic update on each column.
> 
> In Cassandra, LWW is the rule. Suppose 2 concurrent updates on the same row, 
> 1st update changes column Firstname (let's say it's a Person record) and 2nd 
> update changes column Lastname
> 
> Now depending on the timestamp between the 2 updates, you'll have:
> 
> - old Firstname, new Lastname
> - new Firstname, old Lastname
> 
> having updates on columns atomically guarantees you to have new Firstname, 
> new Lastname
> 
> On Fri, Jan 4, 2019 at 8:17 PM Jonathan Haddad  > wrote:
> Those are two different cases though.  It *sounds like* (again, I may be 
> missing the point) you're trying to overwrite a value with another value.  
> You're either going to serialize a blob and overwrite a single cell, or 
> you're going to overwrite all the cells and include a tombstone.
> 
> When you do a read, reading a single tombstone vs a single vs is essentially 
> the same thing, performance wise.  
> 
> In your description you said "~ 20-100 events", and you're overwriting the 
> event each time, so I don't know how you go to 10K tombstones either.  
> Compaction will bring multiple tombstones together for a cell in the same way 
> it compacts multiple values for a single cell.  
> 
> I sounds to make like you're taking some advice about tombstones out of 
> context and trying to apply the advice to a different problem.  Again, I 
> might be misunderstanding what you're doing.
> 
> 
> On Fri, Jan 4, 2019 at 10:49 AM Tomas Bartalos  > wrote:
> Hello Jon, 
> 
> I thought having tombstones is much higher overhead than just overwriting 
> values. The compaction overhead can be l similar, but I think the read 
> performance is much worse.
> 
> Tombstones accumulate and hang for 10 days (by default) before they are 
> eligible for compaction. 
> 
> Also we have tombstone warning and error thresholds. If cassandra scans more 
> than 10 000 tombstones, she will abort the query.
> 
> According to this article: 
> https://opencredo.com/blogs/cassandra-tombstones-common-issues/ 
> 
> 
> "The cassandra.yaml comments explain in perfectly: “When executing a scan, 
> within or across a partition, we need to keep the tombstones seen in memory 
> so we can return them to the coordinator, which will use them to make sure 
> other replicas also know about the deleted rows. With workloads that generate 
> a lot of tombstones, this can cause performance problems and even exhaust the 
> server heap. "
> 
> Regards, 
> Tomas
> 
> On Fri, 4 Jan 2019, 7:06 pm Jonathan Haddad   wrote:
> If you're overwriting values, it really doesn't matter much if it's a 
> tombstone or any other value, they still need to be compacted and have the 
> same overhead at read time.  
> 
> Tombstones are problematic when you try to use Cassandra as a queue (or 
> something like a queue) and you need to scan over thousands of tombstones in 
> order to get to the real data.  You're simply overwriting a row and trying to 
> avoid a single tombstone.  
> 
> Maybe I'm missing something here.  Why do you think overwriting a single cell 
> with a tombstone is any worse than overwriting a single cell with a value?
> 
> Jon
> 
> 
> On Fri, Jan 4, 2019 at 9:57 AM Tomas Bartalos  > wrote:
> Hello,
> 
> I beleive your approach is the same as using spark with 
> "spark.cassandra.output.ignoreNulls=true"
> This will not cover the situation when a value have to be overwriten with 
> null. 
> 
> I found one possible solution - change the schema to keep only primary key 
> fields and move all other fields to frozen UDT.
> create table (year, month, day, id, frozen, primary key((year, month, 
> day), id) )
> In this way anything that is null inside event doesn't create tombstone, 
> since event is serialized to BLOB.
> The penalty is in need of deserializing the whole Event when selecting only 
> few columns. 
> Can anyone confirm if this is good solution performance wise?
> 
> Thank you, 
> 
> On Fri, 4 Jan 2019, 2:20 pm DuyHai 

Re: [EXTERNAL] Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-09 Thread Goutham reddy
Thanks Sean. But what if I want to have both Spark and elasticsearch with
Cassandra as separare data center. Does that cause any overhead ?

On Wed, Jan 9, 2019 at 7:28 AM Durity, Sean R 
wrote:

> I think you could consider option C: Create a (new) analytics DC in
> Cassandra and run your spark nodes there. Then you can address the scaling
> just on that DC. You can also use less vnodes, only replicate certain
> keyspaces, etc. in order to perform the analytics more efficiently.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Dor Laor 
> *Sent:* Friday, January 04, 2019 4:21 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Good way of configuring Apache spark with
> Apache Cassandra
>
>
>
> I strongly recommend option B, separate clusters. Reasons:
>
>  - Networking of node-node is negligible compared to networking within the
> node
>
>  - Different scaling considerations
>
>Your workload may require 10 Spark nodes and 20 database nodes, so why
> bundle them?
>
>This ratio may also change over time as your application evolves and
> amount of data changes.
>
>  - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't
> want it to affect Cassandra and the opposite.
>
>If you isolate it with cgroups, you may have too much idle time when
> the above doesn't happen.
>
>
>
>
>
> On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
> wrote:
>
> Hi,
>
> We have requirement of heavy data lifting and analytics requirement and
> decided to go with Apache Spark. In the process we have come up with two
> patterns
>
> a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
>
> b. Apache Spark on one independent cluster and Apache Cassandra as one
> independent cluster.
>
>
>
> Need good pattern how to use the analytic engine for Cassandra. Thanks in
> advance.
>
>
>
> Regards
>
> Goutham.
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>
-- 
Regards
Goutham Reddy


RE: [EXTERNAL] Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-09 Thread Durity, Sean R
I think you could consider option C: Create a (new) analytics DC in Cassandra 
and run your spark nodes there. Then you can address the scaling just on that 
DC. You can also use less vnodes, only replicate certain keyspaces, etc. in 
order to perform the analytics more efficiently.


Sean Durity

From: Dor Laor 
Sent: Friday, January 04, 2019 4:21 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Good way of configuring Apache spark with Apache 
Cassandra

I strongly recommend option B, separate clusters. Reasons:
 - Networking of node-node is negligible compared to networking within the node
 - Different scaling considerations
   Your workload may require 10 Spark nodes and 20 database nodes, so why 
bundle them?
   This ratio may also change over time as your application evolves and amount 
of data changes.
 - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't want it 
to affect Cassandra and the opposite.
   If you isolate it with cgroups, you may have too much idle time when the 
above doesn't happen.


On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
mailto:goutham.chiru...@gmail.com>> wrote:
Hi,
We have requirement of heavy data lifting and analytics requirement and decided 
to go with Apache Spark. In the process we have come up with two patterns
a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
b. Apache Spark on one independent cluster and Apache Cassandra as one 
independent cluster.

Need good pattern how to use the analytic engine for Cassandra. Thanks in 
advance.

Regards
Goutham.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: About the relationship between the sstable compaction and the read path

2019-01-09 Thread Jeff Jirsa
You’re comparing single machine key/value stores to a distributed db with a 
much richer data model (partitions/slices, statics, range reads, range 
deletions, etc). They’re going to read very differently. Instead of explaining 
why they’re not like rocks/ldb, how about you tell us what you’re trying to do 
/ learn so we can answer the real question?

Few other notes inline.

-- 
Jeff Jirsa


> On Jan 8, 2019, at 10:51 PM, Jinhua Luo  wrote:
> 
> Thanks. Let me clarify my questions more.
> 
> 1) For memtable, if the selected columns (assuming they are in simple
> types) could be found in memtable only, why bother to search sstables
> then? In leveldb and rocksdb, they would stop consulting sstables if
> the memtable already fulfill the query.

We stop at the memtable if we know that’s all we need. This depends on a lot of 
factors (schema, point read vs slice, etc)

> 
> 2) For STCS and LCS, obviously, the sstables are grouped in
> generations (old mutations would promoted into next level or bucket),
> so why not search the columns level by level (or bucket by bucket)
> until all selected columns are collected? In leveldb and rocksdb, they
> do in this way.

They’re single machine and Cassandra isn’t. There’s no guarantee in Cassandra 
that the small sstables in stcs or low levels in LCS are newest:

- you can write arbitrary timestamps into the memtable
- read repair can put old data in the memtable
- streaming (bootstrap/repair) can put old data into new files
- user processes (nodetool refresh) can put old data into new files


> 
> 3) Could you explain the collection, cdt and counter types in more
> detail? Does they need to iterate all sstables? Because they could not
> be simply filtered by timestamp or value range.
> 

I can’t (combination of time available and it’s been a long time since I’ve 
dealt with that code and I don’t want to misspeak).


> For collection, when I select a column of collection type, e.g.
> map, to ensure the whole set of map fields is collected,
> it is necessary to search in all sstables.
> 
> For cdt, it needs to ensure all fields of the cdt is collected.
> 
> For counter, it needs to merge all mutations distributed in all
> sstables to give a final state of counter value.
> 
> Am I correct? If so, then there three complex types seems less
> efficient than simple types, right?
> 
> Jeff Jirsa  于2019年1月8日周二 下午11:58写道:
>> 
>> First:
>> 
>> Compaction controls how sstables are combined but not how they’re read. The 
>> read path (with one tiny exception) doesn’t know or care which compaction 
>> strategy you’re using.
>> 
>> A few more notes inline.
>> 
>>> On Jan 8, 2019, at 3:04 AM, Jinhua Luo  wrote:
>>> 
>>> Hi All,
>>> 
>>> The compaction would organize the sstables, e.g. with LCS, the
>>> sstables would be categorized into levels, and the read path should
>>> read sstables level by level until the read is fulfilled, correct?
>> 
>> LCS levels are to minimize the number of sstables scanned - at most one per 
>> level - but there’s no attempt to fulfill the read with low levels beyond 
>> the filtering done by timestamp.
>> 
>>> 
>>> For STCS, it would search sstables in buckets from smallest to largest?
>> 
>> Nope. No attempt to do this.
>> 
>>> 
>>> What about other compaction cases? They would iterate all sstables?
>> 
>> In all cases, we’ll use a combination of bloom filters and sstable metadata 
>> and indices to include / exclude sstables. If the bloom filter hits, we’ll 
>> consider things like timestamps and whether or not the min/max clustering of 
>> the sstable matches the slice we care about. We don’t consult the compaction 
>> strategy, though the compaction strategy may have (in the case of LCS or 
>> TWCS) placed the sstables into a state that makes this read less expensive.
>> 
>>> 
>>> But in the codes, I'm confused a lot:
>>> In 
>>> org.apache.cassandra.db.SinglePartitionReadCommand#queryMemtableAndDiskInternal,
>>> it seems that no matter whether the selected columns (except the
>>> collection/cdt and counter cases, let's assume here the selected
>>> columns are simple cell) are collected and satisfied, it would search
>>> both memtable and all sstables, regardless of the compaction strategy.
>> 
>> There’s another that includes timestamps that will do some smart-ish 
>> exclusion of sstables that aren’t needed for the read command.
>> 
>>> 
>>> Why?
>>> 
>>> Moreover, for collection/cdt (non-frozen) and counter types, it would
>>> need to iterate all sstable to ensure the whole set of the fields are
>>> collected, correct? If so, such multi-cell or counter types are
>>> heavyweight in performance, correct?
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>> 
>> 
>> -
>> To unsubscribe, e-mail: 

Re: How seed nodes are working and how to upgrade/replace them?

2019-01-09 Thread Jonathan Ballet
On Tue, 8 Jan 2019 at 18:29, Jeff Jirsa  wrote:

> Given Consul's popularity, seems like someone could make an argument that
> we should be shipping a consul-aware seed provider.
>

Elasticsearch has a very handy dedicated file-based discovery system:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#file-based-hosts-provider
It's similar to what Cassandra's built-in SimpleSeedProvider does, but it
doesn't require to keep up-to-date the *whole* cassandra.yaml file and that
could probably be simpler to dynamically watch for changes.
Ultimately, there are plenty of external applications that could be used to
pull-in information from your favorite service discovery tool (etcd,
Consul, etc.) or configuration management and keep this file up to date,
without having to need a plugin for every system out there.


Re: How seed nodes are working and how to upgrade/replace them?

2019-01-09 Thread Jonathan Ballet
On Tue, 8 Jan 2019 at 18:39, Jeff Jirsa  wrote:

> On Tue, Jan 8, 2019 at 8:19 AM Jonathan Ballet  wrote:
>
>> Hi Jeff,
>>
>> thanks for answering to most of my points!
>> From the reloadseeds' ticket, I followed to
>> https://issues.apache.org/jira/browse/CASSANDRA-3829 which was very
>> instructive, although a bit old.
>>
>>
>> On Mon, 7 Jan 2019 at 17:23, Jeff Jirsa  wrote:
>>
>>> > On Jan 7, 2019, at 6:37 AM, Jonathan Ballet 
>>> wrote:
>>> >
>>> [...]
>>>
>>> >   In essence, in my example that would be:
>>> >
>>> >   - decide that #2 and #3 will be the new seed nodes
>>> >   - update all the configuration files of all the nodes to write the
>>> IP addresses of #2 and #3
>>> >   - DON'T restart any node - the new seed configuration will be picked
>>> up only if the Cassandra process restarts
>>> >
>>> > * If I can manage to sort my Cassandra nodes by their age, could it be
>>> a strategy to have the seeds set to the 2 oldest nodes in the cluster?
>>> (This implies these nodes would change as the cluster's nodes get
>>> upgraded/replaced).
>>>
>>> You could do this, seems like a lot of headache for little benefit.
>>> Could be done with simple seed provider and config management
>>> (puppet/chef/ansible) laying  down new yaml or with your own seed provider
>>>
>>
>> So, just to make it clear: sorting by age isn't a goal in itself, it was
>> just an example on how I could get a stable list.
>>
>> Right now, we have a dedicated group of seed nodes + a dedicated group
>> for non-seeds: doing rolling-upgrade of the nodes from the second list is
>> relatively painless (although slow) whereas we are facing the issues
>> discussed in CASSANDRA-3829 for the first group which are non-seeds nodes
>> are not bootstrapping automatically and we need to operate them in a more
>> careful way.
>>
> Rolling upgrade shouldn't need to re-bootstrap. Only replacing a host
> should need a new bootstrap. That should be a new host in your list, so it
> seems like this should be fairly rare?
>

Sorry, that's internal pigdin, by "rolling upgrade" I meant replacing in a
rolling fashion all the nodes.


> What I'm really looking for is a way to simplify adding and removing nodes
>> into our (small) cluster: I can easily provide a small list of nodes from
>> our cluster with our config management tool so that new nodes are
>> discovering the rest of the cluster, but the documentation seems to imply
>> that seed nodes also have other functions and I'm not sure what problems we
>> could face trying to simplify this approach.
>>
>> Ideally, what I would like to have would be:
>>
>> * Considering a stable cluster (no new nodes, no nodes leaving), the N
>> seeds should be always the same N nodes
>> * Adding new nodes should not change that list
>> * Stopping/removing one of these N nodes should "promote" another
>> (non-seed) node as a seed
>>   - that would not restart the already running Cassandra nodes but would
>> update their configuration files.
>>   - if a node restart for whatever reason it would pick up this new
>> configuration
>>
>> So: no node would start its life as a seed, only a few already existing
>> node would have this status. We would not have to deal with the "a seed
>> node doesn't bootstrap" problem and it would make our operation process
>> simpler.
>>
>>
>>> > I also have some more general questions about seed nodes and how they
>>> work:
>>> >
>>> > * I understand that seed nodes are used when a node starts and needs
>>> to discover the rest of the cluster's nodes. Once the node has joined and
>>> the cluster is stable, are seed nodes still playing a role in day to day
>>> operations?
>>>
>>> They’re used probabilistically in gossip to encourage convergence.
>>> Mostly useful in large clusters.
>>>
>>
>> How "large" are we speaking here? How many nodes would it start to be
>> considered "large"?
>>
>
> ~800-1000
>

Alllrriigght, we still have a long way :)

 Jonathan