Re: Incremental Repair

2017-03-12 Thread Jeff Jirsa


On 2017-03-12 10:44 (-0700), Anuj Wadehra  wrote: 
> Hi,
> 
> Our setup is as follows:
> 2 DCS with N nodes, RF=DC1:3,DC2:3, Hinted Handoff=3 hours, Incremental 
> Repair scheduled once on every node (ALL DCs) within the gc grace period.
> 
> I have following queries regarding incremental repairs:
> 
> 1. When a node is down for X hours (where x > hinted handoff hours and less 
> than gc grace time), I think incremental repair is sufficient rather than 
> doing the full repair. Is the understanding correct ? 
> 

Incremental repair SHOULD provide the same guarantees as regular repair.

> 2. DataStax recommends "Run incremental repair daily, run full repairs weekly 
> to monthly". Does that mean that I have to run full repairs every week to 
> month EVEN IF I do daily incremental repairs? If yes, whats the reasoning of 
> running full repair when inc repair is already run?
> 
> Reference: 
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesWhen.html
> 

I don't know why datastax suggests this, there are some nasty edge cases when 
you mix incremental repair and full repair ( 
https://issues.apache.org/jira/browse/CASSANDRA-13153 ) 

> 3. We run inc repair at least once in gc grace instead of the general 
> recommendation that inc repair should be run daily. Do you see any problem 
> with the approach? 
> 
>

The more often you run it, the less data will be transferred, and the less 
painful it will be.  By running it weekly, you're making each run do 7x as much 
as work compared to running it daily, increasing the chance of having it impact 
your typical latencies.




Re: Row cache tuning

2017-03-12 Thread preetika tyagi
I see. Thanks, Arvydas!

In terms of eviction policy in the row cache, does a write operation
invalidates only the row(s) which are going be modified or the whole
partition? In older version of Cassandra, I believe the whole partition
gets invalidated even if only one row is modified. Is that still true for
the latest release (3.10). I browsed through many online articles and
tutorials but cannot find information on this.

On Sun, Mar 12, 2017 at 2:25 PM, Arvydas Jonusonis <
arvydas.jonuso...@gmail.com> wrote:

> You can experiment quite easily without even needing to restart the
> Cassandra service.
>
> The caches (row and key) can be enabled on a table-by-table basis via a
> schema directive. But the cache capacity (which is the one that you
> referred to in your original post, set to 0 in cassandra.yaml) is a global
> setting and can be manipulated via JMX or nodetool (nodetool
> setcachecapacity
> 
> ).
>
> Arvydas
>
> On Sun, Mar 12, 2017 at 9:46 AM, preetika tyagi 
> wrote:
>
>> Thanks, Matija! That was insightful.
>>
>> I don't really have a use case in particular, however, what I'm trying to
>> do is to figure out how the Cassandra performance can be leveraged by using
>> different caching mechanisms, such as row cache, key cache, partition
>> summary etc. Of course, it will also heavily depend on the type of workload
>> but I'm trying to gain more understanding of what's available in the
>> Cassandra framework.
>>
>> Also, I read somewhere that either row cache or key cache can be turned
>> on for a specific table, not both. Based on your comment, I guess the
>> combination of page cache and key cache is used widely for tuning the
>> performance.
>>
>> Thanks,
>> Preetika
>>
>> On Sat, Mar 11, 2017 at 2:01 PM, Matija Gobec 
>> wrote:
>>
>>> Hi,
>>>
>>> In 99% of use cases Cassandra's row cache is not something you should
>>> look into. Leveraging page cache yields good results and if accounted for
>>> can provide you with performance increase on read side.
>>> I'm not a fan of a default row cache implementation and its invalidation
>>> mechanism on updates so you really need to be careful when and how you use
>>> it. There isn't much to configuration as there is to your use case. Maybe
>>> explain what are you trying to solve with row cache and people can get into
>>> discussion with more context.
>>>
>>> Regards,
>>> Matija
>>>
>>> On Sat, Mar 11, 2017 at 9:15 PM, preetika tyagi >> > wrote:
>>>
 Hi,

 I'm new to Cassandra and trying to get a better understanding on how
 the row cache can be tuned to optimize the performance.

 I came across think this article: https://docs.datastax
 .com/en/cassandra/3.0/cassandra/operations/opsConfiguringCaches.html

 And it suggests not to even touch row cache unless read workload is >
 95% and mostly rely on machine's default cache mechanism which comes with
 OS.

 The default row cache size is 0 in cassandra.yaml file so the row cache
 won't be utilized at all.

 Therefore, I'm wondering how exactly I can decide to chose to tweak row
 cache if needed. Are there any good pointers one can provide on this?

 Thanks,
 Preetika

>>>
>>>
>>
>


Re: scylladb

2017-03-12 Thread Kant Kodali
I don't think ScyallDB guys started this conversation in the first place to
suggest or promote "drop-in replacement". It was something that is brought
up by one of the Cassandra users and ScyallDB guys just clarified it. They
are gracious enough to share the internals in detail.

honestly, I find it weird when I see questions like whether a question
belongs  to a mailing list or not especially in this case. If one doesn't
like it they can simply not follow the thread. I am not sure what is the
harm here.



On Sun, Mar 12, 2017 at 2:29 PM, James Carman 
wrote:

> Well, looking back, it appears this thread is from 2015, so apparently
> everyone is okay with it.
>
> Promoting a value-add product that makes using Cassandra easier/more
> efficient/etc would be cool, but coming to the Cassandra mailing list to
> promote a "drop-in replacement" (use us, not Cassandra) isn't cool, IMHO.
>
>
> On Sun, Mar 12, 2017 at 5:04 PM Kant Kodali  wrote:
>
> yes.
>
> On Sun, Mar 12, 2017 at 2:01 PM, James Carman 
> wrote:
>
> Does all of this Scylla talk really even belong on the Cassandra user
> mailing list in the first place?
>
>
>
>
> On Sun, Mar 12, 2017 at 4:07 PM Jeff Jirsa  wrote:
>
>
>
> On 2017-03-11 22:33 (-0700), Dor Laor  wrote:
> > On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
> > > On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
> > > > Cassanda vs Scylla is a valid comparison because they both are
> > > compatible. Scylla is a drop-in replacement for Cassandra.
> > >
> > > No, they aren't, and no, it isn't
> > >
> >
> > Jeff is angry with us for some reason. I don't know why, it's natural
> that
> > when  a new opponent there are objections and the proof lies on us.
>
> I'm not angry. When I'm angry I send emails with paragraphs of expletives.
> It doesn't happen very often.
>
> This is an open source ASF project, it's not about fighting for market
> share against startups who find it necessary to inflate their level of
> compatibility to sell support contracts, it's about providing software that
> people can use (with a license that makes it easy to use). I don't work for
> a company that makes money selling Cassandra based solutions and you're not
> an opponent.
>
> >
> > Scylla IS a drop in replacement for C*. We support the same CQL (from
> > version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based
> on
> > 2.1.8).
>
> Scylla doesn't even run on all of the supported operating systems, let
> alone have feature parity or network level compatibility (which you'd
> probably need if you REALLY want to be drop-in 
> stop-one-cassandra-node-swap-binaries-start-it-up
> compatible, which is what your site used to claim, but obviously isn't
> supported). You support a subset of one query language and can read and
> write one sstable format. You do it with great supporting tech and a great
> engineering team, but you're not compatible, and if I were your cofounder
> I'd ask you to focus on the tech strengths and not your drop-in
> compatibility, so engineers who care about facts don't grow to resent your
> public lies.
>
> I've used a lot of databases in my life, but I don't know that I've ever
> had someone call me angry because I pointed out that database A wasn't
> compatible with database B, but I guess I'll chalk it up to 2017 and the
> year of fake news / alternative facts.
>
> Hugs and kisses,
> - Jeff
>
>
>


Re: scylladb

2017-03-12 Thread Jeff Jirsa


On 2017-03-12 14:29 (-0700), James Carman  wrote: 
> Well, looking back, it appears this thread is from 2015, so apparently
> everyone is okay with it.
> 
> Promoting a value-add product that makes using Cassandra easier/more
> efficient/etc would be cool, but coming to the Cassandra mailing list to
> promote a "drop-in replacement" (use us, not Cassandra) isn't cool, IMHO.
> 

Agreed. Questions about scylla internals belong on scylla lists. user@cassandra 
is meant for questions about Apache Cassandra - let's please not revive 2015 
threads, and let's keep mails to the list on topic for the list. 





Re: scylladb

2017-03-12 Thread James Carman
Well, looking back, it appears this thread is from 2015, so apparently
everyone is okay with it.

Promoting a value-add product that makes using Cassandra easier/more
efficient/etc would be cool, but coming to the Cassandra mailing list to
promote a "drop-in replacement" (use us, not Cassandra) isn't cool, IMHO.


On Sun, Mar 12, 2017 at 5:04 PM Kant Kodali  wrote:

yes.

On Sun, Mar 12, 2017 at 2:01 PM, James Carman 
wrote:

Does all of this Scylla talk really even belong on the Cassandra user
mailing list in the first place?




On Sun, Mar 12, 2017 at 4:07 PM Jeff Jirsa  wrote:



On 2017-03-11 22:33 (-0700), Dor Laor  wrote:
> On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
> > On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
> > > Cassanda vs Scylla is a valid comparison because they both are
> > compatible. Scylla is a drop-in replacement for Cassandra.
> >
> > No, they aren't, and no, it isn't
> >
>
> Jeff is angry with us for some reason. I don't know why, it's natural that
> when  a new opponent there are objections and the proof lies on us.

I'm not angry. When I'm angry I send emails with paragraphs of expletives.
It doesn't happen very often.

This is an open source ASF project, it's not about fighting for market
share against startups who find it necessary to inflate their level of
compatibility to sell support contracts, it's about providing software that
people can use (with a license that makes it easy to use). I don't work for
a company that makes money selling Cassandra based solutions and you're not
an opponent.

>
> Scylla IS a drop in replacement for C*. We support the same CQL (from
> version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based
on
> 2.1.8).

Scylla doesn't even run on all of the supported operating systems, let
alone have feature parity or network level compatibility (which you'd
probably need if you REALLY want to be drop-in
stop-one-cassandra-node-swap-binaries-start-it-up compatible, which is what
your site used to claim, but obviously isn't supported). You support a
subset of one query language and can read and write one sstable format. You
do it with great supporting tech and a great engineering team, but you're
not compatible, and if I were your cofounder I'd ask you to focus on the
tech strengths and not your drop-in compatibility, so engineers who care
about facts don't grow to resent your public lies.

I've used a lot of databases in my life, but I don't know that I've ever
had someone call me angry because I pointed out that database A wasn't
compatible with database B, but I guess I'll chalk it up to 2017 and the
year of fake news / alternative facts.

Hugs and kisses,
- Jeff


Re: Row cache tuning

2017-03-12 Thread Arvydas Jonusonis
You can experiment quite easily without even needing to restart the
Cassandra service.

The caches (row and key) can be enabled on a table-by-table basis via a
schema directive. But the cache capacity (which is the one that you
referred to in your original post, set to 0 in cassandra.yaml) is a global
setting and can be manipulated via JMX or nodetool (nodetool
setcachecapacity

).

Arvydas

On Sun, Mar 12, 2017 at 9:46 AM, preetika tyagi 
wrote:

> Thanks, Matija! That was insightful.
>
> I don't really have a use case in particular, however, what I'm trying to
> do is to figure out how the Cassandra performance can be leveraged by using
> different caching mechanisms, such as row cache, key cache, partition
> summary etc. Of course, it will also heavily depend on the type of workload
> but I'm trying to gain more understanding of what's available in the
> Cassandra framework.
>
> Also, I read somewhere that either row cache or key cache can be turned on
> for a specific table, not both. Based on your comment, I guess the
> combination of page cache and key cache is used widely for tuning the
> performance.
>
> Thanks,
> Preetika
>
> On Sat, Mar 11, 2017 at 2:01 PM, Matija Gobec 
> wrote:
>
>> Hi,
>>
>> In 99% of use cases Cassandra's row cache is not something you should
>> look into. Leveraging page cache yields good results and if accounted for
>> can provide you with performance increase on read side.
>> I'm not a fan of a default row cache implementation and its invalidation
>> mechanism on updates so you really need to be careful when and how you use
>> it. There isn't much to configuration as there is to your use case. Maybe
>> explain what are you trying to solve with row cache and people can get into
>> discussion with more context.
>>
>> Regards,
>> Matija
>>
>> On Sat, Mar 11, 2017 at 9:15 PM, preetika tyagi 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm new to Cassandra and trying to get a better understanding on how the
>>> row cache can be tuned to optimize the performance.
>>>
>>> I came across think this article: https://docs.datastax
>>> .com/en/cassandra/3.0/cassandra/operations/opsConfiguringCaches.html
>>>
>>> And it suggests not to even touch row cache unless read workload is >
>>> 95% and mostly rely on machine's default cache mechanism which comes with
>>> OS.
>>>
>>> The default row cache size is 0 in cassandra.yaml file so the row cache
>>> won't be utilized at all.
>>>
>>> Therefore, I'm wondering how exactly I can decide to chose to tweak row
>>> cache if needed. Are there any good pointers one can provide on this?
>>>
>>> Thanks,
>>> Preetika
>>>
>>
>>
>


Re: scylladb

2017-03-12 Thread Kant Kodali
yes.

On Sun, Mar 12, 2017 at 2:01 PM, James Carman 
wrote:

> Does all of this Scylla talk really even belong on the Cassandra user
> mailing list in the first place?
>
>
>
>
> On Sun, Mar 12, 2017 at 4:07 PM Jeff Jirsa  wrote:
>
>
>
> On 2017-03-11 22:33 (-0700), Dor Laor  wrote:
> > On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
> > > On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
> > > > Cassanda vs Scylla is a valid comparison because they both are
> > > compatible. Scylla is a drop-in replacement for Cassandra.
> > >
> > > No, they aren't, and no, it isn't
> > >
> >
> > Jeff is angry with us for some reason. I don't know why, it's natural
> that
> > when  a new opponent there are objections and the proof lies on us.
>
> I'm not angry. When I'm angry I send emails with paragraphs of expletives.
> It doesn't happen very often.
>
> This is an open source ASF project, it's not about fighting for market
> share against startups who find it necessary to inflate their level of
> compatibility to sell support contracts, it's about providing software that
> people can use (with a license that makes it easy to use). I don't work for
> a company that makes money selling Cassandra based solutions and you're not
> an opponent.
>
> >
> > Scylla IS a drop in replacement for C*. We support the same CQL (from
> > version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based
> on
> > 2.1.8).
>
> Scylla doesn't even run on all of the supported operating systems, let
> alone have feature parity or network level compatibility (which you'd
> probably need if you REALLY want to be drop-in 
> stop-one-cassandra-node-swap-binaries-start-it-up
> compatible, which is what your site used to claim, but obviously isn't
> supported). You support a subset of one query language and can read and
> write one sstable format. You do it with great supporting tech and a great
> engineering team, but you're not compatible, and if I were your cofounder
> I'd ask you to focus on the tech strengths and not your drop-in
> compatibility, so engineers who care about facts don't grow to resent your
> public lies.
>
> I've used a lot of databases in my life, but I don't know that I've ever
> had someone call me angry because I pointed out that database A wasn't
> compatible with database B, but I guess I'll chalk it up to 2017 and the
> year of fake news / alternative facts.
>
> Hugs and kisses,
> - Jeff
>
>


Re: scylladb

2017-03-12 Thread James Carman
Does all of this Scylla talk really even belong on the Cassandra user
mailing list in the first place?




On Sun, Mar 12, 2017 at 4:07 PM Jeff Jirsa  wrote:



On 2017-03-11 22:33 (-0700), Dor Laor  wrote:
> On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
> > On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
> > > Cassanda vs Scylla is a valid comparison because they both are
> > compatible. Scylla is a drop-in replacement for Cassandra.
> >
> > No, they aren't, and no, it isn't
> >
>
> Jeff is angry with us for some reason. I don't know why, it's natural that
> when  a new opponent there are objections and the proof lies on us.

I'm not angry. When I'm angry I send emails with paragraphs of expletives.
It doesn't happen very often.

This is an open source ASF project, it's not about fighting for market
share against startups who find it necessary to inflate their level of
compatibility to sell support contracts, it's about providing software that
people can use (with a license that makes it easy to use). I don't work for
a company that makes money selling Cassandra based solutions and you're not
an opponent.

>
> Scylla IS a drop in replacement for C*. We support the same CQL (from
> version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based
on
> 2.1.8).

Scylla doesn't even run on all of the supported operating systems, let
alone have feature parity or network level compatibility (which you'd
probably need if you REALLY want to be drop-in
stop-one-cassandra-node-swap-binaries-start-it-up compatible, which is what
your site used to claim, but obviously isn't supported). You support a
subset of one query language and can read and write one sstable format. You
do it with great supporting tech and a great engineering team, but you're
not compatible, and if I were your cofounder I'd ask you to focus on the
tech strengths and not your drop-in compatibility, so engineers who care
about facts don't grow to resent your public lies.

I've used a lot of databases in my life, but I don't know that I've ever
had someone call me angry because I pointed out that database A wasn't
compatible with database B, but I guess I'll chalk it up to 2017 and the
year of fake news / alternative facts.

Hugs and kisses,
- Jeff


Re: scylladb

2017-03-12 Thread Edward Capriolo
On Sun, Mar 12, 2017 at 3:45 PM, Dor Laor  wrote:

> On Sun, Mar 12, 2017 at 12:11 PM, Edward Capriolo 
> wrote:
>
>> The simple claim that "Scylla IS a drop in replacement for C*" shows
>> that they clearly don't know as much as they think they do.
>>
>> Even if it did supposedly "support everything" it would not actually work
>> like that. For example, some things in Cassandra work "the way they work" .
>> They are not specifically defined in a unit test or a document that
>> describes how they are supposed to work. During a massive code port you can
>> not reason if the code still works the same way in all situations.
>>
>> Example, without using SEDA and using something else it definitely wont
>> work the same way when the thread pools fill up and it starts blocking,
>> dropping, whatever. There is so much implicitly undefined behavior.
>>
>
> According to your definition there is no such a thing as drop and
> replacement, doesn't it?
>
> One of our users asked us to add a protocol verb that identify Scylla as
> Scylla so they'll know which
> is which for the time they run 2 clusters.
>
> Look, if we'll claim we have all the features and when someone checks they
> see we don't have LWT then it makes us a bad service. Usually when we get
> someone (specific) interested, we map their C* usage and say what feature
> isn't yet there. So far it's just lack of those not-implemented yet
> features that hold users back. We do try to mimic the exact behaviour of C*.
>
> Clearly, I can't defend a 100% drop-in replacement. Once we implement
> someone's selected
> featureset, then we're a drop-in replacement for them and we're not a good
> match for others.
> We're not after quick wins, quite the opposite.
>
>
>> Also just for argument sake. YCSB proves nothing. Nothing. It generates
>> key-value data, and well frankly that is not the primary use case of
>> Cassandra. So again. Know what you don't know.
>>
>>
> a. We do not pretend we know it all.
> We do have a 3 year mileage with Cassandra and 2.5 with Scylla and we
> gained some knowledge... before we decided to go after the C* path, we
> considered
> to reimplement Mongo, HDFS, Kafka and few more examples and the fact
> we chose
> C* shows our appreciation to this project and not the opposite.
>
> b. YCSB is an industry standard, and that's why everybody use it.
> We don't like it at all since it doesn't have prepared statements
> (it's time that
> someone will merge this support).
> It's not a plain K/V since it's a table of 10 columns of 100b each.
> We do support wide rows and learned (the hard way) their challenge,
> especially
> with compaction, repair and streaming. The current Scylla code doesn't
> cache
> wide row beyond 10MB which isn't ideal. In 1.8 (next month) we have a
> partial
> row caching which supposed to be very good. During the past 20 months
> since
> our beta we tried to focus on good out-of-the-box experience to all
> real workloads
> and we knowingly deferred features like LWT since we wanted a good
> solid base
> before we reach feature parity. If we'll do a good job with a
> benchmark but a bad
> one in real workload, we just shot ourselves in the foot. This was the
> case around our
> beta but it was just a beta. Today we think we're in a very solid
> position. We still
> have lots to complete around repair (which is ok but not great). There
> is a work
> in progress to switch out from Merkle tree to a new algorithm and
> reduced latency
> (almost there). We have mixed feelings about anti-compaction for
> incremental repair
> but we're likely to go through this route too
>
>
>>
>>
>>
>> On Sun, Mar 12, 2017 at 2:15 PM, Jonathan Haddad 
>> wrote:
>>
>>> I don't think Jeff comes across as angry.  He's simply pointing out that
>>> ScyllaDB isn't a drop in replacement for Cassandra.  Saying that it is is
>>> very misleading.  The marketing material should really say something like
>>> "drop in replacement for some workloads" or "aims to be a drop in
>>> replacement".  As is, it doesn't support everything, so it's not a drop in.
>>>
>>>
>>> On Sat, Mar 11, 2017 at 10:34 PM Dor Laor  wrote:
>>>
 On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:



 On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
 > Cassanda vs Scylla is a valid comparison because they both are
 compatible. Scylla is a drop-in replacement for Cassandra.

 No, they aren't, and no, it isn't


 Jeff is angry with us for some reason. I don't know why, it's natural
 that when
 a new opponent there are objections and the proof lies on us.
 We go through great deal of doing it and we don't just throw comments
 without backing.

 Scylla IS a drop in replacement for C*. We support the same CQL (from
 version 1.7 it's cql 

Repair while upgradesstables is running

2017-03-12 Thread Anuj Wadehra
Hi,
What is the implication of running inc repair when all nodes have upgraded to 
new Cassandra rpm but parallel upgradesstables is still running on one or more 
of the nodes?
So upgrade is like:1. Rolling upgrade of all nodes (rpm install)2. Parallel 
upgrade sstable on all nodes ( no issues with IO. We can afford it)3. Repair 
inc while step is running??
ThanksAnuj




Re: scylladb

2017-03-12 Thread Jeff Jirsa


On 2017-03-11 22:33 (-0700), Dor Laor  wrote: 
> On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
> > On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
> > > Cassanda vs Scylla is a valid comparison because they both are
> > compatible. Scylla is a drop-in replacement for Cassandra.
> >
> > No, they aren't, and no, it isn't
> >
> 
> Jeff is angry with us for some reason. I don't know why, it's natural that
> when  a new opponent there are objections and the proof lies on us.

I'm not angry. When I'm angry I send emails with paragraphs of expletives. It 
doesn't happen very often. 

This is an open source ASF project, it's not about fighting for market share 
against startups who find it necessary to inflate their level of compatibility 
to sell support contracts, it's about providing software that people can use 
(with a license that makes it easy to use). I don't work for a company that 
makes money selling Cassandra based solutions and you're not an opponent.

> 
> Scylla IS a drop in replacement for C*. We support the same CQL (from
> version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based on
> 2.1.8). 

Scylla doesn't even run on all of the supported operating systems, let alone 
have feature parity or network level compatibility (which you'd probably need 
if you REALLY want to be drop-in 
stop-one-cassandra-node-swap-binaries-start-it-up compatible, which is what 
your site used to claim, but obviously isn't supported). You support a subset 
of one query language and can read and write one sstable format. You do it with 
great supporting tech and a great engineering team, but you're not compatible, 
and if I were your cofounder I'd ask you to focus on the tech strengths and not 
your drop-in compatibility, so engineers who care about facts don't grow to 
resent your public lies.

I've used a lot of databases in my life, but I don't know that I've ever had 
someone call me angry because I pointed out that database A wasn't compatible 
with database B, but I guess I'll chalk it up to 2017 and the year of fake news 
/ alternative facts. 

Hugs and kisses,
- Jeff


Re: scylladb

2017-03-12 Thread Dor Laor
On Sun, Mar 12, 2017 at 12:11 PM, Edward Capriolo 
wrote:

> The simple claim that "Scylla IS a drop in replacement for C*" shows that
> they clearly don't know as much as they think they do.
>
> Even if it did supposedly "support everything" it would not actually work
> like that. For example, some things in Cassandra work "the way they work" .
> They are not specifically defined in a unit test or a document that
> describes how they are supposed to work. During a massive code port you can
> not reason if the code still works the same way in all situations.
>
> Example, without using SEDA and using something else it definitely wont
> work the same way when the thread pools fill up and it starts blocking,
> dropping, whatever. There is so much implicitly undefined behavior.
>

According to your definition there is no such a thing as drop and
replacement, doesn't it?

One of our users asked us to add a protocol verb that identify Scylla as
Scylla so they'll know which
is which for the time they run 2 clusters.

Look, if we'll claim we have all the features and when someone checks they
see we don't have LWT then it makes us a bad service. Usually when we get
someone (specific) interested, we map their C* usage and say what feature
isn't yet there. So far it's just lack of those not-implemented yet
features that hold users back. We do try to mimic the exact behaviour of C*.

Clearly, I can't defend a 100% drop-in replacement. Once we implement
someone's selected
featureset, then we're a drop-in replacement for them and we're not a good
match for others.
We're not after quick wins, quite the opposite.


> Also just for argument sake. YCSB proves nothing. Nothing. It generates
> key-value data, and well frankly that is not the primary use case of
> Cassandra. So again. Know what you don't know.
>
>
a. We do not pretend we know it all.
We do have a 3 year mileage with Cassandra and 2.5 with Scylla and we
gained some knowledge... before we decided to go after the C* path, we
considered
to reimplement Mongo, HDFS, Kafka and few more examples and the fact we
chose
C* shows our appreciation to this project and not the opposite.

b. YCSB is an industry standard, and that's why everybody use it.
We don't like it at all since it doesn't have prepared statements (it's
time that
someone will merge this support).
It's not a plain K/V since it's a table of 10 columns of 100b each.
We do support wide rows and learned (the hard way) their challenge,
especially
with compaction, repair and streaming. The current Scylla code doesn't
cache
wide row beyond 10MB which isn't ideal. In 1.8 (next month) we have a
partial
row caching which supposed to be very good. During the past 20 months
since
our beta we tried to focus on good out-of-the-box experience to all
real workloads
and we knowingly deferred features like LWT since we wanted a good
solid base
before we reach feature parity. If we'll do a good job with a benchmark
but a bad
one in real workload, we just shot ourselves in the foot. This was the
case around our
beta but it was just a beta. Today we think we're in a very solid
position. We still
have lots to complete around repair (which is ok but not great). There
is a work
in progress to switch out from Merkle tree to a new algorithm and
reduced latency
(almost there). We have mixed feelings about anti-compaction for
incremental repair
but we're likely to go through this route too


>
>
>
> On Sun, Mar 12, 2017 at 2:15 PM, Jonathan Haddad 
> wrote:
>
>> I don't think Jeff comes across as angry.  He's simply pointing out that
>> ScyllaDB isn't a drop in replacement for Cassandra.  Saying that it is is
>> very misleading.  The marketing material should really say something like
>> "drop in replacement for some workloads" or "aims to be a drop in
>> replacement".  As is, it doesn't support everything, so it's not a drop in.
>>
>>
>> On Sat, Mar 11, 2017 at 10:34 PM Dor Laor  wrote:
>>
>>> On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
>>>
>>>
>>>
>>> On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
>>> > Cassanda vs Scylla is a valid comparison because they both are
>>> compatible. Scylla is a drop-in replacement for Cassandra.
>>>
>>> No, they aren't, and no, it isn't
>>>
>>>
>>> Jeff is angry with us for some reason. I don't know why, it's natural
>>> that when
>>> a new opponent there are objections and the proof lies on us.
>>> We go through great deal of doing it and we don't just throw comments
>>> without backing.
>>>
>>> Scylla IS a drop in replacement for C*. We support the same CQL (from
>>> version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based on
>>> 2.1.8). In 1.7 release we support cql uploader
>>> from 3.x. We will support the SStable format of 3.x natively in 3 month
>>> time. Soon all of the feature set will be implemented. 

Re: scylladb

2017-03-12 Thread Edward Capriolo
The simple claim that "Scylla IS a drop in replacement for C*" shows that
they clearly don't know as much as they think they do.

Even if it did supposedly "support everything" it would not actually work
like that. For example, some things in Cassandra work "the way they work" .
They are not specifically defined in a unit test or a document that
describes how they are supposed to work. During a massive code port you can
not reason if the code still works the same way in all situations.

Example, without using SEDA and using something else it definitely wont
work the same way when the thread pools fill up and it starts blocking,
dropping, whatever. There is so much implicitly undefined behavior.

Also just for argument sake. YCSB proves nothing. Nothing. It generates
key-value data, and well frankly that is not the primary use case of
Cassandra. So again. Know what you don't know.





On Sun, Mar 12, 2017 at 2:15 PM, Jonathan Haddad  wrote:

> I don't think Jeff comes across as angry.  He's simply pointing out that
> ScyllaDB isn't a drop in replacement for Cassandra.  Saying that it is is
> very misleading.  The marketing material should really say something like
> "drop in replacement for some workloads" or "aims to be a drop in
> replacement".  As is, it doesn't support everything, so it's not a drop in.
>
>
> On Sat, Mar 11, 2017 at 10:34 PM Dor Laor  wrote:
>
>> On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
>>
>>
>>
>> On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
>> > Cassanda vs Scylla is a valid comparison because they both are
>> compatible. Scylla is a drop-in replacement for Cassandra.
>>
>> No, they aren't, and no, it isn't
>>
>>
>> Jeff is angry with us for some reason. I don't know why, it's natural
>> that when
>> a new opponent there are objections and the proof lies on us.
>> We go through great deal of doing it and we don't just throw comments
>> without backing.
>>
>> Scylla IS a drop in replacement for C*. We support the same CQL (from
>> version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based on
>> 2.1.8). In 1.7 release we support cql uploader
>> from 3.x. We will support the SStable format of 3.x natively in 3 month
>> time. Soon all of the feature set will be implemented. We always have been
>> using this page (not 100% up to date, we'll update it this week):
>> http://www.scylladb.com/technology/status/
>>
>> We add a jmx-proxy daemon in java in order to make the transition as
>> smooth as possible. Almost all the nodetool commands just work, for sure
>> all the important ones.
>> Btw: we have a RESTapi and Prometheus formats, much better than the hairy
>> jmx one.
>>
>> Spark, Kairosdb, Presto and probably Titan (we add Thrift just for legacy
>> users and we don't intend
>> to decommission an api).
>>
>> Regarding benchmarks, if someone finds a flaw in them, we'll do the best
>> to fix it.
>> Let's ignore them and just here what our users have to say:
>> http://www.scylladb.com/users/
>>
>>
>>


Re: scylladb

2017-03-12 Thread Dor Laor
On Sun, Mar 12, 2017 at 11:15 AM, Jonathan Haddad  wrote:

> I don't think Jeff comes across as angry.  He's simply pointing out that
> ScyllaDB isn't a drop in
>

Agree, I take it back, it's wasn't due to this.


> replacement for Cassandra.  Saying that it is is very misleading.  The
> marketing material should really say something like "drop in replacement
> for some workloads" or "aims to be a drop in replacement".  As is, it
> doesn't support everything, so it's not a drop in.
>
>
When we need to describe what Scylla is in 140 characters or one liner, we
use drop-in-replacement. When we talk about the details, we provide the
full details as I did above.
The code is open and we take the upstream-first approach and there is the
status page
to summarize it. If someone depends on LWT or UDF we don't have an
immediate answer.
We do have answers for the rest. The vast majority of users don't get to
use these features
and thus they can (and some did) seamlessly migrate.

For a reference sanity check, see all the databases/tools who claim SQL
ability, most of them
don't comply to the ANSI standard. As you said, our desire is to be 100%
compatible.

Btw, going back to technology discussion, while there are lots of reasons
to use C++, the only
challenge is in features like UDF/triggers which relay on JVM based code
execution. We are likely to use Lua for it initially, and later we'll
integrate it with a JVM based solution.



>
> On Sat, Mar 11, 2017 at 10:34 PM Dor Laor  wrote:
>
>> On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
>>
>>
>>
>> On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
>> > Cassanda vs Scylla is a valid comparison because they both are
>> compatible. Scylla is a drop-in replacement for Cassandra.
>>
>> No, they aren't, and no, it isn't
>>
>>
>> Jeff is angry with us for some reason. I don't know why, it's natural
>> that when
>> a new opponent there are objections and the proof lies on us.
>> We go through great deal of doing it and we don't just throw comments
>> without backing.
>>
>> Scylla IS a drop in replacement for C*. We support the same CQL (from
>> version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based on
>> 2.1.8). In 1.7 release we support cql uploader
>> from 3.x. We will support the SStable format of 3.x natively in 3 month
>> time. Soon all of the feature set will be implemented. We always have been
>> using this page (not 100% up to date, we'll update it this week):
>> http://www.scylladb.com/technology/status/
>>
>> We add a jmx-proxy daemon in java in order to make the transition as
>> smooth as possible. Almost all the nodetool commands just work, for sure
>> all the important ones.
>> Btw: we have a RESTapi and Prometheus formats, much better than the hairy
>> jmx one.
>>
>> Spark, Kairosdb, Presto and probably Titan (we add Thrift just for legacy
>> users and we don't intend
>> to decommission an api).
>>
>> Regarding benchmarks, if someone finds a flaw in them, we'll do the best
>> to fix it.
>> Let's ignore them and just here what our users have to say:
>> http://www.scylladb.com/users/
>>
>>
>>


Re: scylladb

2017-03-12 Thread Dor Laor
On Sun, Mar 12, 2017 at 6:40 AM, Stefan Podkowinski  wrote:

> If someone would create a benchmark showing that Cassandra is 10x faster
> than Aerospike, would that mean Cassandra is 100x faster than ScyllaDB?
>
> Joking aside, I personally don't pay a lot of attention to any published
> benchmarks and look at them as pure marketing material. What I'm interested
> in instead is to learn why exactly one solution is faster than the other
> and I have to say that Avi is doing a really good job explaining the design
> motivations behind ScyllaDB in his presentations.
>
> But the Aerospike comparison also has a good point by showing that you
> probably always will be able to find a solution that is faster for a
> certain work load. Therefor the most important step when looking for the
> fastest datastore, is to first really understand your work load
> characteristic. Unfortunately this is something people tend to skip and
> instead get lost in controversial benchmark discussions, which are more fun
> than thinking about your data model and talking to people about projected
> long term load. Because if you do, you might realize that those benchmark
> test scenarios (e.g. insert 1TB as fast as possible and measure compaction
> times) aren't actually that relevant for your application.
>
Agree, however, it allows you to realize what a real workload will suffer
from and that's why we
measured a 'read while heavily writing' result too. In addition we measured
small, medium and large datasets for read only. Still, benchmarks are not a
real workload and we always advise to use our Prometheus detailed metrics
to realize if the hardware is utilized and to understand what's the
bottleneck. Scylla implemented the CQL tracing and can run the slow query
tracing all of the time with a low performance impact



>
> On 03/10/2017 05:58 PM, Bhuvan Rawal wrote:
>
> Agreed C++ gives an added advantage to talk to underlying hardware with
> better efficiency, it sound good but can a pice of code written in C++ give
> 1000% throughput than a Java app? Is TPC design 10X more performant than
> SEDA arch?
>
> And if C/C++ is indeed that fast how can Aerospike (which is itself
> written in C) claim to be 10X faster than Scylla here
> http://www.aerospike.com/benchmarks/scylladb-initial/ ? (Combining your's
> and aerospike's benchmarks it appears that Aerospike is 100X performant
> than C* - I highly doubt that!! )
>
> For a moment lets forget about evaluating 2 different databases, one can
> observe 10X performance difference between a mistuned cassandra cluster and
> one thats tuned as per data model - there are so many Tunables in yaml as
> well as table configs.
>
> Idea is - in order to strengthen your claim, you need to provide complete
> system metrics (Disk, CPU, Network), the OPS increase starts to decay along
> with the configs used. Having plain ops per second and 99p latency is
> blackbox.
>
> Regards,
> Bhuvan
>
> On Fri, Mar 10, 2017 at 12:47 PM, Avi Kivity  wrote:
>
>> ScyllaDB engineer here.
>>
>> C++ is really an enabling technology here. It is directly responsible for
>> a small fraction of the gain by executing faster than Java.  But it is
>> indirectly responsible for the gain by allowing us direct control over
>> memory and threading.  Just as an example, Scylla starts by taking over
>> almost all of the machine's memory, and dynamically assigning it to
>> memtables, cache, and working memory needed to handle requests in flight.
>> Memory is statically partitioned across cores, allowing us to exploit NUMA
>> fully.  You can't do these things in Java.
>>
>> I would say the major contributors to Scylla performance are:
>>  - thread-per-core design
>>  - replacement of the page cache with a row cache
>>  - careful attention to many small details, each contributing a little,
>> but with a large overall impact
>>
>> While I'm here I can say that performance is not the only goal here, it
>> is stable and predictable performance over varying loads and during
>> maintenance operations like repair, without any special tuning.  We measure
>> the amount of CPU and I/O spent on foreground (user) and background
>> (maintenance) tasks and divide them fairly.  This work is not complete but
>> already makes operating Scylla a lot simpler.
>>
>>
>> On 03/10/2017 01:42 AM, Kant Kodali wrote:
>>
>> I dont think ScyllaDB performance is because of C++. The design decisions
>> in scylladb are indeed different from Cassandra such as getting rid of SEDA
>> and moving to TPC and so on.
>>
>> If someone thinks it is because of C++ then just show the benchmarks that
>> proves it is indeed the C++ which gave 10X performance boost as ScyllaDB
>> claims instead of stating it.
>>
>>
>> On Thu, Mar 9, 2017 at 3:22 PM, Richard L. Burton III > > wrote:
>>
>>> They spend an enormous amount of time focusing on performance. You can
>>> expect them to continue on with their optimization and keep 

Re: scylladb

2017-03-12 Thread Jonathan Haddad
I don't think Jeff comes across as angry.  He's simply pointing out that
ScyllaDB isn't a drop in replacement for Cassandra.  Saying that it is is
very misleading.  The marketing material should really say something like
"drop in replacement for some workloads" or "aims to be a drop in
replacement".  As is, it doesn't support everything, so it's not a drop in.


On Sat, Mar 11, 2017 at 10:34 PM Dor Laor  wrote:

> On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
>
>
>
> On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
> > Cassanda vs Scylla is a valid comparison because they both are
> compatible. Scylla is a drop-in replacement for Cassandra.
>
> No, they aren't, and no, it isn't
>
>
> Jeff is angry with us for some reason. I don't know why, it's natural that
> when
> a new opponent there are objections and the proof lies on us.
> We go through great deal of doing it and we don't just throw comments
> without backing.
>
> Scylla IS a drop in replacement for C*. We support the same CQL (from
> version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based on
> 2.1.8). In 1.7 release we support cql uploader
> from 3.x. We will support the SStable format of 3.x natively in 3 month
> time. Soon all of the feature set will be implemented. We always have been
> using this page (not 100% up to date, we'll update it this week):
> http://www.scylladb.com/technology/status/
>
> We add a jmx-proxy daemon in java in order to make the transition as
> smooth as possible. Almost all the nodetool commands just work, for sure
> all the important ones.
> Btw: we have a RESTapi and Prometheus formats, much better than the hairy
> jmx one.
>
> Spark, Kairosdb, Presto and probably Titan (we add Thrift just for legacy
> users and we don't intend
> to decommission an api).
>
> Regarding benchmarks, if someone finds a flaw in them, we'll do the best
> to fix it.
> Let's ignore them and just here what our users have to say:
> http://www.scylladb.com/users/
>
>
>


Incremental Repair

2017-03-12 Thread Anuj Wadehra
Hi,

Our setup is as follows:
2 DCS with N nodes, RF=DC1:3,DC2:3, Hinted Handoff=3 hours, Incremental Repair 
scheduled once on every node (ALL DCs) within the gc grace period.

I have following queries regarding incremental repairs:

1. When a node is down for X hours (where x > hinted handoff hours and less 
than gc grace time), I think incremental repair is sufficient rather than doing 
the full repair. Is the understanding correct ? 

2. DataStax recommends "Run incremental repair daily, run full repairs weekly 
to monthly". Does that mean that I have to run full repairs every week to month 
EVEN IF I do daily incremental repairs? If yes, whats the reasoning of running 
full repair when inc repair is already run?

Reference: 
https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesWhen.html

3. We run inc repair at least once in gc grace instead of the general 
recommendation that inc repair should be run daily. Do you see any problem with 
the approach? 

As per my understanding, if we run inc repair less frequently, compaction 
between unrepaired and repaired data wont happen on a node until some node runs 
inc repair on the unrepaired data range. Thus, there can be some impact on disk 
space and read performance but immediate compaction is anyways never guaranteed 
by Cassandra. So, I see minimal impact on performance and that too just on the 
reads of delta data generated between repairs. 

Thanks
Anuj


RE: scylladb

2017-03-12 Thread Jacques-Henri Berthemet
Will you support custom secondary indexes, triggers and UDF?
I checked index code but it’s just a couple of files with commented Java code. 
I’m curious to test Scylladb but our application uses LWT and custom secondary 
indexes, I understand LWT is coming (soon?).

--
Jacques-Henri Berthemet

From: sfesc...@gmail.com [mailto:sfesc...@gmail.com]
Sent: dimanche 12 mars 2017 09:23
To: user@cassandra.apache.org
Subject: Re: scylladb


On Sat, Mar 11, 2017 at 1:52 AM Avi Kivity 
> wrote:


Lastly, why don't you test Scylla yourself?  It's pretty easy to set up, 
there's nothing to tune.

Avi

 I'll look seriously at Scylla when it is 3.0.12 compatible.


Re: scylladb

2017-03-12 Thread sfesc...@gmail.com
On Sat, Mar 11, 2017 at 1:52 AM Avi Kivity  wrote:

>
>
> Lastly, why don't you test Scylla yourself?  It's pretty easy to set up,
> there's nothing to tune.
>
> Avi
>

 I'll look seriously at Scylla when it is 3.0.12 compatible.


Re: scylladb

2017-03-12 Thread Edward Capriolo
On Sun, Mar 12, 2017 at 11:40 AM, Edward Capriolo 
wrote:

>
>
> On Sun, Mar 12, 2017 at 1:38 AM, benjamin roth  wrote:
>
>> There is no reason to be angry. This is progress. This is the circle of
>> live.
>>
>> It happens anywhere at any time.
>>
>> Am 12.03.2017 07:34 schrieb "Dor Laor" :
>>
>>> On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
>>>


 On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
 > Cassanda vs Scylla is a valid comparison because they both are
 compatible. Scylla is a drop-in replacement for Cassandra.

 No, they aren't, and no, it isn't

>>>
>>> Jeff is angry with us for some reason. I don't know why, it's natural
>>> that when
>>> a new opponent there are objections and the proof lies on us.
>>> We go through great deal of doing it and we don't just throw comments
>>> without backing.
>>>
>>> Scylla IS a drop in replacement for C*. We support the same CQL (from
>>> version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based on
>>> 2.1.8). In 1.7 release we support cql uploader
>>> from 3.x. We will support the SStable format of 3.x natively in 3 month
>>> time. Soon all of the feature set will be implemented. We always have been
>>> using this page (not 100% up to date, we'll update it this week):
>>> http://www.scylladb.com/technology/status/
>>>
>>> We add a jmx-proxy daemon in java in order to make the transition as
>>> smooth as possible. Almost all the nodetool commands just work, for sure
>>> all the important ones.
>>> Btw: we have a RESTapi and Prometheus formats, much better than the
>>> hairy jmx one.
>>>
>>> Spark, Kairosdb, Presto and probably Titan (we add Thrift just for
>>> legacy users and we don't intend
>>> to decommission an api).
>>>
>>> Regarding benchmarks, if someone finds a flaw in them, we'll do the best
>>> to fix it.
>>> Let's ignore them and just here what our users have to say:
>>> http://www.scylladb.com/users/
>>>
>>>
>>>
>
> Scylla is NOT a drop in replacement for Cassandra. Cassandra is a TM.
> Cassandra is NOT a certification body. You are not a certification body.
>
> "Scylla IS a drop in replacement for C*. We support the same CQL (from
> version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based on
> 2.1.8). In 1.7 release we support cql uploader
> from 3.x. We will support the SStable format of 3.x natively in 3 month
> time. Soon all of the feature set will be implemented. We always have been
> using this page (not 100% up to date, we'll update it this week):
> http://www.scylladb.com/technology/status/ "
>
> No matter how "compatible" you believe Scylla is you can not assert this
> claim.
>
>
>
Also there is no reason to say Jeff is 'angry' because he asserted his
believe in fact.

"No, they aren't, and no, it isn't"

Does not sound angry. t

Besides that your your own words prove it:

"Scylla IS a drop in replacement for C*"
"Soon all of the feature set will be implemented"

Something is NOT a "drop in replacement" when it does NOT have all the
features.

Also knowing Jeff who is very even keel, I highly doubt because he made a
short concise statement he is "angry".

That being said I am little bit angry by the shameless self promotion and
Jabbering on you seem to be doing. We get it you know about kernels and
page faults and want to talk endlessly about it.


Re: scylladb

2017-03-12 Thread Edward Capriolo
On Sun, Mar 12, 2017 at 1:38 AM, benjamin roth  wrote:

> There is no reason to be angry. This is progress. This is the circle of
> live.
>
> It happens anywhere at any time.
>
> Am 12.03.2017 07:34 schrieb "Dor Laor" :
>
>> On Sat, Mar 11, 2017 at 10:02 PM, Jeff Jirsa  wrote:
>>
>>>
>>>
>>> On 2017-03-10 09:57 (-0800), Rakesh Kumar wrote:
>>> > Cassanda vs Scylla is a valid comparison because they both are
>>> compatible. Scylla is a drop-in replacement for Cassandra.
>>>
>>> No, they aren't, and no, it isn't
>>>
>>
>> Jeff is angry with us for some reason. I don't know why, it's natural
>> that when
>> a new opponent there are objections and the proof lies on us.
>> We go through great deal of doing it and we don't just throw comments
>> without backing.
>>
>> Scylla IS a drop in replacement for C*. We support the same CQL (from
>> version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based on
>> 2.1.8). In 1.7 release we support cql uploader
>> from 3.x. We will support the SStable format of 3.x natively in 3 month
>> time. Soon all of the feature set will be implemented. We always have been
>> using this page (not 100% up to date, we'll update it this week):
>> http://www.scylladb.com/technology/status/
>>
>> We add a jmx-proxy daemon in java in order to make the transition as
>> smooth as possible. Almost all the nodetool commands just work, for sure
>> all the important ones.
>> Btw: we have a RESTapi and Prometheus formats, much better than the hairy
>> jmx one.
>>
>> Spark, Kairosdb, Presto and probably Titan (we add Thrift just for legacy
>> users and we don't intend
>> to decommission an api).
>>
>> Regarding benchmarks, if someone finds a flaw in them, we'll do the best
>> to fix it.
>> Let's ignore them and just here what our users have to say:
>> http://www.scylladb.com/users/
>>
>>
>>

Scylla is NOT a drop in replacement for Cassandra. Cassandra is a TM.
Cassandra is NOT a certification body. You are not a certification body.

"Scylla IS a drop in replacement for C*. We support the same CQL (from
version 1.7 it's cql 3.3.1, protocol v4), the same SStable format (based on
2.1.8). In 1.7 release we support cql uploader
from 3.x. We will support the SStable format of 3.x natively in 3 month
time. Soon all of the feature set will be implemented. We always have been
using this page (not 100% up to date, we'll update it this week):
http://www.scylladb.com/technology/status/ "

No matter how "compatible" you believe Scylla is you can not assert this
claim.


Re: scylladb

2017-03-12 Thread Stefan Podkowinski
If someone would create a benchmark showing that Cassandra is 10x faster
than Aerospike, would that mean Cassandra is 100x faster than ScyllaDB?

Joking aside, I personally don't pay a lot of attention to any published
benchmarks and look at them as pure marketing material. What I'm
interested in instead is to learn why exactly one solution is faster
than the other and I have to say that Avi is doing a really good job
explaining the design motivations behind ScyllaDB in his presentations.

But the Aerospike comparison also has a good point by showing that you
probably always will be able to find a solution that is faster for a
certain work load. Therefor the most important step when looking for the
fastest datastore, is to first really understand your work load
characteristic. Unfortunately this is something people tend to skip and
instead get lost in controversial benchmark discussions, which are more
fun than thinking about your data model and talking to people about
projected long term load. Because if you do, you might realize that
those benchmark test scenarios (e.g. insert 1TB as fast as possible and
measure compaction times) aren't actually that relevant for your
application.


On 03/10/2017 05:58 PM, Bhuvan Rawal wrote:
> Agreed C++ gives an added advantage to talk to underlying hardware
> with better efficiency, it sound good but can a pice of code written
> in C++ give 1000% throughput than a Java app? Is TPC design 10X more
> performant than SEDA arch?
>
> And if C/C++ is indeed that fast how can Aerospike (which is itself
> written in C) claim to be 10X faster than Scylla
> here http://www.aerospike.com/benchmarks/scylladb-initial/ ?
> (Combining your's and aerospike's benchmarks it appears that Aerospike
> is 100X performant than C* - I highly doubt that!! )
>
> For a moment lets forget about evaluating 2 different databases, one
> can observe 10X performance difference between a mistuned cassandra
> cluster and one thats tuned as per data model - there are so many
> Tunables in yaml as well as table configs.
>
> Idea is - in order to strengthen your claim, you need to provide
> complete system metrics (Disk, CPU, Network), the OPS increase starts
> to decay along with the configs used. Having plain ops per second and
> 99p latency is blackbox.
>
> Regards,
> Bhuvan
>
> On Fri, Mar 10, 2017 at 12:47 PM, Avi Kivity  > wrote:
>
> ScyllaDB engineer here.
>
> C++ is really an enabling technology here. It is directly
> responsible for a small fraction of the gain by executing faster
> than Java.  But it is indirectly responsible for the gain by
> allowing us direct control over memory and threading.  Just as an
> example, Scylla starts by taking over almost all of the machine's
> memory, and dynamically assigning it to memtables, cache, and
> working memory needed to handle requests in flight.  Memory is
> statically partitioned across cores, allowing us to exploit NUMA
> fully.  You can't do these things in Java.
>
> I would say the major contributors to Scylla performance are:
>  - thread-per-core design
>  - replacement of the page cache with a row cache
>  - careful attention to many small details, each contributing a
> little, but with a large overall impact
>
> While I'm here I can say that performance is not the only goal
> here, it is stable and predictable performance over varying loads
> and during maintenance operations like repair, without any special
> tuning.  We measure the amount of CPU and I/O spent on foreground
> (user) and background (maintenance) tasks and divide them fairly. 
> This work is not complete but already makes operating Scylla a lot
> simpler.
>
>
> On 03/10/2017 01:42 AM, Kant Kodali wrote:
>> I dont think ScyllaDB performance is because of C++. The design
>> decisions in scylladb are indeed different from Cassandra such as
>> getting rid of SEDA and moving to TPC and so on. 
>>
>> If someone thinks it is because of C++ then just show the
>> benchmarks that proves it is indeed the C++ which gave 10X
>> performance boost as ScyllaDB claims instead of stating it.
>>
>>
>> On Thu, Mar 9, 2017 at 3:22 PM, Richard L. Burton III
>> > wrote:
>>
>> They spend an enormous amount of time focusing on
>> performance. You can expect them to continue on with their
>> optimization and keep crushing it.
>>
>> P.S., I don't work for ScyllaDB.  
>>
>> On Thu, Mar 9, 2017 at 6:02 PM, Rakesh Kumar
>> > > wrote:
>>
>> In all of their presentation they keep harping on the
>> fact that scylladb is written in C++ and does not carry
>> the overhead of Java.  Still the difference looks staggering.
>> 

Re: scylladb

2017-03-12 Thread Kant Kodali
One more thing. Pretty much every database that is written in C++ or Java
uses native kernel threads for non-blocking I/O as well. They didn't use
Seaster or Quasar but anyways I am going to read up on Seaster and see what
it really does.

On Sun, Mar 12, 2017 at 3:48 AM, Kant Kodali  wrote:

>
> If you have thread-per-core and N (logical) cores, and have M tasks
>> running concurrently where M > N, then you need a scheduler to decide which
>> of those M tasks gets to run on those N kernel threads.  Whether those M
>> tasks are user-level threads, or callbacks, or a mix of the two is
>> immaterial.  In such cases a scheduler always exists, even if it is a
>> simple FIFO queue.
>>
>
>
>> yes ofcourse scheduler is needed. But what you said is immaterial is
>> where I see the devil or say our conflict of arguments really are. Let the
>> kernel thread per core deal with callbacks rather than having to build a
>> user-level thread library and its scheduling mechanisms and the mapping
>> between them. This sounds more of an overhead in general but may work in a
>> specific case.
>>
>
>


Re: scylladb

2017-03-12 Thread Kant Kodali
Sorry I made some typo's here is a better version.

@Avi

"User-level scheduling is great for high performance I/O intensive
applications like databases and file systems." This is generally a claim
made by people who want to use user-level threads but I rarely had seen any
significant performance gain. Since you are claiming that you do. It would
be great if you can quantify that. The other day I have seen a benchmark of
a Golang server which supports user level threads/green threads natively
and it was able to handle 10K concurrent requests. Even Nginx which is
written in C and uses kernel threads can handle that many with Non-blocking
I/O. We all know concurrency is not parallelism.

One may have to pay for something which could be any of the following.

*Duplication of the schedulers*
M:N requires two schedulers which basically do same work, one at user level
and one in kernel. This is undesirable. It requires frequent data
communications between kernel and user space for scheduling information
transference.

Duplication takes more space in both Dcache and Icache for scheduling than
a single scheduler. It is highly undesirable if cache misses are caused by
the schedulers but the application, because a L2 cache miss could be more
expensive than a kernel thread switch. Then the additional scheduler might
become a trouble maker! In this case, to save kernel trappings does not
justify a user-scheduler, which is more truen when the processors are
providing faster and faster kernel trapping execution.

*Thread local data maintenance*
M:N has to maintain thread specific data, which are already provided by
kernel for kernel thread, such as the TLS data, error number. To provide
the same feature for user threads is not straightforward, because, for
example, the error number is returned for system call failure and supported
by kernel. User-level support degrades system performance and increases
system complexity.

*System info oblivious*
Kernel scheduler is close to underlying platform and architecture. It can
take advantage of their features. This is difficult for user thread library
because it's a layer at user level. User threads are second-order entities
in the system. If a kernel thread uses a GDT slot for TLS data, a user
thread perhaps can only use an LDT slot for TLS data. With increasingly
more supports available from the new processors for threading/scheduling
(Hyperthreading, NUMA, many-core), the second order nature seriously limits
the ability of M:N threading.

On Sun, Mar 12, 2017 at 1:33 AM, Kant Kodali  wrote:

> Sorry I made some typo's here is a better version.
>
> @Avi
>
> "User-level scheduling is great for high performance I/O intensive
> applications like databases and file systems." This is generally a claim
> made by people who want to use user-level threads but I rarely had seen any
> significant performance gain. Since you are claiming that you do. It would
> be great if you can quantify that. The other day I have seen a benchmark of
> a Golang server which supports user level threads/green threads natively
> and it was able to handle 10K concurrent requests. Even Nginx which is
> written in C and uses kernel threads can handle that many with Non-blocking
> I/O. We all know concurrency is not parallelism.
>
> One may have to pay for something which could be any of the following.
>
> *Duplication of the schedulers*
> M:N requires two schedulers which basically do same work, one at user
> level and one in kernel. This is undesirable. It requires frequent data
> communications between kernel and user space for scheduling information
> transference.
>
> Duplication takes more space in both Dcache and Icache for scheduling than
> a single scheduler. It is highly undesirable if cache misses are caused by
> the schedulers but the application, because a L2 cache miss could be more
> expensive than a kernel thread switch. Then the additional scheduler might
> become a trouble maker! In this case, to save kernel trappings does not
> justify a user-scheduler, which is more truen when the processors are
> providing faster and faster kernel trapping execution.
>
> *Thread local data maintenance*
> M:N has to maintain thread specific data, which are already provided by
> kernel for kernel thread, such as the TLS data, error number. To provide
> the same feature for user threads is not straightforward, because, for
> example, the error number is returned for system call failure and supported
> by kernel. User-level support degrades system performance and increases
> system complexity.
>
> *System info oblivious*
> Kernel scheduler is close to underlying platform and architecture. It can
> take advantage of their features. This is difficult for user thread library
> because it's a layer at user level. User threads are second-order entities
> in the system. If a kernel thread uses a GDT slot for TLS data, a user
> thread perhaps can only use an LDT slot for TLS data. With 

Re: scylladb

2017-03-12 Thread Kant Kodali
> If you have thread-per-core and N (logical) cores, and have M tasks
> running concurrently where M > N, then you need a scheduler to decide which
> of those M tasks gets to run on those N kernel threads.  Whether those M
> tasks are user-level threads, or callbacks, or a mix of the two is
> immaterial.  In such cases a scheduler always exists, even if it is a
> simple FIFO queue.
>


> yes ofcourse scheduler is needed. But what you said is immaterial is where
> I see the devil or say our conflict of arguments really are. Let the kernel
> thread per core deal with callbacks rather than having to build a
> user-level thread library and its scheduling mechanisms and the mapping
> between them. This sounds more of an overhead in general but may work in a
> specific case.
>


Re: scylladb

2017-03-12 Thread Avi Kivity
If you have thread-per-core and N (logical) cores, and have M tasks 
running concurrently where M > N, then you need a scheduler to decide 
which of those M tasks gets to run on those N kernel threads.  Whether 
those M tasks are user-level threads, or callbacks, or a mix of the two 
is immaterial.  In such cases a scheduler always exists, even if it is a 
simple FIFO queue.



Scheduling happens either voluntarily (the task issues I/O) or 
involuntarily (the scheduler decides it needs to run another task to 
satisfy latency SLA), but it has to happen.  The only case where it 
doesn't need to happen is if M<=N, in which case your server will be 
underutilized whenever your task has to wait.



On 03/12/2017 12:17 PM, Kant Kodali wrote:

@Avi

I don't disagree with thread per core design and in fact I said that 
is a reasonable/good choice. But I am having a hard time seeing 
through how user level scheduling can make a significant difference 
even in Non-blocking I/O case. My question really is that if you 
already have TPC why do you need user level scheduling ? And if the 
answer is to switch between user level tasks then I am simply trying 
to say "concurrency is not parallelism" (just because one was able to 
switch between user level threads doesn't mean they are running in 
parallel underneath). Why not simple schedule those on kernel threads 
running on those cores and have a callback mechanism. Why would one 
need to deal with user level scheduling overhead and all the problems 
that comes with it. This to me just sounds like difference in the 
design paradigm but doesn't seem to add much to the performance.


Seaster sounds very similar to Quasar. And I am not seeing great 
benefits from it.





On Sun, Mar 12, 2017 at 1:48 AM, Avi Kivity > wrote:


We already quantified it, the result is Scylla. Now, Scylla's
performance is only in part due to the threading model, so I can't
give you a number that quantifies how much just this aspect of the
design is worth.  Removing it (or adding it to Cassandra) is a
multi-man-year effort that I can't justify for this conversation.


If you want to continue to use kernel threads for you
applications, by all means continue to do so.  They're the right
choice for all but the most I/O intensive applications.  But for
these I/O intensive applications thread-per-core is the right
choice, regardless of the points you raise.


I encourage you to study the seastar code base [1] and
documentation [2] to see how we handled those problems. I'll also
comment a bit below.


[1] https://github.com/scylladb/seastar


[2] http://www.seastar-project.org/ 


On 03/12/2017 11:07 AM, Kant Kodali wrote:

@Avi

"User-level scheduling is great for high performance I/O
intensive applications like databases and file systems." This is
generally a claim made by people you want to use user-level
threads but I rarely had seen any significant performance gain.
Since you are claiming that you do. It would be great if you can
quantify that. The other day I have seen a benchmark of a Golang
server which supports user level threads/green threads natively
and it was able to handle 10K concurrent requests. Even Nginx
which is written and C and uses kernel threads can handle that
many with Non-blocking I/O. We all know concurrency is not
parallelism.

You may have to pay for something which could be any of the
following.

*Duplication of the schedulers*
M:N requires two schedulers which basically do same work, one at
user level and one in kernel. This is undesirable. It requires
frequent data communications between kernel and user space for
scheduling information transference.

Duplication takes more space in both Dcache and Icache for
scheduling than a single scheduler. It is highly undesirable if
cache misses are caused by the schedulers but the application,
because a L2 cache miss could be more expensive than a kernel
thread switch. Then the additional scheduler might become a
trouble maker! In this case, to save kernel trappings does not
justify a user-scheduler, which is more truen when the processors
are providing faster and faster kernel trapping execution.



That's not a problem, at least in my experience. The kernel
scheduler needs to schedule only one thread, and that very
infrequently. It is completely out of any hot path.



*Thread local data maintenance*
M:N has to maintain thread specific data, which are already
provided by kernel for kernel thread, such as the TLS data, error
number. To provide the same feature for user threads is not
straightforward, because, for example, the error number is
returned for system call failure and supported by kernel.

Re: scylladb

2017-03-12 Thread Kant Kodali
@Avi

I don't disagree with thread per core design and in fact I said that is a
reasonable/good choice. But I am having a hard time seeing through how user
level scheduling can make a significant difference even in Non-blocking I/O
case. My question really is that if you already have TPC why do you need
user level scheduling ? And if the answer is to switch between user level
tasks then I am simply trying to say "concurrency is not parallelism" (just
because one was able to switch between user level threads doesn't mean they
are running in parallel underneath). Why not simple schedule those on
kernel threads running on those cores and have a callback mechanism. Why
would one need to deal with user level scheduling overhead and all the
problems that comes with it. This to me just sounds like difference in the
design paradigm but doesn't seem to add much to the performance.

Seaster sounds very similar to Quasar. And I am not seeing great benefits
from it.




On Sun, Mar 12, 2017 at 1:48 AM, Avi Kivity  wrote:

> We already quantified it, the result is Scylla. Now, Scylla's performance
> is only in part due to the threading model, so I can't give you a number
> that quantifies how much just this aspect of the design is worth.  Removing
> it (or adding it to Cassandra) is a multi-man-year effort that I can't
> justify for this conversation.
>
>
> If you want to continue to use kernel threads for you applications, by all
> means continue to do so.  They're the right choice for all but the most I/O
> intensive applications.  But for these I/O intensive applications
> thread-per-core is the right choice, regardless of the points you raise.
>
>
> I encourage you to study the seastar code base [1] and documentation [2]
> to see how we handled those problems.  I'll also comment a bit below.
>
>
> [1] https://github.com/scylladb/seastar
>
> [2] http://www.seastar-project.org/
>
> On 03/12/2017 11:07 AM, Kant Kodali wrote:
>
> @Avi
>
> "User-level scheduling is great for high performance I/O intensive
> applications like databases and file systems." This is generally a claim
> made by people you want to use user-level threads but I rarely had seen any
> significant performance gain. Since you are claiming that you do. It would
> be great if you can quantify that. The other day I have seen a benchmark of
> a Golang server which supports user level threads/green threads natively
> and it was able to handle 10K concurrent requests. Even Nginx which is
> written and C and uses kernel threads can handle that many with
> Non-blocking I/O. We all know concurrency is not parallelism.
>
> You may have to pay for something which could be any of the following.
>
> *Duplication of the schedulers*
> M:N requires two schedulers which basically do same work, one at user
> level and one in kernel. This is undesirable. It requires frequent data
> communications between kernel and user space for scheduling information
> transference.
>
> Duplication takes more space in both Dcache and Icache for scheduling than
> a single scheduler. It is highly undesirable if cache misses are caused by
> the schedulers but the application, because a L2 cache miss could be more
> expensive than a kernel thread switch. Then the additional scheduler might
> become a trouble maker! In this case, to save kernel trappings does not
> justify a user-scheduler, which is more truen when the processors are
> providing faster and faster kernel trapping execution.
>
>
>
> That's not a problem, at least in my experience. The kernel scheduler
> needs to schedule only one thread, and that very infrequently. It is
> completely out of any hot path.
>
>
> *Thread local data maintenance*
> M:N has to maintain thread specific data, which are already provided by
> kernel for kernel thread, such as the TLS data, error number. To provide
> the same feature for user threads is not straightforward, because, for
> example, the error number is returned for system call failure and supported
> by kernel. User-level support degrades system performance and increases
> system complexity.
>
>
> This is also not a problem, we capture error codes in exceptions
> immediately after a system call and so we don't need to rely on TLS for
> errno.
>
>
> *System info oblivious*
> Kernel scheduler is close to underlying platform and architecture. It can
> take advantage of their features. This is difficult for user thread library
> because it's a layer at user level. User threads are second-order entities
> in the system. If a kernel thread uses a GDT slot for TLS data, a user
> thread perhaps can only use an LDT slot for TLS data. With increasingly
> more supports available from the new processors for threading/scheduling
> (Hyperthreading, NUMA, many-core), the second order nature seriously limits
> the ability of M:N threading.
>
>
> Those are non-issues, in my experience.  In fact it's the other way
> around, the kernel scheduler cannot assume anything about the 

Re: scylladb

2017-03-12 Thread Avi Kivity
We already quantified it, the result is Scylla. Now, Scylla's 
performance is only in part due to the threading model, so I can't give 
you a number that quantifies how much just this aspect of the design is 
worth.  Removing it (or adding it to Cassandra) is a multi-man-year 
effort that I can't justify for this conversation.



If you want to continue to use kernel threads for you applications, by 
all means continue to do so.  They're the right choice for all but the 
most I/O intensive applications.  But for these I/O intensive 
applications thread-per-core is the right choice, regardless of the 
points you raise.



I encourage you to study the seastar code base [1] and documentation [2] 
to see how we handled those problems.  I'll also comment a bit below.



[1] https://github.com/scylladb/seastar

[2] http://www.seastar-project.org/


On 03/12/2017 11:07 AM, Kant Kodali wrote:

@Avi

"User-level scheduling is great for high performance I/O intensive 
applications like databases and file systems." This is generally a 
claim made by people you want to use user-level threads but I rarely 
had seen any significant performance gain. Since you are claiming that 
you do. It would be great if you can quantify that. The other day I 
have seen a benchmark of a Golang server which supports user level 
threads/green threads natively and it was able to handle 10K 
concurrent requests. Even Nginx which is written and C and uses kernel 
threads can handle that many with Non-blocking I/O. We all know 
concurrency is not parallelism.


You may have to pay for something which could be any of the following.

*Duplication of the schedulers*
M:N requires two schedulers which basically do same work, one at user 
level and one in kernel. This is undesirable. It requires frequent 
data communications between kernel and user space for scheduling 
information transference.


Duplication takes more space in both Dcache and Icache for scheduling 
than a single scheduler. It is highly undesirable if cache misses are 
caused by the schedulers but the application, because a L2 cache miss 
could be more expensive than a kernel thread switch. Then the 
additional scheduler might become a trouble maker! In this case, to 
save kernel trappings does not justify a user-scheduler, which is more 
truen when the processors are providing faster and faster kernel 
trapping execution.



That's not a problem, at least in my experience. The kernel scheduler 
needs to schedule only one thread, and that very infrequently. It is 
completely out of any hot path.




*Thread local data maintenance*
M:N has to maintain thread specific data, which are already provided 
by kernel for kernel thread, such as the TLS data, error number. To 
provide the same feature for user threads is not straightforward, 
because, for example, the error number is returned for system call 
failure and supported by kernel. User-level support degrades system 
performance and increases system complexity.


This is also not a problem, we capture error codes in exceptions 
immediately after a system call and so we don't need to rely on TLS for 
errno.




*System info oblivious*
Kernel scheduler is close to underlying platform and architecture. It 
can take advantage of their features. This is difficult for user 
thread library because it's a layer at user level. User threads are 
second-order entities in the system. If a kernel thread uses a GDT 
slot for TLS data, a user thread perhaps can only use an LDT slot for 
TLS data. With increasingly more supports available from the new 
processors for threading/scheduling (Hyperthreading, NUMA, many-core), 
the second order nature seriously limits the ability of M:N threading.


Those are non-issues, in my experience.  In fact it's the other way 
around, the kernel scheduler cannot assume anything about the threads it 
is preempting and so has to save more state.  The threads being 
preempted also cannot assume anything about the kernel scheduler, and so 
have to use atomic read-modify-write instructions for synchronization, 
and to perform a system call whenever they need to block or wake another 
thread.






On Sun, Mar 12, 2017 at 1:05 AM, Avi Kivity > wrote:


btw, for an example of how user-level tasks can be scheduled in a
way that cannot be done with kernel threads, see this pair of blog
posts:


http://www.scylladb.com/2016/04/14/io-scheduler-1/


http://www.scylladb.com/2016/04/29/io-scheduler-2/



There's simply no way to get this kind of control when you rely on
the kernel for scheduling and page cache management.  As a result
you have to overprovision your node and then you mostly
underutilize it.


On 03/12/2017 10:23 AM, Avi Kivity wrote:




On 03/12/2017 12:19 AM, Kant Kodali wrote:

My response is inline.


Re: scylladb

2017-03-12 Thread Kant Kodali
On Sun, Mar 12, 2017 at 12:23 AM, Avi Kivity  wrote:

>
>
> On 03/12/2017 12:19 AM, Kant Kodali wrote:
>
> My response is inline.
>
> On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity  wrote:
>
>> There are several issues at play here.
>>
>> First, a database runs a large number of concurrent operations, each of
>> which only consumes a small amount of CPU. The high concurrency is need to
>> hide latency: disk latency, or the latency of contacting a remote node.
>>
>
> *Ok so you are talking about hiding I/O latency.  If all these I/O are
> non-blocking system calls then a thread per core and callback mechanism
> should suffice isn't it?*
>
>
>
> Scylla uses a mix of user-level threads and callbacks. Most of the code
> uses callbacks (fronted by a future/promise API). SSTable writers
> (memtable flush, compaction) use a user-level thread (internally
> implemented using callbacks).  The important bit is multiplexing many
> concurrent operations onto a single kernel thread.
>
>
> This means that the scheduler will need to switch contexts very often. A
>> kernel thread scheduler knows very little about the application, so it has
>> to switch a lot of context.  A user level scheduler is tightly bound to the
>> application, so it can perform the switching faster.
>>
>
> *sure but this applies in other direction as well. A user level scheduler
> has no idea about kernel level scheduler either.  There is literally no
> coordination between kernel level scheduler and user level scheduler in
> linux or any major OS. It may be possible with OS's that support scheduler
> activation(LWP's) and upcall mechanism. *
>
>
> There is no need for coordination, because the kernel scheduler has no
> scheduling decisions to make.  With one thread per core, bound to its core,
> the kernel scheduler can't make the wrong decision because it has just one
> choice.
>
>
> *Even then it is hard to say if it is all worth it (The research shows
> performance may not outweigh the complexity). Golang problem is exactly
> this if one creates 1000 go routines/green threads where each of them is
> making a blocking system call then it would create 1000 kernel threads
> underneath because it has no way to know that the kernel thread is blocked
> (no upcall). *
>
>
> All of the significant system calls we issue are through the main thread,
> either asynchronous or non-blocking.
>
> *And in non-blocking case I still don't even see a significant performance
> when compared to few kernel threads with callback mechanism.*
>
>
> We do.
>
>
> *  If you are saying user level scheduling is the Future (perhaps I would
> just let the researchers argue about it) As of today that is not case else
> languages would have had it natively instead of using third party
> frameworks or libraries. *
>
>
> User-level scheduling is great for high performance I/O intensive
> applications like databases and file systems.  It's not a general solution,
> and it involves a lot of effort to set up the infrastructure. However, for
> our use case, it was worth it.
>

*Even with I/O intensive applications it is very much debatable. The
numbers I had seen aren't convincing at all. *

>
>
>
>
>> There are also implications on the concurrency primitives in use (locks
>> etc.) -- they will be much faster for the user-level scheduler, because
>> they cooperate with the scheduler.  For example, no atomic
>> read-modify-write instructions need to be executed.
>>
>
>
>  Second, how many (kernel) threads should you run?
> * This question one will always have. If there are 10K user level threads
> that maps to only one kernel thread then they cannot exploit parallelism.
> so there is no right answer but a thread per core is a reasonable/good
> choice. *
>
>
> Only if you can multiplex many operations on top of each of those
> threads.  Otherwise, the CPUs end up underutilized.
>

*Yes thats exactly my point to your question on "how many (kernel) threads
should you run?" so I will repeat myself here.  This question one will
always have even they prefer user-level thread scheduling they still need
to know how may kernel threads they need to map to so one will end up with
same question which is how many kernel threads to create?. If there are 10K
user level threads that maps to only one kernel thread then they cannot
exploit parallelism. so there is no right answer but a thread per core is a
reasonable/good choice. *


>
>
>
>
>> If you run too few threads, then you will not be able to saturate the CPU
>> resources.  This is a common problem with Cassandra -- it's very hard to
>> get it to consume all of the CPU power on even a moderately large machine.
>> On the other hand, if you have too many threads, you will see latency rise
>> very quickly, because kernel scheduling granularity is on the order of
>> milliseconds.  User-level scheduling, because it leaves control in the hand
>> of the application, allows you to both saturate the CPU and 

Re: scylladb

2017-03-12 Thread Bhuvan Rawal
​

On Sun, Mar 12, 2017 at 2:42 PM, Bhuvan Rawal  wrote:

> Looking at the costs of cloud instances, it clearly appears the cost of
> CPU dictates the overall cost of the instance. Having 2X more cores
> increases cost by nearly 2X keeping other things same as can be seen below
> as an example:
>
> (C3 may have slightly better processor but not more than 10-15% peformance
> increase)
>
> Optimising for fewer CPU cycles will invariably reduce costs by a large
> factor. On a modern day machine with SSD's where data density on node can
> be high more requests can be assumed to be served from single node, things
> get CPU bound. Perhaps its because it was invented at a time when SSD's did
> not exist. If we observe closely, many of cassandra defaults are assuming
> disk is rotational - number of flush writers, concurrent compactors, etc.
> The design suggest that too (Using Sequential io as far as possible. Infact
> thats the underlying philosophy for sequential sstable flushes and
> sequential commitlog files to avoid random io). Perhaps if it was designed
> currently things may look radically different.
>
> Comparing an average hard disk - ~200 iops  vs ~40K for ssd thats approx
> 200 times increase effectively increasing expectation from processor to
> serve significantly higher ops per second.
>
> In order to extract best from a modern day node it may need significant
> changes such like below :
> https://issues.apache.org/jira/browse/CASSANDRA-10989
> Possibly going forward the number of cores per node is only going to
> increase as it has been seen for last 5-6 years. In a way thats suggesting
> a significant change in design and possibly thats what scylladb is upto.
>
> "We found that we need a cpu scheduler which takes into account the
> priority of different tasks, such as repair, compaction, streaming, read
> operations and write operations."
> From my understanding in Cassandra as well compaction threads run on low
> nice priority - not sure about repair/streaming.
> http://grokbase.com/t/cassandra/user/14a85xpce7/significant-nice-cpu-usage
>
> Regards,
>
> On Sun, Mar 12, 2017 at 2:35 PM, Avi Kivity  wrote:
>
>> btw, for an example of how user-level tasks can be scheduled in a way
>> that cannot be done with kernel threads, see this pair of blog posts:
>>
>>
>>   http://www.scylladb.com/2016/04/14/io-scheduler-1/
>>
>>   http://www.scylladb.com/2016/04/29/io-scheduler-2/
>>
>>
>> There's simply no way to get this kind of control when you rely on the
>> kernel for scheduling and page cache management.  As a result you have to
>> overprovision your node and then you mostly underutilize it.
>>
>> On 03/12/2017 10:23 AM, Avi Kivity wrote:
>>
>>
>>
>> On 03/12/2017 12:19 AM, Kant Kodali wrote:
>>
>> My response is inline.
>>
>> On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity  wrote:
>>
>>> There are several issues at play here.
>>>
>>> First, a database runs a large number of concurrent operations, each of
>>> which only consumes a small amount of CPU. The high concurrency is need to
>>> hide latency: disk latency, or the latency of contacting a remote node.
>>>
>>
>> *Ok so you are talking about hiding I/O latency.  If all these I/O are
>> non-blocking system calls then a thread per core and callback mechanism
>> should suffice isn't it?*
>>
>>
>>
>> Scylla uses a mix of user-level threads and callbacks. Most of the code
>> uses callbacks (fronted by a future/promise API). SSTable writers
>> (memtable flush, compaction) use a user-level thread (internally
>> implemented using callbacks).  The important bit is multiplexing many
>> concurrent operations onto a single kernel thread.
>>
>>
>> This means that the scheduler will need to switch contexts very often. A
>>> kernel thread scheduler knows very little about the application, so it has
>>> to switch a lot of context.  A user level scheduler is tightly bound to the
>>> application, so it can perform the switching faster.
>>>
>>
>> *sure but this applies in other direction as well. A user level scheduler
>> has no idea about kernel level scheduler either.  There is literally no
>> coordination between kernel level scheduler and user level scheduler in
>> linux or any major OS. It may be possible with OS's that support scheduler
>> activation(LWP's) and upcall mechanism. *
>>
>>
>> There is no need for coordination, because the kernel scheduler has no
>> scheduling decisions to make.  With one thread per core, bound to its core,
>> the kernel scheduler can't make the wrong decision because it has just one
>> choice.
>>
>>
>> *Even then it is hard to say if it is all worth it (The research shows
>> performance may not outweigh the complexity). Golang problem is exactly
>> this if one creates 1000 go routines/green threads where each of them is
>> making a blocking system call then it would create 1000 kernel threads
>> underneath because it has no way to know that the kernel thread is blocked
>> (no 

Re: scylladb

2017-03-12 Thread Bhuvan Rawal
Looking at the costs of cloud instances, it clearly appears the cost of CPU
dictates the overall cost of the instance. Having 2X more cores increases
cost by nearly 2X keeping other things same as can be seen below as an
example:

(C3 may have slightly better processor but not more than 10-15% peformance
increase)

Optimising for fewer CPU cycles will invariably reduce costs by a large
factor. On a modern day machine with SSD's where data density on node can
be high more requests can be assumed to be served from single node, things
get CPU bound. Perhaps its because it was invented at a time when SSD's did
not exist. If we observe closely, many of cassandra defaults are assuming
disk is rotational - number of flush writers, concurrent compactors, etc.
The design suggest that too (Using Sequential io as far as possible. Infact
thats the underlying philosophy for sequential sstable flushes and
sequential commitlog files to avoid random io). Perhaps if it was designed
currently things may look radically different.

Comparing an average hard disk - ~200 iops  vs ~40K for ssd thats approx
200 times increase effectively increasing expectation from processor to
serve significantly higher ops per second.

In order to extract best from a modern day node it may need significant
changes such like below :
https://issues.apache.org/jira/browse/CASSANDRA-10989
Possibly going forward the number of cores per node is only going to
increase as it has been seen for last 5-6 years. In a way thats suggesting
a significant change in design and possibly thats what scylladb is upto.

"We found that we need a cpu scheduler which takes into account the
priority of different tasks, such as repair, compaction, streaming, read
operations and write operations."
>From my understanding in Cassandra as well compaction threads run on low
nice priority - not sure about repair/streaming.
http://grokbase.com/t/cassandra/user/14a85xpce7/significant-nice-cpu-usage

Regards,

On Sun, Mar 12, 2017 at 2:35 PM, Avi Kivity  wrote:

> btw, for an example of how user-level tasks can be scheduled in a way that
> cannot be done with kernel threads, see this pair of blog posts:
>
>
>   http://www.scylladb.com/2016/04/14/io-scheduler-1/
>
>   http://www.scylladb.com/2016/04/29/io-scheduler-2/
>
>
> There's simply no way to get this kind of control when you rely on the
> kernel for scheduling and page cache management.  As a result you have to
> overprovision your node and then you mostly underutilize it.
>
> On 03/12/2017 10:23 AM, Avi Kivity wrote:
>
>
>
> On 03/12/2017 12:19 AM, Kant Kodali wrote:
>
> My response is inline.
>
> On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity  wrote:
>
>> There are several issues at play here.
>>
>> First, a database runs a large number of concurrent operations, each of
>> which only consumes a small amount of CPU. The high concurrency is need to
>> hide latency: disk latency, or the latency of contacting a remote node.
>>
>
> *Ok so you are talking about hiding I/O latency.  If all these I/O are
> non-blocking system calls then a thread per core and callback mechanism
> should suffice isn't it?*
>
>
>
> Scylla uses a mix of user-level threads and callbacks. Most of the code
> uses callbacks (fronted by a future/promise API). SSTable writers
> (memtable flush, compaction) use a user-level thread (internally
> implemented using callbacks).  The important bit is multiplexing many
> concurrent operations onto a single kernel thread.
>
>
> This means that the scheduler will need to switch contexts very often. A
>> kernel thread scheduler knows very little about the application, so it has
>> to switch a lot of context.  A user level scheduler is tightly bound to the
>> application, so it can perform the switching faster.
>>
>
> *sure but this applies in other direction as well. A user level scheduler
> has no idea about kernel level scheduler either.  There is literally no
> coordination between kernel level scheduler and user level scheduler in
> linux or any major OS. It may be possible with OS's that support scheduler
> activation(LWP's) and upcall mechanism. *
>
>
> There is no need for coordination, because the kernel scheduler has no
> scheduling decisions to make.  With one thread per core, bound to its core,
> the kernel scheduler can't make the wrong decision because it has just one
> choice.
>
>
> *Even then it is hard to say if it is all worth it (The research shows
> performance may not outweigh the complexity). Golang problem is exactly
> this if one creates 1000 go routines/green threads where each of them is
> making a blocking system call then it would create 1000 kernel threads
> underneath because it has no way to know that the kernel thread is blocked
> (no upcall). *
>
>
> All of the significant system calls we issue are through the main thread,
> either asynchronous or non-blocking.
>
> *And in non-blocking case I still don't even see a significant performance
> when 

Re: scylladb

2017-03-12 Thread Kant Kodali
@Avi

"User-level scheduling is great for high performance I/O intensive
applications like databases and file systems." This is generally a claim
made by people you want to use user-level threads but I rarely had seen any
significant performance gain. Since you are claiming that you do. It would
be great if you can quantify that. The other day I have seen a benchmark of
a Golang server which supports user level threads/green threads natively
and it was able to handle 10K concurrent requests. Even Nginx which is
written and C and uses kernel threads can handle that many with
Non-blocking I/O. We all know concurrency is not parallelism.

You may have to pay for something which could be any of the following.

*Duplication of the schedulers*
M:N requires two schedulers which basically do same work, one at user level
and one in kernel. This is undesirable. It requires frequent data
communications between kernel and user space for scheduling information
transference.

Duplication takes more space in both Dcache and Icache for scheduling than
a single scheduler. It is highly undesirable if cache misses are caused by
the schedulers but the application, because a L2 cache miss could be more
expensive than a kernel thread switch. Then the additional scheduler might
become a trouble maker! In this case, to save kernel trappings does not
justify a user-scheduler, which is more truen when the processors are
providing faster and faster kernel trapping execution.

*Thread local data maintenance*
M:N has to maintain thread specific data, which are already provided by
kernel for kernel thread, such as the TLS data, error number. To provide
the same feature for user threads is not straightforward, because, for
example, the error number is returned for system call failure and supported
by kernel. User-level support degrades system performance and increases
system complexity.

*System info oblivious*
Kernel scheduler is close to underlying platform and architecture. It can
take advantage of their features. This is difficult for user thread library
because it's a layer at user level. User threads are second-order entities
in the system. If a kernel thread uses a GDT slot for TLS data, a user
thread perhaps can only use an LDT slot for TLS data. With increasingly
more supports available from the new processors for threading/scheduling
(Hyperthreading, NUMA, many-core), the second order nature seriously limits
the ability of M:N threading.

On Sun, Mar 12, 2017 at 1:05 AM, Avi Kivity  wrote:

> btw, for an example of how user-level tasks can be scheduled in a way that
> cannot be done with kernel threads, see this pair of blog posts:
>
>
>   http://www.scylladb.com/2016/04/14/io-scheduler-1/
>
>   http://www.scylladb.com/2016/04/29/io-scheduler-2/
>
>
> There's simply no way to get this kind of control when you rely on the
> kernel for scheduling and page cache management.  As a result you have to
> overprovision your node and then you mostly underutilize it.
>
> On 03/12/2017 10:23 AM, Avi Kivity wrote:
>
>
>
> On 03/12/2017 12:19 AM, Kant Kodali wrote:
>
> My response is inline.
>
> On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity  wrote:
>
>> There are several issues at play here.
>>
>> First, a database runs a large number of concurrent operations, each of
>> which only consumes a small amount of CPU. The high concurrency is need to
>> hide latency: disk latency, or the latency of contacting a remote node.
>>
>
> *Ok so you are talking about hiding I/O latency.  If all these I/O are
> non-blocking system calls then a thread per core and callback mechanism
> should suffice isn't it?*
>
>
>
> Scylla uses a mix of user-level threads and callbacks. Most of the code
> uses callbacks (fronted by a future/promise API). SSTable writers
> (memtable flush, compaction) use a user-level thread (internally
> implemented using callbacks).  The important bit is multiplexing many
> concurrent operations onto a single kernel thread.
>
>
> This means that the scheduler will need to switch contexts very often. A
>> kernel thread scheduler knows very little about the application, so it has
>> to switch a lot of context.  A user level scheduler is tightly bound to the
>> application, so it can perform the switching faster.
>>
>
> *sure but this applies in other direction as well. A user level scheduler
> has no idea about kernel level scheduler either.  There is literally no
> coordination between kernel level scheduler and user level scheduler in
> linux or any major OS. It may be possible with OS's that support scheduler
> activation(LWP's) and upcall mechanism. *
>
>
> There is no need for coordination, because the kernel scheduler has no
> scheduling decisions to make.  With one thread per core, bound to its core,
> the kernel scheduler can't make the wrong decision because it has just one
> choice.
>
>
> *Even then it is hard to say if it is all worth it (The research shows
> performance may not outweigh 

Re: scylladb

2017-03-12 Thread Avi Kivity
btw, for an example of how user-level tasks can be scheduled in a way 
that cannot be done with kernel threads, see this pair of blog posts:



  http://www.scylladb.com/2016/04/14/io-scheduler-1/

  http://www.scylladb.com/2016/04/29/io-scheduler-2/


There's simply no way to get this kind of control when you rely on the 
kernel for scheduling and page cache management.  As a result you have 
to overprovision your node and then you mostly underutilize it.



On 03/12/2017 10:23 AM, Avi Kivity wrote:




On 03/12/2017 12:19 AM, Kant Kodali wrote:

My response is inline.

On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity > wrote:


There are several issues at play here.

First, a database runs a large number of concurrent operations,
each of which only consumes a small amount of CPU. The high
concurrency is need to hide latency: disk latency, or the latency
of contacting a remote node.

*Ok so you are talking about hiding I/O latency. If all these I/O are 
non-blocking system calls then a thread per core and callback 
mechanism should suffice isn't it?*


Scylla uses a mix of user-level threads and callbacks. Most of the 
code uses callbacks (fronted by a future/promise API). SSTable 
writers  (memtable flush, compaction) use a user-level thread 
(internally implemented using callbacks).  The important bit is 
multiplexing many concurrent operations onto a single kernel thread.




This means that the scheduler will need to switch contexts very
often. A kernel thread scheduler knows very little about the
application, so it has to switch a lot of context.  A user level
scheduler is tightly bound to the application, so it can perform
the switching faster.


*sure but this applies in other direction as well. A user level 
scheduler has no idea about kernel level scheduler either.  There is 
literally no coordination between kernel level scheduler and user 
level scheduler in linux or any major OS. It may be possible with 
OS's that support scheduler activation(LWP's) and upcall mechanism. *


There is no need for coordination, because the kernel scheduler has no 
scheduling decisions to make.  With one thread per core, bound to its 
core, the kernel scheduler can't make the wrong decision because it 
has just one choice.



*Even then it is hard to say if it is all worth it (The research 
shows performance may not outweigh the complexity). Golang problem is 
exactly this if one creates 1000 go routines/green threads where each 
of them is making a blocking system call then it would create 1000 
kernel threads underneath because it has no way to know that the 
kernel thread is blocked (no upcall). *


All of the significant system calls we issue are through the main 
thread, either asynchronous or non-blocking.


*And in non-blocking case I still don't even see a significant 
performance when compared to few kernel threads with callback mechanism.*


We do.

*  If you are saying user level scheduling is the Future (perhaps I 
would just let the researchers argue about it) As of today that is 
not case else languages would have had it natively instead of using 
third party frameworks or libraries.

*


User-level scheduling is great for high performance I/O intensive 
applications like databases and file systems.  It's not a general 
solution, and it involves a lot of effort to set up the 
infrastructure. However, for our use case, it was worth it.



There are also implications on the concurrency primitives in use
(locks etc.) -- they will be much faster for the user-level
scheduler, because they cooperate with the scheduler.  For
example, no atomic read-modify-write instructions need to be
executed.


 Second, how many (kernel) threads should you run?*This question 
one will always have. If there are 10K user level threads that maps 
to only one kernel thread then they cannot exploit parallelism. so 
there is no right answer but a thread per core is a reasonable/good 
choice.

*


Only if you can multiplex many operations on top of each of those 
threads.  Otherwise, the CPUs end up underutilized.



If you run too few threads, then you will not be able to saturate
the CPU resources.  This is a common problem with Cassandra --
it's very hard to get it to consume all of the CPU power on even
a moderately large machine. On the other hand, if you have too
many threads, you will see latency rise very quickly, because
kernel scheduling granularity is on the order of milliseconds. 
User-level scheduling, because it leaves control in the hand of

the application, allows you to both saturate the CPU and maintain
low latency.


F*or my workload and probably others I had seen Cassandra was 
always been CPU bound.*






Yes, but does it consume 100% of all of the cores on your machine?  
Cassandra generally doesn't (on a larger machine), and when you 
profile it, you see it 

Re: Row cache tuning

2017-03-12 Thread preetika tyagi
Thanks, Matija! That was insightful.

I don't really have a use case in particular, however, what I'm trying to
do is to figure out how the Cassandra performance can be leveraged by using
different caching mechanisms, such as row cache, key cache, partition
summary etc. Of course, it will also heavily depend on the type of workload
but I'm trying to gain more understanding of what's available in the
Cassandra framework.

Also, I read somewhere that either row cache or key cache can be turned on
for a specific table, not both. Based on your comment, I guess the
combination of page cache and key cache is used widely for tuning the
performance.

Thanks,
Preetika

On Sat, Mar 11, 2017 at 2:01 PM, Matija Gobec  wrote:

> Hi,
>
> In 99% of use cases Cassandra's row cache is not something you should look
> into. Leveraging page cache yields good results and if accounted for can
> provide you with performance increase on read side.
> I'm not a fan of a default row cache implementation and its invalidation
> mechanism on updates so you really need to be careful when and how you use
> it. There isn't much to configuration as there is to your use case. Maybe
> explain what are you trying to solve with row cache and people can get into
> discussion with more context.
>
> Regards,
> Matija
>
> On Sat, Mar 11, 2017 at 9:15 PM, preetika tyagi 
> wrote:
>
>> Hi,
>>
>> I'm new to Cassandra and trying to get a better understanding on how the
>> row cache can be tuned to optimize the performance.
>>
>> I came across think this article: https://docs.datastax
>> .com/en/cassandra/3.0/cassandra/operations/opsConfiguringCaches.html
>>
>> And it suggests not to even touch row cache unless read workload is > 95%
>> and mostly rely on machine's default cache mechanism which comes with OS.
>>
>> The default row cache size is 0 in cassandra.yaml file so the row cache
>> won't be utilized at all.
>>
>> Therefore, I'm wondering how exactly I can decide to chose to tweak row
>> cache if needed. Are there any good pointers one can provide on this?
>>
>> Thanks,
>> Preetika
>>
>
>


Re: scylladb

2017-03-12 Thread Avi Kivity



On 03/12/2017 12:19 AM, Kant Kodali wrote:

My response is inline.

On Sat, Mar 11, 2017 at 1:43 PM, Avi Kivity > wrote:


There are several issues at play here.

First, a database runs a large number of concurrent operations,
each of which only consumes a small amount of CPU. The high
concurrency is need to hide latency: disk latency, or the latency
of contacting a remote node.

*Ok so you are talking about hiding I/O latency.  If all these I/O are 
non-blocking system calls then a thread per core and callback 
mechanism should suffice isn't it?*


Scylla uses a mix of user-level threads and callbacks. Most of the code 
uses callbacks (fronted by a future/promise API). SSTable writers  
(memtable flush, compaction) use a user-level thread (internally 
implemented using callbacks).  The important bit is multiplexing many 
concurrent operations onto a single kernel thread.




This means that the scheduler will need to switch contexts very
often. A kernel thread scheduler knows very little about the
application, so it has to switch a lot of context.  A user level
scheduler is tightly bound to the application, so it can perform
the switching faster.


*sure but this applies in other direction as well. A user level 
scheduler has no idea about kernel level scheduler either.  There is 
literally no coordination between kernel level scheduler and user 
level scheduler in linux or any major OS. It may be possible with OS's 
that support scheduler activation(LWP's) and upcall mechanism. *


There is no need for coordination, because the kernel scheduler has no 
scheduling decisions to make.  With one thread per core, bound to its 
core, the kernel scheduler can't make the wrong decision because it has 
just one choice.



*Even then it is hard to say if it is all worth it (The research shows 
performance may not outweigh the complexity). Golang problem is 
exactly this if one creates 1000 go routines/green threads where each 
of them is making a blocking system call then it would create 1000 
kernel threads underneath because it has no way to know that the 
kernel thread is blocked (no upcall). *


All of the significant system calls we issue are through the main 
thread, either asynchronous or non-blocking.


*And in non-blocking case I still don't even see a significant 
performance when compared to few kernel threads with callback mechanism.*


We do.

*  If you are saying user level scheduling is the Future (perhaps I 
would just let the researchers argue about it) As of today that is not 
case else languages would have had it natively instead of using third 
party frameworks or libraries.

*


User-level scheduling is great for high performance I/O intensive 
applications like databases and file systems.  It's not a general 
solution, and it involves a lot of effort to set up the infrastructure. 
However, for our use case, it was worth it.



There are also implications on the concurrency primitives in use
(locks etc.) -- they will be much faster for the user-level
scheduler, because they cooperate with the scheduler.  For
example, no atomic read-modify-write instructions need to be executed.


 Second, how many (kernel) threads should you run?*This question 
one will always have. If there are 10K user level threads that maps to 
only one kernel thread then they cannot exploit parallelism. so there 
is no right answer but a thread per core is a reasonable/good choice.

*


Only if you can multiplex many operations on top of each of those 
threads.  Otherwise, the CPUs end up underutilized.



If you run too few threads, then you will not be able to saturate
the CPU resources.  This is a common problem with Cassandra --
it's very hard to get it to consume all of the CPU power on even a
moderately large machine. On the other hand, if you have too many
threads, you will see latency rise very quickly, because kernel
scheduling granularity is on the order of milliseconds. 
User-level scheduling, because it leaves control in the hand of

the application, allows you to both saturate the CPU and maintain
low latency.


F*or my workload and probably others I had seen Cassandra was 
always been CPU bound.*






Yes, but does it consume 100% of all of the cores on your machine? 
Cassandra generally doesn't (on a larger machine), and when you profile 
it, you see it spending much of its time in atomic operations, or 
parking/unparking threads -- fighting with itself. It doesn't scale 
within the machine.  Scylla will happily utilize all of the cores that 
it is assigned (all of them by default in most configurations), and the 
bigger the machine you give it, the happier it will be.



There are other factors, like NUMA-friendliness, but in the end it
all boils down to efficiency and control.

None of this is new btw, it's pretty common in the storage world.

  

Re: A Single Dropped Node Fails Entire Read Queries

2017-03-12 Thread Shalom Sagges
Hi Michael,

If a node suddenly fails, and there are other replicas that can still
satisfy the consistency level, shouldn't the request succeed regardless of
the failed node?

Thanks!





Shalom Sagges
DBA
T: +972-74-700-4035
 
 We Create Meaningful Connections



On Fri, Mar 10, 2017 at 6:25 PM, Michael Shuler 
wrote:

> I may be mistaken on the exact configuration option for the timeout
> you're hitting, but I believe this may be the general
> `request_timeout_in_ms: 1` in conf/cassandra.yaml.
>
> A reasonable timeout for a "node down" discovery/processing is needed to
> prevent random flapping of nodes with a super short timeout interval.
> Applications should also retry on a host unavailable exception like
> this, because in the long run, this should be expected from time to time
> for network partitions, node failure, maintenance cycles, etc.
>
> --
> Kind regards,
> Michael
>
> On 03/10/2017 04:07 AM, Shalom Sagges wrote:
> > Hi daniel,
> >
> > I don't think that's a network issue, because ~10 seconds after the node
> > stopped, the queries were successful again without any timeout issues.
> >
> > Thanks!
> >
> >
> > Shalom Sagges
> > DBA
> > T: +972-74-700-4035
> > 
> >     LivePersonInc>
> >
> >   We Create Meaningful Connections
> >
> > 
> >
> >
> >
> > On Fri, Mar 10, 2017 at 12:01 PM, Daniel Hölbling-Inzko
> >  > > wrote:
> >
> > Could there be network issues in connecting between the nodes? If
> > node a gets To be the query coordinator but can't reach b and c is
> > obviously down it won't get a quorum.
> >
> > Greetings
> >
> > Shalom Sagges  > > schrieb am Fr. 10. März 2017 um
> 10:55:
> >
> > @Ryan, my keyspace replication settings are as follows:
> > CREATE KEYSPACE mykeyspace WITH replication = {'class':
> > 'NetworkTopologyStrategy', 'DC1': '3', 'DC2: '3', 'DC3': '3'}
> >  AND durable_writes = true;
> >
> > CREATE TABLE mykeyspace.test (
> > column1 text,
> > column2 text,
> > column3 text,
> > PRIMARY KEY (column1, column2)
> >
> > The query is */select * from mykeyspace.test where
> > column1='x';/*
> >
> > @Daniel, the replication factor is 3. That's why I don't
> > understand why I get these timeouts when only one node drops.
> >
> > Also, when I enabled tracing, I got the following error:
> > *Unable to fetch query trace: ('Unable to complete the operation
> > against any hosts', {: Unavailable('Error
> > from server: code=1000 [Unavailable exception] message="Cannot
> > achieve consistency level LOCAL_QUORUM"
> > info={\'required_replicas\': 2, \'alive_replicas\': 1,
> > \'consistency\': \'LOCAL_QUORUM\'}',)})*
> >
> > But nodetool status shows that only 1 replica was down:
> > --  Address  Load   Tokens   OwnsHost ID
> >   Rack
> > DN  x.x.x.235  134.32 MB  256  ?
> > c0920d11-08da-4f18-a7f3-dbfb8c155b19  RAC1
> > UN  x.x.x.236  134.02 MB  256  ?
> > 2cc0a27b-b1e4-461f-a3d2-186d3d82ff3d  RAC1
> > UN  x.x.x.237  134.34 MB  256  ?
> > 5b2162aa-8803-4b54-88a9-ff2e70b3d830  RAC1
> >
> >
> > I tried to run the same scenario on all 3 nodes, and only the
> > 3rd node didn't fail the query when I dropped it. The nodes were
> > installed and configured with Puppet so the configuration is the
> > same on all 3 nodes.
> >
> >
> > Thanks!
> >
> >
> >
> > On Fri, Mar 10, 2017 at 10:25 AM, Daniel Hölbling-Inzko
> >  > > wrote:
> >
> > The LOCAL_QUORUM works on the available replicas in the dc.
> > So if your replication factor is 2 and you have 10 nodes you
> > can still only loose 1. With a replication factor of 3 you
> > can loose one node and still satisfy the query.
> > Ryan Svihla > schrieb
> > am Do. 9. März 2017 um 18:09:
> >
> > whats your keyspace replication settings and what's your
> > query?
> >
> > On Thu, Mar 9, 2017 at 9:32 AM, Shalom Sagges
> > >
> > wrote:
> >
> > Hi