Re: Java GC pauses, reality check

2016-11-26 Thread Graham Sanderson
It was removed in the 3.0.x line, but not in the 3.x line (post 9472) as far as 
I can tell. It looks to be available in 3.11 and in 3.X branches

> On Nov 26, 2016, at 1:17 PM, Oleksandr Shulgin <oleksandr.shul...@zalando.de> 
> wrote:
> 
> On Nov 26, 2016 20:04, "Graham Sanderson" <gra...@vast.com 
> <mailto:gra...@vast.com>> wrote:
> Not AFAIK; https://issues.apache.org/jira/browse/CASSANDRA-9472 
> <https://issues.apache.org/jira/browse/CASSANDRA-9472> is marked as resolved 
> in 3.4, though we are not running it so I can’t say much about it.
> 
> But I was referring to https://issues.apache.org/jira/browse/CASSANDRA-11039 
> <https://issues.apache.org/jira/browse/CASSANDRA-11039> which removed it 
> again in 3.10 and 3.0.10.
> 
> --
> Alex
> 
> It looks like Zing is no longer tied price wise per core which was a show 
> stopper for us, but it is now priced per server which may affect others 
> differently.
> 
> Note in fact ironically, running 2.1.x with off heap memtables, we had some 
> of our JVMs running for over a year which made us hit 
> https://issues.apache.org/jira/browse/CASSANDRA-10969 
> <https://issues.apache.org/jira/browse/CASSANDRA-10969> when we restarted 
> some nodes for other reasons.
> 
>> On Nov 26, 2016, at 12:07 AM, Oleksandr Shulgin 
>> <oleksandr.shul...@zalando.de <mailto:oleksandr.shul...@zalando.de>> wrote:
>> 
>> On Nov 25, 2016 23:47, "Graham Sanderson" <gra...@vast.com 
>> <mailto:gra...@vast.com>> wrote:
>> If you are seeing 25-30 second GC pauses then (unless you are so badly 
>> configured) seeing full GC under CMS (though G1 may have similar problems).
>> 
>> With CMS eventual fragmentation causing promotion failure is inevitable 
>> (unless you cycle your nodes before it happens). Either your heap has way 
>> too big an old gen, or too small a young gen (but then you need pretty hefty 
>> boxes to be able to run with a large young gen - of the say 4-8G range) 
>> without young collections taking too long.
>> 
>> Depending on your C* version I would highly recommend off heap men-tables. 
>> With those we were able to considerably reduce our heap sizes, despite 
>> having large throughput on a smallish number of nodes.
>> 
>> Aren't offheap memtables discontinued in the most recent releases of 3.0 and 
>> 3.x for a good reason? I thought using them could lead to segfaults?
>> 
>> --
>> Alex
>> 
>> I recommend reading this if you use CMS 
>> http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html 
>> <http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html>, and 
>> also not that if you see a lot of objects of size 131074 in promotion 
>> failures then memtables are the problem - you can try and flush them sooner, 
>> but moving them off heap works better I think.
>> 
>>> On Nov 25, 2016, at 4:38 PM, Kant Kodali <k...@peernova.com 
>>> <mailto:k...@peernova.com>> wrote:
>>> 
>>> +1 Chris Lohfink response
>>> 
>>> I would also restate the following sentence "java GC pauses are pretty much 
>>> a fact of life" to "Any GC based system pauses are pretty much a fact of 
>>> life".
>>> 
>>> I would be more than happy to see if someone can counter prove.
>>> 
>>> 
>>> 
>>> On Fri, Nov 25, 2016 at 1:41 PM, Chris Lohfink <clohfin...@gmail.com 
>>> <mailto:clohfin...@gmail.com>> wrote:
>>> No tuning will eliminate gcs.
>>> 
>>> 20-30 seconds is horrific and out of the ordinary. Most likely implementing 
>>> antipatterns and/or poorly configured. Sub 1s is realistic but with some 
>>> workloads still may require some tuning to maintain. Some workloads are 
>>> very unfriendly to GCs though (ie heavy tombstones, very wide partitions).
>>> 
>>> Chris
>>> 
>>> On Fri, Nov 25, 2016 at 3:25 PM, S Ahmed <sahmed1...@gmail.com 
>>> <mailto:sahmed1...@gmail.com>> wrote:
>>> Hello!
>>> 
>>> From what I understand java GC pauses are pretty much a fact of life, but 
>>> you can tune the jvm to reduce the likelihood of the frequency and length 
>>> of GC pauses.
>>> 
>>> When using Cassandra, how frequent or long have these pauses known to be?  
>>> Even with tuning, is it safe to assume they cannot be eliminated?
>>> 
>>> Would a 20-30 second pause be something out of the ordinary?
>>> 
>>> Thanks.
>>> 
>>> 
>> 
>> 
> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: Java GC pauses, reality check

2016-11-26 Thread Graham Sanderson
Not AFAIK; https://issues.apache.org/jira/browse/CASSANDRA-9472 
<https://issues.apache.org/jira/browse/CASSANDRA-9472> is marked as resolved in 
3.4, though we are not running it so I can’t say much about it.

It looks like Zing is no longer tied price wise per core which was a show 
stopper for us, but it is now priced per server which may affect others 
differently.

Note in fact ironically, running 2.1.x with off heap memtables, we had some of 
our JVMs running for over a year which made us hit 
https://issues.apache.org/jira/browse/CASSANDRA-10969 
<https://issues.apache.org/jira/browse/CASSANDRA-10969> when we restarted some 
nodes for other reasons.

> On Nov 26, 2016, at 12:07 AM, Oleksandr Shulgin 
> <oleksandr.shul...@zalando.de> wrote:
> 
> On Nov 25, 2016 23:47, "Graham Sanderson" <gra...@vast.com 
> <mailto:gra...@vast.com>> wrote:
> If you are seeing 25-30 second GC pauses then (unless you are so badly 
> configured) seeing full GC under CMS (though G1 may have similar problems).
> 
> With CMS eventual fragmentation causing promotion failure is inevitable 
> (unless you cycle your nodes before it happens). Either your heap has way too 
> big an old gen, or too small a young gen (but then you need pretty hefty 
> boxes to be able to run with a large young gen - of the say 4-8G range) 
> without young collections taking too long.
> 
> Depending on your C* version I would highly recommend off heap men-tables. 
> With those we were able to considerably reduce our heap sizes, despite having 
> large throughput on a smallish number of nodes.
> 
> Aren't offheap memtables discontinued in the most recent releases of 3.0 and 
> 3.x for a good reason? I thought using them could lead to segfaults?
> 
> --
> Alex
> 
> I recommend reading this if you use CMS 
> http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html 
> <http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html>, and 
> also not that if you see a lot of objects of size 131074 in promotion 
> failures then memtables are the problem - you can try and flush them sooner, 
> but moving them off heap works better I think.
> 
>> On Nov 25, 2016, at 4:38 PM, Kant Kodali <k...@peernova.com 
>> <mailto:k...@peernova.com>> wrote:
>> 
>> +1 Chris Lohfink response
>> 
>> I would also restate the following sentence "java GC pauses are pretty much 
>> a fact of life" to "Any GC based system pauses are pretty much a fact of 
>> life".
>> 
>> I would be more than happy to see if someone can counter prove.
>> 
>> 
>> 
>> On Fri, Nov 25, 2016 at 1:41 PM, Chris Lohfink <clohfin...@gmail.com 
>> <mailto:clohfin...@gmail.com>> wrote:
>> No tuning will eliminate gcs.
>> 
>> 20-30 seconds is horrific and out of the ordinary. Most likely implementing 
>> antipatterns and/or poorly configured. Sub 1s is realistic but with some 
>> workloads still may require some tuning to maintain. Some workloads are very 
>> unfriendly to GCs though (ie heavy tombstones, very wide partitions).
>> 
>> Chris
>> 
>> On Fri, Nov 25, 2016 at 3:25 PM, S Ahmed <sahmed1...@gmail.com 
>> <mailto:sahmed1...@gmail.com>> wrote:
>> Hello!
>> 
>> From what I understand java GC pauses are pretty much a fact of life, but 
>> you can tune the jvm to reduce the likelihood of the frequency and length of 
>> GC pauses.
>> 
>> When using Cassandra, how frequent or long have these pauses known to be?  
>> Even with tuning, is it safe to assume they cannot be eliminated?
>> 
>> Would a 20-30 second pause be something out of the ordinary?
>> 
>> Thanks.
>> 
>> 
> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: Java GC pauses, reality check

2016-11-25 Thread Graham Sanderson
If you are seeing 25-30 second GC pauses then (unless you are so badly 
configured) seeing full GC under CMS (though G1 may have similar problems).

With CMS eventual fragmentation causing promotion failure is inevitable (unless 
you cycle your nodes before it happens). Either your heap has way too big an 
old gen, or too small a young gen (but then you need pretty hefty boxes to be 
able to run with a large young gen - of the say 4-8G range) without young 
collections taking too long.

Depending on your C* version I would highly recommend off heap men-tables. With 
those we were able to considerably reduce our heap sizes, despite having large 
throughput on a smallish number of nodes.

I recommend reading this if you use CMS 
http://blog.ragozin.info/2011/10/java-cg-hotspots-cms-and-heap.html 
, and also 
not that if you see a lot of objects of size 131074 in promotion failures then 
memtables are the problem - you can try and flush them sooner, but moving them 
off heap works better I think.

> On Nov 25, 2016, at 4:38 PM, Kant Kodali  wrote:
> 
> +1 Chris Lohfink response
> 
> I would also restate the following sentence "java GC pauses are pretty much a 
> fact of life" to "Any GC based system pauses are pretty much a fact of life".
> 
> I would be more than happy to see if someone can counter prove.
> 
> 
> 
> On Fri, Nov 25, 2016 at 1:41 PM, Chris Lohfink  > wrote:
> No tuning will eliminate gcs.
> 
> 20-30 seconds is horrific and out of the ordinary. Most likely implementing 
> antipatterns and/or poorly configured. Sub 1s is realistic but with some 
> workloads still may require some tuning to maintain. Some workloads are very 
> unfriendly to GCs though (ie heavy tombstones, very wide partitions).
> 
> Chris
> 
> On Fri, Nov 25, 2016 at 3:25 PM, S Ahmed  > wrote:
> Hello!
> 
> From what I understand java GC pauses are pretty much a fact of life, but you 
> can tune the jvm to reduce the likelihood of the frequency and length of GC 
> pauses.
> 
> When using Cassandra, how frequent or long have these pauses known to be?  
> Even with tuning, is it safe to assume they cannot be eliminated?
> 
> Would a 20-30 second pause be something out of the ordinary?
> 
> Thanks.
> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: Do partition keys create skinny or wide rows?

2016-10-08 Thread Graham Sanderson
No the employees would end up in arbitrary partitions, and querying them would 
be inefficient (impossible? - I am levels back on C* so don’t know if ALLOW 
FILTERING even works for this).

I would be tempted to use organization_id only or organization_Id and maybe a 
few shard bits (if you are worried about huge orgs) from the employee_Id to 
make the partition key, but it really depends what other queries you will be 
making
> On Oct 8, 2016, at 11:19 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:
> 
> In the case of PRIMARY KEY((organization_id, employee_id)), could I still do 
> a query like Select ... where organization_id = x, to get all employees in a 
> particular organization?
> 
> And, this will put all those employees in the same node, right?
> 
> On Sun, Oct 9, 2016 at 9:17 AM, Graham Sanderson <gra...@vast.com 
> <mailto:gra...@vast.com>> wrote:
> Nomenclature is tricky, but PRIMARY KEY((organization_id, employee_id)) will 
> make organization_id, employee_id the partition key which equates roughly to 
> your latter sentence (I’m not sure about the 4 billion limit - that may be 
> the new actual limit, but probably not a good idea).
> 
>> On Oct 8, 2016, at 8:35 PM, Ali Akhtar <ali.rac...@gmail.com 
>> <mailto:ali.rac...@gmail.com>> wrote:
>> 
>> the last '4 billion rows' should say '4 billion columns / cells'
>> 
>> On Sun, Oct 9, 2016 at 6:34 AM, Ali Akhtar <ali.rac...@gmail.com 
>> <mailto:ali.rac...@gmail.com>> wrote:
>> Say I have the following primary key:
>> PRIMARY KEY((organization_id, employee_id))
>> 
>> Will this create 1 row whose primary key is the organization id, but it has 
>> a 4 billion column / cell limit?
>> 
>> Or will this create 1 row for each employee in the same organization, so if 
>> i have 5 employees, they will each have their own 5 rows, and each of those 
>> 5 rows will have their own 4 billion rows?
>> 
>> Thank you.
>> 
> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: Do partition keys create skinny or wide rows?

2016-10-08 Thread Graham Sanderson
Nomenclature is tricky, but PRIMARY KEY((organization_id, employee_id)) will 
make organization_id, employee_id the partition key which equates roughly to 
your latter sentence (I’m not sure about the 4 billion limit - that may be the 
new actual limit, but probably not a good idea).

> On Oct 8, 2016, at 8:35 PM, Ali Akhtar  wrote:
> 
> the last '4 billion rows' should say '4 billion columns / cells'
> 
> On Sun, Oct 9, 2016 at 6:34 AM, Ali Akhtar  > wrote:
> Say I have the following primary key:
> PRIMARY KEY((organization_id, employee_id))
> 
> Will this create 1 row whose primary key is the organization id, but it has a 
> 4 billion column / cell limit?
> 
> Or will this create 1 row for each employee in the same organization, so if i 
> have 5 employees, they will each have their own 5 rows, and each of those 5 
> rows will have their own 4 billion rows?
> 
> Thank you.
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: JVM safepoints, mmap, and slow disks

2016-10-08 Thread Graham Sanderson
I was using the term “touch” loosely to hopefully mean pre-fetch, though I 
suspect (I think intel has been de-emphasizing) you can still do a sensible 
prefetch instruction in native code. Even if not you are still better blocking 
in JNI code - I haven’t looked at the link to see if the correct barriers are 
enforced by the sun-misc-unsafe method.

I do suspect that you’ll see up to about 5-10% sys call overhead if you hit 
pread.

> On Oct 8, 2016, at 11:02 PM, Ariel Weisberg <ar...@weisberg.ws> wrote:
> 
> Hi,
> 
> This is starting to get into dev list territory.
> 
> Interesting idea to touch every 4K page you are going to read.
> 
> You could use this to minimize the cost.
> http://stackoverflow.com/questions/36298111/is-it-possible-to-use-sun-misc-unsafe-to-call-c-functions-without-jni/36309652#36309652
> 
> Maybe faster than doing buffered IO. It's a lot of cache and TLB misses
> with out prefetching though.
> 
> There is a system call to page the memory in which might be better for
> larger reads. Still no guarantee things stay cached though.
> 
> Ariel
> 
> 
> On Sat, Oct 8, 2016, at 08:21 PM, Graham Sanderson wrote:
>> I haven’t studied the read path that carefully, but there might be a spot at 
>> the C* level rather than JVM level where you could effectively do a JNI 
>> touch of the mmap region you’re going to need next.
>> 
>>> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <gra...@vast.com> wrote:
>>> 
>>> We don’t use Azul’s Zing, but it does have the nice feature that all 
>>> threads don’t have to reach safepoints at the same time. That said we make 
>>> heavy use of Cassandra (with off heap memtables - not directly related but 
>>> allows us a lot more GC headroom) and SOLR where we switched to mmap 
>>> because it FAR out performed pread variants - in no cases have we noticed 
>>> long time to safe point (then again our IO is lightning fast).
>>> 
>>>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>>> 
>>>> Linux automatically uses free memory as cache.  It's not swap.
>>>> 
>>>> http://www.tldp.org/LDP/lki/lki-4.html
>>>> 
>>>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <vla...@winguzone.com> 
>>>> wrote:
>>>>> __
>>>>> Sorry, I don't catch something. What page (memory) cache can exist if 
>>>>> there is no swap file.
>>>>> Where are those page written/read?
>>>>> 
>>>>> 
>>>>> Best regards, Vladimir Yudovin, 
>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra on 
>>>>> Azure and SoftLayer.
>>>>> Launch your cluster in minutes.
> *
>>>>> 
>>>>>  On Sat, 08 Oct 2016 14:09:50 -0400 *Ariel 
>>>>> Weisberg<ar...@weisberg.ws>* wrote  
>>>>>> Hi,
>>>>>> 
>>>>>> Nope I mean page cache. Linux doesn't call the cache it maintains using 
>>>>>> free memory a file cache. It uses free (and some of the time not so 
>>>>>> free!) memory to buffer writes and to cache recently written/read data.
>>>>>> 
>>>>>> http://www.tldp.org/LDP/lki/lki-4.html
>>>>>> 
>>>>>> When Linux decides it needs free memory it can either evict stuff from 
>>>>>> the page cache, flush dirty pages and then evict, or swap anonymous 
>>>>>> memory out. When you disable swap you only disable the last behavior.
>>>>>> 
>>>>>> Maybe we are talking at cross purposes? What I meant is that increasing 
>>>>>> the heap size to reduce GC frequency is a legitimate thing to do and it 
>>>>>> does have an impact on the performance of the page cache even if you 
>>>>>> have swap disabled?
>>>>>> 
>>>>>> Ariel
>>>>>> 
>>>>>> 
>>>>>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>>>>>>>> Page cache is data pending flush to disk and data cached from disk.
>>>>>>> 
>>>>>>> Do you mean file cache?
>>>>>>> 
>>>>>>> 
>>>>>>> Best regards, Vladimir Yudovin, 
>>>>>>> *Winguzone[https://winguzone.com/?from=list] - Hosted Cloud Cassandra 
>>>>>>> on Azure and SoftLayer.
>>>>>>> Launch your cluster in minutes.*
>>>>>

Re: JVM safepoints, mmap, and slow disks

2016-10-08 Thread Graham Sanderson
I haven’t studied the read path that carefully, but there might be a spot at 
the C* level rather than JVM level where you could effectively do a JNI touch 
of the mmap region you’re going to need next.

> On Oct 8, 2016, at 7:17 PM, Graham Sanderson <gra...@vast.com> wrote:
> 
> We don’t use Azul’s Zing, but it does have the nice feature that all threads 
> don’t have to reach safepoints at the same time. That said we make heavy use 
> of Cassandra (with off heap memtables - not directly related but allows us a 
> lot more GC headroom) and SOLR where we switched to mmap because it FAR out 
> performed pread variants - in no cases have we noticed long time to safe 
> point (then again our IO is lightning fast).
> 
>> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad <j...@jonhaddad.com 
>> <mailto:j...@jonhaddad.com>> wrote:
>> 
>> Linux automatically uses free memory as cache.  It's not swap.
>> 
>> http://www.tldp.org/LDP/lki/lki-4.html 
>> <http://www.tldp.org/LDP/lki/lki-4.html>
>> 
>> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin <vla...@winguzone.com 
>> <mailto:vla...@winguzone.com>> wrote:
>> Sorry, I don't catch something. What page (memory) cache can exist if there 
>> is no swap file.
>> Where are those page written/read?
>> 
>> 
>> Best regards, Vladimir Yudovin, 
>> Winguzone <https://winguzone.com/?from=list> - Hosted Cloud Cassandra on 
>> Azure and SoftLayer.
>> Launch your cluster in minutes.
>> 
>> 
>> 
>>  On Sat, 08 Oct 2016 14:09:50 -0400 Ariel Weisberg<ar...@weisberg.ws 
>> <mailto:ar...@weisberg.ws>> wrote  
>> Hi,
>> 
>> Nope I mean page cache. Linux doesn't call the cache it maintains using free 
>> memory a file cache. It uses free (and some of the time not so free!) memory 
>> to buffer writes and to cache recently written/read data.
>> 
>> http://www.tldp.org/LDP/lki/lki-4.html 
>> <http://www.tldp.org/LDP/lki/lki-4.html>
>> 
>> When Linux decides it needs free memory it can either evict stuff from the 
>> page cache, flush dirty pages and then evict, or swap anonymous memory out. 
>> When you disable swap you only disable the last behavior.
>> 
>> Maybe we are talking at cross purposes? What I meant is that increasing the 
>> heap size to reduce GC frequency is a legitimate thing to do and it does 
>> have an impact on the performance of the page cache even if you have swap 
>> disabled?
>> 
>> Ariel
>> 
>> 
>> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
>> >Page cache is data pending flush to disk and data cached from disk.
>> 
>> Do you mean file cache?
>> 
>> 
>> Best regards, Vladimir Yudovin, 
>> Winguzone <https://winguzone.com/?from=list> - Hosted Cloud Cassandra on 
>> Azure and SoftLayer.
>> Launch your cluster in minutes.
>> 
>> 
>>  On Sat, 08 Oct 2016 13:40:19 -0400 Ariel Weisberg <ar...@weisberg.ws 
>> <mailto:ar...@weisberg.ws>> wrote  
>> Hi,
>> 
>> Page cache is in use even if you disable swap. Swap is anonymous memory, and 
>> whatever else the Linux kernel supports paging out. Page cache is data 
>> pending flush to disk and data cached from disk.
>> 
>> Given how bad the GC pauses are in C* I think it's not the high pole in the 
>> tent. Until key things are off heap and C* can run with CMS and get 10 
>> millisecond GCs all day long.
>> 
>> You can go through tuning and hardware selection try to get more consistent 
>> IO pauses and remove outliers as you mention and as a user I think this is 
>> your best bet. Generally it's either bad device or filesystem behavior if 
>> you get page faults taking more than 200 milliseconds O(G1 gc collection).
>> 
>> I think a JVM change to allow safe points around memory mapped file access 
>> is really unlikely although I agree it would be great. I think the best hack 
>> around it is to code up your memory mapped file access into JNI methods and 
>> find some way to get that to work. Right now if you want to create a safe 
>> point a JNI method is the way to do it. The problem is that JNI methods and 
>> POJOs don't get along well.
>> 
>> If you think about it the reason non-memory mapped IO works well is that 
>> it's all JNI methods so they don't impact time to safe point. I think there 
>> is a tradeoff between tolerance for outliers and performance.
>> 
>> I don't know the state of the non-memory mapped path and how reliable that 
>> is. If it were reliable and I cou

Re: JVM safepoints, mmap, and slow disks

2016-10-08 Thread Graham Sanderson
We don’t use Azul’s Zing, but it does have the nice feature that all threads 
don’t have to reach safepoints at the same time. That said we make heavy use of 
Cassandra (with off heap memtables - not directly related but allows us a lot 
more GC headroom) and SOLR where we switched to mmap because it FAR out 
performed pread variants - in no cases have we noticed long time to safe point 
(then again our IO is lightning fast).

> On Oct 8, 2016, at 1:20 PM, Jonathan Haddad  wrote:
> 
> Linux automatically uses free memory as cache.  It's not swap.
> 
> http://www.tldp.org/LDP/lki/lki-4.html 
> 
> 
> On Sat, Oct 8, 2016 at 11:12 AM Vladimir Yudovin  > wrote:
> Sorry, I don't catch something. What page (memory) cache can exist if there 
> is no swap file.
> Where are those page written/read?
> 
> 
> Best regards, Vladimir Yudovin, 
> Winguzone  - Hosted Cloud Cassandra on 
> Azure and SoftLayer.
> Launch your cluster in minutes.
> 
> 
> 
>  On Sat, 08 Oct 2016 14:09:50 -0400 Ariel Weisberg > wrote  
> Hi,
> 
> Nope I mean page cache. Linux doesn't call the cache it maintains using free 
> memory a file cache. It uses free (and some of the time not so free!) memory 
> to buffer writes and to cache recently written/read data.
> 
> http://www.tldp.org/LDP/lki/lki-4.html 
> 
> 
> When Linux decides it needs free memory it can either evict stuff from the 
> page cache, flush dirty pages and then evict, or swap anonymous memory out. 
> When you disable swap you only disable the last behavior.
> 
> Maybe we are talking at cross purposes? What I meant is that increasing the 
> heap size to reduce GC frequency is a legitimate thing to do and it does have 
> an impact on the performance of the page cache even if you have swap disabled?
> 
> Ariel
> 
> 
> On Sat, Oct 8, 2016, at 01:54 PM, Vladimir Yudovin wrote:
> >Page cache is data pending flush to disk and data cached from disk.
> 
> Do you mean file cache?
> 
> 
> Best regards, Vladimir Yudovin, 
> Winguzone  - Hosted Cloud Cassandra on 
> Azure and SoftLayer.
> Launch your cluster in minutes.
> 
> 
>  On Sat, 08 Oct 2016 13:40:19 -0400 Ariel Weisberg  > wrote  
> Hi,
> 
> Page cache is in use even if you disable swap. Swap is anonymous memory, and 
> whatever else the Linux kernel supports paging out. Page cache is data 
> pending flush to disk and data cached from disk.
> 
> Given how bad the GC pauses are in C* I think it's not the high pole in the 
> tent. Until key things are off heap and C* can run with CMS and get 10 
> millisecond GCs all day long.
> 
> You can go through tuning and hardware selection try to get more consistent 
> IO pauses and remove outliers as you mention and as a user I think this is 
> your best bet. Generally it's either bad device or filesystem behavior if you 
> get page faults taking more than 200 milliseconds O(G1 gc collection).
> 
> I think a JVM change to allow safe points around memory mapped file access is 
> really unlikely although I agree it would be great. I think the best hack 
> around it is to code up your memory mapped file access into JNI methods and 
> find some way to get that to work. Right now if you want to create a safe 
> point a JNI method is the way to do it. The problem is that JNI methods and 
> POJOs don't get along well.
> 
> If you think about it the reason non-memory mapped IO works well is that it's 
> all JNI methods so they don't impact time to safe point. I think there is a 
> tradeoff between tolerance for outliers and performance.
> 
> I don't know the state of the non-memory mapped path and how reliable that 
> is. If it were reliable and I couldn't tolerate the outliers I would use 
> that. I have to ask though, why are you not able to tolerate the outliers? If 
> you are reading and writing at quorum how is this impacting you?
> 
> Regards,
> Ariel
> 
> On Sat, Oct 8, 2016, at 12:54 AM, Vladimir Yudovin wrote:
> Hi Josh,
> 
> >Running with increased heap size would reduce GC frequency, at the cost of 
> >page cache.
> 
> Actually it's recommended to run C* without virtual memory enabled. So if 
> there is no enough memory JVM fails instead of blocking
> 
> Best regards, Vladimir Yudovin, 
> Winguzone  - Hosted Cloud Cassandra on 
> Azure and SoftLayer.
> Launch your cluster in minutes.
> 
> 
>  On Fri, 07 Oct 2016 21:06:24 -0400 Josh Snyder > wrote  
> Hello cassandra-users, 
> 
> I'm investigating an issue with JVMs taking a while to reach a safepoint. I'd 
> like the list's input on confirming my hypothesis and finding mitigations. 
> 
> My hypothesis is that slow block devices are 

Re: large system hint partition

2016-09-19 Thread Graham Sanderson
The reason for large partitions is that the partition key is just the uuid of 
the target node

More recent (I think 2.2) don't have this problem since they write hints to the 
file system as per the commit log

Sadly the large partitions make things worse when you are hinting hence 
presumably under stress

Sent from my iPhone

> On Sep 16, 2016, at 6:13 PM, Nicolas Douillet  
> wrote:
> 
> Hi Erza, 
> 
> Have you a dead node in your cluster?
> Because the coordinator stores a hint about dead replicas in the local 
> system.hints when a node is dead or didn't respond to a write request.
> 
> --
> Nicolas
> 
> 
> 
>> Le sam. 17 sept. 2016 à 00:12, Ezra Stuetzel  a 
>> écrit :
>> What would be the likely causes of large system hint partitions? Normally 
>> large partition warnings are for user defined tables which they are writing 
>> large partitions to. In this case, it appears C* is writing large partitions 
>> to the system.hints table. Gossip is not backed up.
>> 
>> version: C* 2.2.7
>> WARN  [MemtableFlushWriter:134] 2016-09-16 04:27:39,220 
>> BigTableWriter.java:184 - Writing large partition 
>> system/hints:7ce838aa-f30f-494a-8caa-d44d1440e48b (128181097 bytes)
>> 
>> 
>> 
>> Thanks,
>> 
>> Ezra


Re: Blog post on Cassandra's inner workings and performance - feedback welcome

2016-07-10 Thread Graham Sanderson
2 )”why more memory makes things worse” - I’d be interested to see you argue 
that - it really isn’t true with big boxes. (but yes off heap is good) - we run 
24 gig JVMs with 8g new gen and never see more than a second or so STW and that 
is rare (but we do have lot of -XX: options)

> On Jul 9, 2016, at 11:58 AM, daemeon reiydelle  wrote:
> 
> I saw this really useful post a few days ago. I found the organization and 
> presentation quite clear and helpful (I often struggle trying to do high 
> level comparisons of Hadoop and Cass). Thank you!
> 
> If there was sections I would like to see your clear thoughts appear within, 
> it would be around:
> (1) why networks need to be clean (the impact of "dirty"/erratic networks); 
> (2) the impact of java (off heap, stop the world garbage collection, why more 
> memory makes things worse;
> (3) table design decisions (read mostly, write mostly, mixed read/write, etc.)
> A really great writeup, thank you!
> 
> 
> 
> 
> 
> 
> ...
> 
> Daemeon C.M. Reiydelle
> USA (+1) 415.501.0198
> London (+44) (0) 20 8144 9872
> 
> On Fri, Jul 8, 2016 at 11:59 PM, Manuel Kiessling  > wrote:
> Yes, the joke's on me. It was a copy error, and I've since posted the 
> correct URL (journeymonitor.com 
> :4000/tutorials/2016/02/29/cassandra-inner-workings-and-how-this-relates-to-performance/).
>  
> 
> Substantial feedback regarding the actual post still very much welcome.
> 
> Regards,
> Manuel
> 
> Am 09.07.2016 um 03:32 schrieb daemeon reiydelle  >:
> 
>> Localhost is a special network address that never leaves the operating 
>> system. It only goes "half way" down the IP stack. Thanks for your efforts!
>> 
>> 
>> ...
>> 
>> Daemeon C.M. Reiydelle
>> USA (+1) 415.501.0198 
>> London (+44) (0) 20 8144 9872 
>> 
>> On Fri, Jul 8, 2016 at 5:53 PM, Joaquin Alzola > > wrote:
>> Hi Manuel
>> 
>>  
>> 
>> I think localhost will not work for people on the internet.
>> 
>>  
>> 
>> BR
>> 
>>  
>> 
>> Joaquin
>> 
>>  
>> 
>> From: kiessling.man...@gmail.com  
>> [mailto:kiessling.man...@gmail.com ] On 
>> Behalf Of Manuel Kiessling
>> Sent: 07 July 2016 14:12
>> To: user@cassandra.apache.org 
>> Subject: Blog post on Cassandra's inner workings and performance - feedback 
>> welcome
>> 
>>  
>> 
>> Hi all,
>> 
>> I'm currently in the process of understanding the inner workings of 
>> Cassandra with regards to network and local storage mechanisms and 
>> operations. In order to do so, I've written a blog post about it which is 
>> now in a "first final" version.
>> 
>> Any feedback, especially corrections regarding misunderstandings on my side, 
>> would be highly appreciated. The post really represents my very subjective 
>> view on how Cassandra works under the hood, which makes it prone to errors 
>> of course.
>> 
>> You can access the current version at 
>> http://localhost:4000/tutorials/2016/02/29/cassandra-inner-workings-and-how-this-relates-to-performance/
>>  
>> 
>>  
>> 
>> Thanks,
>> 
>> --
>> 
>>  Manuel
>> 
>> This email is confidential and may be subject to privilege. If you are not 
>> the intended recipient, please do not copy or disclose its content but 
>> contact the sender immediately upon receipt.
>> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: cassandra full gc too often

2015-12-31 Thread Graham Sanderson
If you are lucky that might mask the real issue, but I doubt it… that is an 
insane number of compaction tasks and indicative of another problem. I would 
check release notes of 2.0.6+, if I recall that was not a stable version and 
may have had leaks.

Aside from that, just FYI, if you use native_objects for memtables you can run 
CMS fine even up to largish (20-30G) heap sizes without long GC pauses (at 
least over months) . That said look for promotion failures of 131074 DWORDS. 
Not sure what your GC logging options are, but what you posted doesn’t say why 
a full GC happened which would be totally tunable, but as I say I don’t think 
GC is actually your problem

> On Dec 31, 2015, at 10:16 AM, Ipremyadav  wrote:
> 
> Simplest option is to use java 8 with G1 gc. 
> 
> On 31 Dec 2015, at 10:23 a.m., Shuo Chen  > wrote:
> 
>> I have a cassandra 2.0.6 cluster with four nodes as backup database. The 
>> only operation is posting data into db. Recently, the full gc of the nodes 
>> increases apparently and blocks cluster operation.
>> 
>> The load of each node is 10G. The heap is 8G each with default jvm memory 
>> settings. The cpu is 24 cores. The settings of cassandra.yaml are almost 
>> default.
>> 
>> For debug, all the clients are disconnected, however the gc is still high.
>> 
>> The GC is too often that it appears in every minute and blocks for nealy 30s.
>> 
>> How to debug this situation
>> 
>> The following is the snippets of gc log:
>> 
>> before full gc histogram
>> 
>> 
>> Total Free Space: 0
>> Max   Chunk Size: 0
>> Number of Blocks: 0
>> Tree  Height: 0
>> 1620.992: [GC 7377342K(8178944K), 0.1380260 secs]
>> Before GC:
>> Statistics for BinaryTreeDictionary:
>> 
>> Total Free Space: 0
>> Max   Chunk Size: 0
>> Number of Blocks: 0
>> Tree  Height: 0
>> Before GC:
>> Statistics for BinaryTreeDictionary:
>> 
>> Total Free Space: 0
>> Max   Chunk Size: 0
>> Number of Blocks: 0
>> Tree  Height: 0
>> --
>>1:  61030245 1952967840  java.util.concurrent.FutureTask
>>2:  61031398 1464753552  
>> java.util.concurrent.Executors$RunnableAdapter
>>3:  61030312 1464727488  
>> java.util.concurrent.LinkedBlockingQueue$Node
>>4:  61030244 1464725856  
>> org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask
>>5:214181   13386992  [B
>>6:2028909738720  java.nio.HeapByteBuffer
>>7: 376225994808  [C
>>8: 421635722096  
>>9: 421635407152  
>>   10:  41704648240  
>>   11:1000604002400  
>> org.apache.cassandra.io.sstable.IndexHelper$IndexInfo
>>   12:  41702860816  
>>   13:  36642702720  
>>   14:  48172701576  [J
>>   15:  25751056600  
>>   16: 37431 898344  java.lang.String
>>   17:  2585 627352  [Ljava.lang.Object;
>>   18: 17728 567296  
>> java.util.concurrent.ConcurrentHashMap$HashEntry
>>   19: 22111 530664  javax.management.ObjectName$Property
>>   20:  7595 463824  [S
>>   21:  4534 433560  java.lang.Class
>>   22: 11581 370592  java.util.HashMap$Entry
>> 
>> ...
>> Total 289942930 8399196504
>> 1658.022: [Full GC 8178943K->6241829K(8178944K), 27.8562520 secs]
>> CMS: Large block 0x0007f7da9588
>> 
>> After GC:
>> Statistics for BinaryTreeDictionary:
>> 
>> Total Free Space: 6335823
>> Max   Chunk Size: 6335823
>> Number of Blocks: 1
>> Av.  Block  Size: 6335823
>> Tree  Height: 1
>> After GC:
>> Statistics for BinaryTreeDictionary:
>> 
>> Total Free Space: 0
>> Max   Chunk Size: 0
>> Number of Blocks: 0
>> Tree  Height: 0
>> 
>> 
>> It seems related to objects of Futuretask.
>> 
>> -- 
>> 陈硕 Shuo Chen
>> chenatu2...@gmail.com 
>> chens...@whaty.com 


smime.p7s
Description: S/MIME cryptographic signature


Re: Cassandra Tuning Issue

2015-12-06 Thread Graham Sanderson
What version of C* are you using; what JVM version - you showed a partial GC 
config but if that is still CMS (not G1) then you are going to have insane GC 
pauses... 

Depending on C* versions are you using on/off heap memtables and what type

Those are the sorts of issues related to fat nodes; I'd be worried about - we 
run very nicely at 20G total heap and 8G new - the rest of our 128G memory is 
disk cache/mmap and all of the off heap stuff so it doesn't go to waste

That said I think Jack is probably on the right path with overloaded 
coordinators- though you'd still expect to see CPU usage unless your timeouts 
are too low for the load, In which case the coordinator would be getting no 
responses in time and quite possibly the other nodes are just dropping the 
mutations (since they don't get to them before they know the coordinator would 
have timed out) - I forget the command to check dropped mutations off the top 
of my head but you can see it in opcenter

If you have GC problems you certainly
Expect to see GC cpu usage but depending on how long you run your tests it 
might take you a little while to run thru 40G

I'm personally not a fan off >32G (ish) heaps as you can't do compressed oops 
and also it is unrealistic for CMS ... The word is that G1 is now working ok 
with C* especially on newer C* and JDK versions, but that said it takes quite a 
lot of thru-put to require insane quantities of young gen... We are guessing 
that when we remove all our legacy thrift batch inserts we will need less - and 
as for 20G total we actually don't need that much (we dropped from 24 when we 
moved memtables off heap, and believe we can drop further)

Sent from my iPhone

> On Dec 6, 2015, at 9:07 AM, Jack Krupansky  wrote:
> 
> What replication factor are you using? Even if your writes use CL.ONE, 
> Cassandra will be attempting writes to the replica nodes in the background.
> 
> Are your writes "token aware"? If not, the receiving node has the overhead of 
> forwarding the request to the node that owns the token for the primary key.
> 
> For the record, Cassandra is not designed and optimized for so-called "fat 
> nodes". The design focus is "commodity hardware" and "distributed cluster" 
> (typically a dozen or more nodes.)
> 
> That said, it would be good if we had a rule of thumb for how many 
> simultaneous requests a node can handle, both external requests and 
> inter-node traffic. I think there is an open Jira to enforce a limit on 
> inflight requests so that nodes don't overloaded and start failing in the 
> middle of writes as you seem to be seeing.
> 
> -- Jack Krupansky
> 
>> On Sun, Dec 6, 2015 at 9:29 AM, jerry  wrote:
>> Dear All,
>> 
>> Now I have a 4 nodes Cassandra cluster, and I want to know the highest 
>> performance of my Cassandra cluster. I write a JAVA client to batch insert 
>> datas into ALL 4 nodes Cassandra, when I start less than 30 subthreads in my 
>> client applications to insert datas into cassandra, it will be ok for 
>> everything, but when I start more than 80 or 100 subthreads in my client 
>> applications, there will be too much timeout Exceptions (Such as: Cassandra 
>> timeout during write query at consistency ONE (1 replica were required but 
>> only 0 acknowledged the write)). And no matter how many subthreads or even I 
>> start multiple clients with multiple subthreads on different computers, I 
>> can get the highest performance for about 6 - 8 TPS. By the way, 
>> each row I insert into cassandra is about 130 Bytes.
>> My 4 nodes of Cassandra is :
>> CPU: 4*15
>> Memory: 512G
>> Disk: flash card (only one disk but better than SSD)
>> My cassandra configurations are:
>> MAX_HEAP_SIZE: 60G
>> NEW_HEAP_SIZE: 40G
>> 
>> When I insert datas into my cassandra cluster, each nodes has NOT 
>> reached bottleneck such as CPU or Memory or Disk. Each of the three main 
>> hardwares is idle。So I think maybe there is something wrong about my 
>> configuration of cassandra cluster. Can somebody please help me to My 
>> Cassandra Tuning? Thanks in advances!
> 


Re: Behavior difference between 2.0 and 2.1

2015-12-03 Thread Graham Sanderson
You didn’t specify which version of 2.0 you were on.

There were a number of inconsistencies with static columns fixed in 2.0.10

for example CASSANDRA-7490, and CASSANDRA-7455, but there were others, and the 
same bugs may have caused a bunch of other issues.

It very much depends exactly how you insert data (and indeed I believe is a 
rare case where an UPDATE is not equivalent to an INSERT) whether a partition 
exists when it only has static columns. The behavior you see does make sense 
though, in that it should be possible to insert static data only, and thus at 
the partition key must exist (so it is entirely reasonable to create CQL rows 
which have no actual - i.e. all null - values). Taking it a step further if you 
have TTL on all non static (clustering and data) columns, you don’t 
(necessarily) want the static data to disappear when the other cells do - 
though you can achieve this with statement wide TTL-ing on insertion of the 
static data.

> On Dec 3, 2015, at 6:31 PM, Robert Wille  wrote:
> 
> With this schema:
> 
> CREATE TABLE roll (
> id INT,
> image BIGINT,
> data VARCHAR static,
> PRIMARY KEY ((id), image)
> ) WITH gc_grace_seconds = 3456000 AND compaction = { 'class' : 
> 'LeveledCompactionStrategy', 'sstable_size_in_mb' : 160 };
> 
> if I run SELECT image FROM roll WHERE id = X on 2.0, where partition X has 
> only static data, no rows were returned. In 2.1.11, it returns one row with a 
> null value. Was this change in behavior intentional? Is there an option to 
> get the old behavior back? I potentially have broken code anywhere that I 
> access a table with a static column. Kind of a mess, and not the kind of 
> thing a person expects when upgrading.
> 
> Thanks
> 
> Robert
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: why cassanra max is 20000/s on a node ?

2015-11-05 Thread Graham Sanderson
Agreed too. It also matters what you are inserting… if you are inserting to the 
same (or small set of) partition key(s) you will be limited because writes to 
the same partition key on a single node are atomic and isolated.

> On Nov 5, 2015, at 8:49 PM, Venkatesh Arivazhagan  
> wrote:
> 
> I agree with Tyler! Have you tries increasing the the client threads from 5 
> to a higher number?
> 
> On Nov 5, 2015 6:46 PM, "郝加来" > 
> wrote:
> right ,
> but wo want a node 's throught is above million , so if the system hava fifty 
> table , a single table can achive 2/s .
>  
>  
> 郝加来
>  
> From: Eric Stevens 
> Date: 2015-11-05 23:56
> To: user@cassandra.apache.org 
> Subject: Re: why cassanra max is 2/s on a node ?
> > 512G memory , 128core cpu
> 
> This seems dramatically oversized for a Cassandra node.  You'd do much better 
> to have a much larger cluster of much smaller nodes.
> 
> 
> On Thu, Nov 5, 2015 at 8:25 AM Jack Krupansky  > wrote:
> I don't know what current numbers are, but last year the idea of getting 1 
> million writes per second on a 96 node cluster was considered a reasonable 
> achievement. That would be roughly 10,000 writes per second per node and you 
> are getting twice that.
> 
> See:
> http://www.datastax.com/1-million-writes 
> 
> 
> Or this Google test which hit 1 million writes per second with 330 nodes, 
> which would be roughly 3,000 writes per second per node:
> http://googlecloudplatform.blogspot.com/2014/03/cassandra-hits-one-million-writes-per-second-on-google-compute-engine.html
>  
> 
> 
> So, is your question why your throughput is so good or are you disappointed 
> that it wasn't better?
> 
> Cassandra is designed for clusters with lots of nodes, so if you want to get 
> an accurate measure of per-node performance you need to test with a 
> reasonable number of nodes and then divide aggregate performance by the 
> number of nodes, not test a single node alone. In short, testing a single 
> node in isolation is not a recommended approach to testing Cassandra 
> performance.
> 
> 
> -- Jack Krupansky
> 
> On Thu, Nov 5, 2015 at 9:05 AM, 郝加来  > wrote:
> hi
> veryone
> i setup cassandra 2.2.3 on a node , the machine 's environment is 
> openjdk-1.8.0 , 512G memory , 128core cpu , 3T ssd .
> the token num is 256 on a node , the program use datastax driver 2.1.8 and 
> use 5 thread to insert data to cassandra on the same machine , the data 's 
> capcity is 6G  and 1157000 line .
>  
> why is the throughput 2/s on the node ?
>  
> # Per-thread stack size.
> JVM_OPTS="$JVM_OPTS -Xss512k"
>  
> # Larger interned string table, for gossip's benefit (CASSANDRA-6410)
> JVM_OPTS="$JVM_OPTS -XX:StringTableSize=103"
>  
> # GC tuning options
> JVM_OPTS="$JVM_OPTS -XX:+CMSIncrementalMode"
> JVM_OPTS="$JVM_OPTS -XX:+DisableExplicitGC"
> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=4" 
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=2"
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
> JVM_OPTS="$JVM_OPTS -XX:CompileCommandFile=$CASSANDRA_CONF/hotspot_compiler"
> JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=6"
>  
> memtable_heap_space_in_mb: 1024
> memtable_offheap_space_in_mb: 10240
> memtable_cleanup_threshold: 0.55
> memtable_allocation_type: heap_buffers
>  
>  
>  
> 以上
> 谢谢
>  
> 郝加来
>  
> 金融华东事业部
> <东软20周年邮件签名logo(1(11-06-10-44-31).jpg>
> 
> 东软集团股份有限公司
> 上海市闵行区紫月路1000号东软软件园
> Postcode:200241
> Tel:(86 21) 33578591
> Fax:(86 21) 23025565-111
> Mobile:13764970711
> Email:ha...@neusoft.com 
> Http://www.neusoft.com 
>  
>  
>  
>  
>  
> 
> ---
> Confidentiality Notice: The information contained in this e-mail and any 
> accompanying attachment(s) 
> is intended only for the use of the intended recipient and may be 
> confidential and/or privileged of 
> Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader of 
> this communication is 
> not the intended recipient, unauthorized use, forwarding, printing,  storing, 
> disclosure or copying 
> is strictly prohibited, and may be unlawful.If you have received this 
> communication in error,please 
> immediately notify the sender by return e-mail, and delete the original 
> 

Re: why cassanra max is 20000/s on a node ?

2015-11-05 Thread Graham Sanderson
Also it sounds like you are reading the data from a single file - the problem 
could easily be with your load tool

try (as someone suggested) using cassandra stress

> On Nov 5, 2015, at 9:06 PM, Graham Sanderson <gra...@vast.com> wrote:
> 
> Agreed too. It also matters what you are inserting… if you are inserting to 
> the same (or small set of) partition key(s) you will be limited because 
> writes to the same partition key on a single node are atomic and isolated.
> 
>> On Nov 5, 2015, at 8:49 PM, Venkatesh Arivazhagan <venkey.a...@gmail.com 
>> <mailto:venkey.a...@gmail.com>> wrote:
>> 
>> I agree with Tyler! Have you tries increasing the the client threads from 5 
>> to a higher number?
>> 
>> On Nov 5, 2015 6:46 PM, "郝加来" <ha...@neusoft.com <mailto:ha...@neusoft.com>> 
>> wrote:
>> right ,
>> but wo want a node 's throught is above million , so if the system hava 
>> fifty table , a single table can achive 2/s .
>>  
>>  
>> 郝加来
>>  
>> From: Eric Stevens <mailto:migh...@gmail.com>
>> Date: 2015-11-05 23:56
>> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
>> Subject: Re: why cassanra max is 2/s on a node ?
>> > 512G memory , 128core cpu
>> 
>> This seems dramatically oversized for a Cassandra node.  You'd do much 
>> better to have a much larger cluster of much smaller nodes.
>> 
>> 
>> On Thu, Nov 5, 2015 at 8:25 AM Jack Krupansky <jack.krupan...@gmail.com 
>> <mailto:jack.krupan...@gmail.com>> wrote:
>> I don't know what current numbers are, but last year the idea of getting 1 
>> million writes per second on a 96 node cluster was considered a reasonable 
>> achievement. That would be roughly 10,000 writes per second per node and you 
>> are getting twice that.
>> 
>> See:
>> http://www.datastax.com/1-million-writes 
>> <http://www.datastax.com/1-million-writes>
>> 
>> Or this Google test which hit 1 million writes per second with 330 nodes, 
>> which would be roughly 3,000 writes per second per node:
>> http://googlecloudplatform.blogspot.com/2014/03/cassandra-hits-one-million-writes-per-second-on-google-compute-engine.html
>>  
>> <http://googlecloudplatform.blogspot.com/2014/03/cassandra-hits-one-million-writes-per-second-on-google-compute-engine.html>
>> 
>> So, is your question why your throughput is so good or are you disappointed 
>> that it wasn't better?
>> 
>> Cassandra is designed for clusters with lots of nodes, so if you want to get 
>> an accurate measure of per-node performance you need to test with a 
>> reasonable number of nodes and then divide aggregate performance by the 
>> number of nodes, not test a single node alone. In short, testing a single 
>> node in isolation is not a recommended approach to testing Cassandra 
>> performance.
>> 
>> 
>> -- Jack Krupansky
>> 
>> On Thu, Nov 5, 2015 at 9:05 AM, 郝加来 <ha...@neusoft.com 
>> <mailto:ha...@neusoft.com>> wrote:
>> hi
>> veryone
>> i setup cassandra 2.2.3 on a node , the machine 's environment is 
>> openjdk-1.8.0 , 512G memory , 128core cpu , 3T ssd .
>> the token num is 256 on a node , the program use datastax driver 2.1.8 and 
>> use 5 thread to insert data to cassandra on the same machine , the data 's 
>> capcity is 6G  and 1157000 line .
>>  
>> why is the throughput 2/s on the node ?
>>  
>> # Per-thread stack size.
>> JVM_OPTS="$JVM_OPTS -Xss512k"
>>  
>> # Larger interned string table, for gossip's benefit (CASSANDRA-6410)
>> JVM_OPTS="$JVM_OPTS -XX:StringTableSize=103"
>>  
>> # GC tuning options
>> JVM_OPTS="$JVM_OPTS -XX:+CMSIncrementalMode"
>> JVM_OPTS="$JVM_OPTS -XX:+DisableExplicitGC"
>> JVM_OPTS="$JVM_OPTS -XX:+CMSConcurrentMTEnabled"
>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=4" 
>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=2"
>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>> JVM_OPTS="$JVM_OPTS -XX:CompileCommandFile=$CASSANDRA_CONF/hotspot_compiler"
>> JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=6"
>>  
>> memtable_heap_space_in_mb: 1024
>

Re: compression cpu overhead

2015-11-03 Thread Graham Sanderson
On read or write?

https://issues.apache.org/jira/browse/CASSANDRA-7039 
 and friends in 2.2 
should make some difference, I didn’t immediately find perf numbers though.

> On Nov 3, 2015, at 5:42 PM, Dan Kinder  wrote:
> 
> Hey all,
> 
> Just wondering if anyone has done seen or done any benchmarking for the 
> actual CPU overhead added by various compression algorithms in Cassandra (at 
> least LZ4) vs no compression. Clearly this is going to be workload dependent 
> but even a rough gauge would be helpful (ex. "Turning on LZ4 compression 
> increases my CPU load by ~2x")
> 
> -dan



smime.p7s
Description: S/MIME cryptographic signature


Re: Cassandra stalls and dropped messages not due to GC

2015-10-29 Thread Graham Sanderson
Only if you actually change cassandra.yaml (that was the change in 2.1.6 which 
is why it matters what version he upgraded from)

> On Oct 29, 2015, at 10:06 PM, Sebastian Estevez 
> <sebastian.este...@datastax.com> wrote:
> 
> The thing about the CASSANDRA-9504 theory is that it was solved in 2.1.6 and 
> Jeff's running 2.1.11.
> 
> @Jeff
> 
> How often does this happen? Can you watch ttop as soon as you notice 
> increased read/write latencies?
> 
> wget https://bintray.com/artifact/download/aragozin/generic/sjk-plus-0.3.6.jar
>  
> <https://bintray.com/artifact/download/aragozin/generic/sjk-plus-0.3.6.jar>java
>  -jar sjk-plus-0.3.6.jar ttop -s localhost:7199 -n 30 -o CPU
> 
> This should at least tell you which Cassandra threads are causing high memory 
> allocations  and CPU consumption.
> 
> All the best,
> 
>  <http://www.datastax.com/>
> Sebastián Estévez
> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com 
> <mailto:sebastian.este...@datastax.com>
>  <https://www.linkedin.com/company/datastax>  
> <https://www.facebook.com/datastax>  <https://twitter.com/datastax>  
> <https://plus.google.com/+Datastax/about>  
> <http://feeds.feedburner.com/datastax>
>  <http://goog_410786983/>
> 
>  <http://www.datastax.com/gartner-magic-quadrant-odbms>
> 
> DataStax is the fastest, most scalable distributed database technology, 
> delivering Apache Cassandra to the world’s most innovative enterprises. 
> Datastax is built to be agile, always-on, and predictably scalable to any 
> size. With more than 500 customers in 45 countries, DataStax is the database 
> technology and transactional backbone of choice for the worlds most 
> innovative companies such as Netflix, Adobe, Intuit, and eBay. 
> 
> On Thu, Oct 29, 2015 at 9:36 PM, Graham Sanderson <gra...@vast.com 
> <mailto:gra...@vast.com>> wrote:
> you didn’t say what you upgraded from, but if it is 2.0.x, then look at 
> CASSANDRA-9504
> 
> If so and you use
> commitlog_sync: batch
> Then you probably want to set
> commitlog_sync_batch_window_in_ms: 1 (or 2)
> Note I’m only slightly convinced this is the cause because of your 
> READ_REPAIR issues (though if you are dropping a lot of MUTATIONS under load 
> and your machines are overloaded, you’d be doing more READ_REPAIR than usual 
> probably)
> 
>> On Oct 29, 2015, at 8:12 PM, Jeff Ferland <j...@tubularlabs.com 
>> <mailto:j...@tubularlabs.com>> wrote:
>> 
>> Using DSE 4.8.1 / 2.1.11.872, Java version 1.8.0_66
>> 
>> We upgraded our cluster this weekend and have been having issues with 
>> dropped mutations since then. Intensely investigating a single node and 
>> toying with settings has revealed that GC stalls don’t make up enough time 
>> to explain the 10 seconds of apparent stall that would cause a hangup.
>> 
>> tpstats output typically shows active threads in the low single digits and 
>> pending similar or 0. Before a failure, pending MutationStage will skyrocket 
>> into 5+ digits. System.log regularly shows the gossiper complaining, then 
>> slow log complaints, then logs dropped mutations.
>> 
>> For the entire minute of 00:55, the gc logging shows no single pause > .14 
>> seconds and most of them much smaller. Abbreviated GC log after switching to 
>> g1gc (problem also exhibited before G1GC):
>> 
>> 2015-10-30T00:55:00.550+: 6752.857: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:02.843+: 6755.150: [GC pause (GCLocker Initiated GC) 
>> (young)
>> 2015-10-30T00:55:05.241+: 6757.548: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:07.755+: 6760.062: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:10.532+: 6762.839: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:13.080+: 6765.387: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:15.914+: 6768.221: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:18.619+: 6770.926: [GC pause (GCLocker Initiated GC) 
>> (young)
>> 2015-10-30T00:55:23.270+: 6775.578: [GC pause (GCLocker Initiated GC) 
>> (young)
>> 2015-10-30T00:55:28.662+: 6780.969: [GC pause (GCLocker Initiated GC) 
>> (young)
>> 2015-10-30T00:55:33.326+: 6785.633: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:36.600+: 6788.907: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:40.050+: 6792.357: [GC pause (G1 Evacuation Pause) 
>> (young)
>> 2015-10-30T00:55:43.728+: 6796.0

Re: Cassandra stalls and dropped messages not due to GC

2015-10-29 Thread Graham Sanderson
you didn’t say what you upgraded from, but if it is 2.0.x, then look at 
CASSANDRA-9504

If so and you use
commitlog_sync: batch
Then you probably want to set
commitlog_sync_batch_window_in_ms: 1 (or 2)
Note I’m only slightly convinced this is the cause because of your READ_REPAIR 
issues (though if you are dropping a lot of MUTATIONS under load and your 
machines are overloaded, you’d be doing more READ_REPAIR than usual probably)

> On Oct 29, 2015, at 8:12 PM, Jeff Ferland  wrote:
> 
> Using DSE 4.8.1 / 2.1.11.872, Java version 1.8.0_66
> 
> We upgraded our cluster this weekend and have been having issues with dropped 
> mutations since then. Intensely investigating a single node and toying with 
> settings has revealed that GC stalls don’t make up enough time to explain the 
> 10 seconds of apparent stall that would cause a hangup.
> 
> tpstats output typically shows active threads in the low single digits and 
> pending similar or 0. Before a failure, pending MutationStage will skyrocket 
> into 5+ digits. System.log regularly shows the gossiper complaining, then 
> slow log complaints, then logs dropped mutations.
> 
> For the entire minute of 00:55, the gc logging shows no single pause > .14 
> seconds and most of them much smaller. Abbreviated GC log after switching to 
> g1gc (problem also exhibited before G1GC):
> 
> 2015-10-30T00:55:00.550+: 6752.857: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:02.843+: 6755.150: [GC pause (GCLocker Initiated GC) 
> (young)
> 2015-10-30T00:55:05.241+: 6757.548: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:07.755+: 6760.062: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:10.532+: 6762.839: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:13.080+: 6765.387: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:15.914+: 6768.221: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:18.619+: 6770.926: [GC pause (GCLocker Initiated GC) 
> (young)
> 2015-10-30T00:55:23.270+: 6775.578: [GC pause (GCLocker Initiated GC) 
> (young)
> 2015-10-30T00:55:28.662+: 6780.969: [GC pause (GCLocker Initiated GC) 
> (young)
> 2015-10-30T00:55:33.326+: 6785.633: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:36.600+: 6788.907: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:40.050+: 6792.357: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:43.728+: 6796.035: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:48.216+: 6800.523: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:53.621+: 6805.928: [GC pause (G1 Evacuation Pause) 
> (young)
> 2015-10-30T00:55:59.048+: 6811.355: [GC pause (GCLocker Initiated GC) 
> (young)
> 
> System log snippet of the pattern I’m seeing:
> 
> WARN  [GossipTasks:1] 2015-10-30 00:55:25,129  Gossiper.java:747 - Gossip 
> stage has 1 pending tasks; skipping status check (no nodes will be marked 
> down)
> INFO  [CompactionExecutor:210] 2015-10-30 00:55:26,006  
> CompactionTask.java:141 - Compacting 
> [SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8283-Data.db'),
>  
> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8286-Data.db'),
>  
> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8284-Data.db'),
>  
> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8285-Data.db'),
>  
> SSTableReader(path='/mnt/cassandra/data/system/hints/system-hints-ka-8287-Data.db')]
> WARN  [GossipTasks:1] 2015-10-30 00:55:26,230  Gossiper.java:747 - Gossip 
> stage has 3 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2015-10-30 00:55:27,330  Gossiper.java:747 - Gossip 
> stage has 5 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2015-10-30 00:55:28,431  Gossiper.java:747 - Gossip 
> stage has 7 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2015-10-30 00:55:29,531  Gossiper.java:747 - Gossip 
> stage has 10 pending tasks; skipping status check (no nodes will be marked 
> down)
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,448  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in 
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,451  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16047 in 
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,453  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16018 in 
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,454  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16042 in 
> slow log
> INFO  [CqlSlowLog-Writer-thread-0] 2015-10-30 00:55:32,455  
> CqlSlowLogWriter.java:151 - Recording statements with duration of 16024 in 
> slow 

Re: High cpu usage when the cluster is idle

2015-10-24 Thread Graham Sanderson
I would imagine you are running on fairly slow machines (given the CPU usage), 
but 2.0.12 and 2.1 use a fairly old version of the yammer/codehale metrics 
library.

It is waking up every 5 seconds, and updating Meters… there are a bunch of 
these Meters per table (embedded in Timers), so your large 1500 table count is 
basically most of the problem. 

AFAIK there is no way to turn the metrics off; they also power the JMX 
interfaces.

> On Oct 24, 2015, at 7:54 AM, Xu Zhongxing  wrote:
> 
> The cassandra version is 2.0.12.  We have 1500 tables in the cluster of 6 
> nodes, with a total 2.5 billion rows.
> 
> 在2015年10月24 20时52分, "Xu Zhongxing"写道:
> 
> 
> I saw an average 10% cpu usage on each node when the cassandra cluster has no 
> load at all.
> I checked which thread was using the cpu, and I got the following 2 metric 
> threads each occupying 5% cpu.
> 
> jstack output:  
> 
> "metrics-meter-tick-thread-2" daemon prio=10 tic=...
>   java.lang.Thread.State: WAITING (parking)
>   at sum.misc.Unsafe.park(Native Method)
>   -parking to wait for ...
>   at ... (LockSupport.java:186)
>   at ... (AbstractQueuedSynchronizer.java:2043)
> ...
> at .. (Thread.java:745)
> 
> The other thread is the same.
> 
> Can someone give some clue to this problem? Thank you.



smime.p7s
Description: S/MIME cryptographic signature


Re: unusual GC log

2015-10-20 Thread Graham Sanderson
What version of C* are you running? any special settings in cassandra.yaml; are 
you running with stock GC settings in cassandra-env.sh? what JDK/OS?

> On Oct 19, 2015, at 11:40 PM, 曹志富  wrote:
> 
> INFO  [Service Thread] 2015-10-20 10:42:47,854 GCInspector.java:252 - ParNew 
> GC in 476ms.  CMS Old Gen: 4288526240 -> 4725514832; Par Eden Space: 
> 671088640 -> 0; 
> INFO  [Service Thread] 2015-10-20 10:42:50,870 GCInspector.java:252 - ParNew 
> GC in 423ms.  CMS Old Gen: 4725514832 -> 5114687560; Par Eden Space: 
> 671088640 -> 0; 
> INFO  [Service Thread] 2015-10-20 10:42:53,847 GCInspector.java:252 - ParNew 
> GC in 406ms.  CMS Old Gen: 5114688368 -> 5513119264; Par Eden Space: 
> 671088640 -> 0; 
> INFO  [Service Thread] 2015-10-20 10:42:57,118 GCInspector.java:252 - ParNew 
> GC in 421ms.  CMS Old Gen: 5513119264 -> 5926324736; Par Eden Space: 
> 671088640 -> 0; 
> INFO  [Service Thread] 2015-10-20 10:43:00,041 GCInspector.java:252 - ParNew 
> GC in 437ms.  CMS Old Gen: 5926324736 -> 6324793584; Par Eden Space: 
> 671088640 -> 0; 
> INFO  [Service Thread] 2015-10-20 10:43:03,029 GCInspector.java:252 - ParNew 
> GC in 429ms.  CMS Old Gen: 6324793584 -> 6693672608; Par Eden Space: 
> 671088640 -> 0; 
> INFO  [Service Thread] 2015-10-20 10:43:05,566 GCInspector.java:252 - ParNew 
> GC in 339ms.  CMS Old Gen: 6693672608 -> 6989128592; Par Eden Space: 
> 671088640 -> 0; 
> INFO  [Service Thread] 2015-10-20 10:43:08,431 GCInspector.java:252 - ParNew 
> GC in 421ms.  CMS Old Gen: 6266493464 -> 6662041272; Par Eden Space: 
> 671088640 -> 0; 
> INFO  [Service Thread] 2015-10-20 10:43:11,131 GCInspector.java:252 - 
> ConcurrentMarkSweep GC in 215ms.  CMS Old Gen: 5926324736 -> 4574418480; CMS 
> Perm Gen: 33751256 -> 33751192
> ; Par Eden Space: 7192 -> 611360336; 
> INFO  [Service Thread] 2015-10-20 10:43:11,848 GCInspector.java:252 - ParNew 
> GC in 511ms.  CMS Old Gen: 4574418480 -> 4996166672; Par Eden Space: 
> 671088640 -> 0; 
> INFO  [Service Thread] 2015-10-20 10:43:14,915 GCInspector.java:252 - ParNew 
> GC in 395ms.  CMS Old Gen: 4996167912 -> 5380926744; Par Eden Space: 
> 671088640 -> 0; 
> INFO  [Service Thread] 2015-10-20 10:43:18,335 GCInspector.java:252 - ParNew 
> GC in 432ms.  CMS Old Gen: 5380926744 -> 5811659120; Par Eden Space: 
> 671088640 -> 0; 
> INFO  [Service Thread] 2015-10-20 10:43:21,492 GCInspector.java:252 - ParNew 
> GC in 439ms.  CMS Old Gen: 5811659120 -> 6270861936; Par Eden Space: 
> 671088640 -> 0; 
> INFO  [Service Thread] 2015-10-20 10:43:24,698 GCInspector.java:252 - ParNew 
> GC in 490ms.  CMS Old Gen: 6270861936 -> 6668734208; Par Eden Space: 
> 671088640 -> 0; Par Survivor Sp
> ace: 83886080 -> 83886072
> INFO  [Service Thread] 2015-10-20 10:43:27,963 GCInspector.java:252 - ParNew 
> GC in 457ms.  CMS Old Gen: 6668734208 -> 7072885208; Par Eden Space: 
> 671088640 -> 0; Par Survivor Sp
> ace: 83886072 -> 83886080
> 
> after seconds node mark down.
> 
> My node config is : 8GB heap NEW_HEAP size is 800MB
> 
> NODE hardware is :4CORE 32GBRAM
> 
> --
> Ranger Tsao



smime.p7s
Description: S/MIME cryptographic signature


BEWARE https://issues.apache.org/jira/browse/CASSANDRA-9504

2015-10-19 Thread Graham Sanderson
If you had Cassandra 2.0.x (possibly before) and upgraded to Cassandra 2.1, you 
may have had

commitlog_sync: batch
commitlog_sync_batch_window_in_ms: 25

in you cassiandra.yaml

It turned out that this was pretty much broken in 2.0 (i.e. fsyncs just 
happened immediately), but fixed in 2.1, which meant that every mutation 
blocked its writer thread for 25ms meaning at 80 mutations/sec/writer thread 
you’d start DROPPING mutations if your write timeout is 2000ms.

This turns out to be a massive problem if you write fast, and the default 
commitlog_sync_batch_window_in_ms was changed to 2 ms in 2.1.6 as a way of 
addressing this (with some suggesting 1ms)

Neither of these changes got much fanfare except an eventual reference in 
CHANGES.TXT

With 2.1.9 if you aren’t doing periodic sync, then I think the new behavior is 
just to sync whenever the commit logs have a consistent/complete set of 
mutations ready.

Note this is hard to diagnose because CPU is idle and pretty much all latency 
metrics (except the overall coordinator write) do not count this time (and you 
probably weren’t noticing the 25ms write ACK time). It turned out for us that 
one of our nodes was getting more writes (> 20k mutations per second) which was 
about the magic number… anything shy of that and everything looked fine, but 
just by going slightly over, this node was dropping lots of mutations.






smime.p7s
Description: S/MIME cryptographic signature


Re: BEWARE https://issues.apache.org/jira/browse/CASSANDRA-9504

2015-10-19 Thread Graham Sanderson
But basically if you were on 2.1.0 thru 2.1.5 you probably couldn’t know to 
change your config
If you were on 2.1.6 thru 2.1.8 you may not have noticed the NEWS.TXT change 
and changed your config
If you are on 2.1.9+ you are probably OK

if you are using periodic fsync then you don’t have an issue

> On Oct 19, 2015, at 11:37 AM, Graham Sanderson <gra...@vast.com> wrote:
> 
> - commitlog_sync_batch_window_in_ms behavior has changed from the
>   maximum time to wait between fsync to the minimum time.  We are 
>   working on making this more user-friendly (see CASSANDRA-9533) but in the
>   meantime, this means 2.1 needs a much smaller batch window to keep
>   writer threads from starving.  The suggested default is now 2ms.
> was added retroactively to NEWS.txt in 2.1.6 which is why it is not obvious
> 
>> On Oct 19, 2015, at 11:03 AM, Michael Shuler <mich...@pbandjelly.org 
>> <mailto:mich...@pbandjelly.org>> wrote:
>> 
>> On 10/19/2015 10:55 AM, Graham Sanderson wrote:
>>> If you had Cassandra 2.0.x (possibly before) and upgraded to Cassandra
>>> 2.1, you may have had
>>> 
>>> commitlog_sync: batch
>>> 
>>> commitlog_sync_batch_window_in_ms: 25
>>> 
>>> 
>>> in you cassiandra.yaml
>>> 
>>> It turned out that this was pretty much broken in 2.0 (i.e. fsyncs just
>>> happened immediately), but fixed in 2.1, *which meant that every
>>> mutation blocked its writer thread for 25ms meaning at 80
>>> mutations/sec/writer thread you’d start DROPPING mutations if your write
>>> timeout is 2000ms.*
>>> 
>>> This turns out to be a massive problem if you write fast, and the
>>> default commitlog_sync_batch_window_in_ms was changed to 2 ms in 2.1.6
>>> as a way of addressing this (with some suggesting 1ms)
>>> 
>>> Neither of these changes got much fanfare except an eventual reference
>>> in CHANGES.TXT
>>> 
>>> With 2.1.9 if you aren’t doing periodic sync, then I think the new
>>> behavior is just to sync whenever the commit logs have a
>>> consistent/complete set of mutations ready.
>>> 
>>> Note this is hard to diagnose because CPU is idle and pretty much all
>>> latency metrics (except the overall coordinator write) do not count this
>>> time (and you probably weren’t noticing the 25ms write ACK time). It
>>> turned out for us that one of our nodes was getting more writes (> 20k
>>> mutations per second) which was about the magic number… anything shy of
>>> that and everything looked fine, but just by going slightly over, this
>>> node was dropping lots of mutations.
>> 
>> If you would be kind enough to submit a patch to JIRA for NEWS.txt (aligned 
>> with the right versions you're warning about) that includes the info 
>> upgrading users might need, that would be great!
>> 
>> -- 
>> Kind regards,
>> Michael
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: BEWARE https://issues.apache.org/jira/browse/CASSANDRA-9504

2015-10-19 Thread Graham Sanderson
- commitlog_sync_batch_window_in_ms behavior has changed from the
  maximum time to wait between fsync to the minimum time.  We are 
  working on making this more user-friendly (see CASSANDRA-9533) but in the
  meantime, this means 2.1 needs a much smaller batch window to keep
  writer threads from starving.  The suggested default is now 2ms.
was added retroactively to NEWS.txt in 2.1.6 which is why it is not obvious

> On Oct 19, 2015, at 11:03 AM, Michael Shuler <mich...@pbandjelly.org> wrote:
> 
> On 10/19/2015 10:55 AM, Graham Sanderson wrote:
>> If you had Cassandra 2.0.x (possibly before) and upgraded to Cassandra
>> 2.1, you may have had
>> 
>> commitlog_sync: batch
>> 
>> commitlog_sync_batch_window_in_ms: 25
>> 
>> 
>> in you cassiandra.yaml
>> 
>> It turned out that this was pretty much broken in 2.0 (i.e. fsyncs just
>> happened immediately), but fixed in 2.1, *which meant that every
>> mutation blocked its writer thread for 25ms meaning at 80
>> mutations/sec/writer thread you’d start DROPPING mutations if your write
>> timeout is 2000ms.*
>> 
>> This turns out to be a massive problem if you write fast, and the
>> default commitlog_sync_batch_window_in_ms was changed to 2 ms in 2.1.6
>> as a way of addressing this (with some suggesting 1ms)
>> 
>> Neither of these changes got much fanfare except an eventual reference
>> in CHANGES.TXT
>> 
>> With 2.1.9 if you aren’t doing periodic sync, then I think the new
>> behavior is just to sync whenever the commit logs have a
>> consistent/complete set of mutations ready.
>> 
>> Note this is hard to diagnose because CPU is idle and pretty much all
>> latency metrics (except the overall coordinator write) do not count this
>> time (and you probably weren’t noticing the 25ms write ACK time). It
>> turned out for us that one of our nodes was getting more writes (> 20k
>> mutations per second) which was about the magic number… anything shy of
>> that and everything looked fine, but just by going slightly over, this
>> node was dropping lots of mutations.
> 
> If you would be kind enough to submit a patch to JIRA for NEWS.txt (aligned 
> with the right versions you're warning about) that includes the info 
> upgrading users might need, that would be great!
> 
> -- 
> Kind regards,
> Michael



smime.p7s
Description: S/MIME cryptographic signature


Re: Realtime data and (C)AP

2015-10-11 Thread Graham Sanderson
Obviously QUORUM_OR_ONE is in general no better than ONE. However we hardly 
EVER fail back to ONE, and we are making a conscious choice. I’m okay with 
hiding it if it is too tempting, but for insert/append only workloads without 
deletes or TTL, it is a perfectly good trade off. Why not just use read ONE 
then you say? well because we do insert very very fast and so drop some 
mutations… to that end when I say we hardly ever fail back to ONE, with 
hinting/read repair and speculative reads we can generally prove we have the 
right data at quorum (we like to know). Note that each partition key may have 
multiple values written over time, and generally the chances of them being read 
anywhere close to when they are written is small.

We don’t do this fallback to ONE for everything - it is entirely based on the 
use case.

P.S. I am not familiar with the DowngradingConsistencyRetryPolicy since we use 
our own (open source) Scala CQL driver, but we only downgrade on read not write 
which may be where some of your argument is coming from. i.e. our requirement 
is that we know whether the data stored is good or not, and if we choose to 
read at quorum it will be or it will fail. we always fail writes if they don’t 
achieve quorum.


> On Oct 11, 2015, at 4:04 PM, Ryan Svihla <r...@foundev.pro> wrote:
> 
> Downgrading Consistency Policy suffers from effectively being the downgraded 
> consistency policy aka CL one. I think it's helpful to remember that 
> Consistency Level is effectively a contract on your consistency, if you do 
> "quorum or one" you're basically CL ONE. Think of it this way, CL ONE usually 
> successfuly writes to RF nodes, but you're only requiring one to have a 
> successful write, how is that any different than "quorum or one"? if you only 
> have one node up it'll be CL ONE, if you have two nodes up it'll be CL 
> QUORUM. 
> 
> This approach somehow accomplishes the worst of both worlds, with the speed 
> of QUORUM (since it has to fail to downgrade) and the consistency contract of 
> ONE, it really is a pretty terrible engineering tradeoff. Plus if you're ok 
> with ONE some of the time you're ok with ONE all the time.
> 
> For clarity, I think downgrading consistency policy should be deprecated, I 
> think it totally gets people thinking the wrong way about consistency level.
> 
> On Sun, Oct 11, 2015 at 11:48 AM, Eric Stevens <migh...@gmail.com 
> <mailto:migh...@gmail.com>> wrote:
> The DataStax Java driver is based on Netty and is non blocking; if you do any 
> CQL work you should look into it.  At ProtectWise we use it with high write 
> volumes from Scala/Akka with great success.  
> 
> We have a thin Scala wrapper around the Java driver that makes it act more 
> Scalaish (eg, Scala futures instead of Java futures, string contexts to 
> construct statements, and so on).  This has also let us do some other cool 
> things like integrate Zipkin tracing at a driver level, and add other utility 
> like token aware batches, and concurrent token aware batch selects.
> 
> On Sat, Oct 10, 2015 at 2:49 PM Graham Sanderson <gra...@vast.com 
> <mailto:gra...@vast.com>> wrote:
> Cool - yeah we are still on astyanax mode drivers and our own built from 
> scratch 100% non blocking Scala driver that we used in akka like environments
> 
> Sent from my iPhone
> 
> On Oct 10, 2015, at 12:12 AM, Steve Robenalt <sroben...@highwire.org 
> <mailto:sroben...@highwire.org>> wrote:
> 
>> Hi Graham,
>> 
>> I've used the Java driver's DowngradingConsistencyRetryPolicy for that in 
>> cases where it makes sense.
>> 
>> Ref: 
>> http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
>>  
>> <http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html>
>> 
>> Steve
>> 
>> 
>> 
>> On Fri, Oct 9, 2015 at 6:06 PM, Graham Sanderson <gra...@vast.com 
>> <mailto:gra...@vast.com>> wrote:
>> Actually maybe I'll open a JIRA issue for a (local)quorum_or_one consistency 
>> level... It should be trivial to implement on server side with exist 
>> timeouts ... I'll need to check the CQL protocol to see if there is a good 
>> place to indicate you didn't reach quorum (in time)
>> 
>> Sent from my iPhone
>> 
>> On Oct 9, 2015, at 8:02 PM, Graham Sanderson <gra...@vast.com 
>> <mailto:gra...@vast.com>> wrote:
>> 
>>> Most of our writes are not user facing so local_quorum is good... We also 
>>> read at local_quorum because we prefer guaranteed consistency... But we 
>>> very quickly fall back to local_one in t

Re: Realtime data and (C)AP

2015-10-10 Thread Graham Sanderson
Cool - yeah we are still on astyanax mode drivers and our own built from 
scratch 100% non blocking Scala driver that we used in akka like environments

Sent from my iPhone

> On Oct 10, 2015, at 12:12 AM, Steve Robenalt <sroben...@highwire.org> wrote:
> 
> Hi Graham,
> 
> I've used the Java driver's DowngradingConsistencyRetryPolicy for that in 
> cases where it makes sense.
> 
> Ref: 
> http://docs.datastax.com/en/drivers/java/2.1/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
> 
> Steve
> 
> 
> 
>> On Fri, Oct 9, 2015 at 6:06 PM, Graham Sanderson <gra...@vast.com> wrote:
>> Actually maybe I'll open a JIRA issue for a (local)quorum_or_one consistency 
>> level... It should be trivial to implement on server side with exist 
>> timeouts ... I'll need to check the CQL protocol to see if there is a good 
>> place to indicate you didn't reach quorum (in time)
>> 
>> Sent from my iPhone
>> 
>>> On Oct 9, 2015, at 8:02 PM, Graham Sanderson <gra...@vast.com> wrote:
>>> 
>>> Most of our writes are not user facing so local_quorum is good... We also 
>>> read at local_quorum because we prefer guaranteed consistency... But we 
>>> very quickly fall back to local_one in the cases where some data fast is 
>>> better than a failure. Currently we do that on a per read basis but we 
>>> could I suppose detect a pattern or just look at the gossip to decide to go 
>>> en masse into a degraded read mode
>>> 
>>> Sent from my iPhone
>>> 
>>>> On Oct 9, 2015, at 5:39 PM, Steve Robenalt <sroben...@highwire.org> wrote:
>>>> 
>>>> Hi Brice,
>>>> 
>>>> I agree with your nit-picky comment, particularly with respect to the OP's 
>>>> emphasis, but there are many cases where read at ONE is sufficient and 
>>>> performance is "better enough" to justify the possibility of a wrong 
>>>> result. As with anything Cassandra, it's highly dependent on the nature of 
>>>> the workload.
>>>> 
>>>> Steve
>>>> 
>>>> 
>>>>> On Fri, Oct 9, 2015 at 12:36 PM, Brice Dutheil <brice.duth...@gmail.com> 
>>>>> wrote:
>>>>>> On Fri, Oct 9, 2015 at 2:27 AM, Steve Robenalt <sroben...@highwire.org> 
>>>>>> wrote:
>>>>>> 
>>>>>> In general, if you write at QUORUM and read at ONE (or LOCAL variants 
>>>>>> thereof if you have multiple data centers), your apps will work well 
>>>>>> despite the theoretical consistency issues.
>>>>> 
>>>>> Nit-picky comment : if consistency is something important then reading at 
>>>>> QUORUM is important. If read is ONE then the read operation may not see 
>>>>> important update. The safest option is QUORUM for both write and read. 
>>>>> Then depending on the business or feature the consistency may be tuned.
>>>>> 
>>>>> — Brice
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Steve Robenalt 
>>>> Software Architect
>>>> sroben...@highwire.org 
>>>> (office/cell): 916-505-1785
>>>> 
>>>> HighWire Press, Inc.
>>>> 425 Broadway St, Redwood City, CA 94063
>>>> www.highwire.org
>>>> 
>>>> Technology for Scholarly Communication
> 
> 
> 
> -- 
> Steve Robenalt 
> Software Architect
> sroben...@highwire.org 
> (office/cell): 916-505-1785
> 
> HighWire Press, Inc.
> 425 Broadway St, Redwood City, CA 94063
> www.highwire.org
> 
> Technology for Scholarly Communication


Re: Realtime data and (C)AP

2015-10-09 Thread Graham Sanderson
Most of our writes are not user facing so local_quorum is good... We also read 
at local_quorum because we prefer guaranteed consistency... But we very quickly 
fall back to local_one in the cases where some data fast is better than a 
failure. Currently we do that on a per read basis but we could I suppose detect 
a pattern or just look at the gossip to decide to go en masse into a degraded 
read mode

Sent from my iPhone

> On Oct 9, 2015, at 5:39 PM, Steve Robenalt  wrote:
> 
> Hi Brice,
> 
> I agree with your nit-picky comment, particularly with respect to the OP's 
> emphasis, but there are many cases where read at ONE is sufficient and 
> performance is "better enough" to justify the possibility of a wrong result. 
> As with anything Cassandra, it's highly dependent on the nature of the 
> workload.
> 
> Steve
> 
> 
>> On Fri, Oct 9, 2015 at 12:36 PM, Brice Dutheil  
>> wrote:
>>> On Fri, Oct 9, 2015 at 2:27 AM, Steve Robenalt  
>>> wrote:
>>> 
>>> In general, if you write at QUORUM and read at ONE (or LOCAL variants 
>>> thereof if you have multiple data centers), your apps will work well 
>>> despite the theoretical consistency issues.
>> 
>> Nit-picky comment : if consistency is something important then reading at 
>> QUORUM is important. If read is ONE then the read operation may not see 
>> important update. The safest option is QUORUM for both write and read. Then 
>> depending on the business or feature the consistency may be tuned.
>> 
>> — Brice
>> 
> 
> 
> 
> -- 
> Steve Robenalt 
> Software Architect
> sroben...@highwire.org 
> (office/cell): 916-505-1785
> 
> HighWire Press, Inc.
> 425 Broadway St, Redwood City, CA 94063
> www.highwire.org
> 
> Technology for Scholarly Communication


Re: Realtime data and (C)AP

2015-10-09 Thread Graham Sanderson
Actually maybe I'll open a JIRA issue for a (local)quorum_or_one consistency 
level... It should be trivial to implement on server side with exist timeouts 
... I'll need to check the CQL protocol to see if there is a good place to 
indicate you didn't reach quorum (in time)

Sent from my iPhone

> On Oct 9, 2015, at 8:02 PM, Graham Sanderson <gra...@vast.com> wrote:
> 
> Most of our writes are not user facing so local_quorum is good... We also 
> read at local_quorum because we prefer guaranteed consistency... But we very 
> quickly fall back to local_one in the cases where some data fast is better 
> than a failure. Currently we do that on a per read basis but we could I 
> suppose detect a pattern or just look at the gossip to decide to go en masse 
> into a degraded read mode
> 
> Sent from my iPhone
> 
>> On Oct 9, 2015, at 5:39 PM, Steve Robenalt <sroben...@highwire.org> wrote:
>> 
>> Hi Brice,
>> 
>> I agree with your nit-picky comment, particularly with respect to the OP's 
>> emphasis, but there are many cases where read at ONE is sufficient and 
>> performance is "better enough" to justify the possibility of a wrong result. 
>> As with anything Cassandra, it's highly dependent on the nature of the 
>> workload.
>> 
>> Steve
>> 
>> 
>>> On Fri, Oct 9, 2015 at 12:36 PM, Brice Dutheil <brice.duth...@gmail.com> 
>>> wrote:
>>>> On Fri, Oct 9, 2015 at 2:27 AM, Steve Robenalt <sroben...@highwire.org> 
>>>> wrote:
>>>> 
>>>> In general, if you write at QUORUM and read at ONE (or LOCAL variants 
>>>> thereof if you have multiple data centers), your apps will work well 
>>>> despite the theoretical consistency issues.
>>> 
>>> Nit-picky comment : if consistency is something important then reading at 
>>> QUORUM is important. If read is ONE then the read operation may not see 
>>> important update. The safest option is QUORUM for both write and read. Then 
>>> depending on the business or feature the consistency may be tuned.
>>> 
>>> — Brice
>>> 
>> 
>> 
>> 
>> -- 
>> Steve Robenalt 
>> Software Architect
>> sroben...@highwire.org 
>> (office/cell): 916-505-1785
>> 
>> HighWire Press, Inc.
>> 425 Broadway St, Redwood City, CA 94063
>> www.highwire.org
>> 
>> Technology for Scholarly Communication


Re: addition of nodes with auth enabled on a datacenter causes existing nodes to loose their permissions

2015-10-01 Thread Graham Sanderson
You are seeing 

https://issues.apache.org/jira/browse/CASSANDRA-9519 


> On Oct 1, 2015, at 9:16 PM, K F  wrote:
> 
> Hi,
> 
> I have 3 DCs out of which in one of the DC, I added 20 nodes. All of the DCs 
> had auth enabled, it was functioning fine. But after addition of 20 nodes in 
> one of the DC, the permissions just got messed-up on the existing nodes. My 
> application started getting errors while querying using the user it normally 
> did. 
> 
> Finally, I logged onto one of the existing nodes that was operating fine and 
> issued the following cql query and it gave me the following error.
> 
> cqlsh:system_auth> select * from users;
> TSocket read 0 bytes
> 
> Upon investigation in system log I found the following exception, what does 
> this mean? Thanks.
> 
> 2015-10-02 02:03:10,229 [RPC-Thread:3] ERROR Message Unexpected throwable 
> while invoking!
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:868)
> at java.util.TimSort.mergeAt(TimSort.java:485)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:426)
> at java.util.TimSort.sort(TimSort.java:223)
> at java.util.TimSort.sort(TimSort.java:173)
> at java.util.Arrays.sort(Arrays.java:659)
> at java.util.Collections.sort(Collections.java:217)
> at 
> org.apache.cassandra.locator.AbstractEndpointSnitch.sortByProximity(AbstractEndpointSnitch.java:49)
> at 
> org.apache.cassandra.locator.DynamicEndpointSnitch.sortByProximityWithScore(DynamicEndpointSnitch.java:157)
> at 
> org.apache.cassandra.locator.DynamicEndpointSnitch.sortByProximityWithBadness(DynamicEndpointSnitch.java:186)
> at 
> org.apache.cassandra.locator.DynamicEndpointSnitch.sortByProximity(DynamicEndpointSnitch.java:151)
> at 
> org.apache.cassandra.service.StorageProxy.getLiveSortedEndpoints(StorageProxy.java:1483)
> at 
> org.apache.cassandra.service.StorageProxy.getRangeSlice(StorageProxy.java:1545)
> at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:258)
> at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:228)
> at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:64)
> at 
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:158)
> at 
> com.datastax.bdp.cassandra.cql3.DseQueryHandler$StatementExecution.execute(DseQueryHandler.java:448)
> at 
> com.datastax.bdp.cassandra.cql3.DseQueryHandler.executeOperationWithTiming(DseQueryHandler.java:190)
> at 
> com.datastax.bdp.cassandra.cql3.DseQueryHandler.executeOperationWithAuditLogging(DseQueryHandler.java:223)
> at 
> com.datastax.bdp.cassandra.cql3.DseQueryHandler.process(DseQueryHandler.java:103)
> at 
> org.apache.cassandra.thrift.CassandraServer.execute_cql3_query(CassandraServer.java:1958)
> at 
> com.datastax.bdp.server.DseServer.execute_cql3_query(DseServer.java:543)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$execute_cql3_query.getResult(Cassandra.java:4486)
> at 
> org.apache.cassandra.thrift.Cassandra$Processor$execute_cql3_query.getResult(Cassandra.java:4470)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
> at 
> com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689)
> at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>   
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: Running Cassandra on Java 8 u60..

2015-09-27 Thread Graham Sanderson
IMHO G1 is still buggy on JDK8 (based solely on being subscribed to the gc-dev 
mailing list)… I think JDK9 will be the one.

> On Sep 25, 2015, at 7:14 PM, Stefano Ortolani  wrote:
> 
> I think those were referring to Java7 and G1GC (early versions were buggy).
> 
> Cheers,
> Stefano
> 
> 
> On Fri, Sep 25, 2015 at 5:08 PM, Kevin Burton  > wrote:
> Any issues with running Cassandra 2.0.16 on Java 8? I remember there is long 
> term advice on not changing the GC but not the underlying version of Java.
> 
> Thoughts?
> 
> -- 
> 
> We’re hiring if you know of any awesome Java Devops or Linux Operations 
> Engineers!
> 
> Founder/CEO Spinn3r.com 
> Location: San Francisco, CA
> blog: http://burtonator.wordpress.com 
> … or check out my Google+ profile 
> 
> 
> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: To batch or not to batch: A question for fast inserts

2015-09-27 Thread Graham Sanderson
We are about to prototype upgrading our batch inserts, so I’m really glad about 
this thread… we are able to saturate our dedicated network links from hadoop 
when inserting via thrift API (Astyanax) - at the time we wrote that code CQL 
wasn’t there.

Reasons to replace our current solution:

1) We don’t have GC problems, but the Thrift lack of streaming (or the C* 
decision not to have Quaid turn it on)… definitely means we allocate a lot more 
memory on the server than we should
2) Lack of token awareness.
3) Thrift is not given any love.

Tuning the batch size is key - apart from the performance sweet spot, we had a 
bug at first where we submitted batches by size not number of mutations - we 
store “key frame” and then deltas of our data, so we were batching up to about 
50k mutations in the latter case, which might be OK, but is horrible if one 
mutation fails!

> On Sep 27, 2015, at 11:45 AM, Gerard Maas  wrote:
> 
> Hi Eric, Ryan,
> 
> Thanks a lot for your insights. I got more than I hoped for in this 
> discussion.
> I'll further improve our code to include the replica-awareness and will 
> compare that to the previous tests.
> 
> That snipped of code is really helpful. Thanks.
> 
> I have not been in the list long enough to have read the discussion you 
> mention. And search delivers too many results. Any pointer to that? 
> What are the doom scenarios that unlogged batch would bring along? 
> 
> When we ported the results of our tests to our application logic, the 
> Cassandra cluster CPU load dropped 66% compared to the same ingest rate using 
> single statement async loading. We are certainly a step further in this area.
> 
> I'll drop our findings in a blog post. When I did my initial research, I 
> didn't find any material that would support unlogged batch as being an 
> alternative for performance insertion. The blogosphere seems to support async 
> as the only approach. As Ryan mentioned, that is case dependent and I think 
> devs should be exposed to the pro's and con's of every alternative to enable 
> them to evaluate the best approach for their particular scenario.
> 
> Thanks a ton!
> 
> Kind regards, Gerard.
> 
> On Fri, Sep 25, 2015 at 10:04 PM, Eric Stevens  > wrote:
> Yep, my approach is definitely naive to hotspotting.  If someone had that 
> trouble, they could exhaust the iterator out of getReplicas() and distribute 
> their writes more evenly (which might result in better statement 
> distribution, but wouldn't change the workload on the cluster).  In the end 
> they're going to get in trouble with hotspotting regardless of async single 
> statements or batches.  The single statement async code prefers the first 
> replica returned, so this logic is consistent with the default model.
> 
> > Lots of folks are still stuck on maximum utilization, ironically these same 
> > people tend to focus on using spindles for storage and so will ultimately 
> > end up having to throttle ingest to allow compaction to catch up
> 
> Yeah, otherwise known cost sensitivity, with the unfortunate side effect of 
> making it easy to accidentally overwhelm a cluster as a new operator since 
> the warning signs look different than they do for most other data stores.  
> 
> Straying a bit far afield here, but I actually think it would be a nice 
> feature if by default Cassandra artificially throttled writes as compaction 
> starts getting behind as an early warning sign (a feature you could turn off 
> with a flag).  Cassandra does a great job of absorbing bursty writes, but 
> unfortunately that masks (for the new operator) the warning signs that your 
> sustained write rate is more than the cluster can handle.  Writes are still 
> fast so you assume the cluster is healthy, and by the time there's 
> backpressure to the client, you're already possibly past the point of simple 
> recovery (eg you no longer have enough excess IO to support bootstrapping new 
> nodes).  That would also actually free up some I/O to keep the cluster from 
> tipping over so hard.
> 
> On Fri, Sep 25, 2015 at 12:14 PM, Ryan Svihla  > wrote:
> 
> I think my main point is still, unlogged token aware batches are great, but 
> if you’re writes are large enough, they may actually hurt rather than help, 
> and likewise if your writes are too small, async only is likely only going to 
> hurt. I’d say the average user I’ve had to help (with my selection bias) has 
> individual writes already on the large size of optimal so batching frequently 
> hurts them. Also they tend not to do async in the first place.
> 
> In summary, batch or not is IMO the wrong area to focus, total write payload 
> sizing for your cluster is the factor to focus on and however you get there 
> is fantastic. more replies inline:
> 
>> On Sep 25, 2015, at 1:24 PM, Eric Stevens > > wrote:
>> 
>> > 

Re: High CPU usage on some of nodes

2015-09-11 Thread Graham Sanderson
again I haven’t read this thread from the beginning so I don’t know which node 
is which, but if nodes pause for longish GC, then other nodes will likely be 
saving hints (assuming you are writing at the time), then they will be 
delivered once the machines become responsive again. I’m just guessing though. 
Take a look at the hinting metrics.
> On Sep 11, 2015, at 2:45 PM, Roman Tkachenko <ro...@mailgunhq.com> wrote:
> 
> I have another datapoint from our monitoring system that shows huge outbound 
> network traffic increase for the affected boxes during these spikes:
> 
> 
> 
> Looking at inbound traffic, it is increased on nodes other than these 
> (purple, yellow and blue) so it does look like some kind of excessive 
> internode communication is going on between these 3 nodes and the rest of the 
> cluster.
> 
> What could these network spikes be a sign of?
> 
> 
> On Thu, Sep 10, 2015 at 12:00 PM, Graham Sanderson <gra...@vast.com 
> <mailto:gra...@vast.com>> wrote:
> Haven’t been following this thread, but we run beefy machines with 8gig new 
> gen, 12 gig old gen (down from 16g since moving memtables off heap, we can 
> probably go lower)…
> 
> Apart from making sure you have all the latest -XX: flags from 
> cassandra-env.sh (and MALLOC_ARENA_MAX), I personally would recommend running 
> latest 2.1.x with
> 
> memory_allocator: JEMallocAllocator
> memtable_allocation_type: offheap_objects
> 
> Some people will probably disagree, but it works great for us (rare long 
> pauses sub 2 secs), and if you’re seeing slow GC because of promotion failure 
> of objects 131074 dwords big, then I definitely suggest you give it a try.
> 
>> On Sep 10, 2015, at 1:43 PM, Robert Coli <rc...@eventbrite.com 
>> <mailto:rc...@eventbrite.com>> wrote:
>> 
>> On Thu, Sep 10, 2015 at 10:54 AM, Roman Tkachenko <ro...@mailgunhq.com 
>> <mailto:ro...@mailgunhq.com>> wrote: 
>> [5 second CMS GC] Is my best shot to play with JVM settings trying to tune 
>> garbage collection then?
>> 
>> Yep. As a minor note, if the machines are that beefy, they probably have a 
>> lot of RAM, you might wish to consider trying G1 GC and a larger heap.
>> 
>> =Rob
>> 
>>  
> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: High CPU usage on some of nodes

2015-09-10 Thread Graham Sanderson
Haven’t been following this thread, but we run beefy machines with 8gig new 
gen, 12 gig old gen (down from 16g since moving memtables off heap, we can 
probably go lower)…

Apart from making sure you have all the latest -XX: flags from cassandra-env.sh 
(and MALLOC_ARENA_MAX), I personally would recommend running latest 2.1.x with

memory_allocator: JEMallocAllocator
memtable_allocation_type: offheap_objects

Some people will probably disagree, but it works great for us (rare long pauses 
sub 2 secs), and if you’re seeing slow GC because of promotion failure of 
objects 131074 dwords big, then I definitely suggest you give it a try.

> On Sep 10, 2015, at 1:43 PM, Robert Coli  wrote:
> 
> On Thu, Sep 10, 2015 at 10:54 AM, Roman Tkachenko  > wrote: 
> [5 second CMS GC] Is my best shot to play with JVM settings trying to tune 
> garbage collection then?
> 
> Yep. As a minor note, if the machines are that beefy, they probably have a 
> lot of RAM, you might wish to consider trying G1 GC and a larger heap.
> 
> =Rob
> 
>  



smime.p7s
Description: S/MIME cryptographic signature


Re: Slow performance because of used-up Waste in AtomicBTreeColumns

2015-07-23 Thread Graham Sanderson
Multiple writes to a single partition key are guaranteed to be atomic. 
Therefore there has to be some protection. 

First rule of thumb, don’t write at insanely high rates to the same partition 
key concurrently (you can probably avoid this, but hints as currently 
implemented suffer because the partition key is the node id - that will be 
fixed in 3; also OpsCenter does fast burst inserts of per node data)

The general strategy taken is one of optimistic concurrency; each thread makes 
its own sub-copy of the tree from the root to the inserted data, sharing 
existing nodes where possible. It then tries to CAS the new tree in place. The 
problem with very high concurrency is that a huge amount of work is done and 
memory allocated (if you are doing lots of writes to the same partition then 
the whole memtable may be one AtomicBTreeColum) only to have the CAS fail, and 
that thread to have to start over. 

Anyway, this CAS failing was giving effectively zero concurrency anyway, but 
high extreme CPU usage (wastage) while allocating 10s of gigabytes of garbage a 
second leading to GC issues also, so in 2.1 the AtomicBTreeColumn (which holds 
state for one partition in the memtable) was altered to estimate the amount of 
memory it was wasting over time, and flip to pessimistic locking if a threshold 
was exceeded. The decision was made not to make it flip back for simplicity, 
and that if you are writing data that fast, the memtable and hence 
AtomicBTrreeColumn won’t last long anyway

There is a DEBUG log level in Memtable that alerts you this is happening.

So the short answer is don’t do it - maybe the trigger is a bit too sensitive 
for your needs, but it’d be interesting to know how many inserts you are doing 
a second when going FAST, and then consider if that sounds like a lot if they 
are sorted by partition_key

The longer term answer, that Benedict suggested is having lazy writes under 
contention which would be applied by next un-contended write or repaired on 
read (or flush) - this was also a reason not to add a flag to turn on/off the 
new behavior, along with the fact that in testing we didn’t manage to make it 
perform worse, but did get it perform very much better. It also has no effect 
on un-contended writes.

 On Jul 23, 2015, at 5:55 AM, Petter. Andreas a.pet...@seeburger.de wrote:
 
 Hello everyone,
 
 we are experiencing performance issues with Cassandra overloading effects 
 (dropped mutations and node drop-outs) with the following workload:
 
 create table test (year bigint, spread bigint, time bigint, batchid bigint, 
 value settext, primary key ((year, spread), time, batchid))
 inserting data using an update statement (+ operator to merge the sets). 
 Data _is_being_ordered_ before the mutation is executed on the session. 
 Number of inserts range from 400k to a few millions.
 
 Originally we were using scalding/summingbird and thought the problem to be 
 in our Cassandra-storage-code. To test that i wrote a simple cascading-hadoop 
 job (not using BulkOutputFormat, but the Datastax driver). I was a little bit 
 surprised to still see Cassandra _overload_ (3 reducers/Hadoop-writers and 3 
 co-located Cassandra nodes, as well as a setup with 4/4 nodes). The internal 
 reason seems to be that many worker threads go into state BLOCKED in 
 AtomicBTreeColumns.addAllWithSizeDelta, because s.th http://s.th/. called 
 waste is used up and Cassandra switches to pessimistic locking.
 
 However, i re-wrote the job using plain Hadoop-mapred (without cascading) but 
 using the same storage abstraction for writing and Cassandra 
 _did_not_overload_ and the job has the great write-performance i'm used to 
 (and threads are not going into state BLOCKED).  We're totally lost and 
 puzzled. 
 
 So i have a few questions:
 1. What is this waste used for? Is it a way of braking or load shedding? 
 Why is locking being used in AtomicBTreeColumns?
 2. Is it o.k. to order columns before inserts are being performed?
 3. What could be the reason that waste is being used-up in the cascading 
 job and not  in the plain Hadoop-job (sorting order?)?
 4. Is there any way to circumvent using up waste (except for scaling nodes, 
 which does not seem to be the answer, as the plain Hadoop job runs 
 Cassandra-friendly)?
 
 thanks in advance,
 regards,
 Andi
 
 
 
 
 

 
 
 SEEBURGER AG  Vorstand/SEEBURGER Executive Board:
 Sitz der Gesellschaft/Registered Office:  Bernd Seeburger, Axel 
 Haas, Michael Kleeberg, Friedemann Heinz, Dr. Martin Kuntz, Matthias 
 Feßenbecker
 Edisonstr. 1  
 D-75015 Bretten   Vorsitzende des Aufsichtsrats/Chairperson of 
 the SEEBURGER Supervisory Board:
 Tel.: 07252 / 96 - 0  Prof. Dr. Simone Zeuchner
 Fax: 07252 / 96 - 
 Internet: http://www.seeburger.de http://www.seeburger.de/  
 Registergericht/Commercial Register:
 e-mail: i...@seeburger.de mailto:i...@seeburger.de  HRB 240708 
 

Re: Bulk loading performance

2015-07-13 Thread Graham Sanderson
Ironically in my experience the fastest ways to get data into C* are considered 
“anti-patterns” by most (but I have no problem saturating multiple gigabit 
network links if I really feel like inserting fast)

It’s been a while since I tried some of the newer approaches though (my fast 
load code is a few years old).

 On Jul 13, 2015, at 5:31 PM, David Haguenauer m...@kurokatta.org wrote:
 
 Hi,
 
 I have a use case wherein I receive a daily batch of data; it's about
 50M--100M records (a record is a list of integers, keyed by a
 UUID). The target is a 12-node cluster.
 
 Using a simple-minded approach (24 batched inserts in parallel, using
 the Ruby client), while the cluster is being read at a rate of about
 150k/s, I get about 15.5k insertions per second. This in itself is
 satisfactory, but the concern is that the large amount of writes
 causes the read latency to jump up during the insertion, and for a
 while after.
 
 I tried using sstableloader instead, and the overall throughput is
 similar (I spend 2/3 of the time preparing the SSTables, and 1/3
 actually pushing them to nodes), but I believe this still causes a
 hike in read latency (after the load is complete).
 
 Is there a set of best practices for this kind of workload? We would
 like to avoid interfering with reads as much as possible.
 
 I can of course post more information about our setup and requirements
 if this helps answering.
 
 -- 
 Thanks,
 David Haguenauer



smime.p7s
Description: S/MIME cryptographic signature


Re: What are problems with schema disagreement

2015-07-02 Thread graham sanderson
What version of C* are you running? Some versions of 2.0.x might occasionally 
fail to propagate schema changes in a timely fashion (though they would fix 
themselves eventually - in the order of a few minutes)

 On Jul 2, 2015, at 9:37 PM, John Wong gokoproj...@gmail.com wrote:
 
 Hi.
 
 Here is a schema disagreement we encountered.
 Schema versions:
 b6467059-5897-3cc1-9ee2-73f31841b0b0: [10.0.1.100, 10.0.1.109]
 c8971b2d-0949-3584-aa87-0050a4149bbd: [10.0.1.55, 10.0.1.16, 
 10.0.1.77]
 c733920b-2a31-30f0-bca1-45a8c9130a2c: [10.0.1.221]
 
 We deployed an application which would send a schema update (DDL=auto). We 
 found this prod cluster had 3 schema difference. Other existing applications 
 were fine, so some people were curious what if we left this problem alone 
 until off hours.
 
 Is there any concerns with not resolve schema disagreement right away? FWIW 
 we went ahead and restarted 221 first, and continue with the rest of the 
 minors.
 
 Thanks.
 
 John
 



smime.p7s
Description: S/MIME cryptographic signature


Re: How to measure disk space used by a keyspace?

2015-07-01 Thread graham sanderson
If you are pushing metric data to graphite, there is

org.apache.cassandra.metrics.keyspace.keyspace_name.LiveDiskSpaceUsed.value

… for each node; Easy enough to graph the sum across machines.

Metrics/JMX are tied together in C*, so there is an equivalent value exposed 
via JMX… I don’t know what it is called off the top of my head, but would be 
something similar to the above.

 On Jul 1, 2015, at 9:28 AM, sean_r_dur...@homedepot.com wrote:
 
 That’s ok for a single node, but to answer the question, “how big is my table 
 across the cluster?” it would be much better if the cluster could provide an 
 answer.
  
 Sean Durity
  
 From: Jonathan Haddad [mailto:j...@jonhaddad.com mailto:j...@jonhaddad.com] 
 Sent: Monday, June 29, 2015 8:15 AM
 To: user
 Subject: Re: How to measure disk space used by a keyspace?
  
 If you're looking to measure actual disk space, I'd use the du command, 
 assuming you're on a linux: http://linuxconfig.org/du-1-manual-page 
 http://linuxconfig.org/du-1-manual-page
  
 On Mon, Jun 29, 2015 at 2:26 AM shahab shahab.mok...@gmail.com 
 mailto:shahab.mok...@gmail.com wrote:
 Hi,
  
 Probably this question has been already asked in the mailing list, but I 
 couldn't find it.
  
 The question is how to measure disk-space used by a keyspace, column family 
 wise, excluding snapshots?
  
 best,
 /Shahab
 
 
 The information in this Internet Email is confidential and may be legally 
 privileged. It is intended solely for the addressee. Access to this Email by 
 anyone else is unauthorized. If you are not the intended recipient, any 
 disclosure, copying, distribution or any action taken or omitted to be taken 
 in reliance on it, is prohibited and may be unlawful. When addressed to our 
 clients any opinions or advice contained in this Email are subject to the 
 terms and conditions expressed in any applicable governing The Home Depot 
 terms of business or client engagement letter. The Home Depot disclaims all 
 responsibility and liability for the accuracy and content of this attachment 
 and for any damages or losses arising from any inaccuracies, errors, viruses, 
 e.g., worms, trojan horses, etc., or other items of a destructive nature, 
 which may be contained in this attachment and shall not be liable for direct, 
 indirect, consequential or special damages in connection with this e-mail 
 message or its attachment.



smime.p7s
Description: S/MIME cryptographic signature


Re: Question about consistency in cassandra 2.0.9

2015-06-11 Thread graham sanderson
It looks (I’m guessing with entirely not enough info) that you only have two 
nodes in DC4, and are probably writing at QUORUM reading at LOCAL_ONE. But 
please specify your configuration
 On Jun 11, 2015, at 7:01 PM, K F kf200...@yahoo.com wrote:
 
 Hi,
 
 I am running a cassandra cluster with 4 dcs. Out of 4 dcs, I have 3 DCs 
 returning right data but 1 dc where I had 1 node down, the data didn't return 
 correct records.
 
 So, e.g. 
 DC1 - 386 records (single-token based DC)
 DC2 - 386 records (vnode based DC)
 DC3 - 386 records (vnode based DC)
 DC4 - 178 records (vnode based DC)
 
 In DC4 I had one node down due to hardware failures and was getting another 
 spare node bootstrapped. Then why would during bootstrapping process or when 
 1 node is down in DC4, I would get incorrect records.
 
 Thanks
 
 Regards,
 Ken



smime.p7s
Description: S/MIME cryptographic signature


Re: Cassandra 2.2, 3.0, and beyond

2015-06-11 Thread graham sanderson
I think the point is that 2.2 will replace 2.1.x + (i.e. the done/safe bits of 
3.0 are included in 2.2).. so 2.2.x and 2.1.x are somewhat synonymous.

 On Jun 11, 2015, at 8:14 PM, Mohammed Guller moham...@glassbeam.com wrote:
 
 Considering that 2.1.6 was just released and it is the first “stable” release 
 ready for production in the 2.1 series, won’t it be too soon to EOL 2.1.x 
 when 3.0 comes out in September?
  
 Mohammed
  
 From: Jonathan Ellis [mailto:jbel...@gmail.com mailto:jbel...@gmail.com] 
 Sent: Thursday, June 11, 2015 10:14 AM
 To: user
 Subject: Re: Cassandra 2.2, 3.0, and beyond
  
 As soon as 8099 is done.
  
 On Thu, Jun 11, 2015 at 11:53 AM, Pierre Devops pierredev...@gmail.com 
 mailto:pierredev...@gmail.com wrote:
 Hi,
  
 3.x beta release date ?
  
 2015-06-11 16:21 GMT+02:00 Jonathan Ellis jbel...@gmail.com 
 mailto:jbel...@gmail.com:
 3.1 is EOL as soon as 3.3 (the next bug fix release) comes out.
  
 On Thu, Jun 11, 2015 at 4:10 AM, Stefan Podkowinski 
 stefan.podkowin...@1und1.de mailto:stefan.podkowin...@1und1.de wrote:
  We are also extending our backwards compatibility policy to cover all 3.x 
  releases: you will be able to upgrade seamlessly from 3.1 to 3.7, for 
  instance, including cross-version repair.
 
 What will be the EOL policy for releases after 3.0? Given your example, will 
 3.1 still see bugfixes at this point when I decide to upgrade to 3.7?
 
 
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder, http://www.datastax.com http://www.datastax.com/
 @spyced
  
 
 
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder, http://www.datastax.com http://www.datastax.com/
 @spyced



smime.p7s
Description: S/MIME cryptographic signature


Re: Throttle Heavy Read / Write Loads

2015-06-05 Thread Graham Sanderson
Are you doing large batch inserts via thrift - you need to be careful there

Sent from my iPhone

 On Jun 4, 2015, at 11:37 PM, Anishek Agarwal anis...@gmail.com wrote:
 
 may be just increase the read and write timeouts at cassandra currently at 5 
 sec i think. i think the datastax java client driver provides ability to say 
 how many max requests per connection are to be sent, you can try and lower 
 that to limit excessive requests along with limiting the number of 
 connections a client can do. 
 
 just out of curiosity how long are GC pauses for you both ParNew and CMS and 
 at what intervals are you seeing the GC happening. I just recently spent time 
 to tune it and would be good to know if its working well.
 
 thanks
 anishek
 
 On Fri, Jun 5, 2015 at 12:03 AM, Anuj Wadehra anujw_2...@yahoo.co.in wrote:
 
 We are using Cassandra 2.0.14 with Hector as client ( will be gradually 
 moving to CQL Driver ). 
 
 Often we see that heavy read and write loads lead to Cassandra timeouts and 
 unpredictable results due to gc pauses and request timeouts. We need to know 
 the best way to throttle read and write load on Cassandra such that even if 
 heavy operations are slower they complete gracefully. This will also shield 
 us against misbehaving clients.
 
 I was thinking of limiting rpc connections via rpc_max_threads property and 
 implementing connection pool at client side. 
 
 I would appreciate if you could please share your suggestions on the above 
 mentioned approach or share any alternatives to the approach.
 
 Thanks
 Anuj Wadehra
 


Re: GC pauses affecting entire cluster.

2015-06-01 Thread graham sanderson
Yes native_objects is the way to go… you can tell if memtables are you problem 
because you’ll see promotion failures of objects sized 131074 dwords.

If your h/w is fast enough make your young gen as big as possible - we can 
collect 8G in sub second always, and this gives you your best chance of 
transient objects (especially if you still have thrift clients) leaking into 
the old gen. Moving to 2.1.x (and off heap memtables) from 2.0.x we have 
reduced our old gen down from 16gig to 12gig and will keep shrinking it, but 
have had no promotion failures yet, and it’s been several months.

Note we are running a patched 2.1.3, but 2.1.5 has the equivalent important 
bugs fixed (that might have given you memory issues)

 On Jun 1, 2015, at 3:00 PM, Carl Hu m...@carlhu.com wrote:
 
 Thank you for the suggestion. After analysis of your settings, the basic 
 hypothesis here is to promote very quickly to Old Gen because of a rapid 
 accumulation of heap usage due to memtables. We happen to be running on 2.1, 
 and I thought a more conservative approach that your (quite aggressive gc 
 settings) is to try the new memtable_allocation_type with offheap_objects and 
 see if the memtable pressure is relieved sufficiently such that the standard 
 gc settings can keep up.
 
 The experiment is in progress and I will report back with the results.
 
 On Mon, Jun 1, 2015 at 10:20 AM, Anuj Wadehra anujw_2...@yahoo.co.in 
 mailto:anujw_2...@yahoo.co.in wrote:
 We have write heavy workload and used to face promotion failures/long gc 
 pauses with Cassandra 2.0.x. I am not into code yet but I think that memtable 
 and compaction related objects have mid-life and write heavy workload is not 
 suitable for generation collection by default. So, we tuned JVM to make sure 
 that minimum objects are promoted to Old Gen and achieved great success in 
 that:
 MAX_HEAP_SIZE=12G
 HEAP_NEWSIZE=3G
 -XX:SurvivorRatio=2
 -XX:MaxTenuringThreshold=20
 -XX:CMSInitiatingOccupancyFraction=70
 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=20
 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions
 JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity
 JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs
 JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768
 JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark
 JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=3
 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=2000
 JVM_OPTS=$JVM_OPTS -XX:+CMSEdenChunksRecordAlways
 JVM_OPTS=$JVM_OPTS -XX:+CMSParallelInitialMarkEnabled
 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking
 We also think that default total_memtable_space_in_mb=1/4 heap is too much 
 for write heavy loads. By default, young gen is also 1/4 heap.We reduced it 
 to 1000mb in order to make sure that memtable related objects dont stay in 
 memory for too long. Combining this with SurvivorRatio=2 and 
 MaxTenuringThreshold=20 did the job well. GC was very consistent. No Full GC 
 observed.
 
 Environment: 3 node cluster with each node having 24cores,64G RAM and SSDs in 
 RAID5.
 We are making around 12k writes/sec in 5 cf (one with 4 sec index) and 2300 
 reads/sec on each node of 3 node cluster. 2 CFs have wide rows with max data 
 of around 100mb per row
 
 Yes. Node marking down has cascading effect. Within seconds all nodes in our 
 cluster are marked down. 
 
 Thanks
 Anuj Wadehra
 
 
 
 On Monday, 1 June 2015 7:12 PM, Carl Hu m...@carlhu.com 
 mailto:m...@carlhu.com wrote:
 
 
 We are running Cassandra version 2.1.5.469 on 15 nodes and are experiencing a 
 problem where the entire cluster slows down for 2.5 minutes when one node 
 experiences a 17 second stop-the-world gc. These gc's happen once every 2 
 hours. I did find a ticket that seems related to this: 
 https://issues.apache.org/jira/browse/CASSANDRA-3853 
 https://issues.apache.org/jira/browse/CASSANDRA-3853, but Jonathan Ellis 
 has resolved this ticket. 
 
 We are running standard gc settings, but this ticket is not so much concerned 
 with the 17 second gc on a single node (after all, we have 14 others), but 
 that the cascading performance problem.
 
 We running standard values of dynamic_snitch_badness_threshold (0.1) and 
 phi_convict_threshold (8). (These values are relevant for the dynamic snitch 
 routing requests away from the frozen node or the failure detector marking 
 the node as 'down').
 
 We use the python client in default round robin mode, so all clients hits the 
 coordinators at all nodes in round robin. One theory is that since the 
 coordinator on all nodes must hit the frozen node at some point in the 17 
 seconds, each node's request queues fills up, and the entire cluster thus 
 freezes up. That would explain a 17 second freeze but would not explain the 
 2.5 minute slowdown (10x increase in request latency @P50).
 
 I'd love your thoughts. I've provided the GC chart here.
 
 Carl
 
 d2c95dce-0848-11e5-91f7-6b223349fc14.png
 
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: 10000+ CF support from Cassandra

2015-06-01 Thread graham sanderson
  I strongly advise against this approach.
 Jon, I think so too. But so you actually foresee any problems with this 
 approach?
 I can think of a few. [I want to evaluate if we can live with this problem]
Just to be clear, I’m not saying this is a great approach, I AM saying that it 
may be better than having 1+ CFs, which was the original question (it 
really depends on the use case which wasn’t well defined)… map size limit may 
be a problem, and then there is the CQL vs thrift question which could start a 
flame war; ideally CQL maps should give you the same flexibility as arbitrary 
thrift columns

 On Jun 1, 2015, at 9:44 PM, Jonathan Haddad j...@jonhaddad.com wrote:
 
  Sorry for this naive question but how important is this tuning? Can this 
  have a huge impact in production?
 
 Massive.  Here's a graph of when we did some JVM tuning at my previous 
 company: 
 
 http://33.media.tumblr.com/5d0efca7288dc969c1ac4fc3d36e0151/tumblr_inline_mzvj254quj1rd24f4.png
  
 http://33.media.tumblr.com/5d0efca7288dc969c1ac4fc3d36e0151/tumblr_inline_mzvj254quj1rd24f4.png
 
 About an order of magnitude difference in performance.
 
 Jon
 
 On Mon, Jun 1, 2015 at 7:20 PM Arun Chaitanya chaitan64a...@gmail.com 
 mailto:chaitan64a...@gmail.com wrote:
 Thanks Jon and Jack,
 
  I strongly advise against this approach.
 Jon, I think so too. But so you actually foresee any problems with this 
 approach?
 I can think of a few. [I want to evaluate if we can live with this problem]
 No more CQL. 
 No data types, everything needs to be a blob.
 Limited clustering Keys and default clustering order.
  First off, different workloads need different tuning.
 Sorry for this naive question but how important is this tuning? Can this have 
 a huge impact in production?
 
  You might want to consider a model where you have an application layer that 
  maps logical tenant tables into partition keys within a single large 
  Casandra table, or at least a relatively small number of  Cassandra tables. 
  It will depend on the typical size of your tenant tables - very small ones 
  would make sense within a single partition, while larger ones should have 
  separate partitions for a tenant's data. The key here is that tables are 
  expensive, but partitions are cheap and scale very well with Cassandra.
 We are actually trying similar approach. But we don't want to expose this to 
 application layer. We are attempting to hide this and provide an API.
 
  Finally, you said 10 clusters, but did you mean 10 nodes? You might want 
  to consider a model where you do indeed have multiple clusters, where each 
  handles a fraction of the tenants, since there is no need for separate 
  tenants to be on the same cluster.
 I meant 10 clusters. We want to split our tables across multiple clusters if 
 above approach is not possible. [But it seems to be very costly]
 
 Thanks,
 
 
 
 
 
 
 
 On Fri, May 29, 2015 at 5:49 AM, Jack Krupansky jack.krupan...@gmail.com 
 mailto:jack.krupan...@gmail.com wrote:
 How big is each of the tables - are they all fairly small or fairly large? 
 Small as in no more than thousands of rows or large as in tens of millions or 
 hundreds of millions of rows?
 
 Small tables are are not ideal for a Cassandra cluster since the rows would 
 be spread out across the nodes, even though it might make more sense for each 
 small table to be on a single node.
 
 You might want to consider a model where you have an application layer that 
 maps logical tenant tables into partition keys within a single large Casandra 
 table, or at least a relatively small number of Cassandra tables. It will 
 depend on the typical size of your tenant tables - very small ones would make 
 sense within a single partition, while larger ones should have separate 
 partitions for a tenant's data. The key here is that tables are expensive, 
 but partitions are cheap and scale very well with Cassandra.
 
 Finally, you said 10 clusters, but did you mean 10 nodes? You might want to 
 consider a model where you do indeed have multiple clusters, where each 
 handles a fraction of the tenants, since there is no need for separate 
 tenants to be on the same cluster.
 
 
 -- Jack Krupansky
 
 On Tue, May 26, 2015 at 11:32 PM, Arun Chaitanya chaitan64a...@gmail.com 
 mailto:chaitan64a...@gmail.com wrote:
 Good Day Everyone,
 
 I am very happy with the (almost) linear scalability offered by C*. We had a 
 lot of problems with RDBMS.
 
 But, I heard that C* has a limit on number of column families that can be 
 created in a single cluster.
 The reason being each CF stores 1-2 MB on the JVM heap.
 
 In our use case, we have about 1+ CF and we want to support multi-tenancy.
 (i.e 1 * no of tenants)
 
 We are new to C* and being from RDBMS background, I would like to understand 
 how to tackle this scenario from your advice.
 
 Our plan is to use Off-Heap memtable approach.
 http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1 
 

Re: 10000+ CF support from Cassandra

2015-05-28 Thread Graham Sanderson
Depending on your use case and data types (for example if you can have a 
minimally
Nested Json representation of the objects;
Than you could go with a common mapstring,string representation where keys 
are top love object fields and values are valid Json literals as strings; eg 
unquoted primitives, quoted strings, unquoted arrays or other objects

Each top level field is then independently updatable - which may be beneficial 
(and allows you to trivially keep historical versions of objects of that is a 
requirement)

If you are updating the object in its entirety on save then simply store the 
entire object in a single cql field, and denormalize any search fields you may 
need (which you kinda want to do anyway)

Sent from my iPhone

 On May 28, 2015, at 1:49 AM, Arun Chaitanya chaitan64a...@gmail.com wrote:
 
 Hello Jack,
 
  Column families? As opposed to tables? Are you using Thrift instead of 
  CQL3? You should be focusing on the latter, not the former.
 We have an ORM developed in our company, which maps each DTO to a column 
 family. So, we have many column families. We are using CQL3.
 
  But either way, the general guidance is that there is no absolute limit of 
  tables per se, but low hundreds is the recommended limit, regardless of 
  whether how many key spaces they may be divided 
  between. More than that is an anti-pattern for Cassandra - maybe you can 
  make it work for your application, but it isn't recommended.
 You want to say that most cassandra users don't have more than 2-300 column 
 families? Is this achieved through careful data modelling?
 
  A successful Cassandra deployment is critically dependent on careful data 
  modeling - who is responsible for modeling each of these tables, you and a 
  single, tightly-knit team with very common interests  and very specific 
  goals and SLAs or many different developers with different managers with 
  different goals such as SLAs?
 The latter.
 
  When you say multi-tenant, are you simply saying that each of your 
  organization's customers has their data segregated, or does each customer 
  have direct access to the cluster?
 Each organization's data is in the same cluster. No customer doesn't have 
 access to the cluster.
 
 Thanks,
 Arun
 
 On Wed, May 27, 2015 at 7:17 PM, Jack Krupansky jack.krupan...@gmail.com 
 wrote:
 Scalability of Cassandra refers primarily to number of rows and number of 
 nodes - to add more data, add more nodes.
 
 Column families? As opposed to tables? Are you using Thrift instead of CQL3? 
 You should be focusing on the latter, not the former.
 
 But either way, the general guidance is that there is no absolute limit of 
 tables per se, but low hundreds is the recommended limit, regardless of 
 whether how many key spaces they may be divided between. More than that is 
 an anti-pattern for Cassandra - maybe you can make it work for your 
 application, but it isn't recommended.
 
 A successful Cassandra deployment is critically dependent on careful data 
 modeling - who is responsible for modeling each of these tables, you and a 
 single, tightly-knit team with very common interests and very specific goals 
 and SLAs or many different developers with different managers with different 
 goals such as SLAs?
 
 When you say multi-tenant, are you simply saying that each of your 
 organization's customers has their data segregated, or does each customer 
 have direct access to the cluster?
 
 
 
 
 
 -- Jack Krupansky
 
 On Tue, May 26, 2015 at 11:32 PM, Arun Chaitanya chaitan64a...@gmail.com 
 wrote:
 Good Day Everyone,
 
 I am very happy with the (almost) linear scalability offered by C*. We had 
 a lot of problems with RDBMS.
 
 But, I heard that C* has a limit on number of column families that can be 
 created in a single cluster.
 The reason being each CF stores 1-2 MB on the JVM heap.
 
 In our use case, we have about 1+ CF and we want to support 
 multi-tenancy.
 (i.e 1 * no of tenants)
 
 We are new to C* and being from RDBMS background, I would like to 
 understand how to tackle this scenario from your advice.
 
 Our plan is to use Off-Heap memtable approach.
 http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1
 
 Each node in the cluster has following configuration
 16 GB machine (8GB Cassandra JVM + 2GB System + 6GB Off-Heap)
 IMO, this should be able to support 1000 CF with no(very less) impact on 
 performance and startup time.
 
 We tackle multi-tenancy using different keyspaces.(Solution I found on the 
 web)
 
 Using this approach we can have 10 clusters doing the job. (We actually are 
 worried about the cost)
 
 Can you please help us evaluate this strategy? I want to hear communities 
 opinion on this.
 
 My major concerns being, 
 
 1. Is Off-Heap strategy safe and my assumption of 16 GB supporting 1000 CF 
 right?
 
 2. Can we use multiple keyspaces to solve multi-tenancy? IMO, the number of 
 column families increase even when we use multiple keyspace.
 
 3. I 

Re: 10000+ CF support from Cassandra

2015-05-26 Thread graham sanderson
Are the CFs different, or all the same schema? Are you contractually obligated 
to actually separate data into separate CFs? It seems like you’d have a lot 
simpler time if you could use the part of the partition key to separate data.

Note also, I don’t know what disks you are using, but disk cache can be pretty 
helpful, and you haven’t allowed for any in your machine sizing. Of course that 
depends on your stored data volume also.

Also hard to answer your questions without an idea of read/write load system 
wide, and indeed distribution across tenants.

 On May 26, 2015, at 10:32 PM, Arun Chaitanya chaitan64a...@gmail.com wrote:
 
 Good Day Everyone,
 
 I am very happy with the (almost) linear scalability offered by C*. We had a 
 lot of problems with RDBMS.
 
 But, I heard that C* has a limit on number of column families that can be 
 created in a single cluster.
 The reason being each CF stores 1-2 MB on the JVM heap.
 
 In our use case, we have about 1+ CF and we want to support multi-tenancy.
 (i.e 1 * no of tenants)
 
 We are new to C* and being from RDBMS background, I would like to understand 
 how to tackle this scenario from your advice.
 
 Our plan is to use Off-Heap memtable approach.
 http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1 
 http://www.datastax.com/dev/blog/off-heap-memtables-in-Cassandra-2-1
 
 Each node in the cluster has following configuration
 16 GB machine (8GB Cassandra JVM + 2GB System + 6GB Off-Heap)
 IMO, this should be able to support 1000 CF with no(very less) impact on 
 performance and startup time.
 
 We tackle multi-tenancy using different keyspaces.(Solution I found on the 
 web)
 
 Using this approach we can have 10 clusters doing the job. (We actually are 
 worried about the cost)
 
 Can you please help us evaluate this strategy? I want to hear communities 
 opinion on this.
 
 My major concerns being, 
 
 1. Is Off-Heap strategy safe and my assumption of 16 GB supporting 1000 CF 
 right?
 
 2. Can we use multiple keyspaces to solve multi-tenancy? IMO, the number of 
 column families increase even when we use multiple keyspace.
 
 3. I understand the complexity using multi-cluster for single application. 
 The code base will get tightly coupled with infrastructure. Is this the right 
 approach?
 
 Any suggestion is appreciated.
 
 Thanks,
 Arun



smime.p7s
Description: S/MIME cryptographic signature


cassanulldra 2.2

2015-05-11 Thread graham sanderson
I think vast may have changed the release schedule of cassandra. I talk a lot 
with one of their key developers, and 3.0 was going to drop off heap memtables 
for several releases due to a rewrite of the storage engine to be more CQL 
friendly.

2.2 will take all of the improvements in 3.0 but not be delayed by that engine 
rewrite (and I think because of us will ship out of the box with off heap 
memtables - I gave them some pretty telling graphs about the differences 
between the two)

In any case in the short term we will move to 2.1.5 which has all the things 
that we had to patch in 2.1.3  (except for one incorrect metric) patched - so 
we will be running of our patched version of 2.1.5 although that patch is only 
one line

smime.p7s
Description: S/MIME cryptographic signature


DateTieredCompactionStrategy and static columns

2015-04-30 Thread graham sanderson
I have a potential use case I haven’t had a chance to prototype yet, which 
would normally be a good candidate for DTCS (i.e. data delivered in order and a 
fixed TTL), however with every write we’d also be updating some static cells 
(namely a few key/values in a static maptext.text CQL column). There could 
also be explicit deletes of keys in the static map, though that’s not 100% 
necessary.

Since those columns don’t have TTL, without reading thru the code code and/or 
trying it, I have no idea what effect this has on DTCS (perhaps it needs to use 
separate sstables for static columns). Has anyone tried this. If not I 
eventually will and will report back.

smime.p7s
Description: S/MIME cryptographic signature


Re: Uderstanding Read after update

2015-04-13 Thread Graham Sanderson
Yes it will look in each sstable that according to the bloom filter may have 
data for that partition key and use time stamps to figure out the latest 
version (or none in case of newer tombstone) to return for each clustering key

Sent from my iPhone

 On Apr 12, 2015, at 11:18 PM, Anishek Agarwal anis...@gmail.com wrote:
 
 Thanks Tyler for the validations, 
 
 I have a follow up question. 
 
  One SSTable doesn't have precedence over another.  Instead, when the same 
 cell exists in both sstables, the one with the higher write timestamp wins.
 
 if my table has 5(non partition key columns) and i update only 1 of them then 
 the new SST table should have only that entry, which means if i query 
 everything for that parition key,  cassandra has to have the timestamp 
 matched per column for a partition key across SST tables to get me the data ?
 
 
 On Fri, Apr 10, 2015 at 10:52 PM, Tyler Hobbs ty...@datastax.com wrote:
 
 
 SST Table level bloom filters have details as to what partition keys are in 
 that table. So to clear up my understanding, if I insert and then have a 
 update to the same row after some time (assuming both go to different SST 
 Tables), then during read cassandra will read data from both SST Tables and 
 merge them in order of time series with Data in Second SST table for the 
 row taking precedence over the First SST Table and return the result ?
 
 That's approximately correct.  The only part that's incorrect is how merging 
 works.  One SSTable doesn't have precedence over another.  Instead, when the 
 same cell exists in both sstables, the one with the higher write timestamp 
 wins.
  
 Does it mark the old column as tombstone in the previous SST Table or wait 
 for compaction to remove the old data ?
 
 It just waits for compaction to remove the old data, there's no tombstone.
 
 
 when the data is in mem cache it also keep tracks of unique keys in that 
 memtable so when it writes to disk it can use that to derive the right size 
 of bloom filter for that SST Table ?
 
 
 That's correct, it knows the number of keys before the bloom filter is 
 created.
 
 -- 
 Tyler Hobbs
 DataStax
 


Re: Huge number of sstables after adding server to existing cluster

2015-04-04 Thread graham sanderson
I have not thought thru why adding a node would cause this behavior, but

https://issues.apache.org/jira/browse/CASSANDRA-8860 
https://issues.apache.org/jira/browse/CASSANDRA-8860
https://issues.apache.org/jira/browse/CASSANDRA-8635 
https://issues.apache.org/jira/browse/CASSANDRA-8635

related issues (which end up with causing excessive numbers of sstables) - we 
saw many thousands per node for some tables

If you have a manageable number of tables, try setting coldReadsToOmit (a 
setting on SizeTieredCompactionStrategy) back to 0 which was the default in 
2.0.x

Otherwise you could apply this patch (which reverts the default - and I doubt 
you had overridden it), but note that the coldReadsToOmit is fixed in 2.1.4, so 
if you can do it just by chaining table config then that is good.

diff --git 
a/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategy.java 
b/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactio
index fbd715c..cbb8c8b 100644
--- 
a/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategy.java
+++ 
b/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategy.java
@@ -118,7 +118,11 @@ public class SizeTieredCompactionStrategy extends 
AbstractCompactionStrategy
 static ListSSTableReader filterColdSSTables(ListSSTableReader 
sstables, double coldReadsToOmit, int minThreshold)
 {
 if (coldReadsToOmit == 0.0)
+{
+if (!sstables.isEmpty())
+logger.debug(Skipping cold sstable filter for list sized {} 
containing {}, sstables.size(), sstables.get(0).getFilename());
 return sstables;
+}
 
 // Sort the sstables by hotness (coldest-first). We first build a map 
because the hotness may change during the sort.
 final MapSSTableReader, Double hotnessSnapshot = 
getHotnessMap(sstables);
diff --git 
a/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategyOptions.java
 b/src/java/org/apache/cassandra/db/compaction/SizeTieredCo
index 84e7d61..c6c5f1b 100644
--- 
a/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategyOptions.java
+++ 
b/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategyOptions.java
@@ -26,7 +26,7 @@ public final class SizeTieredCompactionStrategyOptions
 protected static final long DEFAULT_MIN_SSTABLE_SIZE = 50L * 1024L * 1024L;
 protected static final double DEFAULT_BUCKET_LOW = 0.5;
 protected static final double DEFAULT_BUCKET_HIGH = 1.5;
-protected static final double DEFAULT_COLD_READS_TO_OMIT = 0.05;
+protected static final double DEFAULT_COLD_READS_TO_OMIT = 0.0;
 protected static final String MIN_SSTABLE_SIZE_KEY = min_sstable_size;
 protected static final String BUCKET_LOW_KEY = bucket_low;
 protected static final String BUCKET_HIGH_KEY = bucket_high;




 On Apr 4, 2015, at 4:23 PM, Mantas Klasavičius mantas.klasavic...@gmail.com 
 wrote:
 
 Thanks a lot for all to your responses
 
 I should mention we are running 2.1.3 and  I have set setcompactionthroughput 
 0 already 
 
 nodetool enableautocompaction keyspace table command/bug is new to me I will 
 definitely will try this out and let you know
 
 One more thing I wan't to clarify did I understand correctly 32 is the max 
 number for sstables for normally operating cassandra node?
 
 
 Best regards
 Mantas
 
 On Sat, Apr 4, 2015 at 4:47 AM, graham sanderson gra...@vast.com 
 mailto:gra...@vast.com wrote:
 As does 2.1.3
 
 On Apr 3, 2015, at 5:36 PM, Robert Coli rc...@eventbrite.com 
 mailto:rc...@eventbrite.com wrote:
 
 On Fri, Apr 3, 2015 at 1:04 PM, Thomas Borg Salling tbsall...@tbsalling.dk 
 mailto:tbsall...@tbsalling.dk wrote:
 I agree with Pranay. I have experienced exactly the same on C* 2.1.2.
 
 2.1.2 had a serious bug which resulted in extra files, which is different 
 from the overall issue I am referring to.
 
 =Rob
  
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Huge number of sstables after adding server to existing cluster

2015-04-03 Thread graham sanderson
As does 2.1.3

 On Apr 3, 2015, at 5:36 PM, Robert Coli rc...@eventbrite.com wrote:
 
 On Fri, Apr 3, 2015 at 1:04 PM, Thomas Borg Salling tbsall...@tbsalling.dk 
 mailto:tbsall...@tbsalling.dk wrote:
 I agree with Pranay. I have experienced exactly the same on C* 2.1.2.
 
 2.1.2 had a serious bug which resulted in extra files, which is different 
 from the overall issue I am referring to.
 
 =Rob
  



smime.p7s
Description: S/MIME cryptographic signature


Re: Astyanax Thrift Frame Size Hardcoded - Breaks Ring Describe

2015-04-03 Thread graham sanderson
It is very stable for us; we don’t use it in many cases (generally older stuff 
where it was the best choice), but I think it is a little harsh to write it off

 On Apr 3, 2015, at 1:55 PM, Robert Coli rc...@eventbrite.com wrote:
 
 On Fri, Apr 3, 2015 at 11:16 AM, Eric Stevens migh...@gmail.com 
 mailto:migh...@gmail.com wrote:
 Astyanax is no longer maintained, so I don't really expect that to go 
 anywhere, which is why I thought it might be a good idea to issue a general 
 warning.  This should hopefully be a helpful nudge for anyone still using 
 Astyanax: it's time to find a new driver.
 
 I'm not contesting, but do you have a citation for this? If so, providing it 
 would strengthen your nudge. :D
 
 =Rob
  



smime.p7s
Description: S/MIME cryptographic signature


Re: Disastrous profusion of SSTables

2015-03-26 Thread graham sanderson
you may be seeing

https://issues.apache.org/jira/browse/CASSANDRA-8860 
https://issues.apache.org/jira/browse/CASSANDRA-8860
https://issues.apache.org/jira/browse/CASSANDRA-8635 
https://issues.apache.org/jira/browse/CASSANDRA-8635

related issues (which ends up with excessive numbers of sstables)

we applied

diff --git 
a/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategy.java 
b/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactio
index fbd715c..cbb8c8b 100644
--- 
a/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategy.java
+++ 
b/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategy.java
@@ -118,7 +118,11 @@ public class SizeTieredCompactionStrategy extends 
AbstractCompactionStrategy
 static ListSSTableReader filterColdSSTables(ListSSTableReader 
sstables, double coldReadsToOmit, int minThreshold)
 {
 if (coldReadsToOmit == 0.0)
+{
+if (!sstables.isEmpty())
+logger.debug(Skipping cold sstable filter for list sized {} 
containing {}, sstables.size(), sstables.get(0).getFilename());
 return sstables;
+}
 
 // Sort the sstables by hotness (coldest-first). We first build a map 
because the hotness may change during the sort.
 final MapSSTableReader, Double hotnessSnapshot = 
getHotnessMap(sstables);
diff --git 
a/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategyOptions.java
 b/src/java/org/apache/cassandra/db/compaction/SizeTieredCo
index 84e7d61..c6c5f1b 100644
--- 
a/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategyOptions.java
+++ 
b/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategyOptions.java
@@ -26,7 +26,7 @@ public final class SizeTieredCompactionStrategyOptions
 protected static final long DEFAULT_MIN_SSTABLE_SIZE = 50L * 1024L * 1024L;
 protected static final double DEFAULT_BUCKET_LOW = 0.5;
 protected static final double DEFAULT_BUCKET_HIGH = 1.5;
-protected static final double DEFAULT_COLD_READS_TO_OMIT = 0.05;
+protected static final double DEFAULT_COLD_READS_TO_OMIT = 0.0;
 protected static final String MIN_SSTABLE_SIZE_KEY = min_sstable_size;
 protected static final String BUCKET_LOW_KEY = bucket_low;
 protected static final String BUCKET_HIGH_KEY = bucket_high;

to our 2.1.3, though the entire coldReadsToOmit is removed in 2.1.4

Note you don’t have to patch your code, you can set the value on each table (we 
just have a lot and dynamically generated ones) - basically try setting 
coldReadsToOmit back to 0 which was the default in 2.0.x

 On Mar 26, 2015, at 3:56 AM, Anishek Agarwal anis...@gmail.com wrote:
 
 Are you frequently updating same rows ? What is the memtable flush size ? can 
 you post the table create query here in please.
 
 On Thu, Mar 26, 2015 at 1:21 PM, Dave Galbraith david92galbra...@gmail.com 
 mailto:david92galbra...@gmail.com wrote:
 Hey! So I'm running Cassandra 2.1.2 and using the 
 SizeTieredCompactionStrategy. I'm doing about 3k writes/sec on a single node. 
 My read performance is terrible, all my queries just time out. So I do 
 nodetool cfstats:
 
 Read Count: 42071
 Read Latency: 67.47804242827601 ms.
 Write Count: 131964300
 Write Latency: 0.011721604274792501 ms.
 Pending Flushes: 0
 Table: metrics16513
 SSTable count: 641
 Space used (live): 6366740812
 Space used (total): 6366740812
 Space used by snapshots (total): 0
 SSTable Compression Ratio: 0.25272488401992765
 Memtable cell count: 0
 Memtable data size: 0
 Memtable switch count: 1016
 Local read count: 42071
 Local read latency: 67.479 ms
 Local write count: 131964300
 Local write latency: 0.012 ms
 Pending flushes: 0
 Bloom filter false positives: 994
 Bloom filter false ratio: 0.0
 Bloom filter space used: 37840376
 Compacted partition minimum bytes: 104
 Compacted partition maximum bytes: 24601
 Compacted partition mean bytes: 255
 Average live cells per slice (last five minutes): 111.67243951154147
 Maximum live cells per slice (last five minutes): 1588.0
 Average tombstones per slice (last five minutes): 0.0
 Maximum tombstones per slice (last five minutes): 0.0
 
 and nodetool cfhistograms:
 
 Percentile  SSTables Write Latency  Read LatencyPartition Size
 Cell Count
   (micros)  (micros)   (bytes)
   
 50%46.00  6.99 154844.95   149
  1
 75%   430.00  8.533518837.53   179
  1
 95%   430.00 11.327252897.25   215
  2
 98%   430.00 15.54   22103886.34   

Re: What are the reasons for holding off on 2.1.x at this point?

2015-03-09 Thread graham sanderson
2.1.3 has a few memory leaks/issues, resource management race conditions.

That is horribly vague, however looking at some of the fixes in 2.1.4 I’d be 
tempted to wait on that.

2.1.3 is fine for testing though.

 On Mar 9, 2015, at 6:42 PM, Jacob Rhoden jacob.rho...@me.com wrote:
 
 I notice some of the discussion about rolling back and avoiding upgrading. I 
 wonder if people can elaborate on their pain points? 
 
 We are in a situation where there are some use cases we wish to implement 
 that appear to be much simpler to implement using indexed sets. So it has me 
 wondering about what the cons would be of jumping into 2.1.3, instead of 
 having to code around the limits of 2.0.x, and then re-write the features 
 once we can use 2.1.3. (Ideally we want to get these use cases into prod 
 within the next 4 weeks)
 
 Thanks,
 Jacob



smime.p7s
Description: S/MIME cryptographic signature


Re: Upgrade from 2.0.9 to 2.1.3

2015-03-06 Thread graham sanderson
Note for anyone who accidentally or otherwise ends up with 2.1.3 in a situation 
they cannot downgrade, feel free to look at 

https://github.com/vast-engineering/cassandra/tree/vast-cassandra-2.1.3 
https://github.com/vast-engineering/cassandra/tree/vast-cassandra-2.1.3

We sometimes make custom versions incorporating as many important patches as we 
reasonably can that we need to run a newer C* environment successfully.

Obviously use at your own risk blah blah… basically install procedure would be 
to replace the main cassandra jar on a 2.1.3 node while it is down.

 On Mar 6, 2015, at 3:15 PM, Robert Coli rc...@eventbrite.com wrote:
 
 On Fri, Mar 6, 2015 at 6:25 AM, graham sanderson gra...@vast.com 
 mailto:gra...@vast.com wrote:
 I would definitely wait for at least 2.1.4
 
 +1
 
 https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ 
 https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
 
 =Rob
  



smime.p7s
Description: S/MIME cryptographic signature


Re: best practices for time-series data with massive amounts of records

2015-03-06 Thread graham sanderson
Note that using static column(s) for the “head” value, and trailing TTLed 
values behind is something we’re considering. Note this is especially nice if 
your head state includes say a map which is updated by small deltas (individual 
keys)

We have not yet studied the effect of static columns on say DTCS

 On Mar 6, 2015, at 4:42 PM, Clint Kelly clint.ke...@gmail.com wrote:
 
 Hi all,
 
 Thanks for the responses, this was very helpful.
 
 I don't know yet what the distribution of clicks and users will be, but I 
 expect to see a few users with an enormous amount of interactions and most 
 users having very few.  The idea of doing some additional manual 
 partitioning, and then maintaining another table that contains the head 
 partition for each user makes sense, although it would add additional latency 
 when we want to get say the most recent 1000 interactions for a given user 
 (which is something that we have to do sometimes for applications with tight 
 SLAs).
 
 FWIW I doubt that any users will have so many interactions that they exceed 
 what we could reasonably put in a row, but I wanted to have a strategy to 
 deal with this.
 
 Having a nice design pattern in Cassandra for maintaining a row with the 
 N-most-recent interactions would also solve this reasonably well, but I don't 
 know of any way to implement that without running batch jobs that 
 periodically clean out data (which might be okay).
 
 Best regards,
 Clint
 
 
 
 
 On Tue, Mar 3, 2015 at 8:10 AM, mck m...@apache.org 
 mailto:m...@apache.org wrote:
 
  Here partition is a random digit from 0 to (N*M)
  where N=nodes in cluster, and M=arbitrary number.
 
 
 Hopefully it was obvious, but here (unless you've got hot partitions),
 you don't need N.
 ~mck
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Upgrade from 2.0.9 to 2.1.3

2015-03-06 Thread graham sanderson
I would definitely wait for at least 2.1.4

 On Mar 6, 2015, at 8:13 AM, Fredrik Larsson Stigbäck 
 fredrik.l.stigb...@sitevision.se wrote:
 
 So no upgradeSSTables are required?
 /Fredrik
 
 6 mar 2015 kl. 15:11 skrev Carlos Rolo r...@pythian.com 
 mailto:r...@pythian.com:
 
 I would not recommend an upgrade to 2.1.x for now. Do you have any specific 
 reason to upgrade?
 
 For upgrading from 2.0.9 you can just do a direct upgrade.
 
 Regards,
 
 Carlos Juzarte Rolo
 Cassandra Consultant
  
 Pythian - Love your data
 
 rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo 
 http://linkedin.com/in/carlosjuzarterolo
 Tel: 1649
 www.pythian.com http://www.pythian.com/
 On Fri, Mar 6, 2015 at 3:03 PM, Fredrik Larsson Stigbäck 
 fredrik.l.stigb...@sitevision.se mailto:fredrik.l.stigb...@sitevision.se 
 wrote:
 What’s the recommended way of upgrading from 2.0.9 to 2.1.3?
 Is upgradeSSTables required?
 According to 
 http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html
  
 http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html
  it should be possible to just start the on 2.1.3 directly after 2.0.9.
 
 Regards
 Fredrik
 
 
 
 --
 
 
 
 
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: OOM and high SSTables count

2015-03-04 Thread graham sanderson
We can confirm a problem on 2.1.3 (sadly our beta sstable state obviously did 
not match our production ones in some critical way)

We have about 20k sstables on each of 6 nodes right now; actually a quick 
glance shows 15k of those are from OpsCenter, which may have something to do 
with beta/production mismatch

I will look into the open OOM JIRA issue against 2.1.3 - we may being penalized 
for heavy use of JBOD (x7 per node)

It also looks like 2.1.3 is leaking memory, though it eventually recovers via 
GCInspector causing a complete memtable flush.

 On Mar 4, 2015, at 12:31 PM, daemeon reiydelle daeme...@gmail.com wrote:
 
 Are you finding a correlation between the shards on the OOM DC1 nodes and the 
 OOM DC2 nodes? Does your monitoring tool indicate that the DC1 nodes are 
 using significantly more CPU (and memory) than the nodes that are NOT 
 failing? I am leading you down the path to suspect that your sharding is 
 giving you hot spots. Also are you using vnodes?
 
 Patrick
 
 On Wed, Mar 4, 2015 at 9:26 AM, Jan cne...@yahoo.com 
 mailto:cne...@yahoo.com wrote:
 HI Roni; 
 
 You mentioned: 
 DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB of RAM 
 and 5GB HEAP.
 
 Best practices would be be to:
 a)  have a consistent type of node across both DC's.  (CPUs, Memory, Heap  
 Disk)
 b)  increase heap on DC2 servers to be  8GB for C* Heap 
 
 The leveled compaction issue is not addressed by this. 
 hope this helps
 
 Jan/
 
 
 
 
 On Wednesday, March 4, 2015 8:41 AM, Roni Balthazar ronibaltha...@gmail.com 
 mailto:ronibaltha...@gmail.com wrote:
 
 
 Hi there,
 
 We are running C* 2.1.3 cluster with 2 DataCenters: DC1: 30 Servers /
 DC2 - 10 Servers.
 DC1 servers have 32GB of RAM and 10GB of HEAP. DC2 machines have 16GB
 of RAM and 5GB HEAP.
 DC1 nodes have about 1.4TB of data and DC2 nodes 2.3TB.
 DC2 is used only for backup purposes. There are no reads on DC2.
 Every writes and reads are on DC1 using LOCAL_ONE and the RF DC1: 2 and DC2: 
 1.
 All keyspaces have STCS (Average 20~30 SSTables count each table on
 both DCs) except one that is using LCS (DC1: Avg 4K~7K SSTables / DC2:
 Avg 3K~14K SSTables).
 
 Basically we are running into 2 problems:
 
 1) High SSTables count on keyspace using LCS (This KS has 500GB~600GB
 of data on each DC1 node).
 2) There are 2 servers on DC1 and 4 servers in DC2 that went down with
 the OOM error message below:
 
 ERROR [SharedPool-Worker-111] 2015-03-04 05:03:26,394
 JVMStabilityInspector.java:94 - JVM state determined to be unstable.
 Exiting forcefully due to:
 java.lang.OutOfMemoryError: Java heap space
 at 
 org.apache.cassandra.db.composites.CompoundSparseCellNameType.copyAndMakeWith(CompoundSparseCellNameType.java:186)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.composites.AbstractCompoundCellNameType$CompositeDeserializer.readNext(AbstractCompoundCellNameType.java:286)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.AtomDeserializer.readNext(AtomDeserializer.java:104)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:426)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.fetchMoreData(IndexedSliceReader.java:350)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:142)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:44)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
 ~[guava-16.0.jar:na]
 at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
 ~[guava-16.0.jar:na]
 at 
 org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:82)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.filter.QueryFilter$2.getNext(QueryFilter.java:172)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.db.filter.QueryFilter$2.hasNext(QueryFilter.java:155)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99)
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
 ~[guava-16.0.jar:na]
 at 
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
 

Re: Fastest way to map/parallel read all values in a table?

2015-02-09 Thread graham sanderson
Depending on whether you have deletes/updates, if this is an ad-hoc thing, you 
might want to just read the ss tables directly.

 On Feb 9, 2015, at 12:56 PM, Kevin Burton bur...@spinn3r.com wrote:
 
 I had considered using spark for this but:
 
 1.  we tried to deploy spark only to find out that it was missing a number of 
 key things we need.  
 
 2.  our app needs to shut down to release threads and resources.  Spark 
 doesn’t have support for this so all the workers would have stale thread 
 leaking afterwards.  Though I guess if I can get workers to fork then I 
 should be ok.
 
 3.  Spark SQL actually returned invalid data to our queries… so that was kind 
 of a red flag and a non-starter
 
 On Mon, Feb 9, 2015 at 2:24 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
 mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote:
 Just for the record, I was doing the exact same thing in an internal 
 application in the start up I used to work. We have had the need of writing 
 custom code process in parallel all rows of a column family. Normally we 
 would use Spark for the job, but in our case the logic was a little more 
 complicated, so we wrote custom code. 
 
 What we did was to run N process in M machines (N cores in each), each one 
 processing tasks. The tasks were created by splitting the range -2^ 63 to 2^ 
 63 -1 in N*M*10 tasks. Even if data was not completely distributed along the 
 tasks, no machines were idle, as when some task was completed another one was 
 taken from the task pool.
 
 It was fast enough for us, but I am interested in knowing if there is a 
 better way of doing it.
 
 For your specific case, here is a tool we had opened as open source and can 
 be useful for simpler tests: https://github.com/s1mbi0se/cql_record_processor 
 https://github.com/s1mbi0se/cql_record_processor
 
 Also, I guess you probably know that, but I would consider using Spark for 
 doing this.
 
 Best regards,
 Marcelo.
 
 From: user@cassandra.apache.org mailto:user@cassandra.apache.org 
 Subject: Re:Fastest way to map/parallel read all values in a table?
 What’s the fastest way to map/parallel read all values in a table?
 
 Kind of like a mini map only job.
 
 I’m doing this to compute stats across our entire corpus.
 
 What I did to begin with was use token() and then spit it into the number of 
 splits I needed.
 
 So I just took the total key range space which is -2^63 to 2^63 - 1 and broke 
 it into N parts.
 
 Then the queries come back as:
 
 select * from mytable where token(primaryKey) = x and token(primaryKey)  y
 
 From reading on this list I thought this was the correct way to handle this 
 problem.
 
 However, I’m seeing horrible performance doing this.  After about 1% it just 
 flat out locks up.
 
 Could it be that I need to randomize the token order so that it’s not 
 contiguous?  Maybe it’s all mapping on the first box to begin with.
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com http://spinn3r.com/
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com http://burtonator.wordpress.com/
 … or check out my Google+ profile 
 https://plus.google.com/102718274791889610666/posts
  http://spinn3r.com/
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com http://spinn3r.com/
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com http://burtonator.wordpress.com/
 … or check out my Google+ profile 
 https://plus.google.com/102718274791889610666/posts
  http://spinn3r.com/
 



smime.p7s
Description: S/MIME cryptographic signature


Re: No schema agreement from live replicas?

2015-02-03 Thread graham sanderson
What version of C* are you using; you could be seeing 
https://issues.apache.org/jira/browse/CASSANDRA-7734 
https://issues.apache.org/jira/browse/CASSANDRA-7734 which I think affects 
2.0.7 thru 2.0.10

 On Feb 3, 2015, at 9:47 AM, Clint Kelly clint.ke...@gmail.com wrote:
 
 FWIW increasing the threshold for withMaxSchemaAgreementWaitSeconds to
 30sec was enough to fix my problem---I would like to understand
 whether the cluster has some kind of configuration problem that made
 doing so necessary, however.
 
 Thanks!
 
 On Tue, Feb 3, 2015 at 7:44 AM, Clint Kelly clint.ke...@gmail.com wrote:
 Hi all,
 
 I have an application that uses the Java driver to create a table and then
 immediately write to it.  I see the following warning in my logs:
 
 [10.241.17.134] out: 15/02/03 09:32:24 WARN
 com.datastax.driver.core.Cluster: No schema agreement from live replicas
 after 10 s. The schema may not be up to date on some nodes.
 
 ...this seems to happen after creating a table, and the schema not being up
 to date leads to errors when trying to write the the new tables:
 
 [10.241.17.134] out: Exception in thread main
 com.datastax.driver.core.exceptions.InvalidQueryException: unconfigured
 columnfamily schema_hash
 
 Any suggestions on what to do about this (other than increasing
 withMaxSchemaAgreementWaitSeconds)?  This is only a three-node test
 cluster.  I have not gotten this warning before, even on much bigger
 clusters.
 
 Best regards,
 Clint



smime.p7s
Description: S/MIME cryptographic signature


Re: Versioning in cassandra while indexing ?

2015-01-21 Thread graham sanderson
I believe you can use “USING TIMESTAMP XXX” with your inserts which will set 
the actual cell write times to the timestamp you provide. Then at least on read 
you’ll get the “latest” value… you may or may not incur an actual write of the 
old data to disk, but either way it’ll get cleaned up for you.

 On Jan 21, 2015, at 1:54 AM, Pandian R pandian4m...@gmail.com wrote:
 
 Hi,
 
 I just wanted to know if there is any kind of versioning system in cassandra 
 while indexing new data(like the one we have for ElasticSearch, for example). 
 
 For example, I have a series of payloads each coming with an id and 
 'updatedAt' timestamp. I just want to maintain the latest state of any 
 payload for all the ids ie, index the data only if the current payload has 
 greater 'updatedAt' than the previously stored timestamp. I can do this with 
 one additional self-lookup, but is there a way to achieve this without 
 overhead of additional lookup ?
 
 Thanks !
 
 -- 
 Regards,
 Pandian



smime.p7s
Description: S/MIME cryptographic signature


Re: Startup failure (Core dump) in Solaris 11 + JDK 1.8.0

2015-01-13 Thread graham sanderson
This might well be

https://issues.apache.org/jira/browse/CASSANDRA-8325 
https://issues.apache.org/jira/browse/CASSANDRA-8325

try the latest patch for that if you can.

 On Jan 13, 2015, at 4:50 AM, Bernardino Mota bernardino.m...@inovaworks.com 
 wrote:
 
 Hi,
 
 Yes, with JDK1.7 it works but only in 32bits mode. It seems the problem is 
 with the 64bits version of JDK8 and 7. Didn't try with other older versions.
 
 Unfortunately with 32bits I'm more limited in the memory I can make available 
 for the JVM...
 
 Looking the Web, there are other's complaining with the same problem for a 
 while but until now I haven't found a solution.
 
 It's interesting that many are redirect the problem for the JVM (in solaris). 
 I think that waiting for possible JVM update that might or not resolve this, 
 is not a solution. 
 Has a kind of request :-) It would be great that some change in Cassandra 
 source code could resolve this. 
 
 
 
  
 
 
 On 01/12/2015 04:05 PM, Asit KAUSHIK wrote:
 Probably a bad answers but I was able to run on 1.7 jdk .So if possible  can 
 downsize you jdk version and try. I hit the block on RedHat enterprise...
 
 On Jan 12, 2015 9:31 PM, Bernardino Mota bernardino.m...@inovaworks.com 
 mailto:bernardino.m...@inovaworks.com wrote:
 Hi all,
 
 I'm trying to install Cassandra 2.1.2 in Solaris 11 but I'm getting a core 
 dump at startup.
 
 Any help is appreciated, since I can't change the operating system...
 
 My setup is:
 - Solaris 11
 - JDK build 1.8.0_25-b17 
 
 
 The error:
 
 appserver02:/opt/apache-cassandra-2.1.2/bin$ ./cassandra
 appserver02:/opt/apache-cassandra-2.1.2/bin$ CompilerOracle: inline 
 org/apache/cassandra/db/AbstractNativeCell.compareTo 
 (Lorg/apache/cassandra/db/composites/Composite;)I
 CompilerOracle: inline 
 org/apache/cassandra/db/composites/AbstractSimpleCellNameType.compareUnsigned(Lorg/apache/cassandra/db/composites/Composite;Lorg/apache/cassandra/db/composites/Composite;)I
 CompilerOracle: inline org/apache/cassandra/utils/ByteBufferUtil.compare 
 (Ljava/nio/ByteBuffer;[B)I
 CompilerOracle: inline org/apache/cassandra/utils/ByteBufferUtil.compare 
 ([BLjava/nio/ByteBuffer;)I
 CompilerOracle: inline 
 org/apache/cassandra/utils/ByteBufferUtil.compareUnsigned 
 (Ljava/nio/ByteBuffer;Ljava/nio/ByteBuffer;)I
 CompilerOracle: inline 
 org/apache/cassandra/utils/FastByteOperations$UnsafeOperations.compareTo 
 (Ljava/lang/Object;JILjava/lang/Object;JI)I
 CompilerOracle: inline 
 org/apache/cassandra/utils/FastByteOperations$UnsafeOperations.compareTo 
 (Ljava/lang/Object;JILjava/nio/ByteBuffer;)I
 CompilerOracle: inline 
 org/apache/cassandra/utils/FastByteOperations$UnsafeOperations.compareTo 
 (Ljava/nio/ByteBuffer;Ljava/nio/ByteBuffer;)I
 INFO  14:08:07 Hostname: appserver02.local
 INFO  14:08:07 Loading settings from 
 file:/opt/apache-cassandra-2.1.2/conf/cassandra.yaml 
 INFO  14:08:08 Node configuration:[authenticator=AllowAllAuthenticator; 
 authorizer=AllowAllAuthorizer; auto_snapshot=true; 
 batch_size_warn_threshold_in_kb=5; batchlog_replay_throttle_in_kb=1024; 
 cas_contention_timeout_in_ms=1000; client_encryption_options=REDACTED; 
 cluster_name=Test Cluster; column_index_size_in_kb=64; 
 commit_failure_policy=stop; commitlog_segment_size_in_mb=32; 
 commitlog_sync=periodic; commitlog_sync_period_in_ms=1; 
 compaction_throughput_mb_per_sec=16; concurrent_counter_writes=32; 
 concurrent_reads=32; concurrent_writes=32; counter_cache_save_period=7200; 
 counter_cache_size_in_mb=null; counter_write_request_timeout_in_ms=5000; 
 cross_node_timeout=false; disk_failure_policy=stop; 
 dynamic_snitch_badness_threshold=0.1; 
 dynamic_snitch_reset_interval_in_ms=60; 
 dynamic_snitch_update_interval_in_ms=100; endpoint_snitch=SimpleSnitch; 
 hinted_handoff_enabled=true; hinted_handoff_throttle_in_kb=1024; 
 incremental_backups=false; index_summary_capacity_in_mb=null; 
 index_summary_resize_interval_in_minutes=60; inter_dc_tcp_nodelay=false; 
 internode_compression=all; key_cache_save_period=14400; 
 key_cache_size_in_mb=null; listen_address=localhost; 
 max_hint_window_in_ms=1080; max_hints_delivery_threads=2; 
 memtable_allocation_type=heap_buffers; native_transport_port=9042; 
 num_tokens=256; partitioner=org.apache.cassandra.dht.Murmur3Partitioner; 
 permissions_validity_in_ms=2000; range_request_timeout_in_ms=1; 
 read_request_timeout_in_ms=5000; 
 request_scheduler=org.apache.cassandra.scheduler.NoScheduler; 
 request_timeout_in_ms=1; row_cache_save_period=0; 
 row_cache_size_in_mb=0; rpc_address=localhost; rpc_keepalive=true; 
 rpc_port=9160; rpc_server_type=sync; 
 seed_provider=[{class_name=org.apache.cassandra.locator.SimpleSeedProvider, 
 parameters=[{seeds=127.0.0.1}]}]; server_encryption_options=REDACTED; 
 snapshot_before_compaction=false; ssl_storage_port=7001; 
 sstable_preemptive_open_interval_in_mb=50; start_native_transport=true; 
 start_rpc=true; storage_port=7000; thrift_framed_transport_size_in_mb=15; 
 

Re: Error when dropping keyspaces; One row required, 0 found

2014-12-02 Thread graham sanderson
I don’t know what it is but I also saw “empty” keyspaces via CQL while 
migrating an existing test cluster from 2.0.9  to 2.1.0 (final release bits 
prior to labelling). Since I was doing this manually (and had cqlsh problems 
due to python change) I figured it might have been me.

My observation was the same - this wasn’t actual data corruption but some 
nodes’ in memory state was incorrect - I didn’t investigate the problem too 
much as it soon went away, however it seemed to happen when there was schema 
disagreement in the cluster. Fixing that and restarting the nodes solved it.

 On Dec 1, 2014, at 7:24 PM, Mark Greene green...@gmail.com wrote:
 
 I'm running Cassandra 2.1.0.
 
 I was attempting to drop two keyspaces via cqlsh and encountered an error in 
 the CLI as well as the appearance of losing all my keyspaces. Below is the 
 output from my cqlsh session:
 
 
 
 
 $ cqlsh
 Connected to Production Cluster at 127.0.0.1:9042 http://127.0.0.1:9042/.
 [cqlsh 5.0.1 | Cassandra 2.1.0 | CQL spec 3.2.0 | Native protocol v3]
 Use HELP for help.
 cqlsh desc keyspaces;
 
 contacts_index  contacts_testing  contacts  system  OpsCenter  system_traces
 
 
 cqlsh drop keyspace contacts_index;
 
 cqlsh drop keyspace contacts;
 ErrorMessage code= [Server error] message=java.lang.RuntimeException: 
 java.util.concurrent.ExecutionException: java.lang.NullPointerException
 
 cqlsh drop keyspace contacts;
 ErrorMessage code= [Server error] message=java.lang.RuntimeException: 
 java.util.concurrent.ExecutionException: java.lang.IllegalStateException: One 
 row required, 0 found
 
 cqlsh desc keyspaces;
 
 empty   -- OH SHIT
 
 --
 
 After it appeared that I had lost all my keyspaces, I looked at the 
 system.log and found this: (full log attached)
 
 ERROR [MigrationStage:1] 2014-12-01 23:54:05,622 CassandraDaemon.java:166 - 
 Exception in thread Thread[MigrationStage:1,5,main]
 java.lang.IllegalStateException: One row required, 0 found
 at 
 org.apache.cassandra.cql3.UntypedResultSet$FromResultSet.one(UntypedResultSet.java:78)
  ~[apache-cassandra-2.1.0.jar:2.1.0]
 at org.apache.cassandra.config.KSMetaData.fromSchema(KSMetaData.java:275) 
 ~[apache-cassandra-2.1.0.jar:2.1.0]
 at org.apache.cassandra.db.DefsTables.mergeKeyspaces(DefsTables.java:230) 
 ~[apache-cassandra-2.1.0.jar:2.1.0]
 at 
 org.apache.cassandra.db.DefsTables.mergeSchemaInternal(DefsTables.java:186) 
 ~[apache-cassandra-2.1.0.jar:2.1.0]
 at org.apache.cassandra.db.DefsTables.mergeSchema(DefsTables.java:164) 
 ~[apache-cassandra-2.1.0.jar:2.1.0]
 at 
 org.apache.cassandra.db.DefinitionsUpdateVerbHandler$1.runMayThrow(DefinitionsUpdateVerbHandler.java:49)
  ~[apache-cassandra-2.1.0.jar:2.1.0]
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
 ~[apache-cassandra-2.1.0.jar:2.1.0]
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
 ~[na:1.7.0_65]
 at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_65]
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  ~[na:1.7.0_65]
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  [na:1.7.0_65]
 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65]
 
 At this point I wasn't sure quite what to do about this and did a rolling 
 restart of the entire ring. After which, the keyspaces that were not 
 attempted to be deleted returned when running 'desc keyspaces' and my 
 intended keyspaces to be deleted had been removed as expected. 
 
 Strangely enough, because we run OpsCenter, we lost the dashboards we had 
 configured. Not a total deal breaker, but concerning that data loss occurred 
 here assuming it's related.
 
 
 Anyone run into something like this before?
 system.log



smime.p7s
Description: S/MIME cryptographic signature


Re: Nodes get stuck in crazy GC loop after some time, leading to timeouts

2014-11-28 Thread graham sanderson
Your GC settings would be helpful, though you can see guesstimate by eyeballing 
(assuming settings are the same across all 4 images)

Bursty load can be a big cause of old gen fragmentation (as small working set 
objects tends to get spilled (promoted) along with memtable slabs which aren’t 
flushed quickly enough). That said, empty fragmentation holes wouldn’t show up 
as “used” in your graph, and that clearly looks like you are above your 
CMSIniatingOccupancyFraction and CMS is running continuously, so they probably 
aren’t the issue here.

Other than trying a slightly larger heap to give you more head room, I’d also 
suggest from eyeballing that you have probably let the JVM pick its own new gen 
size, and I’d suggest it is too small. What to set it to really depends on your 
workload, but you could try something in the 0.5gig range unless that makes 
your young gen pauses too long. In that case (or indeed anyway) make sure you 
also have the latest GC settings (e.g. -XX:+CMSParallelInitialMarkEnabled 
-XX:+CMSEdenChunksRecordAlways) on newer JVMs (to help the young gc pauses)

 On Nov 28, 2014, at 2:55 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:
 
 Hello,
 
 This is a recurrent behavior of JVM GC in Cassandra that I never completely 
 understood: when a node is UP for many days (or even months), or receives a 
 very high load spike (3x-5x normal load), CMS GC pauses start becoming very 
 frequent and slow, causing periodic timeouts in Cassandra. Trying to run GC 
 manually doesn't free up memory. The only solution when a node reaches this 
 state is to restart the node.
 
 We restart the whole cluster every 1 or 2 months, to avoid machines getting 
 into this crazy state. We tried tuning GC size and parameters, different 
 cassandra versions (1.1, 1.2, 2.0), but this behavior keeps happening. More 
 recently, during black friday, we received about 5x our normal load, and some 
 machines started presenting this behavior. Once again, we restart the nodes 
 an the GC behaves normal again.
 
 I'm attaching a few pictures comparing the heap of healthy and sick 
 nodes: http://imgur.com/a/Tcr3w http://imgur.com/a/Tcr3w
 
 You can clearly notice some memory is actually reclaimed during GC in healthy 
 nodes, while in sick machines very little memory is reclaimed. Also, since GC 
 is executed more frequently in sick machines, it uses about 2x more CPU than 
 non-sick nodes.
 
 Have you ever observed this behavior in your cluster? Could this be related 
 to heap fragmentation? Would using the G1 collector help in this case? Any GC 
 tuning or monitoring advice to troubleshoot this issue?
 
 Any advice or pointers will be kindly appreciated.
 
 Cheers,
 
 -- 
 Paulo Motta
 
 Chaordic | Platform
 www.chaordic.com.br http://www.chaordic.com.br/
 +55 48 3232.3200



smime.p7s
Description: S/MIME cryptographic signature


Re: Nodes get stuck in crazy GC loop after some time, leading to timeouts

2014-11-28 Thread graham sanderson
I should note that the young gen size is just a tuning suggestion, not directly 
related to your problem at hand.

You might want to make sure you don’t have issues with key/row cache.

Also, I’m assuming that your extra load isn’t hitting tables that you wouldn’t 
normally be hitting.

 On Nov 28, 2014, at 6:54 PM, graham sanderson gra...@vast.com wrote:
 
 Your GC settings would be helpful, though you can see guesstimate by 
 eyeballing (assuming settings are the same across all 4 images)
 
 Bursty load can be a big cause of old gen fragmentation (as small working set 
 objects tends to get spilled (promoted) along with memtable slabs which 
 aren’t flushed quickly enough). That said, empty fragmentation holes wouldn’t 
 show up as “used” in your graph, and that clearly looks like you are above 
 your CMSIniatingOccupancyFraction and CMS is running continuously, so they 
 probably aren’t the issue here.
 
 Other than trying a slightly larger heap to give you more head room, I’d also 
 suggest from eyeballing that you have probably let the JVM pick its own new 
 gen size, and I’d suggest it is too small. What to set it to really depends 
 on your workload, but you could try something in the 0.5gig range unless that 
 makes your young gen pauses too long. In that case (or indeed anyway) make 
 sure you also have the latest GC settings (e.g. 
 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways) on newer 
 JVMs (to help the young gc pauses)
 
 On Nov 28, 2014, at 2:55 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com mailto:paulo.mo...@chaordicsystems.com 
 wrote:
 
 Hello,
 
 This is a recurrent behavior of JVM GC in Cassandra that I never completely 
 understood: when a node is UP for many days (or even months), or receives a 
 very high load spike (3x-5x normal load), CMS GC pauses start becoming very 
 frequent and slow, causing periodic timeouts in Cassandra. Trying to run GC 
 manually doesn't free up memory. The only solution when a node reaches this 
 state is to restart the node.
 
 We restart the whole cluster every 1 or 2 months, to avoid machines getting 
 into this crazy state. We tried tuning GC size and parameters, different 
 cassandra versions (1.1, 1.2, 2.0), but this behavior keeps happening. More 
 recently, during black friday, we received about 5x our normal load, and 
 some machines started presenting this behavior. Once again, we restart the 
 nodes an the GC behaves normal again.
 
 I'm attaching a few pictures comparing the heap of healthy and sick 
 nodes: http://imgur.com/a/Tcr3w http://imgur.com/a/Tcr3w
 
 You can clearly notice some memory is actually reclaimed during GC in 
 healthy nodes, while in sick machines very little memory is reclaimed. Also, 
 since GC is executed more frequently in sick machines, it uses about 2x more 
 CPU than non-sick nodes.
 
 Have you ever observed this behavior in your cluster? Could this be related 
 to heap fragmentation? Would using the G1 collector help in this case? Any 
 GC tuning or monitoring advice to troubleshoot this issue?
 
 Any advice or pointers will be kindly appreciated.
 
 Cheers,
 
 -- 
 Paulo Motta
 
 Chaordic | Platform
 www.chaordic.com.br http://www.chaordic.com.br/
 +55 48 3232.3200
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Trying to build Cassandra for FreeBSD 10.1

2014-11-17 Thread graham sanderson
Only thing I can see from looking at the exception, is that it looks like - I 
didn’t disassemble the code from hex - that the “peer” value in the 
RefCountedMemory object is probably 0

Given that Unsafe.allocateMemory should not return 0 even on allocation failure 
(which should throw OOM) - though you should add a log statement to the Memory 
class to check that - I’d suggest logging to see if anyone is calling 
SSTableReader.releaseSummary, which could set the peer to 0

 On Nov 17, 2014, at 7:30 PM, Michael Shuler mich...@pbandjelly.org wrote:
 
 On 11/17/2014 07:19 PM, William Arbaugh wrote:
 I've successfully built 2.1.2 for FreeBSD, but the JVM crashes upon start-up.
 
 Here's the snippet from the top of the log file (attached)
 
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGSEGV (0xb) at pc=0x000802422655, pid=76732, tid=34384931840
 #
 
 Any hints on how to get 2.1.2 working on FreeBSD is appreciated.
 
 
 Not a very helpful hint, other than the fact you are not alone  :)
 
 https://issues.apache.org/jira/browse/CASSANDRA-8325
 
 -- 
 Michael



smime.p7s
Description: S/MIME cryptographic signature


Re: What actually causing java.lang.OutOfMemoryError: unable to create new native thread

2014-11-10 Thread graham sanderson
First question are you running 32bit or 64bit… on 32bit you can easily run out 
of virtual address space for thread stacks.

 On Nov 10, 2014, at 8:25 AM, Jason Wee peich...@gmail.com wrote:
 
 Hello people, below is an extraction from cassandra system log.
 
 ERROR [Thread-273] 2012-04-10 16:33:18,328 AbstractCassandraDaemon.java (line 
 139) Fatal exception in thread Thread[Thread-273,5,main]
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:640)
 at 
 java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
 at 
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:657)
 at 
 org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThreadPoolServer.java:104)
 at 
 org.apache.cassandra.thrift.CassandraDaemon$ThriftServer.run(CassandraDaemon.java:214)
 
 Investigated into the call until the java native call, 
 http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/tip/src/share/vm/prims/jvm.cpp#l2698
  
 http://hg.openjdk.java.net/jdk7/jdk7/hotspot/file/tip/src/share/vm/prims/jvm.cpp#l2698
 
   if (native_thread-osthread() == NULL) {
 // No one should hold a reference to the 'native_thread'.
 delete native_thread;
 if (JvmtiExport::should_post_resource_exhausted()) {
   JvmtiExport::post_resource_exhausted(
 JVMTI_RESOURCE_EXHAUSTED_OOM_ERROR | JVMTI_RESOURCE_EXHAUSTED_THREADS,
 unable to create new native thread);
 }
 THROW_MSG(vmSymbols::java_lang_OutOfMemoryError(),
   unable to create new native thread);
   }
 
 Question. Is that out of memory error due to native os memory or java heap? 
 Stacked size to the jvm is -Xss128k. Operating system file descriptor max 
 user processes 26. open files capped at 65536
 
 Can any java/cpp expert pin point what JVMTI_RESOURCE_EXHAUSTED_OOM_ERROR and 
  JVMTI_RESOURCE_EXHAUSTED_THREADS means too?
 
 Thank you.
 
 Jason



smime.p7s
Description: S/MIME cryptographic signature


Re: Why is one query 10 times slower than the other?

2014-11-05 Thread graham sanderson
In your “lookup_code” example “type” is not a clustercolumn it is the partition 
key, and hence the first query only hits one partition
The second query is a range slice across all possible keys, so the sub-ranges 
are farmed out to nodes with the data.
You are likely at CL_ONE, so it only needs response from one node for each 
sub-range… I guess it has decided (based on the snitch) that it is not 
unreasonable to share the query across the two nodes 

 On Nov 5, 2014, at 10:41 PM, Jacob Rhoden jacob.rho...@me.com wrote:
 
 Hi Guys,
 
 I have two cassandra 2.0.5 nodes, RF=2. When I do a:
 
 select * from table1 where clustercolumn=‘something'
 
 The trace indicates that it only needs to talk to one node, which I would 
 have expected. However when I do a:
 
 select * from table2
 
 Which is a small table with only has 20 rows in it, should be fully 
 replicated, and should be a much quicker query, trace indicates that 
 cassandra is talking to both nodes. This adds a 200ms to the query results, 
 and is not necessary for my application (this table might have an amendment 
 once per year if that), theres no real need to check both nodes for 
 consistency.
 
 At this point I’ve not altered anything to do with consistency level. Does 
 this mean that cassandra attempts to guess/infer what consistency level you 
 need depending on if your query includes a filter on a particular key or 
 clustering key?
 
 Thanks,
 Jacob
 
 
 CREATE KEYSPACE mykeyspace WITH replication = { 'class': 'SimpleStrategy', 
 'replication_factor': ‘2' };
 
 CREATE TABLE organisation (uuid uuid, name text, url text, PRIMARY KEY (uuid))
 
 CREATE TABLE lookup_code (type text, code text, name text, PRIMARY KEY 
 ((type), code)) 
 
 
 select * from lookup_code where type=‘mylist':
 
  activity  | 
 timestamp| source   | source_elapsed
 ---+--+--+
 execute_cql3_query | 
 04:20:15,319 | 74.50.54.123 |  0
  Parsing select * from lookup_code where type='research_area' LIMIT 1; | 
 04:20:15,319 | 74.50.54.123 | 64
Preparing statement | 
 04:20:15,320 | 74.50.54.123 |204
Executing single-partition query on lookup_code | 
 04:20:15,320 | 74.50.54.123 |849
   Acquiring sstable references | 
 04:20:15,320 | 74.50.54.123 |870
Merging memtable tombstones | 
 04:20:15,320 | 74.50.54.123 |894
  Skipped 0/0 non-slice-intersecting sstables, included 0 due to tombstones | 
 04:20:15,320 | 74.50.54.123 |958
 Merging data from memtables and 0 sstables | 
 04:20:15,320 | 74.50.54.123 |976
   Read 168 live and 0 tombstoned cells | 
 04:20:15,321 | 74.50.54.123 |   1412
   Request complete | 
 04:20:15,321 | 74.50.54.123 |   2043
 
 
 select * from organisation:
 
  activity 
| timestamp| source   | source_elapsed
 -+--+--+
   
 execute_cql3_query | 04:21:03,641 | 74.50.54.123 |  0
  Parsing select * from 
 organisation LIMIT 1; | 04:21:03,641 | 74.50.54.123 | 68
  
 Preparing statement | 04:21:03,641 | 74.50.54.123 |174

 Determining replicas to query | 04:21:03,642 | 74.50.54.123 |307
   Enqueuing 
 request to /72.249.82.85 | 04:21:03,642 | 74.50.54.123 |   1034
 Sending 
 message to /72.249.82.85 | 04:21:03,643 | 74.50.54.123 |   1402
  Message received 
 from /74.50.54.123 | 04:21:03,644 | 72.249.82.85 | 47
  Executing seq scan across 0 sstables for [min(-9223372036854775808), 
 min(-9223372036854775808)] | 04:21:03,644 | 72.249.82.85 |461
   Read 1 live and 
 0 tombstoned cells | 04:21:03,644 | 72.249.82.85 |560
 

Re: Client-side compression, cassandra or both?

2014-11-03 Thread graham sanderson
I wouldn’t do both.
Unless a little server CPU or (and you’d have to measure it - I imagine it is 
probably not significant - as you say C* has more context, and hopefully most 
things can compress “0, “ repeatedly) disk space are an issue, I wouldn’t 
bother to compress yourself. Compression across the wire is good of course 
(client side CPU a wash, and server CPU we already mentioned anyway)

On a side note, perhaps your object model should address the redundancy, though 
of course this is perhaps equivalent to the complexity of doing client side 
compression, IDK.

We do have one table where we keep compressed blobs, but that is because those 
are natural from an application perspective, and so we just turn off C* table 
compression for those (there isn’t much other data there).

Note, I haven’t been tracking it recently, but certainly in the past the 
compression code path on the C* had to do more data copies, but this is not 
likely significant unless your case is special. I believe this has been/will be 
improved in 2.1 or 3.

 On Nov 3, 2014, at 9:40 AM, DuyHai Doan doanduy...@gmail.com wrote:
 
 Hello Robin
 
  You have many options for compression in C*:
 
 1) Serialized in bytes instead of JSON, to save a lot of space due to String 
 encoding. Of course the data will be opaque and not human readable
 
 2) Activate client-node data compression. In this case, do not forget to ship 
 LZ4 or SNAPPY dependency on the client side. 
 
 On the server-side, data compression is active by default using LZ4 when 
 you're creating a new table so there is pretty much nothing to do.
 
  It's up to you to consider whether the compression ratio difference between 
 Gzip and LZ4 does worth relying on C* compression.
 
 
 Regards
 
 
 On Mon, Nov 3, 2014 at 3:51 PM, Robin Verlangen ro...@us2.nl 
 mailto:ro...@us2.nl wrote:
 Hi there,
 
 We're working on a project which is going to store a lot of JSON objects in 
 Cassandra. A large piece of this (90%) consists of an array of integers, of 
 which in a lot of cases there are a bunch of zeroes. 
 
 The average JSON is 4KB in size, and once GZIP (default compression) just 
 under 100 bytes. 
 
 My question is, should we compress client-side (literally converting JSON 
 string to compressed gzip bytes), let Cassandra do the work, or do both?
 
 From my point of view I think Cassandra would be better, as it could compress 
 beyond a single value, using large blocks within a row / SSTable.
 
 Thank you in advance for your help.
 
 Best regards, 
 
 Robin Verlangen
 Chief Data Architect
 
 W http://www.robinverlangen.nl http://www.robinverlangen.nl/
 E ro...@us2.nl mailto:ro...@us2.nl
 
  http://goo.gl/Lt7BC
 What is CloudPelican? http://goo.gl/HkB3D
 
 Disclaimer: The information contained in this message and attachments is 
 intended solely for the attention and use of the named addressee and may be 
 confidential. If you are not the intended recipient, you are reminded that 
 the information remains the property of the sender. You must not use, 
 disclose, distribute, copy, print or rely on this e-mail. If you have 
 received this message in error, please contact the sender immediately and 
 irrevocably delete this message and any copies.
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Intermittent long application pauses on nodes

2014-10-31 Thread graham sanderson
I have to admit that I haven’t tried the SafepointTimeout (I just noticed that 
it was actually a production VM option in the JVM code, after my initial 
suggestions below for debugging without it).

There doesn’t seem to be an obvious bug in SafepointTimeout, though I may not 
be looking at the same version of the JVM source. That said there IS some code 
executed after waiting (up to the timeout) for threads to reach safe point 
before it actually records the sync time, so maybe that code is slow for some 
reason. If you were running with TraceSafepoint you’d see this (but that 
requires a debug build of the JVM) because there is an extra trace statement 
here.
 
I’d either try some suggestions below (though if what I said above is the case 
they may not help you as much since the other threads may be well behaved, 
though they may still give you some insight), but firstly, for sure, I’d try to 
see if SafepointTimeout is working at all by setting a ridiculously low timeout 
delay and seeing if you catch anything.

 On Oct 31, 2014, at 2:42 PM, Dan van Kley dvank...@salesforce.com wrote:
 
 Well I tried the SafepointTimeout option, but unfortunately it seems like the 
 long safepoint syncs don't actually trigger the SafepointTimeout mechanism, 
 so we didn't get any logs on it. It's possible I'm just doing it wrong, I 
 used the following options:
 
 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions 
 -XX:LogFile=/var/log/cassandra/stdout.log -XX:+LogVMOutput
 JVM_OPTS=$JVM_OPTS -XX:+PrintSafepointStatistics
 JVM_OPTS=$JVM_OPTS -XX:PrintSafepointStatisticsCount=1
 JVM_OPTS=$JVM_OPTS -XX:SafepointTimeoutDelay=4000
 JVM_OPTS=$JVM_OPTS -XX:+SafepointTimeout”
 
 and saw the safepoint logging as usual in that stdout.log file, but no 
 timeout logging in either that file or the GC log when safepoint syncs 
 exceeded the timeout. It also seems possible that SafepointTimeout doesn't 
 work on long syncs, see 
 http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2013-April/006945.html
  
 http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2013-April/006945.html.
 
 That being the case, any other ideas or suggestions would be appreciated. 
 Thanks!
 
 On Mon, Oct 27, 2014 at 9:44 AM, Dan van Kley dvank...@salesforce.com 
 mailto:dvank...@salesforce.com wrote:
 Excellent, thanks for the tips, Graham. I'll give SafepointTimeout a try and 
 see if that gives us anything to act on.
 
 On Fri, Oct 24, 2014 at 3:52 PM, graham sanderson gra...@vast.com 
 mailto:gra...@vast.com wrote:
 And -XX:SafepointTimeoutDelay=xxx
 
 to set how long before it dumps output (defaults to 1 I believe)…
 
 Note it doesn’t actually timeout by default, it just prints the problematic 
 threads after that time and keeps on waiting
 
 On Oct 24, 2014, at 2:44 PM, graham sanderson gra...@vast.com 
 mailto:gra...@vast.com wrote:
 
 Actually - there is 
 
 -XX:+SafepointTimeout
 
 which will print out offending threads (assuming you reach a 10 second 
 pause)…
 
 That is probably your best bet.
 
 On Oct 24, 2014, at 2:38 PM, graham sanderson gra...@vast.com 
 mailto:gra...@vast.com wrote:
 
 This certainly sounds like a JVM bug.
 
 We are running C* 2.0.9 on pretty high end machines with pretty large 
 heaps, and don’t seem to have seen this (note we are on 7u67, so that might 
 be an interesting data point, though since the old thread predated that 
 probably not)
 
 1) From the app/java side, I’d obviously see if you can identify anything 
 which always coincides with this - repair, compaction etc
 2) From the VM side (given that this as Benedict mentioned) some threads 
 are taking a long time to rendezvous at the safe point, and it is probably 
 not application threads, I’d look what GC threads, compiler threads etc 
 might be doing. As mentioned it shouldn’t be anything to do with operations 
 which run at a safe point anyway (e.g. scavenge)
 a) So look at what CMS is doing at the time and see if you can correlate
 b) Check Oracle for related bugs - didn’t obviously see any, but there 
 have been some complaints related to compilation and safe points
 c) Add any compilation tracing you can
 d) Kind of important here - see if you can figure out via dtrace, 
 system tap, gdb or whatever, what the threads are doing when this happens. 
 Sadly it doesn’t look like you can figure out when this is happening (until 
 afterwards) unless you have access to a debug JVM build (and can turn on 
 -XX:+TraceSafepoint and look for a safe point start without a corresponding 
 update within a time period) - if you don’t have access to that, I guess 
 you could try and get a dump every 2-3 seconds (you should catch a 9 second 
 pause eventually!)
 
 On Oct 24, 2014, at 12:35 PM, Dan van Kley dvank...@salesforce.com 
 mailto:dvank...@salesforce.com wrote:
 
 I'm also curious to know if this was ever resolved or if there's any other 
 recommended steps to take to continue to track it down. I'm seeing the 
 same issue in our

Re: Intermittent long application pauses on nodes

2014-10-24 Thread graham sanderson
Actually - there is 

-XX:+SafepointTimeout

which will print out offending threads (assuming you reach a 10 second pause)…

That is probably your best bet.

 On Oct 24, 2014, at 2:38 PM, graham sanderson gra...@vast.com wrote:
 
 This certainly sounds like a JVM bug.
 
 We are running C* 2.0.9 on pretty high end machines with pretty large heaps, 
 and don’t seem to have seen this (note we are on 7u67, so that might be an 
 interesting data point, though since the old thread predated that probably 
 not)
 
 1) From the app/java side, I’d obviously see if you can identify anything 
 which always coincides with this - repair, compaction etc
 2) From the VM side (given that this as Benedict mentioned) some threads are 
 taking a long time to rendezvous at the safe point, and it is probably not 
 application threads, I’d look what GC threads, compiler threads etc might be 
 doing. As mentioned it shouldn’t be anything to do with operations which run 
 at a safe point anyway (e.g. scavenge)
   a) So look at what CMS is doing at the time and see if you can correlate
   b) Check Oracle for related bugs - didn’t obviously see any, but there 
 have been some complaints related to compilation and safe points
   c) Add any compilation tracing you can
   d) Kind of important here - see if you can figure out via dtrace, 
 system tap, gdb or whatever, what the threads are doing when this happens. 
 Sadly it doesn’t look like you can figure out when this is happening (until 
 afterwards) unless you have access to a debug JVM build (and can turn on 
 -XX:+TraceSafepoint and look for a safe point start without a corresponding 
 update within a time period) - if you don’t have access to that, I guess you 
 could try and get a dump every 2-3 seconds (you should catch a 9 second pause 
 eventually!)
 
 On Oct 24, 2014, at 12:35 PM, Dan van Kley dvank...@salesforce.com 
 mailto:dvank...@salesforce.com wrote:
 
 I'm also curious to know if this was ever resolved or if there's any other 
 recommended steps to take to continue to track it down. I'm seeing the same 
 issue in our production cluster, which is running Cassandra 2.0.10 and JVM 
 1.7u71, using the CMS collector. Just as described above, the issue is long 
 Total time for which application threads were stopped pauses that are not 
 a direct result of GC pauses (ParNew, initial mark or remark). When I 
 enabled the safepoint logging I saw the same result, long sync pause times 
 with short spin and block times, usually with the RevokeBias description. 
 We're seeing pause times sometimes in excess of 10 seconds, so it's a pretty 
 debilitating issue. Our machines are not swapping (or even close to it) or 
 having other load issues when these pauses occur. Any ideas would be very 
 appreciated. Thanks!
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Intermittent long application pauses on nodes

2014-10-24 Thread graham sanderson
And -XX:SafepointTimeoutDelay=xxx

to set how long before it dumps output (defaults to 1 I believe)…

Note it doesn’t actually timeout by default, it just prints the problematic 
threads after that time and keeps on waiting

 On Oct 24, 2014, at 2:44 PM, graham sanderson gra...@vast.com wrote:
 
 Actually - there is 
 
 -XX:+SafepointTimeout
 
 which will print out offending threads (assuming you reach a 10 second pause)…
 
 That is probably your best bet.
 
 On Oct 24, 2014, at 2:38 PM, graham sanderson gra...@vast.com 
 mailto:gra...@vast.com wrote:
 
 This certainly sounds like a JVM bug.
 
 We are running C* 2.0.9 on pretty high end machines with pretty large heaps, 
 and don’t seem to have seen this (note we are on 7u67, so that might be an 
 interesting data point, though since the old thread predated that probably 
 not)
 
 1) From the app/java side, I’d obviously see if you can identify anything 
 which always coincides with this - repair, compaction etc
 2) From the VM side (given that this as Benedict mentioned) some threads are 
 taking a long time to rendezvous at the safe point, and it is probably not 
 application threads, I’d look what GC threads, compiler threads etc might be 
 doing. As mentioned it shouldn’t be anything to do with operations which run 
 at a safe point anyway (e.g. scavenge)
  a) So look at what CMS is doing at the time and see if you can correlate
  b) Check Oracle for related bugs - didn’t obviously see any, but there 
 have been some complaints related to compilation and safe points
  c) Add any compilation tracing you can
  d) Kind of important here - see if you can figure out via dtrace, 
 system tap, gdb or whatever, what the threads are doing when this happens. 
 Sadly it doesn’t look like you can figure out when this is happening (until 
 afterwards) unless you have access to a debug JVM build (and can turn on 
 -XX:+TraceSafepoint and look for a safe point start without a corresponding 
 update within a time period) - if you don’t have access to that, I guess you 
 could try and get a dump every 2-3 seconds (you should catch a 9 second 
 pause eventually!)
 
 On Oct 24, 2014, at 12:35 PM, Dan van Kley dvank...@salesforce.com 
 mailto:dvank...@salesforce.com wrote:
 
 I'm also curious to know if this was ever resolved or if there's any other 
 recommended steps to take to continue to track it down. I'm seeing the same 
 issue in our production cluster, which is running Cassandra 2.0.10 and JVM 
 1.7u71, using the CMS collector. Just as described above, the issue is long 
 Total time for which application threads were stopped pauses that are not 
 a direct result of GC pauses (ParNew, initial mark or remark). When I 
 enabled the safepoint logging I saw the same result, long sync pause 
 times with short spin and block times, usually with the RevokeBias 
 description. We're seeing pause times sometimes in excess of 10 seconds, so 
 it's a pretty debilitating issue. Our machines are not swapping (or even 
 close to it) or having other load issues when these pauses occur. Any ideas 
 would be very appreciated. Thanks!
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: LOCAL_* consistency levels

2014-10-14 Thread graham sanderson
There were some versions of C* that didn’t allow you to use LOCAL_* and a 
single DC NetworkTopologyStrategy, or with SimpleTopologyStrategy.

https://issues.apache.org/jira/browse/CASSANDRA-6238 I think

You should use a NetworkTopologyStrategy with one DC for now.

On Oct 14, 2014, at 7:39 AM, Robert Wille rwi...@fold3.com wrote:

 I’m wondering if there’s a best practice for an annoyance I’ve come across.
 
 Currently all my environments (dev, staging and live) have a single DC. In 
 the future my live environment will most likely have a second DC. When that 
 happens, I’ll want to use LOCAL_* consistency levels. However, if I write my 
 code with LOCAL_* consistency levels, an exception is thrown. I’ve forgotten 
 the exact verbiage, but its something about having a NetworkTopologyStrategy 
 that doesn’t support local consistency levels. t don’t really want to change 
 all my queries when I have a second DC, nor do I want to check my environment 
 for every query. Is there a nice way to use LOCAL_* consistency levels and 
 have Cassandra do the appropriate thing when there is a single DC?
 
 Thanks in advance
 
 Robert
 



smime.p7s
Description: S/MIME cryptographic signature


Re: describe tables… and vertical formatting?

2014-10-14 Thread graham sanderson
Ha oops - typo on my part

On Oct 14, 2014, at 10:55 AM, Tyler Hobbs ty...@datastax.com wrote:

 You want this:
 
 select keyspace_name, columnfamily_name from system.schema_columnfamilies;
 
 On Sun, Oct 12, 2014 at 5:16 PM, Kevin Burton bur...@spinn3r.com wrote:
 huh.  That sort of works.  The problem now is that there are multiple entries 
 per table...
 
 On Sun, Oct 12, 2014 at 10:39 AM, graham sanderson gra...@vast.com wrote:
 select keyspace_name, columnfamily_name from system.schema_columns;
 ?
 
 On Oct 12, 2014, at 10:29 AM, Kevin Burton bur...@spinn3r.com wrote:
 
 It seems annoying that I can’t get “describe tables” to vertical.  
 
 maybe there’s some option I’m missing?
 
 Kevin
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 
 
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 
 
 
 
 -- 
 Tyler Hobbs
 DataStax



smime.p7s
Description: S/MIME cryptographic signature


Re: describe tables… and vertical formatting?

2014-10-12 Thread graham sanderson
select keyspace_name, columnfamily_name from system.schema_columns;
?

On Oct 12, 2014, at 10:29 AM, Kevin Burton bur...@spinn3r.com wrote:

 It seems annoying that I can’t get “describe tables” to vertical.  
 
 maybe there’s some option I’m missing?
 
 Kevin
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Bitmaps

2014-10-06 Thread graham sanderson
You certainly have plenty of freedom to trade off size vs access granularity 
using multiple blobs. It really depends on how mutable the data is, how you 
intend to read it, whether it is highly sparse and or highly dense (in which 
case you perhaps don’t need to store every bit) etc.

On Oct 6, 2014, at 3:56 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Isn't there a video of Ooyala at some past Cassandra Summit demonstrating 
 usage of Cassandra for text search using Trigram ? AFAIK they were storing 
 kind of bitmap to perform OR  AND operations on trigram
 
 On Mon, Oct 6, 2014 at 10:53 PM, Russell Bradberry rbradbe...@gmail.com 
 wrote:
 I highly recommend against storing data structures like this in C*. That 
 really isn't it's sweet spot.  For instance, if you were to use the blob type 
 which will give you the smallest size, you are still looking at a cell size 
 of (90,000,000/8/1024) = 10,986 or over 10MB in size, which is prohibitively 
 large.
 
 Additionally, there is no way to modify the bitmap in place, you would have 
 to read the entire structure out and write it back in.
 
 You could store one bit per cell, but that would essentially defeat the 
 purpose of the bitmap's compact size. 
 
 On Mon, Oct 6, 2014 at 4:46 PM, Eduardo Cusa 
 eduardo.c...@usmediaconsulting.com wrote:
 Hi Guys, what data type recommend to store bitmaps?
 I am planning to store maps of 90,000,000 length and then query by key.
 
 Example:
 
 key : 22_ES
 bitmap : 10101101010111010101011
 
 
 
 Thanks 
 Eduardo
 
 
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: best practice for waiting for schema changes to propagate

2014-09-30 Thread graham sanderson
Also be aware of https://issues.apache.org/jira/browse/CASSANDRA-7734 if you 
are using C* 2.0.6+ (2.0.6 introduced a change that can sometimes causes 
initial schema propagation not to happen, introducing potentially long delays 
until some other code path repairs it later)

On Sep 30, 2014, at 1:54 AM, Ben Bromhead b...@instaclustr.com wrote:

 The system.peers table which is a copy of some gossip info the node has 
 stored, including the schema version. You should query this and wait until 
 all schema versions have converged.
 
 http://www.datastax.com/documentation/cql/3.0/cql/cql_using/use_sys_tab_cluster_t.html
 
 http://www.datastax.com/dev/blog/the-data-dictionary-in-cassandra-1-2
 
 As ensuring that the driver keeps talking to the node you made the schema 
 change on I would ask the drivers specific mailing list / IRC:
 
 MAILING LIST: 
 https://groups.google.com/a/lists.datastax.com/forum/#!forum/java-driver-user
 IRC: #datastax-drivers on irc.freenode.net
 
 
 On 30 September 2014 10:16, Clint Kelly clint.ke...@gmail.com wrote:
 Hi all,
 
 I often have problems with code that I write that uses the DataStax Java 
 driver to create / modify a keyspace or table and then soon after reads the 
 metadata for the keyspace to verify that whatever changes I made the keyspace 
 or table are complete.
 
 As an example, I may create a table called `myTableName` and then very soon 
 after do something like:
 
 assert(session
   .getCluster()
   .getMetaData()
   .getKeyspace(myKeyspaceName)
   .getTable(myTableName) != null)
 
 I assume this fails sometimes because the default round-robin load balancing 
 policy for the Java driver will send my create-table request to one node and 
 the metadata read to another, and because it takes some time for the table 
 creation to propagate across all of the nodes in my cluster.
 
 What is the best way to deal with this problem?  Is there a standard way to 
 wait for schema changes to propagate?
 
 Best regards,
 Clint
 
 
 
 -- 
 Ben Bromhead
 
 Instaclustr | www.instaclustr.com | @instaclustr | +61 415 936 359
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Unable to query with token range.. unable to make long from ‘...'

2014-09-28 Thread graham sanderson
It is expecting a 64 bit value … murmer3 partitioner uses 64 bit long tokens… 
where did you get your 128 bit long from, and what partitioner are you using?

On Sep 28, 2014, at 1:39 PM, Kevin Burton bur...@spinn3r.com wrote:

 I’m trying to query an entire table in parallel by splitting it up in token 
 ranges.
 
 However, it’s not working because I get this:
 
 cqlsh:blogindex  select token(hashcode), hashcode from source where 
 token(hashcode) = 0 and token(hashcode) = 
 17014118346046923173168730371588410572 limit 10;
 Bad Request: unable to make long from '17014118346046923173168730371588410572'
 
 … so I’m trying to figure out what’s going on here.
 
 Is there some magic I have to use to force the string representation of the 
 128 bit long into a token pointer?
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Unable to query with token range.. unable to make long from ‘...'

2014-09-28 Thread graham sanderson
Looks like you are looking at old docs (pre Murmer3 partitioner). Latest are 
here (don’t think it has changed in 2.1 from 2.0.x)

http://www.datastax.com/documentation/cassandra/2.1/cassandra/configuration/configGenTokens_c.html

Murmer3 is definitely 64 bits

On Sep 28, 2014, at 5:55 PM, Kevin Burton bur...@spinn3r.com wrote:

 Hm.. is it 64 bits or 128 bits?
 
 I’m using Murmur3Partitioner
 
 … 
 
 I can’t find any documentation on it (as usual.. ha)
 
 This says:
 
 http://www.datastax.com/docs/1.1/initialize/token_generation
 
  The tokens assigned to your nodes need to be distributed throughout the 
  entire possible range of tokens (0 to 2127 -1)
 
 so it would need to be 2^63 -1 or 2^127-1
 
 
 
 On Sun, Sep 28, 2014 at 1:19 PM, graham sanderson gra...@vast.com wrote:
 It is expecting a 64 bit value … murmer3 partitioner uses 64 bit long tokens… 
 where did you get your 128 bit long from, and what partitioner are you using?
 
 On Sep 28, 2014, at 1:39 PM, Kevin Burton bur...@spinn3r.com wrote:
 
 I’m trying to query an entire table in parallel by splitting it up in token 
 ranges.
 
 However, it’s not working because I get this:
 
 cqlsh:blogindex  select token(hashcode), hashcode from source where 
 token(hashcode) = 0 and token(hashcode) = 
 17014118346046923173168730371588410572 limit 10;
 Bad Request: unable to make long from 
 '17014118346046923173168730371588410572'
 
 … so I’m trying to figure out what’s going on here.
 
 Is there some magic I have to use to force the string representation of the 
 128 bit long into a token pointer?
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 
 
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: ava.lang.OutOfMemoryError: unable to create new native thread

2014-09-17 Thread graham sanderson
Are you running on a 32 bit JVM?

On Sep 17, 2014, at 9:43 AM, Yatong Zhang bluefl...@gmail.com wrote:

 Hi there,
 
 I am using leveled compaction strategy and have many sstable files. The error 
 was during the startup, so any idea about this?
  
 ERROR [FlushWriter:4] 2014-09-17 22:36:59,383 CassandraDaemon.java (line 199) 
 Exception in thread Thread[FlushWriter:4,5,main]
 java.lang.OutOfMemoryError: unable to create new native thread
 at java.lang.Thread.start0(Native Method)
 at java.lang.Thread.start(Thread.java:693)
 at 
 java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
 at 
 java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1017)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1163)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 ERROR [FlushWriter:2] 2014-09-17 22:36:59,472 CassandraDaemon.java (line 199) 
 Exception in thread Thread[FlushWriter:2,5,main]
 FSReadError in 
 /data5/cass/system/compactions_in_progress/system-compactions_in_progress-jb-23-Index.db
 at 
 org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:200)
 at 
 org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.complete(MmappedSegmentedFile.java:168)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:334)
 at 
 org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:324)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:394)
 at 
 org.apache.cassandra.db.Memtable$FlushRunnable.runWith(Memtable.java:342)
 at 
 org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 Caused by: java.io.IOException: Map failed
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849)
 at 
 org.apache.cassandra.io.util.MmappedSegmentedFile$Builder.createSegments(MmappedSegmentedFile.java:192)
 ... 10 more
 Caused by: java.lang.OutOfMemoryError: Map failed
 at sun.nio.ch.FileChannelImpl.map0(Native Method)
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:846)
 ... 11 more
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Storage: upsert vs. delete + insert

2014-09-10 Thread graham sanderson
delete inserts a tombstone which is likely smaller than the original record 
(though still (currently) has overhead of cost for full key/column name
the data for the insert after a delete would be identical to the data if you 
just inserted/updated

no real benefit I can think of for doing the delete first.

On Sep 10, 2014, at 2:25 PM, olek.stas...@gmail.com wrote:

 I think so.
 this is how i see it:
 on the very beginning you have such line in datafile:
 {key: [col_name, col_value, date_of_last_change]} //something similar,
 i don't remember now
 
 after delete you're adding line:
 {key:[col_name, last_col_value, date_of_delete, 'd']} //this d
 indicates that field is deleted
 after insert the following line is added:
 {key: [col_name, col_value, date_of_insert]}
 so delete and then insert generates 2 lines in datafile.
 
 after pure insert (upsert in fact) you will have only one line
 {key: [col_name, col_value, date_of_insert]}
 So, summarizing, in second scenario you have only one line, in first: two.
 I hope my post is correct ;)
 regards,
 Olek
 
 2014-09-10 18:56 GMT+02:00 Michal Budzyn michalbud...@gmail.com:
 Would the factor before compaction be always 2 ?
 
 On Wed, Sep 10, 2014 at 6:38 PM, olek.stas...@gmail.com
 olek.stas...@gmail.com wrote:
 
 IMHO, delete then insert will take two times more disk space then
 single insert. But after compaction the difference will disappear.
 This was true in version prior to 2.0, but it should still work this
 way. But maybe someone will correct me, if i'm wrong.
 Cheers,
 Olek
 
 2014-09-10 18:30 GMT+02:00 Michal Budzyn michalbud...@gmail.com:
 One insert would be much better e.g. for performance and network
 latency.
 I wanted to know if there is a significant difference (apart from
 additional
 commit log entry) in the used storage between these 2 use cases.
 
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Storage: upsert vs. delete + insert

2014-09-10 Thread graham sanderson
agreed

On Sep 10, 2014, at 3:27 PM, olek.stas...@gmail.com wrote:

 You're right, there is no data in tombstone, only a column name. So
 there is only small overhead of disk size after delete. But i must
 agree with post above, it's pointless in deleting prior to inserting.
 Moreover, it needs one op more to compute resulting row.
 cheers,
 Olek
 
 2014-09-10 22:18 GMT+02:00 graham sanderson gra...@vast.com:
 delete inserts a tombstone which is likely smaller than the original record 
 (though still (currently) has overhead of cost for full key/column name
 the data for the insert after a delete would be identical to the data if you 
 just inserted/updated
 
 no real benefit I can think of for doing the delete first.
 
 On Sep 10, 2014, at 2:25 PM, olek.stas...@gmail.com wrote:
 
 I think so.
 this is how i see it:
 on the very beginning you have such line in datafile:
 {key: [col_name, col_value, date_of_last_change]} //something similar,
 i don't remember now
 
 after delete you're adding line:
 {key:[col_name, last_col_value, date_of_delete, 'd']} //this d
 indicates that field is deleted
 after insert the following line is added:
 {key: [col_name, col_value, date_of_insert]}
 so delete and then insert generates 2 lines in datafile.
 
 after pure insert (upsert in fact) you will have only one line
 {key: [col_name, col_value, date_of_insert]}
 So, summarizing, in second scenario you have only one line, in first: two.
 I hope my post is correct ;)
 regards,
 Olek
 
 2014-09-10 18:56 GMT+02:00 Michal Budzyn michalbud...@gmail.com:
 Would the factor before compaction be always 2 ?
 
 On Wed, Sep 10, 2014 at 6:38 PM, olek.stas...@gmail.com
 olek.stas...@gmail.com wrote:
 
 IMHO, delete then insert will take two times more disk space then
 single insert. But after compaction the difference will disappear.
 This was true in version prior to 2.0, but it should still work this
 way. But maybe someone will correct me, if i'm wrong.
 Cheers,
 Olek
 
 2014-09-10 18:30 GMT+02:00 Michal Budzyn michalbud...@gmail.com:
 One insert would be much better e.g. for performance and network
 latency.
 I wanted to know if there is a significant difference (apart from
 additional
 commit log entry) in the used storage between these 2 use cases.
 
 
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: update static column using partition key

2014-09-07 Thread graham sanderson
Presumably you meant unread_ids to be a static column (it isn’t in your table 
definition)

On Sep 7, 2014, at 10:14 AM, tommaso barbugli tbarbu...@gmail.com wrote:

 Hi,
 I am trying to use a couple of static columns; I am using cassandra 2.0.7 and 
 when I try to set a value using the partition key only, I get a primary key 
 incomplete error.
 
 Here is the schema and the query with the error I get from cqlsh
 
 CREATE TABLE shard75 (
   group_id ascii,
   event_id timeuuid,
   group ascii,
   unread_ids settimeuuid,
   unseen_ids settimeuuid,
   PRIMARY KEY (group_id, event_id)
 ) WITH CLUSTERING ORDER BY (event_id DESC) AND
   bloom_filter_fp_chance=0.10 AND
   caching='KEYS_ONLY' AND
   comment='' AND
   dclocal_read_repair_chance=0.00 AND
   gc_grace_seconds=864000 AND
   index_interval=128 AND
   read_repair_chance=0.10 AND
   populate_io_cache_on_flush='false' AND
   default_time_to_live=0 AND
   speculative_retry='99.0PERCENTILE' AND
   memtable_flush_period_in_ms=0 AND
   compaction={'sstable_size_in_mb': '64', 'tombstone_threshold': '0.2', 
 'class': 'LeveledCompactionStrategy'} AND
   compression={'sstable_compression': 'LZ4Compressor'};
  
  
 UPDATE shard75 set unread_ids = { } WHERE group_id = 'asd';
  
 Bad Request: Missing mandatory PRIMARY KEY part event_id
 
 
 Am I doing something unsupported here? I am trying to follow the examples in 
 the release docs (from cassandra 2.0.6)
 
 Thank you,
 Tommaso



smime.p7s
Description: S/MIME cryptographic signature


Re: update static column using partition key

2014-09-07 Thread graham sanderson
Note also (though you are likely not hitting them) there were a bunch of static 
column related edge cases fixed in 2.0.10

On Sep 7, 2014, at 1:18 PM, graham sanderson gra...@vast.com wrote:

 Presumably you meant unread_ids to be a static column (it isn’t in your table 
 definition)
 
 On Sep 7, 2014, at 10:14 AM, tommaso barbugli tbarbu...@gmail.com wrote:
 
 Hi,
 I am trying to use a couple of static columns; I am using cassandra 2.0.7 
 and when I try to set a value using the partition key only, I get a primary 
 key incomplete error.
 
 Here is the schema and the query with the error I get from cqlsh
 
 CREATE TABLE shard75 (
   group_id ascii,
   event_id timeuuid,
   group ascii,
   unread_ids settimeuuid,
   unseen_ids settimeuuid,
   PRIMARY KEY (group_id, event_id)
 ) WITH CLUSTERING ORDER BY (event_id DESC) AND
   bloom_filter_fp_chance=0.10 AND
   caching='KEYS_ONLY' AND
   comment='' AND
   dclocal_read_repair_chance=0.00 AND
   gc_grace_seconds=864000 AND
   index_interval=128 AND
   read_repair_chance=0.10 AND
   populate_io_cache_on_flush='false' AND
   default_time_to_live=0 AND
   speculative_retry='99.0PERCENTILE' AND
   memtable_flush_period_in_ms=0 AND
   compaction={'sstable_size_in_mb': '64', 'tombstone_threshold': '0.2', 
 'class': 'LeveledCompactionStrategy'} AND
   compression={'sstable_compression': 'LZ4Compressor'};
  
  
 UPDATE shard75 set unread_ids = { } WHERE group_id = 'asd';
  
 Bad Request: Missing mandatory PRIMARY KEY part event_id
 
 
 Am I doing something unsupported here? I am trying to follow the examples in 
 the release docs (from cassandra 2.0.6)
 
 Thank you,
 Tommaso
 



smime.p7s
Description: S/MIME cryptographic signature


Re: OOM(Java heap space) on start-up during commit log replaying

2014-08-12 Thread graham sanderson
Agreed need more details; and just start by increasing heap because that may 
wells solve the problem.

I have just observed (which makes sense when you think about it) while testing 
fix for https://issues.apache.org/jira/browse/CASSANDRA-7546, that if you are 
replaying a commit log which has a high level of updates for the same partition 
key, you can hit that issue - excess memory allocation under high contention 
for the same partition key - (this might not cause OOM but will certainly 
massively tax GC and it sounds like you don’t have a lot/any headroom).

On Aug 12, 2014, at 12:31 PM, Robert Coli rc...@eventbrite.com wrote:

 
 On Tue, Aug 12, 2014 at 9:34 AM, jivko donev jivko_...@yahoo.com wrote:
 We have a node with commit log director ~4G. During start-up of the node on 
 commit log replaying the used heap space is constantly growing ending with 
 OOM error. 
 
 The heap size and new heap size properties are - 1G and 256M. We are using 
 the default settings for commitlog_sync, commitlog_sync_period_in_ms and 
 commitlog_segment_size_in_mb.
 
 What version of Cassandra?
 
 1G is tiny for cassandra heap. There is a direct relationship between the 
 data in the commitlog and memtables and in the heap. You almost certainly 
 need more heap or less commitlog.
 
 =Rob
   



smime.p7s
Description: S/MIME cryptographic signature


Re: Strange slow schema agreement on 2.0.9 ... anyone seen this? - knowsVersion may get stuck as false?

2014-08-10 Thread graham sanderson
We saw this problem again today, so it certainly seems reasonable that it was 
introduced by upgrade from 2.0.5 to 2.0.9 (we hadn’t seen it ever before that)
I think this must be related to 
https://issues.apache.org/jira/browse/CASSANDRA-6695 or 
https://issues.apache.org/jira/browse/CASSANDRA-6700 which were both 
implemented in 2.0.6
The reason I think it is a problem with choosing not to do the schema push, is 
a “trace” of a manual table create on some nodes (where the problem occurs) 
does not send messages to some other nodes, whereas if the table creation is 
done from another node it may send messages to all nodes.

Note quite sure exactly what is/might be going on; seems like it could be a 
race of some kind (note we have ALWAYS been on 2.0.x in this environment, so it 
isn’t an issue with 1.x) that leaves the affected node with incorrect state 
about the other node’s version
I’m going to add logging in that code path (gossip still seems to indicate that 
everything is up - it is certainly possible that earlier a node appeared to be 
down due to GC, but it seems whatever state this causes does not resolve itself 
later - i.e. even though the schema change is eventually propagated, future 
schema changes have the same problem). Note whatever the conditions are, it 
seems to be a one way thing, i.e. A skips push to B, but then B happily pulls 
from A.

Other than schema changes, nothing else seemed to be affected (if nodes thought 
several other nodes were down, we’d likely see LOCAL_QUORUM operations fail)… 
this again points to the new “getRawVersion” change which is only used by the 
schema push/pull (“getVersion” assumes current version if no version info is 
known)… so there must be some sequence of event that causes node A to 
(permanently(-ish?)) lose version information for node B. 

On Aug 8, 2014, at 5:06 PM, graham sanderson gra...@vast.com wrote:

 Actually I think it is a different issue (or a freak issue)… the invocation 
 in InternalResponseStage is part of the “schema pull” mechanism this ticket 
 relates to, and in my case this is actually repairing (thank you) the schema 
 disagreement because as a result of it eventually being noticed by gossip. 
 For whatever reason, the “schema push” mechanism got broken for some nodes. 
 Strange as I say since this push code looks for live nodes according to 
 gossip and all nodes were up according to gossip info at the time. So, sadly 
 the new debug logging in the pull path won’t help… if it happens again, I’ll 
 have some more context to dig deeper, before just getting in and fixing the 
 problem by restarting the nodes which I did today.
 
 On Aug 8, 2014, at 4:37 PM, graham sanderson gra...@vast.com wrote:
 
 Ok thanks - I guess I can at least enable the debug logging added for that 
 issue to see if it is deliberately choosing not to pull the schema… no repro 
 case, but it may happen again!
 
 On Aug 8, 2014, at 4:21 PM, Robert Coli rc...@eventbrite.com wrote:
 
 On Fri, Aug 8, 2014 at 1:45 PM, graham sanderson gra...@vast.com wrote:
 We have some data that is partitioned in tables created periodically (once 
 a day). This morning, this automated process timed out because the schema 
 did not reach agreement quickly enough after we created a new empty table.
 
 I have seen this on 1.2.16, but it was supposed to be fixed in 1.2.18 and 
 2.0.7.
 
 https://issues.apache.org/jira/browse/CASSANDRA-6971
 
 If you can repro on 2.0.9, I would file a JIRA with repro steps and link it 
 on a reply to this thread.
 
 =Rob 
 
 



smime.p7s
Description: S/MIME cryptographic signature


Strange slow schema agreement on 2.0.9 ... anyone seen this?

2014-08-08 Thread graham sanderson
We recently upgraded C* from 2.0.5 to 2.0.9

We have some data that is partitioned in tables created periodically (once a 
day). This morning, this automated process timed out because the schema did not 
reach agreement quickly enough after we created a new empty table.

I was able to reproduce this manually via CQLSH. when I created the table, and 
ran a nodetool describecluster, it showed 3 nodes on the old schema and 3 nodes 
on the new schema instantly (or as quick as I could run the nodetool 
describecluster). It took almost exactly a minute for the other nodes to switch.

The nodes weren’t busy, machines were healthy network was healthy, JVMs were 
healthy - nodetool status, gossipinfo and OpsCenter all looked happy. We never 
saw this issue in beta on 2.0.9 or anywhere on 2.0.5, and yesterday on 2.0.9 
after the upgrade it worked correctly.

The only clue I have is that for this case, the nodes which were slow to update 
called DefsTables.mergeSchema from InternalResponseStage not MigrationStage 
(which is what it is called on as I test it now).
Looking at the logs, these InternalResponseStage happened eerily close (within 
a second) to exactly a minute.

Having discovered nothing else wrong, I restarted one of the “slow” nodes, and 
the problem went away (for that node). So now the cluster has been rolling 
restarted, and is proceeding fine.

Anyways, I will dig a little deeper as to why (when all nodes thing each other 
are up) the migration verb might not get executed (there were no errors in any 
logs)… mostly wondering if this rings a bell with anyone

smime.p7s
Description: S/MIME cryptographic signature


Re: Delete By Partition Key Implementation

2014-08-08 Thread graham sanderson
A deletion of an entire row is a single row tombstone, and yes there are range 
tombstones for marking deletion of a range of columns also

On Aug 8, 2014, at 2:17 PM, Kevin Burton bur...@spinn3r.com wrote:

 This is a good question.. I'd love to find out the answer.  Seems like a 
 tombstone with prefixes for the keys would work well.
 
 Also, can't any key prefixes work in theory?
 
 
 On Thu, Aug 7, 2014 at 8:33 AM, DuyHai Doan doanduy...@gmail.com wrote:
 Hello all
 
  Usually, when using DELETE in CQL3 on some fields, C* creates tombstone 
 columns for those fields.
 
  Now if I delete a whole PARTITION (delete from MyTable where 
 partitionKey=...), what will C* do ? Will it create as many tombstones as 
 there are physical columns on this partition or will it just mark this 
 partition as deleted (Row Key deletion marker) ?
 
  On a side note, if I insert a bunch of physical columns in one partition 
 with the SAME ttl value, after a while they will appear as expired, would C* 
 need to scan the whole partition on disk to see which columns to expire or 
 could it see that the whole partition is indeed expired thanks to meta data/ 
 Partition key cache kept in memory ?  I was thinking about the estimate 
 histograms for TTL but I don't know in detail how it work
 
  Regards
 
  Duy Hai  DOAN
 
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Strange slow schema agreement on 2.0.9 ... anyone seen this?

2014-08-08 Thread graham sanderson
Ok thanks - I guess I can at least enable the debug logging added for that 
issue to see if it is deliberately choosing not to pull the schema… no repro 
case, but it may happen again!

On Aug 8, 2014, at 4:21 PM, Robert Coli rc...@eventbrite.com wrote:

 On Fri, Aug 8, 2014 at 1:45 PM, graham sanderson gra...@vast.com wrote:
 We have some data that is partitioned in tables created periodically (once a 
 day). This morning, this automated process timed out because the schema did 
 not reach agreement quickly enough after we created a new empty table.
 
 I have seen this on 1.2.16, but it was supposed to be fixed in 1.2.18 and 
 2.0.7.
 
 https://issues.apache.org/jira/browse/CASSANDRA-6971
 
 If you can repro on 2.0.9, I would file a JIRA with repro steps and link it 
 on a reply to this thread.
 
 =Rob 



smime.p7s
Description: S/MIME cryptographic signature


Re: Strange slow schema agreement on 2.0.9 ... anyone seen this?

2014-08-08 Thread graham sanderson
Actually I think it is a different issue (or a freak issue)… the invocation in 
InternalResponseStage is part of the “schema pull” mechanism this ticket 
relates to, and in my case this is actually repairing (thank you) the schema 
disagreement because as a result of it eventually being noticed by gossip. For 
whatever reason, the “schema push” mechanism got broken for some nodes. Strange 
as I say since this push code looks for live nodes according to gossip and all 
nodes were up according to gossip info at the time. So, sadly the new debug 
logging in the pull path won’t help… if it happens again, I’ll have some more 
context to dig deeper, before just getting in and fixing the problem by 
restarting the nodes which I did today.

On Aug 8, 2014, at 4:37 PM, graham sanderson gra...@vast.com wrote:

 Ok thanks - I guess I can at least enable the debug logging added for that 
 issue to see if it is deliberately choosing not to pull the schema… no repro 
 case, but it may happen again!
 
 On Aug 8, 2014, at 4:21 PM, Robert Coli rc...@eventbrite.com wrote:
 
 On Fri, Aug 8, 2014 at 1:45 PM, graham sanderson gra...@vast.com wrote:
 We have some data that is partitioned in tables created periodically (once a 
 day). This morning, this automated process timed out because the schema did 
 not reach agreement quickly enough after we created a new empty table.
 
 I have seen this on 1.2.16, but it was supposed to be fixed in 1.2.18 and 
 2.0.7.
 
 https://issues.apache.org/jira/browse/CASSANDRA-6971
 
 If you can repro on 2.0.9, I would file a JIRA with repro steps and link it 
 on a reply to this thread.
 
 =Rob 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Is per-table memory overhead due to SSTables or tables?

2014-08-08 Thread graham sanderson
See https://issues.apache.org/jira/browse/CASSANDRA-5935

2.1 has a radically different implementation that side steps this (with off 
heap memtables), but if you really want lots of tables now you can do so as a 
trade off against GC behavior.

The problem is not SSTables per se, but more potentially one memtable per CF 
(and with slab allocator that can/does cost 1M); I am not familiar enough with 
the code to know when you would have 1 memtable vs 0 memtable for a CF that 
isn’t currently actively used.

Note also https://issues.apache.org/jira/browse/CASSANDRA-6602 and friends; 
there is definitely a need for efficient discarding of old data in event 
streams.


On Aug 8, 2014, at 2:29 PM, Kevin Burton bur...@spinn3r.com wrote:

 The conventional wisdom says that it's ideal to only use in the low 
 hundreds in the number of tables with cassandra as each table can use 1MB or 
 so of heap.  So if you have 1000 tables you'd have 1GB of heap used (which is 
 no fun).
 
 But is this an issue with the tables themselves or the SSTables?
 
 I think the root of this is the SSTables as all the arena overhead will be 
 for the SSTables too and more SSTables means more overhead.
 
 So by adding more tables, you end up with more SSTables which means more heap 
 memory.
 
 If I'm in correct then this means that Cassandra could benefit from table 
 partitioning.  Whereby you put all values in a specific region to a specific 
 set of tables.
 
 So if you were storing log data, you could store it in hourly, or daily 
 partitions, but view the table as one logical unit.
 
 the benefit here is that you could easily just drop the oldest data.  So if 
 you need to clean up data, you wouldn't have to drop the whole table, just a 
 days worth of the data. 
 
 And since that day is just one SSTable on disk, the drop would be easy.. no 
 tombstones, just delete the whole SSTable.
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Is per-table memory overhead due to SSTables or tables?

2014-08-08 Thread graham sanderson
google ;-)

On Aug 8, 2014, at 7:33 PM, Kevin Burton bur...@spinn3r.com wrote:

 hm.. as a side note, it's amazing how much cassandra information is locked up 
 in JIRAs… wonder if there's a way to compute automatically the JIRAs with 
 important information.
 
 
 On Fri, Aug 8, 2014 at 5:14 PM, graham sanderson gra...@vast.com wrote:
 See https://issues.apache.org/jira/browse/CASSANDRA-5935
 
 2.1 has a radically different implementation that side steps this (with off 
 heap memtables), but if you really want lots of tables now you can do so as a 
 trade off against GC behavior.
 
 The problem is not SSTables per se, but more potentially one memtable per CF 
 (and with slab allocator that can/does cost 1M); I am not familiar enough 
 with the code to know when you would have 1 memtable vs 0 memtable for a CF 
 that isn’t currently actively used.
 
 Note also https://issues.apache.org/jira/browse/CASSANDRA-6602 and friends; 
 there is definitely a need for efficient discarding of old data in event 
 streams.
 
 
 On Aug 8, 2014, at 2:29 PM, Kevin Burton bur...@spinn3r.com wrote:
 
 The conventional wisdom says that it's ideal to only use in the low 
 hundreds in the number of tables with cassandra as each table can use 1MB 
 or so of heap.  So if you have 1000 tables you'd have 1GB of heap used 
 (which is no fun).
 
 But is this an issue with the tables themselves or the SSTables?
 
 I think the root of this is the SSTables as all the arena overhead will be 
 for the SSTables too and more SSTables means more overhead.
 
 So by adding more tables, you end up with more SSTables which means more 
 heap memory.
 
 If I'm in correct then this means that Cassandra could benefit from table 
 partitioning.  Whereby you put all values in a specific region to a specific 
 set of tables.
 
 So if you were storing log data, you could store it in hourly, or daily 
 partitions, but view the table as one logical unit.
 
 the benefit here is that you could easily just drop the oldest data.  So if 
 you need to clean up data, you wouldn't have to drop the whole table, just a 
 days worth of the data. 
 
 And since that day is just one SSTable on disk, the drop would be easy.. no 
 tombstones, just delete the whole SSTable.
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 
 
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Does SELECT … IN () use parallel dispatch?

2014-07-25 Thread Graham Sanderson
Of course the driver in question is allowed to be smarter and can do so if use 
use a ? parameter for a list or even individual elements

I'm not sure which if any drivers currently do this but we plan to combine this 
with token aware routing in our scala driver in the future 

Sent from my iPhone

 On Jul 25, 2014, at 1:14 PM, DuyHai Doan doanduy...@gmail.com wrote:
 
 Nope. Select ... IN() sends one request to a coordinator. This coordinator 
 dispatch the request to 50 nodes as in your example and waits for 50 
 responses before sending back the final result. As you can guess this 
 approach is not optimal since the global request latency is bound to the 
 slowest latency among 50 nodes.
 
  On the other hand if you use async feature from the native protocol, you 
 client will issue 50 requests in parallel and the answers arrive as soon as 
 they are fetched from different nodes.
 
  Clearly the only advantage of using IN() clause is ease of query. I would 
 advise to use IN() only when you have a few values, not 50.
 
 
 On Fri, Jul 25, 2014 at 8:08 PM, Kevin Burton bur...@spinn3r.com wrote:
 Say I have about 50 primary keys I need to fetch.
 
 I'd like to use parallel dispatch.  So that if I have 50 hosts, and each has 
 record, I can read from all 50 at once.
 
 I assume cassandra does the right thing here ?  I believe it does… at least 
 from reading the docs but it's still a bit unclear.
 
 Kevin
 
 -- 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 


Re: All writes fail with ONE consistency level when adding second node to cluster?

2014-07-23 Thread graham sanderson
Hey now; it is GREAT for a 100% write only use case ;-)

On Jul 23, 2014, at 12:15 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Jul 22, 2014 at 7:46 PM, Andrew redmu...@gmail.com wrote:
 ONE means write to one replica (in addition to the original).  If you want to 
 write to any of them, use ANY.  Is that the right understanding?
 
 This has come up a few times, so let me be unambiguous about when to use 
 CL.ANY :
 
 NEVER EVER USE CL.ANY. IT ALMOST CERTAINLY SHOULD NOT EVEN EXIST.
 
 IF YOU THINK YOU NEED TO USE IT, YOU ARE ALMOST CERTAINLY WRONG.
 
 ;D
 
 =Rob
 



smime.p7s
Description: S/MIME cryptographic signature


Re: All writes fail with ONE consistency level when adding second node to cluster?

2014-07-23 Thread graham sanderson
I was being a little tongue in cheek!

On Jul 23, 2014, at 3:20 PM, Jack Krupansky j...@basetechnology.com wrote:

 Granted, for “normal” apps it is unlikely to be appropriate but...
  
 From an old post by Jonathan:
 ---
 Extreme write availability
  
 For applications that want Cassandra to accept writes even when all the 
 normal replicas are down (so even ConsistencyLevel.ONE cannot be satisfied), 
 Cassandra provides ConsistencyLevel.ANY. ConsistencyLevel.ANY guarantees that 
 the write is durable and will be readable once an appropriate replica target 
 becomes available and receives the hint replay.
 ---
 See:
 http://www.datastax.com/dev/blog/understanding-hinted-handoff
  
 I can think of a couple of use cases: sensor data where the devices are 
 streaming frequently, so losing a reading is not a big deal because another 
 reading is coming soon anyway, and a Twitter firehose where you are after a 
 robust sample rather than absolute consistency. Minimizing network latency 
 may be a bigger deal than whether immediate queries can see the data.
  
 And as the description notes, hinted handoff will eventually propagate the 
 data (unless it times out and drops the hint.)
  
 -- Jack Krupansky
  
 From: Robert Coli
 Sent: Wednesday, July 23, 2014 1:15 PM
 To: user@cassandra.apache.org
 Cc: Kevin Burton
 Subject: Re: All writes fail with ONE consistency level when adding second 
 node to cluster?
  
 On Tue, Jul 22, 2014 at 7:46 PM, Andrew redmu...@gmail.com wrote:
  
 ONE means write to one replica (in addition to the original).  If you want to 
 write to any of them, use ANY.  Is that the right understanding?
  
  
 This has come up a few times, so let me be unambiguous about when to use 
 CL.ANY :
  
 NEVER EVER USE CL.ANY. IT ALMOST CERTAINLY SHOULD NOT EVEN EXIST.
  
 IF YOU THINK YOU NEED TO USE IT, YOU ARE ALMOST CERTAINLY WRONG.
  
 ;D
  
 =Rob
  



smime.p7s
Description: S/MIME cryptographic signature


Re: All writes fail with ONE consistency level when adding second node to cluster?

2014-07-22 Thread graham sanderson
I assumed you must have now switched to ANY which you probably didn’t want to 
do, and likely won’t help (and very few people use ANY which may explain the 
lack of google hits, plus this particular “Cassandra timeout during write query 
at consistency” error message comes from the datastax CQL java driver not C* 
itself.

In any case… my original response was just to explain to you that your 
understanding of what ONE means in general was correct, and this incorrect 
looking error message was a weird case during adding a node.

I have no idea what is going on with your bootstrapping node others may be able 
to help, but in the meanwhile I’d look for errors in the server log and google 
those and/or google for instructions on how to add nodes to a cassandra cluster 
on whatever version you are running.

On Jul 22, 2014, at 10:47 PM, Kevin Burton bur...@spinn3r.com wrote:

 and there are literally zero google hits on the query: Cassandra timeout 
 during write query at consistency ANY (2 replica were required but only 1 
 acknowledged the write)
 
 .. so I imagine I'm the first to find this bug!  Aren't I lucky!
 
 
 On Tue, Jul 22, 2014 at 8:46 PM, Kevin Burton bur...@spinn3r.com wrote:
 Yeah.. that's fascinating … so now I get something that's even worse:
 
 Cassandra timeout during write query at consistency ANY (2 replica were 
 required but only 1 acknowledged the write)
 
 … the issue is that the new cassandra node has all its ports closed.
 
 Only the storage port is open.
 
 So obviously writes are going to fail to it.
 
 … is this by design?  Perhaps it's not going to open the ports until the node 
 joins the ring?  It's currently joining …
 
 so… basically, my entire cluster is offline during this join?
 
 I assume this is either a bug or some weird state base on growing from 1-2 
 nodes?
 
 frustrating :-(
 
 
 On Tue, Jul 22, 2014 at 8:13 PM, graham sanderson gra...@vast.com wrote:
 Incorrect, ONE does not refer to the number of “other nodes, it just refers 
 to the number of nodes. so ONE under normal circumstances would only require 
 one node to acknowledge the write.
 
 The confusing error message you are getting is related to 
 https://issues.apache.org/jira/browse/CASSANDRA-833… Kevin you are correct in 
 that normally that error message would make no sense.
 
 I don’t have much experience adding/removing nodes, but I think what is 
 happening is that your new node is in the middle of taken over ownership of a 
 token range - while that happens C* is trying to write to both the old owner 
 (your original node), AND (hence the 2 not 1 in the error message) the new 
 owner (the new node) so that once the bootstrapping of the new node is 
 complete, it is immediately safe to delete the (no longer owned data) from 
 the old node. For whatever reason the write to the new node is timing out, 
 causing the exception, and the error message is exposing the “2” which 
 happens to be how many C* thinks it is waiting for at the time (i.e. how many 
 it should be waiting for based on the consistency level (1) plus this extra 
 node).
 
 
 On Jul 22, 2014, at 9:46 PM, Andrew redmu...@gmail.com wrote:
 
 ONE means write to one replica (in addition to the original).  If you want 
 to write to any of them, use ANY.  Is that the right understanding?
 
 http://www.datastax.com/docs/1.0/dml/data_consistency
 
 Andrew
 
 On July 22, 2014 at 7:43:43 PM, Kevin Burton (bur...@spinn3r.com) wrote:
 
 I'm super confused by this.. and disturbed that this was my failure 
 scenario :-(
 
 I had one cassandra node for the alpha of my app… and now we're moving into 
 beta… which means three replicas.
 
 So I added the second node… but my app immediately broke with:
 
 Cassandra timeout during write query at consistency ONE (2 replica were 
 required but only 1 acknowledged the write)
 
 … but that makes no sense… if I'm at ONE and I have one acknowledged write, 
 why does it matter that the second one hasn't ack'd yet…
 
 ?
 
 --
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: ghost table is breaking compactions and won't go away… even during a drop.

2014-07-16 Thread graham sanderson
Known issue deleting and recreating a CF with the same name, fixed in 2.1 
(manifests in lots of ways)

https://issues.apache.org/jira/browse/CASSANDRA-5202

On Jul 16, 2014, at 8:53 PM, Kevin Burton bur...@spinn3r.com wrote:

 looks like a restart of cassandra and a nodetool compact fixed this… 
 
 
 On Wed, Jul 16, 2014 at 6:45 PM, Kevin Burton bur...@spinn3r.com wrote:
 this is really troubling…
 
 I have a ghost table.  I dropped it.. but it's not going away.  
 
 (Cassandra 2.0.8 btw)
 
 I ran a 'drop table' on it.. then a 'describe tables' shows that it's not 
 there.  
 
 However, when I recreated it, with a new schema, all operations on it failed.
 
 Looking at why… it seems that cassandra had some old SSTables that I imagine 
 are no longer being used but are now in an inconsistent state?
 
 This is popping up in the system.log:
 
 Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: 
 /d0/cassandra/data/blogindex/content_idx_source_hashcode/blogindex-content_idx_source_hashcode-jb-1447-Data.db
  (No such file or directory)
 
 so I think what happened… is that the original drop table, failed, and then 
 left things in an inconsistent state.
 
 I tried a nodetool repair and a nodetool compact… those fail on the same 
 java.io.FileNotFoundException … I moved the directories out of the way, same 
 failure issue.
 
 … any advice on resolving this?
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 
 
 
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Write Inconsistency to update a row

2014-07-03 Thread graham sanderson
What is your keyspace replication_factor?

What consistency level are you reading/writing with?

Does the data show up eventually?

I’m assuming you don’t have any errors (timeouts etc) on the write site

On Jul 3, 2014, at 7:55 AM, Sávio S. Teles de Oliveira 
savio.te...@cuia.com.br wrote:

 I have two Cassandra 2.0.5 servers running with some datas inserted, where 
 each row have one empty column. When the client send a lot of update commands 
 to fill this column in each row, some lines update their content, but some 
 lines remain with the empty column.
 
 Using one server, this never happens!
 
 Any suggestions?
 
 Tks.
 -- 
 Atenciosamente,
 Sávio S. Teles de Oliveira
 voice: +55 62 9136 6996
 http://br.linkedin.com/in/savioteles
 Mestrando em Ciências da Computação - UFG 
 Arquiteto de Software
 CUIA Internet Brasil



smime.p7s
Description: S/MIME cryptographic signature


Re: Dynamic Columns in Cassandra 2.X

2014-06-13 Thread graham sanderson
My 2 cents…

A motivation for CQL3 AFAIK was to make Cassandra more familiar to SQL users. 
This is a valid goal, and works well in many cases.
Equally there are use cases (that some might find ugly) where Cassandra is 
chosen explicitly because of the sorts of things you can do at the thrift 
level, which aren’t (currently) exposed via CQL3

To Robert’s point earlier - Rational people should presume that Thrift support 
must eventually disappear”… he is probably right (though frankly I’d rather the 
non-blocking thrift version was added instead). However if we do get rid of the 
thrift interface, then it needs to be at a time that CQLn is capable of 
expressing all the things you could do via the thrift API. Note, I need to go 
look and see if the non-blocking thrift version also requires materializing the 
entire thrift object in memory.

On Jun 13, 2014, at 4:55 PM, DuyHai Doan doanduy...@gmail.com wrote:

 There are always the pros and the cons with a querying language, as always.
 
 But as far as I can see, the advantages of Thrift I can see over CQL3 are:
 
  1) Thrift require a little bit less decoding server-side (a difference 
 around 10% in CPU usage).
 
  2) Thrift use more compact storage because CQL3 need to add extra marker 
 columns to guarantee the existence of primary key. It is worsen when you use 
 clustering columns because for each distinct clustering group you have a 
 related marker columns.
 
  That being said, point 1) is not really an issue since most of the time 
 nodes are more I/O bound than CPU bound. Only in extreme cases where you have 
 incredible read rate with data that fits entirely in memory that you may 
 notice the difference.
 
  For point 2) this is a small trade-off to have access to a query language 
 and being able to do slice queries using the WHERE clause. Some like it, 
 other hate it, it's just a question of taste.  Please note that the waste 
 in disk space is somehow mitigated by compression.
 
  Long story short I think Thrift may have appropriate usage but only in very 
 few use cases. Recently a lot of improvement and features have been added to 
 CQL3 so that it shoud be considered as the first choice for most users and if 
 they fall into those few use cases then switch back to Thrift
 
 My 2 cents
 
 
 
 
 
 
 On Fri, Jun 13, 2014 at 11:43 PM, Peter Lin wool...@gmail.com wrote:
 
 With text based query approach like CQL, you loose the type with dynamic 
 columns. Yes, we're storing it as bytes, but it is simpler and easier with 
 Thrift to do these types of things.
 
 I like CQL3 and what it does, but text based query languages make certain 
 dynamic schema use cases painful. Having used and built ORM's they are poorly 
 suited to dynamic schemas. If you've never had to write an ORM to handle 
 dynamic user defined schemas at runtime, it's tough to see where the problems 
 arise and how that makes life painful.
 
 Just to be clear, I'm not saying don't use CQL3 or CQL3 is bad. I'm 
 saying CQL3 is good for certain kinds of use cases and Thrift is good at 
 certain use cases. People need to look at what and how they're storing data 
 and do what makes the most sense to them. Slavishly following CQL3 doesn't 
 make any sense to me.
  
 
 
 On Fri, Jun 13, 2014 at 5:30 PM, DuyHai Doan doanduy...@gmail.com wrote:
 the validation type is set to bytes, and my code is type safe, so it knows 
 which serializers to use. Those dynamic columns are driven off the types in 
 Java.  -- Correct. However, you are still bound by the column comparator 
 type which should be fixed (unless again you set it to bytes, in this case 
 you loose the ordering and sorting feature)
 
  Basically what you are doing is telling Cassandra to save data in the cells 
 as raw bytes, the serialization is taken care client side using the 
 appropriate serializer. This is perfectly a valid strategy.
 
  But how is it different from using CQL3 and setting the value to blob 
 (equivalent to bytes) and take care of the serialization client-side also ? 
 You can even imagine saving value in JSON format and set the type to text.
 
  Really, I don't see why CQL3 cannot achieve the scenario you describe.
 
  For the record, when you create a table in CQL3 as follow:
 
  CREATE TABLE user (
  id bigint PRIMARY KEY,
  firstname text,
  lastname text,
  last_connection timestamp,
  );
 
  C* will create a column family with validation type = bytes to accommodate 
 the timestamp and text types for the firstname, lastname and last_connection 
 columns. Basically the CQL3 engine is doing the serialization server-side for 
 you
 
  
 
 
 
 
 On Fri, Jun 13, 2014 at 11:19 PM, Peter Lin wool...@gmail.com wrote:
 
 the validation type is set to bytes, and my code is type safe, so it knows 
 which serializers to use. Those dynamic columns are driven off the types in 
 Java.
 
 Having said that, CQL3 does have a new custom type feature, but the 
 documentation is basically 

Re: Dynamic Columns in Cassandra 2.X

2014-06-13 Thread graham sanderson
Note as I mentioned mid post, thrift also supports async nowadays (there was a 
recent discussion on cassandra dev and the choice was not to move to it)

I think the binary protocol is the way forward; CQL3 needs some new features, 
or there need to be some other types of requests you can make over the binary 
protocol

On Jun 13, 2014, at 5:51 PM, Peter Lin wool...@gmail.com wrote:

 
 without a doubt there's nice features of CQL3 like notifications and async. I 
 want to see CQL3 mature and handle all the use cases that Thrift handles 
 easily today. It's to everyone's benefit to work together and improve CQL3.
 
 Other benefits of Thrift drivers today is being able to use object API with 
 generics. For tool builders, this is especially useful. Not everyone wants to 
 write tools, but I do so it matters to me.
 
 
 
 
 On Fri, Jun 13, 2014 at 6:39 PM, Laing, Michael michael.la...@nytimes.com 
 wrote:
 Just to add 2 more cents... :)
 
 The CQL3 protocol is asynchronous. This can provide a substantial throughput 
 increase, according to my benchmarking, when one uses non-blocking techniques.
 
 It is also peer-to-peer. Hence the server can generate events to send to the 
 client, e.g. schema changes - in general, 'triggers' become possible.
 
 ml
 
 
 On Fri, Jun 13, 2014 at 6:21 PM, graham sanderson gra...@vast.com wrote:
 My 2 cents…
 
 A motivation for CQL3 AFAIK was to make Cassandra more familiar to SQL users. 
 This is a valid goal, and works well in many cases.
 Equally there are use cases (that some might find ugly) where Cassandra is 
 chosen explicitly because of the sorts of things you can do at the thrift 
 level, which aren’t (currently) exposed via CQL3
 
 To Robert’s point earlier - Rational people should presume that Thrift 
 support must eventually disappear”… he is probably right (though frankly I’d 
 rather the non-blocking thrift version was added instead). However if we do 
 get rid of the thrift interface, then it needs to be at a time that CQLn is 
 capable of expressing all the things you could do via the thrift API. Note, I 
 need to go look and see if the non-blocking thrift version also requires 
 materializing the entire thrift object in memory.
 
 On Jun 13, 2014, at 4:55 PM, DuyHai Doan doanduy...@gmail.com wrote:
 
 There are always the pros and the cons with a querying language, as always.
 
 But as far as I can see, the advantages of Thrift I can see over CQL3 are:
 
  1) Thrift require a little bit less decoding server-side (a difference 
 around 10% in CPU usage).
 
  2) Thrift use more compact storage because CQL3 need to add extra 
 marker columns to guarantee the existence of primary key. It is worsen 
 when you use clustering columns because for each distinct clustering group 
 you have a related marker columns.
 
  That being said, point 1) is not really an issue since most of the time 
 nodes are more I/O bound than CPU bound. Only in extreme cases where you 
 have incredible read rate with data that fits entirely in memory that you 
 may notice the difference.
 
  For point 2) this is a small trade-off to have access to a query language 
 and being able to do slice queries using the WHERE clause. Some like it, 
 other hate it, it's just a question of taste.  Please note that the waste 
 in disk space is somehow mitigated by compression.
 
  Long story short I think Thrift may have appropriate usage but only in very 
 few use cases. Recently a lot of improvement and features have been added to 
 CQL3 so that it shoud be considered as the first choice for most users and 
 if they fall into those few use cases then switch back to Thrift
 
 My 2 cents
 
 
 
 
 
 
 On Fri, Jun 13, 2014 at 11:43 PM, Peter Lin wool...@gmail.com wrote:
 
 With text based query approach like CQL, you loose the type with dynamic 
 columns. Yes, we're storing it as bytes, but it is simpler and easier with 
 Thrift to do these types of things.
 
 I like CQL3 and what it does, but text based query languages make certain 
 dynamic schema use cases painful. Having used and built ORM's they are 
 poorly suited to dynamic schemas. If you've never had to write an ORM to 
 handle dynamic user defined schemas at runtime, it's tough to see where the 
 problems arise and how that makes life painful.
 
 Just to be clear, I'm not saying don't use CQL3 or CQL3 is bad. I'm 
 saying CQL3 is good for certain kinds of use cases and Thrift is good at 
 certain use cases. People need to look at what and how they're storing data 
 and do what makes the most sense to them. Slavishly following CQL3 doesn't 
 make any sense to me.
  
 
 
 On Fri, Jun 13, 2014 at 5:30 PM, DuyHai Doan doanduy...@gmail.com wrote:
 the validation type is set to bytes, and my code is type safe, so it knows 
 which serializers to use. Those dynamic columns are driven off the types in 
 Java.  -- Correct. However, you are still bound by the column comparator 
 type which should be fixed (unless again you set it to bytes, in this case

Re: Pattern to store maps of maps...

2014-06-13 Thread graham sanderson
My personal opinion is that unless you are doing map operations on a CQL3 map 
and will always intend to read the whole thing (you don’t have any choice 
today), don’t use one at all - use a blob of whatever variety makes sense (e.g. 
Json, AVRO, Protobuf etc)

On Jun 13, 2014, at 7:17 PM, Kevin Burton bur...@spinn3r.com wrote:

 So the cassandra map support in CQL is nice but it's got me wanting deeper 
 nesting.
 
 For example { foo: { bar: hello } }
 
 … but that's not possible with CQL.
 
 Of course… one solution is something like avro, and then store your entire 
 record as a blob.
 
 I guess that's not TOO bad but that means all my data is somewhat opaque to 
 cqlsh.
 
 What are my options here?  What are you guys doing to work around this 
 problem?
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 Skype: burtonator
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
 people.
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Efficient bulk range deletions without compactions by dropping SSTables.

2014-05-16 Thread graham sanderson
Just a few data points from our experience

One of our use cases involves storing a periodic full base state for millions 
of records, then fairly frequent delta updates to subsets of the records in 
between. C* is great for this because we can read the whole row (or up to the 
clustering key/column marking “now” as perceived by the client) and munge the 
base + deltas together in the client.

To keep rows small (and for recovery), we start over in a new CF whenever we 
start a new base state

The upshot is that we have pretty much the same scenario as Jeremy is describing

For this use case we are also using Astyanax (but C* 2.0.5)

We have not come across many of the schema problems you mention (which is 
likely accountable to some changes in the 2.0.x line), however one thing to 
note is that Astyanax itself seems to be very picky about un-resolved schema 
changes. We found that we had to do the schema changes via a CQL “create table” 
(we can still use Astyanax for that) rather than creating it via old style 
thrift CF creation


On May 13, 2014, at 9:42 AM, Jeremy Powell jeremym.pow...@gmail.com wrote:

 Hi Kevin,
 
 C* version: 1.2.xx
 Astyanax: 1.56.xx
 
 We basically do this same thing in one of our production clusters, but rather 
 than dropping SSTables, we drop Column Families. We time-bucket our CFs, and 
 when a CF has passed some time threshold (metadata or embedded in CF name), 
 it is dropped. This means there is a home-grown system that is doing the 
 bookkeeping/maintenance rather than relying on C*s inner workings. It is 
 unfortunate that we have to maintain a system which maintains CFs, but we've 
 been in a pretty good state for the last 12 months using this method. 
 
 Some caveats:
 
 By default, C* makes snapshots of your data when a table is dropped. You can 
 leave that and have something else clear up the snapshots, or if you're less 
 paranoid, set auto_snapshot: false in the cassandra.yaml file.
 
 Cassandra does not handle 'quick' schema changes very well, and we found that 
 only one node should be used for these changes. When adding or removing 
 column families, we have a single, property defined C* node that is 
 designated as the schema node. After making a schema change, we had to throw 
 in an artificial delay to ensure that the schema change propagated through 
 the cluster before making the next schema change. And of course, relying on a 
 single node being up for schema changes is less than ideal, so handling fail 
 over to a new node is important.
 
 The final, and hardest problem, is that C* can't really handle schema changes 
 while a node is being bootstrapped (new nodes, replacing a dead node). If a 
 column family is dropped, but the new node has not yet received that data 
 from its replica, the node will fail to bootstrap when it finally begins to 
 receive that data - there is no column family for the data to be written to, 
 so that node will be stuck in the joining state, and it's system keyspace 
 needs to be wiped and re-synced to attempt to get back to a happy state. This 
 unfortunately means we have to stop schema changes when a node needs to be 
 replaced, but we have this flow down pretty well.
 
 Hope this helps,
 Jeremy Powell
 
 
 On Mon, May 12, 2014 at 5:53 PM, Kevin Burton bur...@spinn3r.com wrote:
 We have a log only data structure… everything is appended and nothing is ever 
 updated.
 
 We should be totally fine with having lots of SSTables sitting on disk 
 because even if we did a major compaction the data would still look the same.
 
 By 'lots' I mean maybe 1000 max.  Maybe 1GB each.
 
 However, I would like a way to delete older data.
 
 One way to solve this could be to just drop an entire SSTable if all the 
 records inside have tombstones.
 
 Is this possible, to just drop a specific SSTable?  
 
 -- 
 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 Skype: burtonator
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
 people.
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Question about READS in a multi DC environment.

2014-05-15 Thread graham sanderson
Yeah, but all the requests for data/digest are sent at the same time… responses 
that aren’t “needed” to complete the request are dealt with asynchronously 
(possibly causing repair). 

In the original trace (which is confusing because I don’t think the clocks are 
in sync)… I don’t see anything that makes me believe it is blocking for all 3 
responses - It actually does reads on all 3 nodes even if only digests are 
required

On May 12, 2014, at 12:37 AM, DuyHai Doan doanduy...@gmail.com wrote:

 Ins't read repair supposed to be done asynchronously in background ?
 
 
 On Mon, May 12, 2014 at 2:07 AM, graham sanderson gra...@vast.com wrote:
 You have a read_repair_chance of 1.0 which is probably why your query is 
 hitting all data centers.
 
 On May 11, 2014, at 3:44 PM, Mark Farnan devm...@petrolink.com wrote:
 
  Im trying to understand READ load in Cassandra across a multi-datacenter 
  cluster.   (Specifically why it seems to be hitting more than one DC) and 
  hope someone can help.
 
  From what Iím seeing here, a READ, with Consistency LOCAL_ONE,   seems to 
  be hitting All 3 datacenters, rather than just the one Iím connected to.   
  I see  'Read 101 live and 0 tombstoned cells'  from EACH of the 3 DCs in 
  the trace, which seems, wrong.
  I have tried every  Consistency level, same result.   This also is same 
  from my C# code via the DataStax driver, (where I first noticed the issue).
 
  Can someone please shed some light on what is occurring ?  Specifically I 
  dont' want a query on one DC, going anywhere near the other 2 as a rule, as 
  in production,  these DC's will be accross slower links.
 
 
  Query:  (NOTE:  Whilst this uses a kairosdb table,  i'm just playing with 
  queries against it as it has 100k columns in this key for testing).
 
  cqlsh:kairosdb consistency local_one
  Consistency level set to LOCAL_ONE.
 
  cqlsh:kairosdb select * from data_points where key = 
  0x6d61726c796e2e746573742e74656d7034000145b514a400726f6f6d3d6f6963653a
   limit 1000;
 
  ... Some return data  rows listed here which I've removed 
 
  CassandraQuery.txt
  Query Respose Trace:
 
  activity
   | timestamp
  | source | source_elapsed
  --+--++
  
execute_cql3_query | 07:18:12,692 
  | 192.168.25.111 |  0
  
 Message received from /192.168.25.111 | 07:18:00,706 
  | 192.168.25.131 | 50
  
   Executing single-partition query on data_points | 07:18:00,707 
  | 192.168.25.131 |760
  
  Acquiring sstable references | 07:18:00,707 
  | 192.168.25.131 |814
  
   Merging memtable tombstones | 07:18:00,707 
  | 192.168.25.131 |924
  
  Bloom filter allows skipping sstable 191 | 07:18:00,707 
  | 192.168.25.131 |   1050
  
  Bloom filter allows skipping sstable 190 | 07:18:00,707 
  | 192.168.25.131 |   1166
  
 Key cache hit for sstable 189 | 07:18:00,707 
  | 192.168.25.131 |   1275
  
   Seeking to partition beginning in data file | 07:18:00,707 
  | 192.168.25.131 |   1293
 Skipped 0/3 
  non-slice-intersecting sstables, included 0 due to tombstones | 
  07:18:00,708 | 192.168.25.131 |   2173
  
Merging data from memtables and 1 sstables | 07:18:00,708 
  | 192.168.25.131 |   2195
  
 Read 1001 live and 0 tombstoned cells | 07:18:00,709 
  | 192.168.25.131 |   3259
  
 Enqueuing response to /192.168.25.111 | 07

  1   2   >