from:"kevin"

Yes.. Logging it is far far far far better.

I think a lot of devs don't have experience working in actual production
environments.  YES the client should probably handle it, but WHICH client.
This is why you log things.  Log the statement that was aborted (at least
the first 100 bytes),

On Wed, Aug 3, 2016 at 2:30 PM, Ryan Svihla <r...@foundev.pro> wrote:

> Where I see this a lot is:
>
> 1. DBA notices it in logs
> 2. Everyone says code works fine no errors
> 3. Weeks of combing all apps find out 3 teams are doing fire and forget
> futures...
> 4. Convince each team they really need to handle futures
> 5. Couple months before you figure out who was the culprit by the time he
> deploys hit production.
>
> Would save everyone a ton of brain cells if we just logged it.
>
> Regards,
>
> Ryan Svihla
>
> On Aug 3, 2016, at 4:21 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:
>
> I haven't verified, so i'm not 100% certain, but I believe you'd get back
> an exception to the client.  Yes, this belongs in the DB, but I don't think
> you're totally blind to what went wrong.
>
> My guess is this exception in the Python driver (but other drivers should
> have a similar exception):
> https://github.com/datastax/python-driver/blob/master/cassandra/protocol.py#L288
>
> On Wed, Aug 3, 2016 at 1:59 PM Ryan Svihla <r...@foundev.pro> wrote:
>
>> Made a Jira about it already
>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-12231
>>
>> Regards,
>>
>> Ryan Svihla
>>
>> On Aug 3, 2016, at 2:58 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>
>> It seems these are basically impossible to track down.
>>
>>
>> https://support.datastax.com/hc/en-us/articles/207267063-Mutation-of-x-bytes-is-too-large-for-the-maxiumum-size-of-y-
>>
>> has some information but their work around is to increase the transaction
>> log.  There's no way to find out WHAT client or what CQL is causing the
>> large mutation.
>>
>> Any thoughts on how to mitigate this?
>>
>> Kevin
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Mutation of X bytes is too large for the maximum size of Y

It seems these are basically impossible to track down.

https://support.datastax.com/hc/en-us/articles/207267063-Mutation-of-x-bytes-is-too-large-for-the-maxiumum-size-of-y-

has some information but their work around is to increase the transaction
log.  There's no way to find out WHAT client or what CQL is causing the
large mutation.

Any thoughts on how to mitigate this?

Kevin

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Re: [Marketing Mail] Re: Memory leak and lockup on our 2.2.7 Cassandra cluster.

We usually use 100 per every 5 minutes.. but you're right.  We might
actually move this use case over to using Elasticsearch in the next couple
of weeks.

On Wed, Aug 3, 2016 at 11:09 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> Kevin,
>
> "Our scheme uses large buckets of content where we write to a
> bucket/partition for 5 minutes, then move to a new one."
>
> Are you writing to a single partition and only that partition for 5
> minutes?  If so, you should really rethink your data model.  This method
> does not scale as you add nodes, it can only scale vertically.
>
> On Wed, Aug 3, 2016 at 9:24 AM Reynald Bourtembourg <
> reynald.bourtembo...@esrf.fr> wrote:
>
>> Hi,
>>
>> Maybe Ben was referring to this issue which has been mentioned recently
>> on this mailing list:
>> https://issues.apache.org/jira/browse/CASSANDRA-11887
>>
>> Cheers,
>> Reynald
>>
>>
>> On 03/08/2016 18:09, Romain Hardouin wrote:
>>
>> > Curious why the 2.2 to 3.x upgrade path is risky at best.
>> I guess that upgrade from 2.2 is less tested by DataStax QA because DSE4
>> used C* 2.1, not 2.2.
>> I would say the safest upgrade is 2.1 to 3.0.x
>>
>> Best,
>>
>> Romain
>>
>>
>>


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Re: Memory leak and lockup on our 2.2.7 Cassandra cluster.

DuyHai.  Yes.  We're generally happy with our disk throughput.  We're on
all SSD and have about 60 boxes.  The amount of data written isn't THAT
much.  Maybe 5GB max... but its over 60 boxes.



On Wed, Aug 3, 2016 at 3:49 AM, DuyHai Doan <doanduy...@gmail.com> wrote:

> On a side node, do you monitor your disk I/O to see whether the disk
> bandwidth can catch up with the huge spikes in write ? Use dstat during the
> insert storm to see if you have big values for CPU wait
>
> On Wed, Aug 3, 2016 at 12:41 PM, Ben Slater <ben.sla...@instaclustr.com>
> wrote:
>
>> Yes, looks like you have a (at least one) 100MB partition which is big
>> enough to cause issues. When you do lots of writes to the large partition
>> it is likely to end up getting compacted (as per the log) and compactions
>> often use a lot of memory / cause a lot of GC when they hit large
>> partitions. This, in addition to the write load is probably pushing you
>> over the edge.
>>
>> There are some improvements in 3.6 that might help (
>> https://issues.apache.org/jira/browse/CASSANDRA-11206) but the 2.2 to
>> 3.x upgrade path seems risky at best at the moment. In any event, your best
>> solution would be to find a way to make your partitions smaller (like
>> 1/10th of the size).
>>
>> Cheers
>> Ben
>> <https://issues.apache.org/jira/browse/CASSANDRA-11206>
>>
>> On Wed, 3 Aug 2016 at 12:35 Kevin Burton <bur...@spinn3r.com> wrote:
>>
>>> I have a theory as to what I think is happening here.
>>>
>>> There is a correlation between the massive content all at once, and our
>>> outags.
>>>
>>> Our scheme uses large buckets of content where we write to a
>>> bucket/partition for 5 minutes, then move to a new one.  This way we can
>>> page through buckets.
>>>
>>> I think what's happening is that CS is reading the entire partition into
>>> memory, then slicing through it... which would explain why its running out
>>> of memory.
>>>
>>> system.log:WARN  [CompactionExecutor:294] 2016-08-03 02:01:55,659
>>> BigTableWriter.java:184 - Writing large partition
>>> blogindex/content_legacy_2016_08_02:1470154500099 (106107128 bytes)
>>>
>>> On Tue, Aug 2, 2016 at 6:43 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>>
>>>> We have a 60 node CS cluster running 2.2.7 and about 20GB of RAM
>>>> allocated to each C* node.  We're aware of the recommended 8GB limit to
>>>> keep GCs low but our memory has been creeping up (probably) related to this
>>>> bug.
>>>>
>>>> Here's what we're seeing... if we do a low level of writes we think
>>>> everything generally looks good.
>>>>
>>>> What happens is that we then need to catch up and then do a TON of
>>>> writes all in a small time window.  Then CS nodes start dropping like
>>>> flies.  Some of them just GC frequently and are able to recover. When they
>>>> GC like this we see GC pause in the 30 second range which then cause them
>>>> to not gossip for a while and they drop out of the cluster.
>>>>
>>>> This happens as a flurry around the cluster so we're not always able to
>>>> catch which ones are doing it as they recover. However, if we have 3 down,
>>>> we mostly have a locked up cluster.  Writes don't complete and our app
>>>> essentially locks up.
>>>>
>>>> SOME of the boxes never recover. I'm in this state now.  We have t3-5
>>>> nodes that are in GC storms which they won't recover from.
>>>>
>>>> I reconfigured the GC settings to enable jstat.
>>>>
>>>> I was able to catch it while it was happening:
>>>>
>>>> ^Croot@util0067 ~ # sudo -u cassandra jstat -gcutil 4235 2500
>>>>   S0 S1 E  O  M CCSYGC YGCTFGCFGCT
>>>> GCT
>>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471
>>>> 1139.142 2825.332
>>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471
>>>> 1139.142 2825.332
>>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471
>>>> 1139.142 2825.332
>>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471
>>>> 1139.142 2825.332
>>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471
>>>> 1139.142 2825.332
>>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471
>>>> 1139.142 282

Re: Memory leak and lockup on our 2.2.7 Cassandra cluster.

Curious why the 2.2 to 3.x upgrade path is risky at best. Do you mean that
this is just for OUR use case since we're having some issues or that the
upgrade path is risky in general?

On Wed, Aug 3, 2016 at 3:41 AM, Ben Slater <ben.sla...@instaclustr.com>
wrote:

> Yes, looks like you have a (at least one) 100MB partition which is big
> enough to cause issues. When you do lots of writes to the large partition
> it is likely to end up getting compacted (as per the log) and compactions
> often use a lot of memory / cause a lot of GC when they hit large
> partitions. This, in addition to the write load is probably pushing you
> over the edge.
>
> There are some improvements in 3.6 that might help (
> https://issues.apache.org/jira/browse/CASSANDRA-11206) but the 2.2 to 3.x
> upgrade path seems risky at best at the moment. In any event, your best
> solution would be to find a way to make your partitions smaller (like
> 1/10th of the size).
>
> Cheers
> Ben
> <https://issues.apache.org/jira/browse/CASSANDRA-11206>
>
> On Wed, 3 Aug 2016 at 12:35 Kevin Burton <bur...@spinn3r.com> wrote:
>
>> I have a theory as to what I think is happening here.
>>
>> There is a correlation between the massive content all at once, and our
>> outags.
>>
>> Our scheme uses large buckets of content where we write to a
>> bucket/partition for 5 minutes, then move to a new one.  This way we can
>> page through buckets.
>>
>> I think what's happening is that CS is reading the entire partition into
>> memory, then slicing through it... which would explain why its running out
>> of memory.
>>
>> system.log:WARN  [CompactionExecutor:294] 2016-08-03 02:01:55,659
>> BigTableWriter.java:184 - Writing large partition
>> blogindex/content_legacy_2016_08_02:1470154500099 (106107128 bytes)
>>
>> On Tue, Aug 2, 2016 at 6:43 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>
>>> We have a 60 node CS cluster running 2.2.7 and about 20GB of RAM
>>> allocated to each C* node.  We're aware of the recommended 8GB limit to
>>> keep GCs low but our memory has been creeping up (probably) related to this
>>> bug.
>>>
>>> Here's what we're seeing... if we do a low level of writes we think
>>> everything generally looks good.
>>>
>>> What happens is that we then need to catch up and then do a TON of
>>> writes all in a small time window.  Then CS nodes start dropping like
>>> flies.  Some of them just GC frequently and are able to recover. When they
>>> GC like this we see GC pause in the 30 second range which then cause them
>>> to not gossip for a while and they drop out of the cluster.
>>>
>>> This happens as a flurry around the cluster so we're not always able to
>>> catch which ones are doing it as they recover. However, if we have 3 down,
>>> we mostly have a locked up cluster.  Writes don't complete and our app
>>> essentially locks up.
>>>
>>> SOME of the boxes never recover. I'm in this state now.  We have t3-5
>>> nodes that are in GC storms which they won't recover from.
>>>
>>> I reconfigured the GC settings to enable jstat.
>>>
>>> I was able to catch it while it was happening:
>>>
>>> ^Croot@util0067 ~ # sudo -u cassandra jstat -gcutil 4235 2500
>>>   S0 S1 E  O  M CCSYGC YGCTFGCFGCT
>>>   GCT
>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>>> 2825.332
>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>>> 2825.332
>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>>> 2825.332
>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>>> 2825.332
>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>>> 2825.332
>>>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
>>> 2825.332
>>>
>>> ... as you can see the box is legitimately out of memory.  S0, S1, E and
>>> O are all completely full.
>>>
>>> I'm not sure were to go from here.  I think 20GB for our work load is
>>> more than reasonable.
>>>
>>> 90% of the time they're well below 10GB of RAM used.  While I was
>>> watching this box I was seeing 30% RAM used until it decided to climb to
>>> 100%
>>>
>>> Any advice on what do do next... I don't see anything obvious in the
>>> logs to signal a problem.
>>>
>>>

Re: Memory leak and lockup on our 2.2.7 Cassandra cluster.

2016-08-02 Thread Kevin Burton

I have a theory as to what I think is happening here.

There is a correlation between the massive content all at once, and our
outags.

Our scheme uses large buckets of content where we write to a
bucket/partition for 5 minutes, then move to a new one.  This way we can
page through buckets.

I think what's happening is that CS is reading the entire partition into
memory, then slicing through it... which would explain why its running out
of memory.

system.log:WARN  [CompactionExecutor:294] 2016-08-03 02:01:55,659
BigTableWriter.java:184 - Writing large partition
blogindex/content_legacy_2016_08_02:1470154500099 (106107128 bytes)

On Tue, Aug 2, 2016 at 6:43 PM, Kevin Burton <bur...@spinn3r.com> wrote:

> We have a 60 node CS cluster running 2.2.7 and about 20GB of RAM allocated
> to each C* node.  We're aware of the recommended 8GB limit to keep GCs low
> but our memory has been creeping up (probably) related to this bug.
>
> Here's what we're seeing... if we do a low level of writes we think
> everything generally looks good.
>
> What happens is that we then need to catch up and then do a TON of writes
> all in a small time window.  Then CS nodes start dropping like flies.  Some
> of them just GC frequently and are able to recover. When they GC like this
> we see GC pause in the 30 second range which then cause them to not gossip
> for a while and they drop out of the cluster.
>
> This happens as a flurry around the cluster so we're not always able to
> catch which ones are doing it as they recover. However, if we have 3 down,
> we mostly have a locked up cluster.  Writes don't complete and our app
> essentially locks up.
>
> SOME of the boxes never recover. I'm in this state now.  We have t3-5
> nodes that are in GC storms which they won't recover from.
>
> I reconfigured the GC settings to enable jstat.
>
> I was able to catch it while it was happening:
>
> ^Croot@util0067 ~ # sudo -u cassandra jstat -gcutil 4235 2500
>   S0 S1 E  O  M CCSYGC YGCTFGCFGCT
> GCT
>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
> 2825.332
>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
> 2825.332
>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
> 2825.332
>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
> 2825.332
>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
> 2825.332
>   0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
> 2825.332
>
> ... as you can see the box is legitimately out of memory.  S0, S1, E and O
> are all completely full.
>
> I'm not sure were to go from here.  I think 20GB for our work load is more
> than reasonable.
>
> 90% of the time they're well below 10GB of RAM used.  While I was watching
> this box I was seeing 30% RAM used until it decided to climb to 100%
>
> Any advice on what do do next... I don't see anything obvious in the logs
> to signal a problem.
>
> I attached all the command line arguments we use.  Note that I think that
> the cassandra-env.sh script puts them in there twice.
>
> -ea
> -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar
> -XX:+CMSClassUnloadingEnabled
> -XX:+UseThreadPriorities
> -XX:ThreadPriorityPolicy=42
> -Xms2M
> -Xmx2M
> -Xmn4096M
> -XX:+HeapDumpOnOutOfMemoryError
> -Xss256k
> -XX:StringTableSize=103
> -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC
> -XX:+CMSParallelRemarkEnabled
> -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=75
> -XX:+UseCMSInitiatingOccupancyOnly
> -XX:+UseTLAB
> -XX:CompileCommandFile=/hotspot_compiler
> -XX:CMSWaitDuration=1
> -XX:+CMSParallelInitialMarkEnabled
> -XX:+CMSEdenChunksRecordAlways
> -XX:CMSWaitDuration=1
> -XX:+UseCondCardMark
> -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps
> -XX:+PrintHeapAtGC
> -XX:+PrintTenuringDistribution
> -XX:+PrintGCApplicationStoppedTime
> -XX:+PrintPromotionFailure
> -XX:PrintFLSStatistics=1
> -Xloggc:/var/log/cassandra/gc.log
> -XX:+UseGCLogFileRotation
> -XX:NumberOfGCLogFiles=10
> -XX:GCLogFileSize=10M
> -Djava.net.preferIPv4Stack=true
> -Dcom.sun.management.jmxremote.port=7199
> -Dcom.sun.management.jmxremote.rmi.port=7199
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Djava.library.path=/usr/share/cassandra/lib/sigar-bin
> -XX:+UnlockCommercialFeatures
> -XX:+FlightRecorder
> -ea
> -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar
> -XX:+CMSClassUnloadingEnabled
> -XX:+UseThreadPriorities
> -XX:ThreadPriorityPolicy=42
> -Xms2M
> -Xmx2M
> -Xmn4096M
> -XX:+HeapDumpOnO

Memory leak and lockup on our 2.2.7 Cassandra cluster.

2016-08-02 Thread Kevin Burton

We have a 60 node CS cluster running 2.2.7 and about 20GB of RAM allocated
to each C* node.  We're aware of the recommended 8GB limit to keep GCs low
but our memory has been creeping up (probably) related to this bug.

Here's what we're seeing... if we do a low level of writes we think
everything generally looks good.

What happens is that we then need to catch up and then do a TON of writes
all in a small time window.  Then CS nodes start dropping like flies.  Some
of them just GC frequently and are able to recover. When they GC like this
we see GC pause in the 30 second range which then cause them to not gossip
for a while and they drop out of the cluster.

This happens as a flurry around the cluster so we're not always able to
catch which ones are doing it as they recover. However, if we have 3 down,
we mostly have a locked up cluster.  Writes don't complete and our app
essentially locks up.

SOME of the boxes never recover. I'm in this state now.  We have t3-5 nodes
that are in GC storms which they won't recover from.

I reconfigured the GC settings to enable jstat.

I was able to catch it while it was happening:

^Croot@util0067 ~ # sudo -u cassandra jstat -gcutil 4235 2500
  S0 S1 E  O  M CCSYGC YGCTFGCFGCT
GCT
  0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
2825.332
  0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
2825.332
  0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
2825.332
  0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
2825.332
  0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
2825.332
  0.00 100.00 100.00  94.76  97.60  93.06  10435 1686.191   471 1139.142
2825.332

... as you can see the box is legitimately out of memory.  S0, S1, E and O
are all completely full.

I'm not sure were to go from here.  I think 20GB for our work load is more
than reasonable.

90% of the time they're well below 10GB of RAM used.  While I was watching
this box I was seeing 30% RAM used until it decided to climb to 100%

Any advice on what do do next... I don't see anything obvious in the logs
to signal a problem.

I attached all the command line arguments we use.  Note that I think that
the cassandra-env.sh script puts them in there twice.

-ea
-javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar
-XX:+CMSClassUnloadingEnabled
-XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42
-Xms2M
-Xmx2M
-Xmn4096M
-XX:+HeapDumpOnOutOfMemoryError
-Xss256k
-XX:StringTableSize=103
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseTLAB
-XX:CompileCommandFile=/hotspot_compiler
-XX:CMSWaitDuration=1
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways
-XX:CMSWaitDuration=1
-XX:+UseCondCardMark
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:PrintFLSStatistics=1
-Xloggc:/var/log/cassandra/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=10M
-Djava.net.preferIPv4Stack=true
-Dcom.sun.management.jmxremote.port=7199
-Dcom.sun.management.jmxremote.rmi.port=7199
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Djava.library.path=/usr/share/cassandra/lib/sigar-bin
-XX:+UnlockCommercialFeatures
-XX:+FlightRecorder
-ea
-javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar
-XX:+CMSClassUnloadingEnabled
-XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42
-Xms2M
-Xmx2M
-Xmn4096M
-XX:+HeapDumpOnOutOfMemoryError
-Xss256k
-XX:StringTableSize=103
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseTLAB
-XX:CompileCommandFile=/etc/cassandra/hotspot_compiler
-XX:CMSWaitDuration=1
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways
-XX:CMSWaitDuration=1
-XX:+UseCondCardMark
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-XX:PrintFLSStatistics=1
-Xloggc:/var/log/cassandra/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=10
-XX:GCLogFileSize=10M
-Djava.net.preferIPv4Stack=true
-Dcom.sun.management.jmxremote.port=7199
-Dcom.sun.management.jmxremote.rmi.port=7199
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Djava.library.path=/usr/share/cassandra/lib/sigar-bin
-XX:+UnlockCommercialFeatures
-XX:+FlightRecorder
-Dlogback.configurationFile=logback.xml
-Dcassandra.logdir=/var/log/cassandra
-Dcassandra.storagedir=
-Dcassandra-pidfile=/var/run/cassandra/cassandra.pid


-- 

We’re hiring if you know of any awesome

Re: Are counters faster than CAS or vice versa?

2016-07-20 Thread Kevin Burton

On Wed, Jul 20, 2016 at 11:53 AM, Jeff Jirsa 
wrote:

> Can you tolerate the value being “close, but not perfectly accurate”? If
> not, don’t use a counter.
>
>
>

yeah.. agreed.. this is a problem which is something I was considering.  I
guess it depends on whether they are 10x faster..

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Are counters faster than CAS or vice versa?

2016-07-20 Thread Kevin Burton

We ended up implementing a task/queue system which uses a global pointer.

Basically the pointer just increments ... so we have thousands of tasks
that just increment this one pointer.

The problem is that we're seeing contention on it and not being able to
write this record properly.

We're just doing a CAS operation now to read the existing value, then
increment it.

I think it might have been better to implement this as a counter.  Would
that be inherently faster or would a CAS be about the same?

I can't really test it without deploying it so I figured I would just ask
here first.

Kevin

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Open source equivalents of OpsCenter

2016-07-13 Thread Kevin O'Connor

Now that OpsCenter doesn't work with open source installs, are there any
runs at an open source equivalent? I'd be more interested in looking at
metrics of a running cluster and doing other tasks like managing
repairs/rolling restarts more so than historical data.

Re: Latency overhead on Cassandra cluster deployed on multiple AZs (AWS)

2016-04-12 Thread Kevin O'Connor

Are you in VPC or EC2 Classic? Are you using enhanced networking?

On Tue, Apr 12, 2016 at 9:52 AM, Alessandro Pieri  wrote:

> Hi Jack,
>
> As mentioned before I've used m3.xlarge instance types together with two
> ephemeral disks in raid 0 and, according to Amazon, they have "high"
> network performance.
>
> I ran many tests starting with a brand-new cluster every time and I got
> consistent results.
>
> I believe there's something that I cannot explain yet with the client used
> by cassandra-stress to connect to the nodes, I'd like to understand why
> there is such a big difference:
>
> Multi-AZ, CL=ONE, "--nodes node1,node2,node3,node4,node5,node6" -> 95th
> percentile: 38.14ms
> Multi-AZ, CL=ONE, "--nodes node1" -> 95th percentile: 5.9ms
>
> Hope you can help to figure it out.
>
> Cheers,
> Alessandro
>
>
>
>
> On Tue, Apr 12, 2016 at 5:43 PM, Jack Krupansky 
> wrote:
>
>> Which instance type are you using? Some may be throttled for EBS access,
>> so you could bump into a rate limit, and who knows what AWS will do at that
>> point.
>>
>> -- Jack Krupansky
>>
>> On Tue, Apr 12, 2016 at 6:02 AM, Alessandro Pieri <
>> alessan...@getstream.io> wrote:
>>
>>> Thanks Chris for your reply.
>>>
>>> I ran the tests 3 times for 20 minutes/each and I monitored the network
>>> latency in the meanwhile, it was very low (even the 99th percentile).
>>>
>>> I didn't notice any cpu spike caused by the GC but, as you pointed out,
>>> I will look into the GC log, just to be sure.
>>>
>>> In order to avoid the problem you mentioned with EBS and to keep the
>>> deviation under control I used two ephemeral disks in raid 0.
>>>
>>> I think the odd results come from the way cassandra-stress deals with
>>> multiple nodes. As soon as possible I will go through the Java code to get
>>> some more detail.
>>>
>>> If you have something else in your mind please let me know, your
>>> comments were really appreciated.
>>>
>>> Cheers,
>>> Alessandro
>>>
>>>
>>> On Mon, Apr 11, 2016 at 4:15 PM, Chris Lohfink 
>>> wrote:
>>>
 Where do you get the ~1ms latency between AZs? Comparing a short term
 average to a 99th percentile isn't very fair.

 "Over the last month, the median is 2.09 ms, 90th percentile is
 20ms, 99th percentile is 47ms." - per
 https://www.quora.com/What-are-typical-ping-times-between-different-EC2-availability-zones-within-the-same-region

 Are you using EBS? That would further impact latency on reads and GCs
 will always cause hiccups in the 99th+.

 Chris


 On Mon, Apr 11, 2016 at 7:57 AM, Alessandro Pieri 
 wrote:

> Hi everyone,
>
> Last week I ran some tests to estimate the latency overhead introduces
> in a Cassandra cluster by a multi availability zones setup on AWS EC2.
>
> I started a Cassandra cluster of 6 nodes deployed on 3 different AZs
> (2 nodes/AZ).
>
> Then, I used cassandra-stress to create an INSERT (write) test of 20M
> entries with a replication factor = 3, right after, I ran cassandra-stress
> again to READ 10M entries.
>
> Well, I got the following unexpected result:
>
> Single-AZ, CL=ONE -> median/95th percentile/99th percentile:
> 1.06ms/7.41ms/55.81ms
> Multi-AZ, CL=ONE -> median/95th percentile/99th percentile:
> 1.16ms/38.14ms/47.75ms
>
> Basically, switching to the multi-AZ setup the latency increased of
> ~30ms. That's too much considering the the average network latency between
> AZs on AWS is ~1ms.
>
> Since I couldn't find anything to explain those results, I decided to
> run the cassandra-stress specifying only a single node entry (i.e. 
> "--nodes
> node1" instead of "--nodes node1,node2,node3,node4,node5,node6") and
> surprisingly the latency went back to 5.9 ms.
>
> Trying to recap:
>
> Multi-AZ, CL=ONE, "--nodes node1,node2,node3,node4,node5,node6" ->
> 95th percentile: 38.14ms
> Multi-AZ, CL=ONE, "--nodes node1" -> 95th percentile: 5.9ms
>
> For the sake of completeness I've ran a further test using a
> consistency level = LOCAL_QUORUM and the test did not show any large
> variance with using a single node or multiple ones.
>
> Do you guys know what could be the reason?
>
> The test were executed on a m3.xlarge (network optimized) using the
> DataStax AMI 2.6.3 running Cassandra v2.0.15.
>
> Thank you in advance for your help.
>
> Cheers,
> Alessandro
>


>>>
>>>
>>> --
>>> *Alessandro Pieri*
>>> *Software Architect @ Stream.io Inc*
>>> e-Mail: alessan...@getstream.io - twitter: sirio7g
>>> 
>>>
>>>
>>
>

Re: Efficiently filtering results directly in CS

2016-04-08 Thread Kevin Burton

Ha..  Yes... C*...  I guess I need something like coprocessors in bigtable.


On Fri, Apr 8, 2016 at 1:49 AM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> c* I suppose
>
> 2016-04-07 19:30 GMT+02:00 Jonathan Haddad <j...@jonhaddad.com>:
>
>> What is CS?
>>
>> On Thu, Apr 7, 2016 at 10:03 AM Kevin Burton <bur...@spinn3r.com> wrote:
>>
>>> I have a paging model whereby we stream data from CS by fetching 'pages'
>>> thereby reading (sequentially) entire datasets.
>>>
>>> We're using the bucket approach where we write data for 5 minutes, then
>>> we can just fetch the bucket for that range.
>>>
>>> Our app now has TONS of data and we have a piece of middleware that
>>> filters it based on the client requests.
>>>
>>> So if they only want english they just get english and filter away about
>>> 60% of our data.
>>>
>>> but it doesn't support condition pushdown.  So ALL this data has to be
>>> sent from our CS boxes to our middleware and filtered there (wasting a lot
>>> of network IO).
>>>
>>> Is there away (including refactoring the code) that I could push this
>>> this into CS?  Maybe some way I could discovery the CS topology and put
>>> daemons on each of our CS boxes and fetch from CS directly (doing the
>>> filtering there).
>>>
>>> Thoughts?
>>>
>>> --
>>>
>>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>>> Engineers!
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> <https://plus.google.com/102718274791889610666/posts>
>>>
>>>
>


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Efficiently filtering results directly in CS

2016-04-07 Thread Kevin Burton

I have a paging model whereby we stream data from CS by fetching 'pages'
thereby reading (sequentially) entire datasets.

We're using the bucket approach where we write data for 5 minutes, then we
can just fetch the bucket for that range.

Our app now has TONS of data and we have a piece of middleware that filters
it based on the client requests.

So if they only want english they just get english and filter away about
60% of our data.

but it doesn't support condition pushdown.  So ALL this data has to be sent
from our CS boxes to our middleware and filtered there (wasting a lot of
network IO).

Is there away (including refactoring the code) that I could push this this
into CS?  Maybe some way I could discovery the CS topology and put daemons
on each of our CS boxes and fetch from CS directly (doing the filtering
there).

Thoughts?

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

[ANNOUNCE] YCSB 0.7.0 Release

2016-02-26 Thread Kevin Risden

On behalf of the development community, I am pleased to announce the
release of YCSB 0.7.0.

Highlights:

* GemFire binding replaced with Apache Geode (incubating) binding
* Apache Solr binding was added
* OrientDB binding improvements
* HBase Kerberos support and use single connection
* Accumulo improvements
* JDBC improvements
* Couchbase scan implementation
* MongoDB improvements
* Elasticsearch version increase to 2.1.1

Full release notes, including links to source and convenience binaries:
https://github.com/brianfrankcooper/YCSB/releases/tag/0.7.0

This release covers changes from the last 1 month.

Faster version of 'nodetool status'

2016-02-12 Thread Kevin Burton

Is there a faster way to get the output of 'nodetool status' ?

I want us to more aggressively monitor for 'nodetool status' and boxes
being DN...

I was thinking something like jolokia and REST but I'm not sure if there
are variables exported by jolokia for nodetool status.

Thoughts?

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Re: automated CREATE TABLE just nuked my cluster after a 2.0 -> 2.1 upgrade....

2016-01-23 Thread Kevin Burton

Once the CREATE TABLE returns in cqlsh (or programatically) is it safe to
assume it's on all nodes at that point?

If not I'll have to put in even more logic to handle this case..

On Fri, Jan 22, 2016 at 9:22 PM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> I recall that there was some discussion last year about this issue of how
> risky it is to do an automated CREATE TABLE IF NOT EXISTS due to the
> unpredictable amount of time it takes for the table creation to fully
> propagate around the full cluster. I think it was recognized as a real
> problem, but without an immediate solution, so the recommended practice for
> now is to only manually perform the operation (sure, it can be scripted,
> but only under manual control) to assure that the operation completes and
> that only one attempt is made to create the table. I don't recall if there
> was a specific Jira assigned, and the antipattern doc doesn't appear to
> reference this scenario. Maybe a committer can shed some more light.
>
> -- Jack Krupansky
>
> On Fri, Jan 22, 2016 at 10:29 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> I sort of agree.. but we are also considering migrating to hourly
>> tables.. and what if the single script doesn't run.
>>
>> I like having N nodes make changes like this because in my experience
>> that central / single box will usually fail at the wrong time :-/
>>
>>
>>
>> On Fri, Jan 22, 2016 at 6:47 PM, Jonathan Haddad <j...@jonhaddad.com>
>> wrote:
>>
>>> Instead of using ZK, why not solve your concurrency problem by removing
>>> it?  By that, I mean simply have 1 process that creates all your tables
>>> instead of creating a race condition intentionally?
>>>
>>> On Fri, Jan 22, 2016 at 6:16 PM Kevin Burton <bur...@spinn3r.com> wrote:
>>>
>>>> Not sure if this is a bug or not or kind of a *fuzzy* area.
>>>>
>>>> In 2.0 this worked fine.
>>>>
>>>> We have a bunch of automated scripts that go through and create
>>>> tables... one per day.
>>>>
>>>> at midnight UTC our entire CQL went offline.. .took down our whole app.
>>>>  ;-/
>>>>
>>>> The resolution was a full CQL shut down and then a drop table to remove
>>>> the bad tables...
>>>>
>>>> pretty sure the issue was with schema disagreement.
>>>>
>>>> All our CREATE TABLE use IF NOT EXISTS but I think the IF NOT
>>>> EXISTS only checks locally?
>>>>
>>>> My work around is going to be to use zookeeper to create a mutex lock
>>>> during this operation.
>>>>
>>>> Any other things I should avoid?
>>>>
>>>>
>>>> --
>>>>
>>>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>>>> Engineers!
>>>>
>>>> Founder/CEO Spinn3r.com
>>>> Location: *San Francisco, CA*
>>>> blog: http://burtonator.wordpress.com
>>>> … or check out my Google+ profile
>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>
>>>>
>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>
>


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Re: automated CREATE TABLE just nuked my cluster after a 2.0 -> 2.1 upgrade....

2016-01-22 Thread Kevin Burton

I sort of agree.. but we are also considering migrating to hourly tables..
and what if the single script doesn't run.

I like having N nodes make changes like this because in my experience that
central / single box will usually fail at the wrong time :-/



On Fri, Jan 22, 2016 at 6:47 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> Instead of using ZK, why not solve your concurrency problem by removing
> it?  By that, I mean simply have 1 process that creates all your tables
> instead of creating a race condition intentionally?
>
> On Fri, Jan 22, 2016 at 6:16 PM Kevin Burton <bur...@spinn3r.com> wrote:
>
>> Not sure if this is a bug or not or kind of a *fuzzy* area.
>>
>> In 2.0 this worked fine.
>>
>> We have a bunch of automated scripts that go through and create tables...
>> one per day.
>>
>> at midnight UTC our entire CQL went offline.. .took down our whole app.
>>  ;-/
>>
>> The resolution was a full CQL shut down and then a drop table to remove
>> the bad tables...
>>
>> pretty sure the issue was with schema disagreement.
>>
>> All our CREATE TABLE use IF NOT EXISTS but I think the IF NOT EXISTS
>> only checks locally?
>>
>> My work around is going to be to use zookeeper to create a mutex lock
>> during this operation.
>>
>> Any other things I should avoid?
>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

automated CREATE TABLE just nuked my cluster after a 2.0 -> 2.1 upgrade....

2016-01-22 Thread Kevin Burton

Not sure if this is a bug or not or kind of a *fuzzy* area.

In 2.0 this worked fine.

We have a bunch of automated scripts that go through and create tables...
one per day.

at midnight UTC our entire CQL went offline.. .took down our whole app.  ;-/

The resolution was a full CQL shut down and then a drop table to remove the
bad tables...

pretty sure the issue was with schema disagreement.

All our CREATE TABLE use IF NOT EXISTS but I think the IF NOT EXISTS
only checks locally?

My work around is going to be to use zookeeper to create a mutex lock
during this operation.

Any other things I should avoid?


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Strategy / order for upgradesstables during rolling upgrade.

2016-01-21 Thread Kevin Burton

I think there are two strategies to upgradesstables after a release.

We're doing a 2.0 to 2.1 upgrade (been procrastinating here).

I think we can go with B below... Would you agree?

Strategy A:

- foreach server
- upgrade to 2.1
- nodetool upgradesstables

Strategy B:

- foreach server
- upgrade to 2.1
- foreach server
- nodetool upgradesstables


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Re: Using cassandra a BLOB store / web cache.

2016-01-20 Thread Kevin Burton

There's also the 'support' issue.. C* is hard enough as it is... maybe you
can bring in another system like ES or HDFS but the more you bring in the
more your complexity REALLY goes through the roof.

Better to keep things simple.

I really like the chunking idea for C*... seems like an easy way to store
tons of data.

On Tue, Jan 19, 2016 at 4:13 PM, Robert Coli  wrote:

> On Tue, Jan 19, 2016 at 2:07 PM, Richard L. Burton III  > wrote:
>
>> I would ask why do this over say HDFS, S3, etc. seems like this problem
>> has been solved with other solutions that are specifically designed for
>> blob storage?
>>
>
> HDFS's default block size is 64mb. If you are storing objects smaller than
> this, that might be bad! It also doesn't have http transport, which other
> things do.
>
> Etc..
>
> =Rob
>
>

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Re: Using cassandra a BLOB store / web cache.

2016-01-19 Thread Kevin Burton

Lots of interesting feedback.. I like the ideal of chunking the IO into
pages.. it would require more thinking but I could even do cassandra async
IO and async HTTP to serve the data and then use HTTP chunks for each
range.

On Tue, Jan 19, 2016 at 10:47 AM, Robert Coli <rc...@eventbrite.com> wrote:

> On Mon, Jan 18, 2016 at 6:52 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> Internally we have the need for a blob store for web content.  It's
>> MOSTLY key, ,value based but we'd like to have lookups by coarse grained
>> tags.
>>
>
> I know you know how to operate and scale MySQL, so I suggest MogileFS for
> the actual blob storage :
>
> https://github.com/mogilefs
>
> Then do some simple indexing in some search store. Done.
>
> =Rob
>
>

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Re: Cassandra is consuming a lot of disk space

2016-01-12 Thread Kevin O'Connor

Have you tried restarting? It's possible there's open file handles to
sstables that have been compacted away. You can verify by doing lsof and
grepping for DEL or deleted.

If it's not that, you can run nodetool cleanup on each node to scan all of
the sstables on disk and remove anything that it's not responsible for.
Generally this would only work if you added nodes recently.

On Tuesday, January 12, 2016, Rahul Ramesh  wrote:

> We have a 2 node Cassandra cluster with a replication factor of 2.
>
> The load factor on the nodes is around 350Gb
>
> Datacenter: Cassandra
> ==
> Address  RackStatus State   LoadOwns
>  Token
>
>   -5072018636360415943
> 172.31.7.91  rack1   Up Normal  328.5 GB100.00%
>   -7068746880841807701
> 172.31.7.92  rack1   Up Normal  351.7 GB100.00%
>   -5072018636360415943
>
> However,if I use df -h,
>
> /dev/xvdf   252G  223G   17G  94% /HDD1
> /dev/xvdg   493G  456G   12G  98% /HDD2
> /dev/xvdh   197G  167G   21G  90% /HDD3
>
>
> HDD1,2,3 contains only cassandra data. It amounts to close to 1Tb in one
> of the machine and in another machine it is close to 650Gb.
>
> I started repair 2 days ago, after running repair, the amount of disk
> space consumption has actually increased.
> I also checked if this is because of snapshots. nodetool listsnapshot
> intermittently lists a snapshot but it goes away after sometime.
>
> Can somebody please help me understand,
> 1. why so much disk space is consumed?
> 2. Why did it increase after repair?
> 3. Is there any way to recover from this state.
>
>
> Thanks,
> Rahul
>
>

Re: compact/repair shouldn't compete for normal compaction resources.

2015-10-19 Thread Kevin Burton

Yes.. .it's not currently possible :)

I think it should be.

Say the IO on your C* is at 60% utilization.

If you do a repair, this would require 120% utilization obviously not
possible, so now your app is down / offline until the repair finishes.

If you could throttle repair separately this would resolve this problem.

IF anyone else thinks this is an issue I'll create a JIRA.

On Mon, Oct 19, 2015 at 3:38 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Mon, Oct 19, 2015 at 9:30 AM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> I think the point I was trying to make is that on highly loaded boxes,
>>  repair should take lower priority than normal compactions.
>>
>
> You can manually do this by changing the thread priority of compaction
> threads which you somhow identify as doing repair related compaction...
>
> ... but incoming streamed SStables are compacted just as if they were
> flushed, so I'm pretty sure what you're asking for is not currently
> possible?
>
> =Rob
>
>

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Re: compact/repair shouldn't compete for normal compaction resources.

2015-10-19 Thread Kevin Burton

I think the point I was trying to make is that on highly loaded boxes,
 repair should take lower priority than normal compactions.

Having a throttle on *both* doesn't solve the problem.

So I need a

setcompactionthroughput

and a

setrepairthroughput

and total througput would be the sum of both.

On Mon, Oct 19, 2015 at 8:30 AM, Sebastian Estevez <
sebastian.este...@datastax.com> wrote:

> The validation compaction part of repair is susceptible to the compaction
> throttling knob `nodetool getcompactionthroughput`
> / `nodetool setcompactionthroughput` and you can use that to tune down the
> resources that are being used by repair.
>
> Check out this post by driftx on advanced repair techniques
> <http://www.datastax.com/dev/blog/advanced-repair-techniques>.
>
> Given your other question, I agree with Raj that it might be a good idea
> to decommission the new nodes rather than repairing depending on how much
> data has made it to them and how tight you were on resources before adding
> nodes.
>
>
> All the best,
>
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Sebastián Estévez
>
> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com
>
> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
> <https://twitter.com/datastax> [image: g+.png]
> <https://plus.google.com/+Datastax/about>
> <http://feeds.feedburner.com/datastax>
> <http://goog_410786983>
>
>
> <http://www.datastax.com/gartner-magic-quadrant-odbms>
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
> On Sun, Oct 18, 2015 at 8:18 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> I'm doing a big nodetool repair right now and I'm pretty sure the added
>> overhead is impacting our performance.
>>
>> Shouldn't you be able to throttle repair so that normal compactions can
>> use most of the resources?
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>
>


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Re: Would we have data corruption if we bootstrapped 10 nodes at once?

2015-10-18 Thread Kevin Burton

ouch.. OK.. I think I really shot myself in the foot here then.  This might
be bad.

I'm not sure if I would have missing data.  I mean basically the data is on
the other nodes.. but the cluster has been running with 10 nodes
accidentally bootstrapped with auto_bootstrap=false.

So they have new data and seem to be missing values.

this is somewhat misleading... Initially if you start it up and run
nodetool status , it only returns one node.

So I assumed auto_bootstrap=false meant that it just doesn't join the
cluster.

I'm running a nodetool repair now to hopefully fix this.



On Sun, Oct 18, 2015 at 7:25 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
wrote:

> auto_bootstrap=false tells it to join the cluster without running
> bootstrap – the node assumes it has all of the necessary data, and won’t
> stream any missing data.
>
> This generally violates consistency guarantees, but if done on a single
> node, is typically correctable with `nodetool repair`.
>
> If you do it on many  nodes at once, it’s possible that the new nodes
> could represent all 3 replicas of the data, but don’t physically have any
> of that data, leading to missing records.
>
>
>
> From: <burtonator2...@gmail.com> on behalf of Kevin Burton
> Reply-To: "user@cassandra.apache.org"
> Date: Sunday, October 18, 2015 at 3:44 PM
> To: "user@cassandra.apache.org"
> Subject: Re: Would we have data corruption if we bootstrapped 10 nodes at
> once?
>
> An shit.. I think we're seeing corruption.. missing records :-/
>
> On Sat, Oct 17, 2015 at 10:45 AM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> We just migrated from a 30 node cluster to a 45 node cluster. (so 15 new
>> nodes)
>>
>> By default we have auto_boostrap = false
>>
>> so we just push our config to the cluster, the cassandra daemons restart,
>> and they're not cluster members and are the only nodes in the cluster.
>>
>> Anyway.  While I was about 1/2 way done adding the 15 nodes,  I had about
>> 7 members of the cluster and 8 not yet joined.
>>
>> We are only doing 1 at a time because apparently bootstrapping more than
>> 1 is unsafe.
>>
>> I did a rolling restart whereby I went through and restarted all the
>> cassandra boxes.
>>
>> Somehow the new nodes auto boostrapped themselves EVEN though
>> auto_bootstrap=false.
>>
>> We don't have any errors.  Everything seems functional.  I'm just worried
>> about data loss.
>>
>> Thoughts?
>>
>> Kevin
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
>
>


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

compact/repair shouldn't compete for normal compaction resources.

2015-10-18 Thread Kevin Burton

I'm doing a big nodetool repair right now and I'm pretty sure the added
overhead is impacting our performance.

Shouldn't you be able to throttle repair so that normal compactions can use
most of the resources?

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Would we have data corruption if we bootstrapped 10 nodes at once?

2015-10-17 Thread Kevin Burton

We just migrated from a 30 node cluster to a 45 node cluster. (so 15 new
nodes)

By default we have auto_boostrap = false

so we just push our config to the cluster, the cassandra daemons restart,
and they're not cluster members and are the only nodes in the cluster.

Anyway.  While I was about 1/2 way done adding the 15 nodes,  I had about 7
members of the cluster and 8 not yet joined.

We are only doing 1 at a time because apparently bootstrapping more than 1
is unsafe.

I did a rolling restart whereby I went through and restarted all the
cassandra boxes.

Somehow the new nodes auto boostrapped themselves EVEN though
auto_bootstrap=false.

We don't have any errors.  Everything seems functional.  I'm just worried
about data loss.

Thoughts?

Kevin

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Re: reiserfs - DirectoryNotEmptyException

2015-10-17 Thread Kevin Burton

My advice is to not even consider anything else or make any other changes
to your architecture until you get onto a modern and maintained filesystem.

VERY VERY VERY few people are deploying anything on ReiserFS so you're
going to be the first group encountering any problems.

On Thu, Oct 15, 2015 at 12:28 PM, Modha, Digant <
digant.mo...@tdsecurities.com> wrote:

> It is deployed on an existing cluster but will be migrated soon to a
> different file system & Linux distribution.
>
> -Original Message-
> From: Michael Shuler [mailto:mshu...@pbandjelly.org] On Behalf Of Michael
> Shuler
> Sent: Wednesday, October 14, 2015 6:02 PM
> To: user@cassandra.apache.org
> Subject: Re: reiserfs - DirectoryNotEmptyException
>
> On 10/13/2015 01:58 PM, Modha, Digant wrote:
> > I am running Cassandra 2.1.10 and noticed intermittent
> > DirectoryNotEmptyExceptions during repair.  My cassandra data drive is
> > reiserfs.
>
> Why? I'm genuinely interested in this filesystem selection, since it is
> unmaintained, has been dropped from some mainstream linux distributions,
> and some may call it "dead". ;)
>
> > I noticed that on reiserfs wiki site
> > https://en.m.wikipedia.org/wiki/ReiserFS#Criticism, it states that
> > unlink operation is not synchronous. Is that the reason for the
> > exception below:
> >
> > ERROR [ValidationExecutor:137] 2015-10-13 00:46:30,759
> > CassandraDaemon.java:227 - Exception in thread
> > Thread[ValidationExecutor:137,1,main]
> >
> > org.apache.cassandra.io.FSWriteError:
> > java.nio.file.DirectoryNotEmptyException:
> >
> > at
> > org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.jav
> > a:135)
> >
> >~[apache-cassandra-2.1.10.jar:2.1.10]
> <...>
>
> This seems like a reasonable explanation. Using a modern filesystem like
> ext4 or xfs would certainly be helpful in getting you within the realm of
> a "common" hardware setup.
>
> https://wiki.apache.org/cassandra/CassandraHardware
>
> https://www.safaribooksonline.com/library/view/cassandra-high-performance/9781849515122/ch04s06.html
>
> I think Al Tobey had a slide deck on filesystem tuning for C*, but I
> didn't find it quickly.
>
> --
> Kind regards,
> Michael
>
>
> TD Securities disclaims any liability or losses either direct or
> consequential caused by the use of this information. This communication is
> for informational purposes only and is not intended as an offer or
> solicitation for the purchase or sale of any financial instrument or as an
> official confirmation of any transaction. TD Securities is neither making
> any investment recommendation nor providing any professional or advisory
> services relating to the activities described herein. All market prices,
> data and other information are not warranted as to completeness or accuracy
> and are subject to change without notice Any products described herein are
> (i) not insured by the FDIC, (ii) not a deposit or other obligation of, or
> guaranteed by, an insured depository institution and (iii) subject to
> investment risks, including possible loss of the principal amount invested.
> The information shall not be further distributed or duplicated in whole or
> in part by any means without the prior written consent of TD Securities. TD
> Securities is a trademark of The Toronto-Dominion Bank and represents TD
> Securities (USA) LLC and certain investment banking activities of The
> Toronto-Dominion Bank and its subsidiaries.
>



-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Post portem of a large Cassandra datacenter migration.

2015-10-09 Thread Kevin Burton

We just finished up a pretty large migration of about 30 Cassandra boxes to
a new datacenter.

We'll be migrating to about 60 boxes here in the next month so scalability
(and being able to do so cleanly) is important.

We also completed an Elasticsearch migration at the same time.  The ES
migration worked fine. A few small problems with it doing silly things with
relocating nodes too often but all in all it was somewhat painless.

At one point we were doing 200 shard reallocations in parallel and pushing
about 2-4Gbit...

The Cassandra migration, however, was a LOT harder.

One quick thing I wanted to point out - we're hiring.  So if you're a
killer Java Devops guy drop me an email

Anyway.  Back to the story.

Obviously we did a bunch of research before hand to make sure we had plenty
of bandwidth.  This was a migration from Washington DC to Germany.

Using iperf, we could consistently push about 2Gb back and forth between DC
and Germany.  This includes TCP as we switched to using large window sizes.

The big problem that we had, was that we could only bootstrap one node at a
time.  The ends up taking a LOT more time because you have to keep checking
on a node so that you can start the next one.

I imagine one could write a coordinator script but we had so many problems
with CS that it wouldn't have worked if we tried.

We had 2-3 main problems.

1.  Sometimes streams would just stop and lock up.  No explanation why.
They would just lock up and not resume.  We'd wait 10-15 minutes with no
response.. This would require us abort and retry.  Had we updated to
Cassandra 2.2 before hand I think the new resume support would work.

2.  Some of our keyspaces created by Thrift caused exceptions regarding
"too few resources" when trying to bootstrap. Dropping these keyspaces
fixed the problem.  They were just test keyspaces so it didn't matter.

3.  Because of #1, it's probably better to make sure you have 2x or more
disk space on the remote end before you do the migration.  This way you can
boot the same number of nodes you had before and just decommission the old
ones quickly. (er use nodetool removenode - see below)

4.  We're not sure why, but our OLDER machines kept locking up during this
process.  This kept requiring us to do a rolling restart on all the older
nodes.  We suspect this is GC and we were seeing single cores to 100%.  I
didn't have time to attach a profiler as were all burned out at this point
and just wanted to get it over with.  This problem meant that #1 was
exacerbated because our old boxes would either refuse to send streams or
refuse to accept them.  It seemed to get better when we upgraded the older
boxes to use Java 8.

5.  Don't use nodetool decommission if you have a large number of nodes.
Instead, use nodetool removenode.  It's MUCH faster and does M-N
replication between nodes directly.  The downside is that you go down to
N-1 replicas during this process. However, it was easily 20-30x faster.
This probably saved me about 5 hours of sleep!

In hindsight, I'm not sure what we would have done differently.  Maybe
bought more boxes.  Maybe upgraded to Cassandra 2.2 and probably java 8 as
well.

Setting up datacenter migration might have worked out better too.

Kevin

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Re: Does failing to run "nodetool cleanup" end up causing more data to be transferred during bootstrapping?

2015-10-07 Thread Kevin Burton

vnodes ... of course!

On Wed, Oct 7, 2015 at 9:09 PM, Sebastian Estevez <
sebastian.este...@datastax.com> wrote:

> vnodes or single tokens?
>
> All the best,
>
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Sebastián Estévez
>
> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com
>
> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
> <https://twitter.com/datastax> [image: g+.png]
> <https://plus.google.com/+Datastax/about>
> <http://feeds.feedburner.com/datastax>
>
>
> <http://cassandrasummit-datastax.com/?utm_campaign=summit15_medium=summiticon_source=emailsignature>
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
> On Thu, Oct 8, 2015 at 12:06 AM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> Let's say I have 10 nodes, I add 5 more, if I fail to run nodetool
>> cleanup, is excessive data transferred when I add the 6th node?  IE do the
>> existing nodes send more data to the 6th node?
>>
>> the documentation is unclear.  It sounds like the biggest problem is that
>> the existing data causes things to become unbalanced due to "load" computed
>> wrong".
>>
>> but I also think that the excessive data will be removed in the next
>> major compaction and that nodetool cleanup just triggers a major compaction.
>>
>> Is my hypothesis correct?
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>
>


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Why can't nodetool status include a hostname?

2015-10-07 Thread Kevin Burton

I find it really frustrating that nodetool status doesn't include a hostname

Makes it harder to track down problems.

I realize it PRIMARILY uses the IP but perhaps cassandra.yml can include an
optional 'hostname' parameter that can be set by the user.  OR have the box
itself include the hostname in gossip when it starts up.

I realize that hostname wouldn't be authoritative and that the IP must
still be shown but we could add another column for the hostname.

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Does failing to run "nodetool cleanup" end up causing more data to be transferred during bootstrapping?

2015-10-07 Thread Kevin Burton

Let's say I have 10 nodes, I add 5 more, if I fail to run nodetool cleanup,
is excessive data transferred when I add the 6th node?  IE do the existing
nodes send more data to the 6th node?

the documentation is unclear.  It sounds like the biggest problem is that
the existing data causes things to become unbalanced due to "load" computed
wrong".

but I also think that the excessive data will be removed in the next major
compaction and that nodetool cleanup just triggers a major compaction.

Is my hypothesis correct?

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Maximum node decommission // bootstrap at once.

2015-10-06 Thread Kevin Burton

We're in the middle of migrating datacenters.

We're migrating from 13 nodes to 30 nodes in the new datacenter.

The plan was to bootstrap the 30 nodes first, wait until they have joined.
 then we're going to decommission the old ones.

How many nodes can we bootstrap at once?  How many can we decommission?

I remember reading docs for this but hell if I can find it now :-P

I know what the answer is theoretically.  I just want to make sure we do
everything properly.

Kevin

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Re: Maximum node decommission // bootstrap at once.

2015-10-06 Thread Kevin Burton

I'm not sure which is faster/easier.  Just joining one box at a time and
then decommissioning or using replace_address.

this stuff is always something you do rarely and then more complex than it
needs to be.

This complicates long term migration too.  Having to have gigabit is
somewhat of a problem in that you might now actually have it where you're
going.

We're migrating from Washington, DC to Germany so we have to change TCP
send/receive buffers to get decent bandwidth.

But I think we can do this at 1Gb per so per box.

On Tue, Oct 6, 2015 at 12:48 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Tue, Oct 6, 2015 at 12:32 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> How many nodes can we bootstrap at once?  How many can we decommission?
>>
>
> short answer : 1 node can join or part at simultaneously
>
> longer answer : https://issues.apache.org/jira/browse/CASSANDRA-2434 /
> https://issues.apache.org/jira/browse/CASSANDRA-7069 /
> -Dconsistent.rangemovement
>
> Have you considered using replace_address to replace your existing 13
> nodes, at which point you just have to join 17 more?
>
> =Rob
>
>

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Re: Maximum node decommission // bootstrap at once.

2015-10-06 Thread Kevin Burton

OH. interesting.  Yeah. That's another strategy.  We've already done a
bunch of TCP tuning... we get about 1Gbit with large TCP windows.  So I
think we have that part done.

It's sad that CS can't resume...

Plan be we will just rsync the data.. Does it pretty much work just by
putting the data in a directory or do you have to do anything special?

On Tue, Oct 6, 2015 at 1:34 PM, Bryan Cheng <br...@blockcypher.com> wrote:

> Honestly, we've had more luck bootstrapping in our old DC (defining
> topology properties as the new DC) and using rsync to migrate the data
> files to new machines in the new datacenter. We had 10gig within the
> datacenter but significantly less than this cross-DC, which lead to a lot
> of broken streaming pipes and wasted effort. This might make sense
> depending on your link quality and the resources/time you have available to
> do TCP tuning,
>
> On Tue, Oct 6, 2015 at 1:29 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> I'm not sure which is faster/easier.  Just joining one box at a time and
>> then decommissioning or using replace_address.
>>
>> this stuff is always something you do rarely and then more complex than
>> it needs to be.
>>
>> This complicates long term migration too.  Having to have gigabit is
>> somewhat of a problem in that you might now actually have it where you're
>> going.
>>
>> We're migrating from Washington, DC to Germany so we have to change TCP
>> send/receive buffers to get decent bandwidth.
>>
>> But I think we can do this at 1Gb per so per box.
>>
>>
>> On Tue, Oct 6, 2015 at 12:48 PM, Robert Coli <rc...@eventbrite.com>
>> wrote:
>>
>>> On Tue, Oct 6, 2015 at 12:32 PM, Kevin Burton <bur...@spinn3r.com>
>>> wrote:
>>>
>>>> How many nodes can we bootstrap at once?  How many can we decommission?
>>>>
>>>
>>> short answer : 1 node can join or part at simultaneously
>>>
>>> longer answer : https://issues.apache.org/jira/browse/CASSANDRA-2434 /
>>> https://issues.apache.org/jira/browse/CASSANDRA-7069 /
>>> -Dconsistent.rangemovement
>>>
>>> Have you considered using replace_address to replace your existing 13
>>> nodes, at which point you just have to join 17 more?
>>>
>>> =Rob
>>>
>>>
>>
>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>
>


-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Re: Running Cassandra on Java 8 u60..

2015-09-27 Thread Kevin Burton

Possibly for existing apps… we’re running G1 for everything except
Elasticsearch and Cassandra and are pretty happy with it.

On Sun, Sep 27, 2015 at 10:28 AM, Graham Sanderson <gra...@vast.com> wrote:

> IMHO G1 is still buggy on JDK8 (based solely on being subscribed to the
> gc-dev mailing list)… I think JDK9 will be the one.
>
> On Sep 25, 2015, at 7:14 PM, Stefano Ortolani <ostef...@gmail.com> wrote:
>
> I think those were referring to Java7 and G1GC (early versions were buggy).
>
> Cheers,
> Stefano
>
>
> On Fri, Sep 25, 2015 at 5:08 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> Any issues with running Cassandra 2.0.16 on Java 8? I remember there is
>> long term advice on not changing the GC but not the underlying version of
>> Java.
>>
>> Thoughts?
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com <http://spinn3r.com/>
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>
>>
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

Using inline JSON is 2-3x faster than using many columns (>20)

2015-09-26 Thread Kevin Burton

I wanted to share this with the community in the hopes that it might help
someone with their schema design.

I didn't get any red flags early on to limit the number of columns we use.
If anything the community pushes for dynamic schema because Cassandra has
super nice online ALTER TABLE.

However, in practice we've found that Cassandra started to use a LOT more
CPU than anything else in our stack.

Including Elasticsearch.  ES uses about 8% of our total CPU whereas
Cassandra uses about 70% of it.. It's not an apples to oranges comparison
mind you but Cassandra definitely warrants some attention in this scenario.

I put Cassandra into a profiler (Java Mission Control) to see if anything
weird was happening and didn't see any red flags.

There were some issues with CAS so I rewrote that to implement a query
before CAS operation where we first check if the row is already there, then
use a CAS if its missing. That was a BIG performance bump.  Probably
reduced our C* usages by 40%

However, I started to speculate that it might be getting overwhelmed with
the raw numbers of rows.

I fired up cassandra_stress to verify and basically split it at 10 columns
with 150 bytes and then 150 columns with 10 bytes.

In this synthetic benchmark C* was actually 5-6x faster for the run with 10
columns.

So this tentatively confirmed my hypothesis.

So I decided to get a bit more aggressive and tried to test it with a less
synthetic benchmark.

I wrote my own benchmark which uses our own schema in two forms.

INLINE_ONLY: 150 columns...
DATA_ONLY: 4 columns (two primary key, 1 data_format and one data_blob)
column

It creates T threads, writes W rows, then reads R rows..

I set T=50, W=50,000, R=50,000

It does a write pass, then a read pass.  I didn't implement a mixed
workload though.. I think that my results wouldn't matter as much.

The results were similarly impressive but not as much as the synthetic
benchmark above.  It was 2x faster (6 minutes vs 3 minutes).

In the inline only benchmark, C* spends 70% of the time in high CPU.  In
data_only it's about 50/50.

I think we're going to move to this model and re-write all our C* stables
to support this inline JSON.

The second benchmark was under 2.0.16... (our production version).  The
cassandra_stress was under 3.0 beta as I wanted to see if a later version
of cassandra fixed the problem. It doesn't.

This was done on a 128GB box with two Samsung SSDs in RAID0.  I didn't test
it with any replicas.

This brings up some interesting issues:

- still interesting that C* spends as much time as it does under high CPU
load.  I'd like to profile it again.

- Looks like there's room for improvement in the JSON encoder/decoder.  I'm
not sure how much we would see though because it's already using the latest
jackson which I've tuned significantly.  I might be able to get some
performance out of it by avoiding GC and garbage collection.

- Later C* might improve our CPU regardless so this might be something we
do anyway (upgrade our cassandra).



-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Running Cassandra on Java 8 u60..

2015-09-25 Thread Kevin Burton

Any issues with running Cassandra 2.0.16 on Java 8? I remember there is
long term advice on not changing the GC but not the underlying version of
Java.

Thoughts?

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Re: Best strategy for hiring from OSS communities.

2015-09-13 Thread Kevin Burton

I think j...@apache.org is dead…

I saw this:

http://mail-archives.apache.org/mod_mbox/community-dev/201304.mbox/%3CCAKQbXgAgO_3SzLMR0L4p_qkSALQzE=ehpnbmjndccu6dtm-...@mail.gmail.com%3E

And can’t find any documentation on a j...@apache.org

I think it would be valuable to create one.  Maybe I should post to general@
…

On Fri, Sep 11, 2015 at 5:34 PM, Otis Gospodnetić <
otis.gospodne...@gmail.com> wrote:

> Hey Kevin - I think there is j...@apache.org
>
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Thu, Aug 13, 2015 at 6:02 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> Mildly off topic but we are looking to hire someone with Cassandra
>> experience..
>>
>> I don’t necessarily want to spam the list though.  We’d like someone from
>> the community who contributes to Open Source, etc.
>>
>> Are there forums for Apache / Cassandra, etc for jobs? I couldn’t fine
>> one.
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>

cassandra-stress on 3.0 with column widths benchmark.

2015-09-13 Thread Kevin Burton

I’m trying to benchmark two scenarios…

10 columns with 150 bytes each

vs

150 columns with 10 bytes each.

The total row “size” would be 1500 bytes (ignoring overhead).

Our app uses 150 columns so I’m trying to see if packing it into a JSON
structure using one column would improve performance.

I seem to have confirmed my hypothesis.

I’m running two tests:

./tools/bin/cassandra-stress write -insert -col n=FIXED\(10\)
> size=FIXED\(150\) | tee cassandra-stress-10-150.log
>


> time ./tools/bin/cassandra-stress write -insert -col n=FIXED\(150\)
> size=FIXED\(10\) | tee cassandra-stress-150-10.log


this shows that the "op rate” is much much lower when running with 150
columns:

root@util0063 ~/apache-cassandra-3.0.0-beta2 # grep "op rate"
> cassandra-stress-10-150.log
> op rate   : 7632 [WRITE:7632]
> op rate   : 11851 [WRITE:11851]
> op rate   : 31967 [WRITE:31967]
> op rate   : 41798 [WRITE:41798]
> op rate   : 51251 [WRITE:51251]
> op rate   : 58057 [WRITE:58057]
> op rate   : 62977 [WRITE:62977]
> op rate   : 65398 [WRITE:65398]
> op rate   : 67673 [WRITE:67673]
> op rate   : 69198 [WRITE:69198]
> op rate   : 70402 [WRITE:70402]
> op rate   : 71019 [WRITE:71019]
> op rate   : 71574 [WRITE:71574]
> root@util0063 ~/apache-cassandra-3.0.0-beta2 # grep "op rate"
> cassandra-stress-150-10.log
> op rate   : 2570 [WRITE:2570]
> op rate   : 5144 [WRITE:5144]
> op rate   : 10906 [WRITE:10906]
> op rate   : 11832 [WRITE:11832]
> op rate   : 12471 [WRITE:12471]
> op rate   : 12915 [WRITE:12915]
> op rate   : 13620 [WRITE:13620]
> op rate   : 13456 [WRITE:13456]
> op rate   : 13916 [WRITE:13916]
> op rate   : 14029 [WRITE:14029]
> op rate   : 13915 [WRITE:13915]


… what’s WEIRD here is that

Both tests take about 10 minutes.  Yet it’s saying that the op rate for the
second is slower.  Why would that be? That doesn’t make much sense…

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

Re: Cassandra 2.2 for time series

2015-09-02 Thread Kevin Burton

Check out kairosd for a time series db on Cassandra.
On Aug 31, 2015 7:12 AM, "Peter Lin"  wrote:

>
> I didn't realize they had added max and min as stock functions.
>
> to get the sample time. you'll probably need to write a custom function.
> google for it and you'll find people that have done it.
>
> On Mon, Aug 31, 2015 at 10:09 AM, Pål Andreassen  > wrote:
>
>> Cassandra 2.2 has min and max built-in. My problem is getting the
>> corresponding sample time as well.
>>
>>
>>
>> *Pål Andreassen*
>>
>> *54°23'58"S 3°18'53"E*
>>
>> *Konsulent*
>>
>> Mobil +47 982 85 504
>>
>> pal.andreas...@bouvet.no
>>
>>
>>
>>
>> *Bouvet Norge AS Avdeling Grenland*
>>
>> Uniongata 18, Klosterøya
>>
>> N-3732 Skien
>>
>> Tlf +47 23 40 60 00
>>
>> *bouvet.no*
>> 
>>
>>
>>
>> *From:* Peter Lin [mailto:wool...@gmail.com]
>> *Sent:* mandag 31. august 2015 16.09
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Cassandra 2.2 for time series
>>
>>
>>
>>
>>
>> Unlike SQL, CQL doesn't have built-in functions like max/min
>>
>> In the past, people would create summary tables to keep rolling stats for
>> reports/analytics. In cql3, there's user defined functions, so you can
>> write a function to do max/min
>>
>> http://cassandra.apache.org/doc/cql3/CQL-2.2.html#selectStmt
>> http://cassandra.apache.org/doc/cql3/CQL-2.2.html#udfs
>>
>>
>>
>> On Mon, Aug 31, 2015 at 9:48 AM, Pål Andreassen 
>> wrote:
>>
>> Hi
>>
>>
>>
>> I’m currently evaluating Cassandra as a potiantial database for storing
>> time series data from lots of devices (IoT type of scenario).
>>
>> Currently we have a few thousand devices with X channels (measurements)
>> that they report at different intervals (from 5 minutes and up).
>>
>>
>>
>> I’ve created as simple test table to store the data:
>>
>>
>>
>> CREATE TABLE DataRaw(
>>
>>   channelId int,
>>
>>   sampleTime timestamp,
>>
>>   value double,
>>
>>   PRIMARY KEY (channelId, sampleTime)
>>
>> ) WITH CLUSTERING ORDER BY (sampleTime ASC);
>>
>>
>>
>> This schema seems to work ok, but I have queries that I need to support
>> that I cannot easily figure out how to perform (except getting all the data
>> out and iterate it myself).
>>
>>
>>
>> Query 1: For max and min queries, I not only want the maximum/minimum
>> value, but also the corresponding timestamp.
>>
>>
>>
>> sampleTime  value
>>
>> 2015-08-28 00:0010
>>
>> 2015-08-28 01:0015
>>
>> 2015-08-28 02:0013
>>
>>
>> I'd like the max query to return both 2015-08-28 01:00 and 15. SELECT
>> sampleTime, max(value) FROM DataRAW return the max value, but the first
>> sampleTime.
>>
>> Also I wonder if Cassandra has built-in support for
>> interpolation/extrapolation. Some sort of group by hour/day/week/month and
>> even year function.
>>
>>
>>
>> Query 2: Give me hourly averages for channel X for yesterday. I’d expect
>> to get 24 values each of which is the hourly average. Or give my daily
>> averages for last year for a given channel. Should return 365 daily
>> averages.
>>
>>
>>
>> Best regards
>>
>>
>>
>> *Pål Andreassen*
>>
>> *54°23'58"S 3°18'53"E*
>>
>> *Konsulent*
>>
>> Mobil +47 982 85 504
>>
>> pal.andreas...@bouvet.no
>>
>>
>>
>>
>> *Bouvet Norge AS Avdeling Grenland*
>>
>> Uniongata 18, Klosterøya
>>
>> N-3732 Skien
>>
>> Tlf +47 23 40 60 00
>>
>> *bouvet.no*
>> 
>>
>>
>>
>>
>>
>
>

Re: Practical limitations of too many columns/cells ?

2015-08-25 Thread Kevin Burton

No problem.  IS there a JIRA ticket already for this?

On Mon, Aug 24, 2015 at 6:06 AM, Jonathan Haddad j...@jonhaddad.com wrote:

 Can you post your findings to JIRA as well?  Would be good to see some
 real numbers from production.

 The refactor of the storage engine (8099) may completely change this, but
 it's good to have it on the radar.


 On Sun, Aug 23, 2015 at 10:31 PM Kevin Burton bur...@spinn3r.com wrote:

 Agreed.  We’re going to run a benchmark.  Just realized we grew to 144
 columns.  Fun.  Kind of disappointing that Cassandra is so slow in this
 regard.  Kind of defeats the whole point of flexible schema if actually
 using that feature is slow as hell.

 On Sun, Aug 23, 2015 at 4:54 PM, Jeff Jirsa jeff.ji...@crowdstrike.com
 wrote:

 The key is to benchmark it with your real data. Modern cassandra-stress
 let’s you get very close to your actual read/write behavior, and the real
 differentiator will depend on your use case (how often do you write the
 whole row vs updating just one column/field). My gist shows a ton of
 different examples, but they’re not scientific, and at this point they’re
 old versions (and performance varies version to version).

 - Jeff

 From: burtonator2...@gmail.com on behalf of Kevin Burton
 Reply-To: user@cassandra.apache.org
 Date: Sunday, August 23, 2015 at 2:58 PM
 To: user@cassandra.apache.org
 Subject: Re: Practical limitations of too many columns/cells ?

 Ah.. yes.  Great benchmarks. If I’m interpreting them correctly it was
 ~15x slower for 22 columns vs 2 columns?

 Guess we have to refactor again :-P

 Not the end of the world of course.

 On Sun, Aug 23, 2015 at 1:53 PM, Jeff Jirsa jeff.ji...@crowdstrike.com
 wrote:

 A few months back, a user in #cassandra on freenode mentioned that when
 they transitioned from thrift to cql, their overall performance decreased
 significantly. They had 66 columns per table, so I ran some benchmarks with
 various versions of Cassandra and thrift/cql combinations.

 It shouldn’t really surprise you that more columns = more work = slower
 operations. It’s not necessarily the size of the writes, but the amount of
 work that needs to be done with the extra cells (2 large columns totaling
 2k performs better than 66 small columns totaling 0.66k even though it’s
 three times as much raw data being written to disk)

 https://gist.github.com/jeffjirsa/6e481b132334dfb6d42c

 2.0.13, 2 tokens per node, 66 columns, 10 bytes per column, thrift (660
 bytes per): cassandra-stress --operation INSERT --num-keys 100
 --columns 66 --column-size=10 --replication-factor 2 --nodesfile=nodes
 Averages from the middle 80% of values: interval_op_rate : 10720

 2.0.13, 2 tokens per node, 20 columns, 10 bytes per column, thrift (200
 bytes per): cassandra-stress --operation INSERT --num-keys 100
 --columns 20 --column-size=10 --replication-factor 2 --nodesfile=nodes
 Averages from the middle 80% of values: interval_op_rate : 28667

 2.0.13, 2 tokens per node, 2 large columns, thrift (2048 bytes per):
 cassandra-stress --operation INSERT --num-keys 100 --columns 2
 --column-size=1024 --replication-factor 2 --nodesfile=nodes Averages
 from the middle 80% of values: interval_op_rate : 23489

 From: burtonator2...@gmail.com on behalf of Kevin Burton
 Reply-To: user@cassandra.apache.org
 Date: Sunday, August 23, 2015 at 1:02 PM
 To: user@cassandra.apache.org
 Subject: Practical limitations of too many columns/cells ?

 Is there any advantage to using say 40 columns per row vs using 2
 columns (one for the pk and the other for data) and then shoving the data
 into a BLOB as a JSON object?

 To date, we’ve been just adding new columns.  I profiled Cassandra and
 about 50% of the CPU time is spent on CPU doing compactions.  Seeing that
 CS is being CPU bottlenecked maybe this is a way I can optimize it.

 Any thoughts?

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts




 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts




 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

Practical limitations of too many columns/cells ?

Is there any advantage to using say 40 columns per row vs using 2 columns
(one for the pk and the other for data) and then shoving the data into a
BLOB as a JSON object?

To date, we’ve been just adding new columns.  I profiled Cassandra and
about 50% of the CPU time is spent on CPU doing compactions.  Seeing that
CS is being CPU bottlenecked maybe this is a way I can optimize it.

Any thoughts?

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

Re: Practical limitations of too many columns/cells ?

Ah.. yes.  Great benchmarks. If I’m interpreting them correctly it was ~15x
slower for 22 columns vs 2 columns?

Guess we have to refactor again :-P

Not the end of the world of course.

On Sun, Aug 23, 2015 at 1:53 PM, Jeff Jirsa jeff.ji...@crowdstrike.com
wrote:

 A few months back, a user in #cassandra on freenode mentioned that when
 they transitioned from thrift to cql, their overall performance decreased
 significantly. They had 66 columns per table, so I ran some benchmarks with
 various versions of Cassandra and thrift/cql combinations.

 It shouldn’t really surprise you that more columns = more work = slower
 operations. It’s not necessarily the size of the writes, but the amount of
 work that needs to be done with the extra cells (2 large columns totaling
 2k performs better than 66 small columns totaling 0.66k even though it’s
 three times as much raw data being written to disk)

 https://gist.github.com/jeffjirsa/6e481b132334dfb6d42c

 2.0.13, 2 tokens per node, 66 columns, 10 bytes per column, thrift (660
 bytes per): cassandra-stress --operation INSERT --num-keys 100
 --columns 66 --column-size=10 --replication-factor 2 --nodesfile=nodesAverages
 from the middle 80% of values:interval_op_rate : 10720

 2.0.13, 2 tokens per node, 20 columns, 10 bytes per column, thrift (200
 bytes per):cassandra-stress --operation INSERT --num-keys 100
 --columns 20 --column-size=10 --replication-factor 2 --nodesfile=nodes 
 Averages
 from the middle 80% of values:interval_op_rate : 28667

 2.0.13, 2 tokens per node, 2 large columns, thrift (2048 bytes 
 per):cassandra-stress
 --operation INSERT --num-keys 100 --columns 2 --column-size=1024
 --replication-factor 2 --nodesfile=nodes Averages from the middle 80% of
 values:interval_op_rate : 23489

 From: burtonator2...@gmail.com on behalf of Kevin Burton
 Reply-To: user@cassandra.apache.org
 Date: Sunday, August 23, 2015 at 1:02 PM
 To: user@cassandra.apache.org
 Subject: Practical limitations of too many columns/cells ?

 Is there any advantage to using say 40 columns per row vs using 2 columns
 (one for the pk and the other for data) and then shoving the data into a
 BLOB as a JSON object?

 To date, we’ve been just adding new columns.  I profiled Cassandra and
 about 50% of the CPU time is spent on CPU doing compactions.  Seeing that
 CS is being CPU bottlenecked maybe this is a way I can optimize it.

 Any thoughts?

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

Store JSON as text or UTF-8 encoded blobs?

Hey.

I’m considering migrating my DB from using multiple columns to just 2
columns, with the second one being a JSON object.  Is there going to be any
real difference between TEXT or UTF-8 encoded BLOB?

I guess it would probably be easier to get tools like spark to parse the
object as JSON if it’s represented as a BLOB.

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

Re: Practical limitations of too many columns/cells ?

Agreed.  We’re going to run a benchmark.  Just realized we grew to 144
columns.  Fun.  Kind of disappointing that Cassandra is so slow in this
regard.  Kind of defeats the whole point of flexible schema if actually
using that feature is slow as hell.

On Sun, Aug 23, 2015 at 4:54 PM, Jeff Jirsa jeff.ji...@crowdstrike.com
wrote:

 The key is to benchmark it with your real data. Modern cassandra-stress
 let’s you get very close to your actual read/write behavior, and the real
 differentiator will depend on your use case (how often do you write the
 whole row vs updating just one column/field). My gist shows a ton of
 different examples, but they’re not scientific, and at this point they’re
 old versions (and performance varies version to version).

 - Jeff

 From: burtonator2...@gmail.com on behalf of Kevin Burton
 Reply-To: user@cassandra.apache.org
 Date: Sunday, August 23, 2015 at 2:58 PM
 To: user@cassandra.apache.org
 Subject: Re: Practical limitations of too many columns/cells ?

 Ah.. yes.  Great benchmarks. If I’m interpreting them correctly it was
 ~15x slower for 22 columns vs 2 columns?

 Guess we have to refactor again :-P

 Not the end of the world of course.

 On Sun, Aug 23, 2015 at 1:53 PM, Jeff Jirsa jeff.ji...@crowdstrike.com
 wrote:

 A few months back, a user in #cassandra on freenode mentioned that when
 they transitioned from thrift to cql, their overall performance decreased
 significantly. They had 66 columns per table, so I ran some benchmarks with
 various versions of Cassandra and thrift/cql combinations.

 It shouldn’t really surprise you that more columns = more work = slower
 operations. It’s not necessarily the size of the writes, but the amount of
 work that needs to be done with the extra cells (2 large columns totaling
 2k performs better than 66 small columns totaling 0.66k even though it’s
 three times as much raw data being written to disk)

 https://gist.github.com/jeffjirsa/6e481b132334dfb6d42c

 2.0.13, 2 tokens per node, 66 columns, 10 bytes per column, thrift (660
 bytes per): cassandra-stress --operation INSERT --num-keys 100
 --columns 66 --column-size=10 --replication-factor 2 --nodesfile=nodes
 Averages from the middle 80% of values: interval_op_rate : 10720

 2.0.13, 2 tokens per node, 20 columns, 10 bytes per column, thrift (200
 bytes per): cassandra-stress --operation INSERT --num-keys 100
 --columns 20 --column-size=10 --replication-factor 2 --nodesfile=nodes
 Averages from the middle 80% of values: interval_op_rate : 28667

 2.0.13, 2 tokens per node, 2 large columns, thrift (2048 bytes per):
 cassandra-stress --operation INSERT --num-keys 100 --columns 2
 --column-size=1024 --replication-factor 2 --nodesfile=nodes Averages
 from the middle 80% of values: interval_op_rate : 23489

 From: burtonator2...@gmail.com on behalf of Kevin Burton
 Reply-To: user@cassandra.apache.org
 Date: Sunday, August 23, 2015 at 1:02 PM
 To: user@cassandra.apache.org
 Subject: Practical limitations of too many columns/cells ?

 Is there any advantage to using say 40 columns per row vs using 2 columns
 (one for the pk and the other for data) and then shoving the data into a
 BLOB as a JSON object?

 To date, we’ve been just adding new columns.  I profiled Cassandra and
 about 50% of the CPU time is spent on CPU doing compactions.  Seeing that
 CS is being CPU bottlenecked maybe this is a way I can optimize it.

 Any thoughts?

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts




 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

Best strategy for hiring from OSS communities.

2015-08-13 Thread Kevin Burton

Mildly off topic but we are looking to hire someone with Cassandra
experience..

I don’t necessarily want to spam the list though.  We’d like someone from
the community who contributes to Open Source, etc.

Are there forums for Apache / Cassandra, etc for jobs? I couldn’t fine one.

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

Re: TTLs on tables with only primary keys?

2015-08-05 Thread Kevin Burton

Thanks. This is what I was looking for…

I ended up working around this by using a boolean field as a column.
Wastes a bit of space but its not the end of the world.

On Wed, Aug 5, 2015 at 7:33 AM, Tyler Hobbs ty...@datastax.com wrote:

 You can set the TTL on a row when you create it using an INSERT
 statement.  For example:

 INSERT INTO mytable (partitionkey, clusteringkey) VALUES (0, 0) USING TTL
 100;

 However, Cassandra doesn't support the ttl() function on primary key
 columns yet.  The ticket to support this is
 https://issues.apache.org/jira/browse/CASSANDRA-9312.

 On Tue, Aug 4, 2015 at 9:22 PM, Kevin Burton bur...@spinn3r.com wrote:

 I have a table which just has primary keys.

 basically:

 create table foo (

 sequence bigint,
 signature text,
 primary key( sequence, signature )
 )

 I need these to eventually get GCd however it doesn’t seem to work.

 If I then run:

 select ttl(sequence) from foo;

 I get:

 Cannot use selection function ttl on PRIMARY KEY part sequence

 …

 I get the same thing if I do it on the second column .. (signature).

 And the value doesn’t seem to be TTLd.

 What’s the best way to proceed here?


 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts




 --
 Tyler Hobbs
 DataStax http://datastax.com/




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

TTLs on tables with only primary keys?

2015-08-04 Thread Kevin Burton

I have a table which just has primary keys.

basically:

create table foo (

sequence bigint,
signature text,
primary key( sequence, signature )
)

I need these to eventually get GCd however it doesn’t seem to work.

If I then run:

select ttl(sequence) from foo;

I get:

Cannot use selection function ttl on PRIMARY KEY part sequence

…

I get the same thing if I do it on the second column .. (signature).

And the value doesn’t seem to be TTLd.

What’s the best way to proceed here?


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

Configuring the java client to retry on write failure.

2015-07-12 Thread Kevin Burton

I can’t seem to find a decent resource to really explain this…

Our app seems to fail some write requests, a VERY low percentage.  I’d like
to retry the write requests that fail due to number of replicas not being
correct.

http://docs.datastax.com/en/developer/java-driver/2.0/common/drivers/reference/tuningPolicies_c.html

This is the best resource I can find.

I think the best strategy is to look at DefaultRetryPolicy and then create
a custom one that keeps retrying on write failures up to say 1 minute.
Latency isn’t critical for us as this is a batch processing system.

The biggest issue is how to test it?  I could unit test that my methods
return on the correct inputs but not really in real world situations.

What’s the best way to unit test this?

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

Lots of write timeouts and missing data during decomission/bootstrap

2015-07-01 Thread Kevin Burton

We get lots of write timeouts when we decommission a node.  About 80% of
them are write timeout and just about 20% of them are read timeout.

We’ve tried to adjust streamthroughput (and compaction throughput) for that
matter and that doesn’t resolve the issue.

We’ve increased write_request_timeout_in_ms … and read timeout as well.

Is there anything else I should be looking at?

I can’t seem to find the documentation that explains what the heck is
happening.

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

Re: Lots of write timeouts and missing data during decomission/bootstrap

2015-07-01 Thread Kevin Burton

Looks like all of this is happening because we’re using CAS operations and
the driver is going to SERIAL consistency level.

SERIAL and LOCAL_SERIAL write failure scenarios¶

http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html?scroll=concept_ds_umf_5xx_zj__failure-scenariosIf
one of three nodes is down, the Paxos commit fails under the following
conditions:

- CQL query-configured consistency level of ALL

- Driver-configured serial consistency level of SERIAL

- Replication factor of 3

I don’t understand why this would fail.. it seems completely broken in this
situation.

We were having write timeout at replication factor of 2 .. and a lot of
people from the list said of course , because 2 nodes with 1 node down
means there’s no quorum and paxos needs a quorum. .. and not sure why I
missed that :-P

So we went with 3 replicas, and a quorum,

but this is new and I didn’t see this documented. We set the driver to
QUORUM but then I guess the driver sees that this is a CAS operation and
forces it back to SERIAL? Doesn’t this mean that all decommissions result
in failures of CAS?

This is Cassandra 2.0.9 btw.

On Wed, Jul 1, 2015 at 2:22 PM, Kevin Burton bur...@spinn3r.com wrote:

We get lots of write timeouts when we decommission a node. About 80% of
them are write timeout and just about 20% of them are read timeout.

We’ve tried to adjust streamthroughput (and compaction throughput) for
that matter and that doesn’t resolve the issue.

We’ve increased write_request_timeout_in_ms … and read timeout as well.

Is there anything else I should be looking at?

I can’t seem to find the documentation that explains what the heck is
happening.

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

Re: Lots of write timeouts and missing data during decomission/bootstrap

2015-07-01 Thread Kevin Burton

WOW.. nice. you rock!!

On Wed, Jul 1, 2015 at 3:18 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Jul 1, 2015 at 2:58 PM, Kevin Burton bur...@spinn3r.com wrote:

 Looks like all of this is happening because we’re using CAS operations
 and the driver is going to SERIAL consistency level.
 ...
 This is Cassandra 2.0.9 btw.


  https://issues.apache.org/jira/browse/CASSANDRA-8640

 =Rob
 (credit to iamaleksey on IRC for remembering the JIRA #)




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

How the heck do we repair when migrating to 3 replicas on 2.0.x ?

2015-06-11 Thread Kevin Burton

We’re running Cassandra 2.0.9 and just migrated from 2-3 replicas.

We changes our consistency level to 2 during this period while we’re
running a repair.

but we can’t figure out what command to run to repair our data

We *think* we have to run “nodetool repair -pr” on each node.. is that
right?  or do we have to run nodetool -h hostname repair ?

We tried to RFTM… we really did :)


`” --

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

Tracking ETA and % complete in nodetool netstats during a decommission ?

2015-05-08 Thread Kevin Burton

I’m trying to track the throughput of nodetool decommission so I can figure
out how long until this box is out of service.

Basically, I want a % complete, and a ETA on when the job will be done.

IS this possible? Without opscenter?

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts

Re: Timeseries analysis using Cassandra and partition by date period

2015-04-05 Thread Kevin Burton

 Hi, I switched from HBase to Cassandra and try to find problem solution
for timeseries analysis on top Cassandra.

Depending on what you’re looking for, you might want to check out KairosDB.

0.95 beta2 just shipped yesterday as well so you have good timing.

https://github.com/kairosdb/kairosdb

On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak serega.shey...@gmail.com
wrote:

 Okay, so bucketing by day/week/month is a capacity planning stuff and
 actual questions I want to ask.
 As as a conclusion:
 I have a table events

 CREATE TABLE user_plans (
   id timeuuid,
   user_id timeuuid,
   event_ts timestamp,
   event_type int,
   some_other_attr text

 PRIMARY KEY (user_id, ends)
 );
 which fits tactic queries:
 select smth from user_plans where user_id='xxx' and end_ts  now()

 Then I create second table user_plans_daily (or weekly, monthy)

 with DDL:
 CREATE TABLE user_plans_daily/weekly/monthly (
   ymd int,
   user_id timeuuid,
   event_ts timestamp,
   event_type int,
   some_other_attr text
 )
 PRIMARY KEY ((ymd, user_id), event_ts )
 WITH CLUSTERING ORDER BY (event_ts DESC);

 And this table is good for answering strategic questions:
 select * from
 user_plans_daily/weekly/monthly
 where ymd in ()
 And I should avoid long condition inside IN clause, that is why you
 suggest me to create bigger bucket, correct?


 2015-04-04 20:00 GMT+02:00 Jack Krupansky jack.krupan...@gmail.com:

 It sounds like your time bucket should be a month, but it depends on the
 amount of data per user per day and your main query range. Within the
 partition you can then query for a range of days.

 Yes, all of the rows within a partition are stored on one physical node
 as well as the replica nodes.

 -- Jack Krupansky

 On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak serega.shey...@gmail.com
 wrote:

 non-equal relation on a partition key is not supported
 Ok, can I generate select query:
 select some_attributes
 from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
 20150331

  The partition key determines which node can satisfy the query
 So you mean that all rows with the same *(ymd, user_id)* would be on
 one physical node?


 2015-04-04 16:38 GMT+02:00 Jack Krupansky jack.krupan...@gmail.com:

 Unfortunately, a non-equal relation on a partition key is not
 supported. You would need to bucket by some larger unit, like a month, and
 then use the date/time as a clustering column for the row key. Then you
 could query within the partition. The partition key determines which node
 can satisfy the query. Designing your partition key judiciously is the key
 (haha!) to performant Cassandra applications.

 -- Jack Krupansky

 On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak 
 serega.shey...@gmail.com wrote:

 Hi, we plan to have 10^8 users and each user could generate 10 events
 per day.
 So we have:
 10^8 records per day
 10^8*30 records per month.
 Our timewindow analysis could be from 1 to 6 months.

 Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts
 of event.

 So you suggest this approach:
 *PRIMARY KEY ((ymd, user_id), event_ts ) *
 *WITH CLUSTERING ORDER BY (**event_ts*
 * DESC);*

 where ymd=20150102 (the Second of January)?

 *What happens to writes:*
 SSTable with past days (ymd  current_day) stay untouched and don't
 take part in Compaction process since there are o changes to them?

 What happens to read:
 I issue query:
 select some_attributes
 from events where ymd = 20150101 and ymd  20150301
 Does Cassandra skip SSTables which don't have ymd in specified range
 and give me a kind of partition elimination, like in traditional DBs?


 2015-04-04 14:41 GMT+02:00 Jack Krupansky jack.krupan...@gmail.com:

 It depends on the actual number of events per user, but simply
 bucketing the partition key can give you the same effect - clustering 
 rows
 by time range. A composite partition key could be comprised of the user
 name and the date.

 It also depends on the data rate - is it many events per day or just
 a few events per week, or over what time period. You need to be careful -
 you don't want your Cassandra partitions to be too big (millions of rows)
 or too small (just a few or even one row per partition.)

 -- Jack Krupansky

 On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak 
 serega.shey...@gmail.com wrote:

 Hi, I switched from HBase to Cassandra and try to find problem
 solution for timeseries analysis on top Cassandra.
 I have a entity named Event.
 Event has attributes:
 user_id - a guy who triggered event
 event_ts - when even happened
 event_type - type of event
 some_other_attr - some other attrs we don't care about right now.

 The DDL for entity event looks this way:

 CREATE TABLE user_plans (

   id timeuuid,
   user_id timeuuid,
   event_ts timestamp,
   event_type int,
   some_other_attr text

 PRIMARY KEY (user_id, ends)
 );

 Table is infinite, It would grow continuously during application
 lifetime.
 I want to ask question:
 Cassandra, give me all event where

Re: Fastest way to map/parallel read all values in a table?

2015-02-09 Thread Kevin Burton

I had considered using spark for this but:

1.  we tried to deploy spark only to find out that it was missing a number
of key things we need.

2.  our app needs to shut down to release threads and resources.  Spark
doesn’t have support for this so all the workers would have stale thread
leaking afterwards.  Though I guess if I can get workers to fork then I
should be ok.

3.  Spark SQL actually returned invalid data to our queries… so that was
kind of a red flag and a non-starter

On Mon, Feb 9, 2015 at 2:24 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
mvallemil...@bloomberg.net wrote:

 Just for the record, I was doing the exact same thing in an internal
 application in the start up I used to work. We have had the need of writing
 custom code process in parallel all rows of a column family. Normally we
 would use Spark for the job, but in our case the logic was a little more
 complicated, so we wrote custom code.

 What we did was to run N process in M machines (N cores in each), each one
 processing tasks. The tasks were created by splitting the range -2^ 63 to
 2^ 63 -1 in N*M*10 tasks. Even if data was not completely distributed along
 the tasks, no machines were idle, as when some task was completed another
 one was taken from the task pool.

 It was fast enough for us, but I am interested in knowing if there is a
 better way of doing it.

 For your specific case, here is a tool we had opened as open source and
 can be useful for simpler tests:
 https://github.com/s1mbi0se/cql_record_processor

 Also, I guess you probably know that, but I would consider using Spark for
 doing this.

 Best regards,
 Marcelo.

 From: user@cassandra.apache.org
 Subject: Re:Fastest way to map/parallel read all values in a table?

 What’s the fastest way to map/parallel read all values in a table?

 Kind of like a mini map only job.

 I’m doing this to compute stats across our entire corpus.

 What I did to begin with was use token() and then spit it into the number
 of splits I needed.

 So I just took the total key range space which is -2^63 to 2^63 - 1 and
 broke it into N parts.

 Then the queries come back as:

 select * from mytable where token(primaryKey) = x and token(primaryKey) 
 y

 From reading on this list I thought this was the correct way to handle
 this problem.

 However, I’m seeing horrible performance doing this.  After about 1% it
 just flat out locks up.

 Could it be that I need to randomize the token order so that it’s not
 contiguous?  Maybe it’s all mapping on the first box to begin with.



 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com





-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com

Re: High GC activity on node with 4TB on data

2015-02-08 Thread Kevin Burton

Do you have a lot of individual tables?  Or lots of small compactions?

I think the general consensus is that (at least for Cassandra), 8GB heaps
are ideal.

If you have lots of small tables it’s a known anti-pattern (I believe)
because the Cassandra internals could do a better job on handling the in
memory metadata representation.

I think this has been improved in 2.0 and 2.1 though so the fact that
you’re on 1.2.18 could exasperate the issue.  You might want to consider an
upgrade (though that has its own issues as well).

On Sun, Feb 8, 2015 at 12:44 PM, Jiri Horky ho...@avast.com wrote:

 Hi all,

 we are seeing quite high GC pressure (in old space by CMS GC Algorithm)
 on a node with 4TB of data. It runs C* 1.2.18 with 12G of heap memory
 (2G for new space). The node runs fine for couple of days when the GC
 activity starts to raise and reaches about 15% of the C* activity which
 causes dropped messages and other problems.

 Taking a look at heap dump, there is about 8G used by SSTableReader
 classes in org.apache.cassandra.io.compress.CompressedRandomAccessReader.

 Is this something expected and we have just reached the limit of how
 many data a single Cassandra instance can handle or it is possible to
 tune it better?

 Regards
 Jiri Horky




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com

Fastest way to map/parallel read all values in a table?

2015-02-08 Thread Kevin Burton

What’s the fastest way to map/parallel read all values in a table?

Kind of like a mini map only job.

I’m doing this to compute stats across our entire corpus.

What I did to begin with was use token() and then spit it into the number
of splits I needed.

So I just took the total key range space which is -2^63 to 2^63 - 1 and
broke it into N parts.

Then the queries come back as:

select * from mytable where token(primaryKey) = x and token(primaryKey)  y

From reading on this list I thought this was the correct way to handle this
problem.

However, I’m seeing horrible performance doing this.  After about 1% it
just flat out locks up.

Could it be that I need to randomize the token order so that it’s not
contiguous?  Maybe it’s all mapping on the first box to begin with.



-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com

Disabling the write ahead log with 2 data centers?

2015-01-23 Thread Kevin Burton

The WAL (and walls in general) impose a performance overhead.

If one were to just take a machine out of the cluster, permanently, when a
machine crashes, you could quickly get all the shards back up to N replicas
after a node crashes.

So realistically, running with a WAL is somewhat redundant.

ESPECIALLY when you have 2 data centers at 3 replicas in each datacenter
(for a total of 6 replicas).

I think this would only be about a 15% performance overhead.

Additionally, on flash, if you lay out the SSTables properly, you arguably
don’t need a WAL because your SSTable itself can be a wall and you could
run without memtables.   This has been proposed in a number of situations.
Especially on something like FusionIO …

Thoughts?

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com

number of replicas per data center?

How do people normally setup multiple data center replication in terms of
number of *local* replicas?

So say you have two data centers, do you have 2 local replicas, for a total
of 4 replicas?  Or do you have 2 in one datacenter, and 1 in another?

If you only have one in a local datacenter then when it fails you have to
transfer all that data over the WAN.



-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com

Re: number of replicas per data center?

Ah.. six replicas.  At least its super inexpensive that way (sarcasm!)



On Sun, Jan 18, 2015 at 8:14 PM, Jonathan Haddad j...@jonhaddad.com wrote:

 Sorry, I left out RF.  Yes, I prefer 3 replicas in each datacenter, and
 that's pretty common.


 On Sun Jan 18 2015 at 8:02:12 PM Kevin Burton bur...@spinn3r.com wrote:

  3 what? :-P replicas per datacenter or 3 data centers?

 So if you have 2 data centers you would have 6 total replicas with 3
 local replicas per datacenter?

 On Sun, Jan 18, 2015 at 7:53 PM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 Personally I wouldn't go  3 unless you have a good reason.


 On Sun Jan 18 2015 at 7:52:10 PM Kevin Burton bur...@spinn3r.com
 wrote:

 How do people normally setup multiple data center replication in terms
 of number of *local* replicas?

 So say you have two data centers, do you have 2 local replicas, for a
 total of 4 replicas?  Or do you have 2 in one datacenter, and 1 in another?

 If you only have one in a local datacenter then when it fails you have
 to transfer all that data over the WAN.



 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com




 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com

Re: number of replicas per data center?

 3 what? :-P replicas per datacenter or 3 data centers?

So if you have 2 data centers you would have 6 total replicas with 3 local
replicas per datacenter?

On Sun, Jan 18, 2015 at 7:53 PM, Jonathan Haddad j...@jonhaddad.com wrote:

 Personally I wouldn't go  3 unless you have a good reason.


 On Sun Jan 18 2015 at 7:52:10 PM Kevin Burton bur...@spinn3r.com wrote:

 How do people normally setup multiple data center replication in terms of
 number of *local* replicas?

 So say you have two data centers, do you have 2 local replicas, for a
 total of 4 replicas?  Or do you have 2 in one datacenter, and 1 in another?

 If you only have one in a local datacenter then when it fails you have to
 transfer all that data over the WAN.



 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com

Re: Not enough replica available” when consistency is ONE?

OK.. so if I’m running with 2 replicas, then BOTH of them need to be online
for this to work. Correct? Because with two replicas I need 2 to form a
quorum.

This is somewhat confusing them. Because if you have two replicas, and
you’re depending on these types of transactions, then this is a VERY
dangerous state. Because if ANY of your Cassandra nodes goes offline, then
your entire application crashes. So the more nodes you have, the HIGHER
the probability that your application will crash.

Which is just what happened to me. And in retrospect, this makes total
sense, but of course I just missed this in the application design.

So ConsistencyLevel.ONE and if not exists are essentially mutually
incompatible and shouldn’t the driver throw an exception if the user
requests this configuration?

Its dangerous enough that it probably shouldn’t be supported.

On Sun, Jan 18, 2015 at 7:43 AM, Eric Stevens migh...@gmail.com wrote:

Check out
http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_tunable_consistency_c.html

Cassandra 2.0 uses the Paxos consensus protocol, which resembles
2-phase commit, to support linearizable consistency. All operations are
quorum-based ...

This kicks in whenever you do CAS operations (eg, IF NOT EXISTS).
Otherwise a cluster which became network partitioned would end up being
able to have two separate CAS statements which both succeeded, but which
disagreed with each other.

On Sun, Jan 18, 2015 at 8:02 AM, Kevin Burton bur...@spinn3r.com wrote:

I’m really confused here.

I”m calling:

acquireInsert.setConsistencyLevel( ConsistencyLevel.ONE );

but I”m still getting the exception:

com.datastax.driver.core.exceptions.UnavailableException: Not enough
replica available for query at consistency SERIAL (2 required but only 1
alive)

Does it matter that I’m using:

ifNotExists();

and that maybe cassandra needs two because it’s using a coordinator ?

If so then an exception should probably be thrown when I try to set a
wrong consistency level.

which would be weird because I *do* have at least two replicas online. I
have 4 nodes in my cluster right now...

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com

Not enough replica available” when consistency is ONE?